Older statistics methods being supplanted by more accurate ones

May 28, 2014 by David Warton, The Conversation

It's an exciting time to be doing statistics. You heard me – statistics: exciting.

It often gets a bad rap, but stats is after all at the business end of the . When I've collaborated on studies of megafauna, leopard seals, police confessions, a new casino game or climate change effects on biodiversity, the point where researchers find out their results and have those "Eureka!" moments is more often than not in front of a computer rather than out in the field.

Now is an especially good time to be a statistician because the technological revolution over the past couple of decades has blown the field wide open – but despite this, some researchers continue to use outdated and inadequate statistical methods.

The sooner we can change this, the better.

When I started high school (around 25 years ago) computers looked like the one pictured right. They had 64kB memory. And this compressed digital image is more than 100kB, meaning that this poor computer doesn't have enough memory to look at this picture of itself!

Modern computers are thousands of times faster and can have a million times as much memory. This and related technological advances has changed data analysis in two main ways:

  1. the increased capacity means that more computationally demanding methods of analysis are on the table that just weren't an option previously, so we can use more refined approaches to better answer important questions and even answer some questions not possible previously
  2. technology has led to the collection of new data types, forcing statisticians to think in different ways to how they had before.

This has forced statisticians to think (amongst other things) about the challenging problem of high-dimensionality – when there are many more variables than subjects on which these variables were measured, leading to all sorts of methodological innovations.

The methodological advances over the past decade or two are hard enough to keep track of for someone with high-level training in , so spare a thought for the applied scientist out there who has no clue what a sparse precision matrix is, let alone why you would want one or how an L1 penalty might help you get one.

Filling in the gaps

As an eco-statistician, an important part of what I do is to bridge the knowledge gap between statistics and ecology, which in one particular case has for some time been a huge gaping hole.

Ecologists often want to study whole communities rather than individual species (such as when looking for environmental impacts of human activities), and start by collecting data that look like the table shown left.

This particular dataset comes from a PhD student who has spent the past three years studying grassland communities of bugs (order Hemiptera) and how they change as temperature and rainfall changes (with a view to predicting how they might change as temperature and rainfall change in the future).

There are two important properties you should be able to see straight away:

  1. lots of zeros (most species not found in most places)
  2. lots of variables (many species recorded).

The first problem is relatively easy to deal with, but must be done with care. Having lots of variables ("high dimensionality") is a much more difficult problem technically.

In fact, if an ecologist walked into a statistician's office with this dataset in the 1980s there is a good chance they'd have been told to come back when they had fewer variables (hence a more refined and manageable problem).

Given the gap in the 1980s between what could offer and what ecologists needed, some ecologists quite rightly went off and made up their own methods, going in quite a different direction to the statistical literature of the time.

These involved some quite clever innovations to deal with their computational constraints (think 64kB), but also some compromises which I've recently found lead to some undesirable properties (partly because they don't deal with the lots-of-zeros property well).

In science, solving the problem isn't the end of the story. You then have to find ways to communicate your results and show people how your solutions can be useful to them.

This step is especially challenging in my case because the problematic approaches ecologists have been using since the 1980s are entrenched in the discipline, and are often even taught to ecology students in second year at university.

Even when I can show that in realistic settings their methods give wrong answers as much as 100% of the time, it is not a done deal for an ecologist to turn away from methods that have been standard in their discipline their whole career, and used countless times by them and their contemporaries. Never gonna give you up?

Roll in Rick Astley

This is where Rick Astley comes in. In order to convince ecologists to think differently about how they look for patterns in ecological communities, on YouTube (below) I compared the methods they use to Rick Astley's music – sounded like a great idea in the 80s, but these days, there are better options out there:

This video is not supported by your browser at this time.

Rick was huge between mid-1987 and 1989, but his music dated very quickly (albeit with a resurgence following the Rick-Rolling phenomenon), and I've been making the case that the multivariate methods used in ecology have dated in much the same way, at least, from a methodological perspective.

On the whole this seemed to go over pretty well – the video is no Charlie Bit My Finger but 46 likes on the topic of multivariate analysis just might be a record! But the message here is that just as the technology available for doing science has changed, the technology for science communication has changed too.

I give several talks each year to other scientists, but in small groups, and over a 10-year period I will probably have presented to maybe 3,000 scientists. But I've managed to reach that many people in a single YouTube video that only took two days to write, record and edit.

Whereas a decade ago I felt a bit like a "lone voice" on modern approaches to modelling ecological communities, the good news is that this is changing, with exciting new ideas on the topic now emerging from a number of groups around the world. Happily, we are still making some inroads at the UNSW end too.

Explore further: New scientific field looks at the big picture

add to favorites email to friend print save as pdf

Related Stories

The numbers game

Jan 24, 2012

Dipak Dey, Board of Trustees Distinguished Professor of Statistics and associate dean in the College of Liberal Arts and Sciences, has been called an ambassador for the field. A prolific researcher, he is ...

Q&A: Allan Sly on probability theory and random processes

Feb 16, 2012

Newly awarded a 2012 Alfred P. Sloan Research Fellowship, Allan Sly, assistant professor of statistics, talks about his research into probability theory, his students and his own days as a UC Berkeley graduate ...

Recommended for you

Precarious work schedules common among younger workers

Aug 29, 2014

One wish many workers may have this Labor Day is for more control and predictability of their work schedules. A new report finds that unpredictability is widespread in many workers' schedules—one reason ...

User comments : 1

Adjust slider to filter visible comments by rank

Display comments: newest first

julianpenrod
1 / 5 (1) May 28, 2014
Basically a case of a dullard looking at something shiny and assuming it must be a diamond.
Before applying kudos to David Warton, among other things, note that he references two illustrations that he evidently spent too little time checking. One isn't there and one is not where he said it was!
And, no matter what David Warton tries to convince those without an understanding of the world around them of, statistics doesn't change that much. It's still a process of overall description of events. The only real change would be to accept that randomness is affecting a system but, even with that, to describe a single random event exactly, rather than falling within certain limits. And the new methods do not do away with that or they would be touted as ending randomness utterly!
Nor should sheer computational power be touted as so much. The most powerful computer can still work only finitely. A universal law or theory, on the other hands, extends to infinity.