# The problem with p values: how significant are they, really?

##### November 12th, 2013
For researchers there's a lot that turns on the p value, the number used to determine whether a result is statistically significant. The current consensus is that if p is less than .05, a study has reached the holy grail of being statistically significant, and therefore likely to be published. Over .05 and it's usually back to the drawing board.

But today, Texas A&M University professor Valen Johnson, writing in the prestigious journal Proceedings of the National Academy of Sciences, argues that p less than .05 is far too weak a standard.

Using .05 is, he contends, a key reason why false claims are published and many published results fail to replicate. He advocates requiring .005 or even .001 as the criterion for statistical significance.

What is a p value anyway?

The p value is at the heart of the most common approach to data analysis – null hypothesis significance testing (NHST). Think of NHST as a waltz with three steps:

State a null hypothesis: that is, there is no effect.

Calculate the p value, which is the probability of getting results like ours – if the null hypothesis is true.

If p is sufficiently small, reject the null hypothesis and sound the trumpets: our effect is not zero, it's statistically significant!

British statistician and geneticist Sir Ronald Fisher introduced the p value in 1925. He adopted .05 as a reference point for rejecting a null hypothesis. For him it was not a sharp cutoff: a thoughtful researcher should consider the context, and other results as well.

NHST has, however, become deeply entrenched in medicine and numerous other disciplines. The precise value .05 has become a bar to wriggle under to achieve publication in top journals. Generations of students have been inducted into the rituals of .05 meaning "significant", and .01 "highly significant".

Sounds good. What's the problem?

The trouble is there are numerous deep flaws in NHST.

There's evidence that students, researchers and even many teachers of statistics don't understand NHST properly. More worryingly, there's evidence it's widely misused, even in top journals.

Most researchers don't appreciate that p is highly unreliable. Repeat your experiment and you'll get a p value that could be extremely different. Even more surprisingly, p is highly unreliable even for very large samples.

NHST may be a waltz, but the dance of p is highly frenetic. Here's a demonstration of why we simply shouldn't trust any p value:

Despite all those problems, NHST persists, perhaps because we yearn for certainty. Declaring a result "significant" suggests certainty, even though our results almost always contain considerable uncertainty.

Should we require stronger evidence?

Johnson makes a cogent argument that .05 provides only weak evidence against the null hypothesis: perhaps only odds of 3 or 4 to 1 against reasonable alternative hypotheses.

He suggests we should require more persuasive odds, say 50 to 1 or even 200 to 1.

To do this, we need to adopt .005 or .001 as our p value criterion for statistical significance.

He recognises there's a price to pay for demanding stronger evidence. In typical cases, we'd need to roughly double our sample sizes to still have a reasonable chance of finding true effects. Using larger samples would indeed be highly desirable, but sometimes that's simply not possible. And are research grants about to double?

Johnson is correct that .05 corresponds to weak evidence, and .005 or .001 to evidence that's usefully stronger. Adopting his stricter criterion would, however, mean that the majority of all published research analysed using NHST would fail the new test, and suddenly be statistically non-significant!

More fundamentally, merely shifting the criterion does not overcome the unreliability of p, or most of the other deep flaws of NHST. The core problem is that NHST panders to our yearning for certainty by presenting the world as black or white—an effect is statistically significant or not; it exists or it doesn't.

In fact our world is many shades of grey—I won't pretend to know how many. We need something more nuanced than NHST, and fortunately there are good alternatives.

A better way: estimation and meta-analysis

Bayesian techniques are highly promising and becoming widely used. Most readily available and already widely used is estimation based on confidence intervals.

A confidence interval gives us the best estimate of the true effect, and also indicates the extent of uncertainty in our results. Confidence intervals are also what we need to use meta-analysis, which allows us to integrate results from a number of experiments that investigate the same issue.

We often need to make clear decisions—whether or not to licence the new drug, for example—but NHST provides a poor basis for such decisions. It's far better to use the integration of all available evidence to guide decisions, and estimation and meta-analysis provides that.

Merely shifting the NHST goal posts simply won't do.

http://phys.org/news/2013-11-statistician-statistical-standards-amount-non-reproducible.html

Source: The Conversation

This story is published courtesy of The Conversation (under Creative Commons-Attribution/No derivatives).

This Phys.org Science News Wire page contains a press release issued by an organization mentioned above and is provided to you “as is” with little or no review from Phys.Org staff.

## More news stories

#### 'Pause' in global warming was never real, new research proves

Claims of a 'pause' in observed global temperature warming are comprehensively disproved in a pair of new studies published today.

#### InSight engineers have made a Martian rock garden

NASA's InSight lander is due to set its first science instrument on Mars in the coming days. But engineers here on Earth already saw it happen—last week.

#### Southwest forest trees will grow much slower in the 21st century

Southwest forests may decline in productivity on average as much as 75 percent over the 21st century as climate warms, according to a University of Arizona-led research report published in Nature Communications on Dec. 17.

#### Machine-learning research unlocking molecular cages' energy-saving potential

Nanosized cages may play a big role in reducing energy consumption in science and industry, and machine-learning research at Oregon State University aims to accelerate the deployment of these remarkable molecules.

#### Graphene's magic is in the defects

A team of researchers at the New York University Tandon School of Engineering and NYU Center for Neural Science has solved a longstanding puzzle of how to build ultra-sensitive, ultra-small electrochemical sensors with homogenous ...

#### Scientists develop method to visualize a genetic mutation

A team of scientists has developed a method that yields, for the first time, visualization of a gene amplifications and deletions known as copy number variants in single cells.

#### NASA's 1st flight to moon, Apollo 8, marks 50th anniversary

Fifty years ago on Christmas Eve, a tumultuous year of assassinations, riots and war drew to a close in heroic and hopeful fashion with the three Apollo 8 astronauts reading from the Book of Genesis on live TV as they orbited ...

#### Dive-bombing for love: Male hummingbirds dazzle females with a highly synchronized display

When it comes to flirting, animals know how to put on a show. In the bird world, males often go to great lengths to attract female attention, like peacocks shaking their tail feathers and manakins performing complex dance ...

#### New megalibrary approach proves useful for the rapid discovery of new materials

Different eras of civilization are defined by the discovery of new materials, as new materials drive new capabilities. And yet, identifying the best material for a given application—catalysts, light-harvesting structures, ...

#### Understanding dynamic stall at high speeds

When a bird in flight lands, it performs a rapid pitch-up maneuver during the perching process to keep from overshooting the branch or telephone wire. In aerodynamics, that action produces a complex phenomenon known as dynamic ...