# The problem with p values: how significant are they, really?

##### November 12th, 2013
For researchers there's a lot that turns on the p value, the number used to determine whether a result is statistically significant. The current consensus is that if p is less than .05, a study has reached the holy grail of being statistically significant, and therefore likely to be published. Over .05 and it's usually back to the drawing board.

But today, Texas A&M University professor Valen Johnson, writing in the prestigious journal Proceedings of the National Academy of Sciences, argues that p less than .05 is far too weak a standard.

Using .05 is, he contends, a key reason why false claims are published and many published results fail to replicate. He advocates requiring .005 or even .001 as the criterion for statistical significance.

What is a p value anyway?

The p value is at the heart of the most common approach to data analysis – null hypothesis significance testing (NHST). Think of NHST as a waltz with three steps:

State a null hypothesis: that is, there is no effect.

Calculate the p value, which is the probability of getting results like ours – if the null hypothesis is true.

If p is sufficiently small, reject the null hypothesis and sound the trumpets: our effect is not zero, it's statistically significant!

British statistician and geneticist Sir Ronald Fisher introduced the p value in 1925. He adopted .05 as a reference point for rejecting a null hypothesis. For him it was not a sharp cutoff: a thoughtful researcher should consider the context, and other results as well.

NHST has, however, become deeply entrenched in medicine and numerous other disciplines. The precise value .05 has become a bar to wriggle under to achieve publication in top journals. Generations of students have been inducted into the rituals of .05 meaning "significant", and .01 "highly significant".

Sounds good. What's the problem?

The trouble is there are numerous deep flaws in NHST.

There's evidence that students, researchers and even many teachers of statistics don't understand NHST properly. More worryingly, there's evidence it's widely misused, even in top journals.

Most researchers don't appreciate that p is highly unreliable. Repeat your experiment and you'll get a p value that could be extremely different. Even more surprisingly, p is highly unreliable even for very large samples.

NHST may be a waltz, but the dance of p is highly frenetic. Here's a demonstration of why we simply shouldn't trust any p value:

Despite all those problems, NHST persists, perhaps because we yearn for certainty. Declaring a result "significant" suggests certainty, even though our results almost always contain considerable uncertainty.

Should we require stronger evidence?

Johnson makes a cogent argument that .05 provides only weak evidence against the null hypothesis: perhaps only odds of 3 or 4 to 1 against reasonable alternative hypotheses.

He suggests we should require more persuasive odds, say 50 to 1 or even 200 to 1.

To do this, we need to adopt .005 or .001 as our p value criterion for statistical significance.

He recognises there's a price to pay for demanding stronger evidence. In typical cases, we'd need to roughly double our sample sizes to still have a reasonable chance of finding true effects. Using larger samples would indeed be highly desirable, but sometimes that's simply not possible. And are research grants about to double?

Johnson is correct that .05 corresponds to weak evidence, and .005 or .001 to evidence that's usefully stronger. Adopting his stricter criterion would, however, mean that the majority of all published research analysed using NHST would fail the new test, and suddenly be statistically non-significant!

More fundamentally, merely shifting the criterion does not overcome the unreliability of p, or most of the other deep flaws of NHST. The core problem is that NHST panders to our yearning for certainty by presenting the world as black or white—an effect is statistically significant or not; it exists or it doesn't.

In fact our world is many shades of grey—I won't pretend to know how many. We need something more nuanced than NHST, and fortunately there are good alternatives.

A better way: estimation and meta-analysis

Bayesian techniques are highly promising and becoming widely used. Most readily available and already widely used is estimation based on confidence intervals.

A confidence interval gives us the best estimate of the true effect, and also indicates the extent of uncertainty in our results. Confidence intervals are also what we need to use meta-analysis, which allows us to integrate results from a number of experiments that investigate the same issue.

We often need to make clear decisions—whether or not to licence the new drug, for example—but NHST provides a poor basis for such decisions. It's far better to use the integration of all available evidence to guide decisions, and estimation and meta-analysis provides that.

Merely shifting the NHST goal posts simply won't do.

http://phys.org/news/2013-11-statistician-statistical-standards-amount-non-reproducible.html

Source: The Conversation

This story is published courtesy of The Conversation (under Creative Commons-Attribution/No derivatives).

This Phys.org Science News Wire page contains a press release issued by an organization mentioned above and is provided to you “as is” with little or no review from Phys.Org staff.

## More news stories

#### Forging a brand-new chemical bond using the pressure of the Mars core

When it comes to making chemical bonds, some elements go together like peanut butter and jelly; but for others, it's more like oil and water. Scientists can combat this elemental antipathy using extreme pressures. And now ...

#### Breakthrough Listen to search for intelligent life around weird star

Tabby's star has provoked so much excitement over the past year, with speculation that it hosts a highly advanced civilization capable of building orbiting megastructures to capture the star's energy, that UC Berkeley's Breakthrough ...

#### VLT detects unexpected giant glowing halos around distant quasars

An international collaboration of astronomers, led by a group at the Swiss Federal Institute of Technology (ETH) in Zurich, Switzerland, has used the unrivalled observing power of MUSE on the Very Large Telescope (VLT) at ...

#### Precise quantum cloning: Possible pathway to secure communication

Physicists at The Australian National University (ANU) and University of Queensland (UQ) have produced near-perfect clones of quantum information using a new method to surpass previous cloning limits.

#### New gene-editing technology successfully cures a genetic blood disorder in mice

A next-generation gene-editing system developed by Carnegie Mellon University and Yale University scientists has successfully cured a genetic blood disorder in living mice using a simple IV treatment. Unlike the popular CRISPR ...

#### New analysis of big data sheds light on cell functions

Researchers have developed a new way of obtaining useful information from big data in biology to better understand—and predict—what goes on inside a cell. Using genome-scale models, researchers were able to integrate ...

#### Arctic found to play unexpectedly large role in removing nitrogen

Areas of the Arctic play a larger role than previously thought in the global nitrogen cycle—the process responsible for keeping a critical element necessary for life flowing between the atmosphere, the land and oceans. ...

#### You are less anonymous on the web than you think—much less

If you still think you can be anonymous on the internet, a team of Stanford and Princeton researchers has news for you: You can't. Over the summer, the team launched what they called the Footprints Project, which invited ...

#### Making it easier to collaborate on code

Git is an open-source system with a polarizing reputation among programmers. It's a powerful tool to help developers track changes to code, but many view it as prohibitively difficult to use.

#### Researchers use an optical technique to probe magnetism at a hidden interface between two exotic thin films

Magnetic properties of materials underlie technologies from old-fashioned recording tape to modern hard drives, and scientists are constantly pushing to develop new uses from magnetic behavior. Recently, researchers at MIT ...