*Proceedings of the National Academy of Sciences*, that the statistical standard used to judge the soundness of research efforts be made more stringent. Doing so, he writes, would reduce the large numbers of non-reproducible findings by researchers and as a result prevent the undermining of confidence in such efforts.

Over the past few years, the number of research papers being published that claim to have made certain findings, but which can't be reproduced by others in the field has increased, leading to calls for changes to be made in how such efforts are graded.

The traditional approach is based on a P value—a number obtained by comparing an alternative hypothesize against a null value (what it would be if left alone). This number is supposed to give the researcher an idea of whether his or her efforts have resulted in a change to whatever it is they are investigating. Convention argues that a P value of 0.05 is statistically significant enough to claim that something has indeed been changed, which means the researchers can claim success in their endeavor. But, Johnson argues, there is a serious flaw in this approach. He argues that the P value actually represents the likelihood of an extreme value occurring in an experiment, and thus doesn't truly reflect the degree of variation from the norm that researchers believe it to be.

In statistics, there is a another way to calculate the difference between the norm and results obtained by causing a change to a system, it's called Bayesian Hypothesis testing and it, Johnson explains, offers a way to calculate a genuine comparison. To strengthen his point, he has devised a way to convert a Bayes factor to P values. Doing so, he argues shows just how weak P values can be.

The problem, he writes, is not that researchers use P values, but that they rely on values for it that are not stringent enough. He suggests the research community change its standard of acceptance from .05 to .005 or even to .001. That he believes, would greatly reduce the number of research papers with un-reproducible results being published, saving reputations and reduce money spent on wasted follow-up research efforts.

**Explore further:**
Most negative online posts aren't defamation

**More information:**
Revised standards for statistical evidence, *PNAS*, Published online before print November 11, 2013, DOI: 10.1073/pnas.1313476110

**Abstract**

Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25–50:1, and to 100–200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.

* Read also this The Conversation article.

## BobSage

## Szkeptik

To explain P value much simpler than the article does:

A P value of 0.05 means that there is a 5% chance that you make a mistake if you claim that your result is significant. 0.005 would mean this chance is 0.5%.

So ideally -if nobody is lying aboult their results- the unrepeatable results would never exceed 5% of all significant results, as that would be a mathematical impossibility.

## antialias_physorg

That's a rather pointless demand. Some sciences would never be able to fulfill that norm (e.g. there would be no way you could get enough test subjects for a pharmaceutical study to have enough power to get a p value of 0.005 or 0.001.) In other sciences where the scope of an event is well defined such standards are too low (e.g. particle physics did had a p value of 0.0000001 before announcing that they had found the Higgs boson)

Science isn't about truth: it's about what works. Banking endeavours (or at the very least further investigation) on a 95% chance of success is adequately merited.

Yes, there will be some studies that get fluke positives (or fluke negatives) - but those are quickly found out when put to a repeat test.

## thermodynamics

So ideally -if nobody is lying about their results- the unrepeatable results would never exceed 5% of all significant results, as that would be a mathematical impossibility."

I agree with your view, however, I don't look to lying as the reason that we see the lack of repeatability. First, I find that a strong background in statistics is rare for science majors (this comes from someone who has a strong math and physics background but a statistician girl friend and has his chops handed to him on a regular basis). What I used to consider a strong approach to statistics she showed me to be weak. I have improved, but only with a lot of thumping on the back of the head. Continued

## thermodynamics

## Urgelt

What nobody has yet mentioned is that Bayesian statistics are inherently wonky and often used to disguise unstated and even unwarranted assumptions about the data being evaluated. It's quite easy to have a statistically 'true' and apparently rigorous P value that gives a bad study result.

For reproducibility, care and transparency in selecting assumptions, and avoiding bias in data collection, are far more important than tightening up P - as is minimizing the need for assumptions in the first place.

## antialias_physorg

The trials aren't done sloppily.

Replication for pharamceutical trials is easier said than done.

Remember that clinical trials go through various phases (for precisely that reason: to catch non-working substances early before you move on to the next, more costly phase).

Each phase has more test subjects than the last. The whole procedure lasts years and costs on the order of 100 million dollars for ONE substance that passes through all phases (the majority of the cost is incurred in the last phase where you have thousands of trial subjects). This cost is borne by companies, BTW - not tax payers.

In recent years streamlined trial procedures have been approved for pharamaceuticals that are deemed critical (vaccines for newly arising epidemics, cancer cures, etc. )

## thermodynamics

What I was trying to make clear (and didn't do a great job) is that clinical trials do not start until some basic research has made it clear that there was a reasonable possibility of an approach paying off. It is that basic research that should be verified before spending the huge sums required for clinical trials. As I understand the process, a company finds a promising direction and then applies for stage 1 trials. Continued

## thermodynamics

## thermodynamics

This is open access on PLOS so you can read the entire article.

http://www.plosme....0040028

Here is one that led to three cancer clinical trials before they found the basic research was not reproducible.

http://magazine.a...cyjan11/

Here is an article about research that does not share the data and moves to make sure it is shared.

http://www.bmj.co...mj.e4383

That article also links to a number of other sites in the bibliography.

## antialias_physorg

The basic research are done before the phases start: in vitro and on animal models.

Replicating that research has only marginal value because what works in a mouse or a pig (or in vitro) doesn't necessarily work in live humans.

And you have to go through a LOT of red tape to get some animal studies approved (months and months of wrangling with the ethics comission. They usually downgrade the numbers you apply for and there is no way to get more afterwards).

It's easiest with rats and mice since they aren't classified as animals but as 'vermin' (for exactly those legal/ethical reasons). But mouse models aren't good for a lot of types of pharamceuticals. There you need hamsters, sheep, pigs, rabbits and whatnot. And you just don't get 1000 pigs to try something out on. You may get 10 (or fewer)..if you're lucky to get any at all.

## antialias_physorg

Wikipedia has a good graphic on the trial phases and timelines for various approval processes

http://en.wikiped...l#Phases

Also note this part:

(That is before the phase 0 or phase I trial even starts)

It would be good to do the study again, but that would require another 3-5 years. You could get equivalent statistical power to a retest by upping the number of animals used in the original study. In either case you run into trouble with the ethics comission for frivolously using animal models (and rightfully so!)

## triplehelix

Actually, Pharma and medical sciences can achieve 0.001 easily in many circumstances.

## antialias_physorg

You'd need many thousands of (human) test subjects and only a very few where it doesn't work to call the results 'not statistically significant' (i.e.: don't go to market with the drug). For some illnesses it would even be OK if only, say, 20 percent were cured (e.g. HIV) to call it a success and have it marketed.

Many drugs are very specific to a certain type of illness. Getting the number of test subjects who are willing/able to test something for 3-7 years is difficult. You can imagine that a sizeable percentage will drop out of the study over such timeframes - either through non-compliance, moving home, death, secondary/other illness (which renders their results not admissible as the effect of that other illness on the primary cannot be easily judged), and a host of other factors.

Especially the last one is tricky. It's not always a clear cut case whether a subject should be excluded from a study or not.

## antialias_physorg

(E.g. in a study I was involved in we could show how arthrothis of the knee progressed with/without a given drug via analysis of CT images - however the pain levels reported were not totally corelated with that progression. Some patients even reported less pain with objective worseing of the cartilage/bone situation)