Statistician suggests raising statistical standards to reduce amount of non-reproducible studies

Nov 12, 2013 by Bob Yirka report
Statistician suggests raising statistical standards to reduce amount of non-reproducible studies
A graphical depiction of the meaning of p-values. Credit: Repapetilto/Wikipedia.

(Phys.org) —Valen Johnson, a statistician with Texas A&M University is suggesting in a paper published in Proceedings of the National Academy of Sciences, that the statistical standard used to judge the soundness of research efforts be made more stringent. Doing so, he writes, would reduce the large numbers of non-reproducible findings by researchers and as a result prevent the undermining of confidence in such efforts.

Over the past few years, the number of being published that claim to have made certain findings, but which can't be reproduced by others in the field has increased, leading to calls for changes to be made in how such efforts are graded.

The traditional approach is based on a P value—a number obtained by comparing an alternative hypothesize against a null value (what it would be if left alone). This number is supposed to give the researcher an idea of whether his or her efforts have resulted in a change to whatever it is they are investigating. Convention argues that a P value of 0.05 is statistically significant enough to claim that something has indeed been changed, which means the researchers can claim success in their endeavor. But, Johnson argues, there is a serious flaw in this approach. He argues that the P value actually represents the likelihood of an extreme value occurring in an experiment, and thus doesn't truly reflect the degree of variation from the norm that researchers believe it to be.

In statistics, there is a another way to calculate the difference between the norm and results obtained by causing a change to a system, it's called Bayesian Hypothesis testing and it, Johnson explains, offers a way to calculate a genuine comparison. To strengthen his point, he has devised a way to convert a Bayes factor to P values. Doing so, he argues shows just how weak P values can be.

The problem, he writes, is not that researchers use P values, but that they rely on values for it that are not stringent enough. He suggests the research community change its standard of acceptance from .05 to .005 or even to .001. That he believes, would greatly reduce the number of research papers with un-reproducible results being published, saving reputations and reduce money spent on wasted follow-up research efforts.

Explore further: Most negative online posts aren't defamation

More information: Revised standards for statistical evidence, PNAS, Published online before print November 11, 2013, DOI: 10.1073/pnas.1313476110

Abstract
Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25–50:1, and to 100–200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.

* Read also this The Conversation article.

Related Stories

Recommended for you

Creative activities outside work can improve job performance

10 hours ago

Employees who pursue creative activities outside of work may find that these activities boost their performance on the job, according to a new study by San Francisco State University organizational psychologist Kevin Eschleman ...

Performance measures for CEOs vary greatly, study finds

16 hours ago

As companies file their annual proxy statements with the U.S. Securities and Exchange Commission (SEC) this spring, a new study by Rice University and Cornell University shows just how S&P 500 companies have ...

User comments : 15

Adjust slider to filter visible comments by rank

Display comments: newest first

BobSage
2.6 / 5 (10) Nov 12, 2013
I think this is a great step in eliminating the glut of research findings, but most especially those that seem to promote a lifestyle change. For example, "Eating chocolate makes you skinnier." That's nice, but until one knows how much chocolate makes one how much skinnier one doesn't really know whether to eat more chocolate or not. The fact that the finding itself might not even be statistically significant, of course, makes matters worse.
Szkeptik
2.1 / 5 (8) Nov 12, 2013
As someone actually working in the field I can say that if we took .005 as the threshold of significance, very few experiments could be declared to have shown a significant result at all, and their costs would go though the roof as we would have to repeat them an insane amount of times to get that kind of P value. The more datapoints you have the lower P value you can achieve, but of course having more datapoints requires more patients or lab animals or cell cultures or whatever you're studying, and all the extra costs of running a larger experiment.

To explain P value much simpler than the article does:
A P value of 0.05 means that there is a 5% chance that you make a mistake if you claim that your result is significant. 0.005 would mean this chance is 0.5%.
So ideally -if nobody is lying aboult their results- the unrepeatable results would never exceed 5% of all significant results, as that would be a mathematical impossibility.
antialias_physorg
3.6 / 5 (5) Nov 12, 2013
He suggests the research community change its standard of acceptance from .05 to .005 or even to .001.

That's a rather pointless demand. Some sciences would never be able to fulfill that norm (e.g. there would be no way you could get enough test subjects for a pharmaceutical study to have enough power to get a p value of 0.005 or 0.001.) In other sciences where the scope of an event is well defined such standards are too low (e.g. particle physics did had a p value of 0.0000001 before announcing that they had found the Higgs boson)

Science isn't about truth: it's about what works. Banking endeavours (or at the very least further investigation) on a 95% chance of success is adequately merited.
Yes, there will be some studies that get fluke positives (or fluke negatives) - but those are quickly found out when put to a repeat test.
thermodynamics
3.7 / 5 (3) Nov 12, 2013
Szkeptik said: "your result is significant. 0.005 would mean this chance is 0.5%.
So ideally -if nobody is lying about their results- the unrepeatable results would never exceed 5% of all significant results, as that would be a mathematical impossibility."

I agree with your view, however, I don't look to lying as the reason that we see the lack of repeatability. First, I find that a strong background in statistics is rare for science majors (this comes from someone who has a strong math and physics background but a statistician girl friend and has his chops handed to him on a regular basis). What I used to consider a strong approach to statistics she showed me to be weak. I have improved, but only with a lot of thumping on the back of the head. Continued
thermodynamics
3 / 5 (2) Nov 12, 2013
Continued: I still consider a P value of 0.5 is worth publishing and also needs to be replicated. I think that a lot of the research that has not been able to be replicated in the medical area was just sloppy research and should be replicated before any investment. I think that the trials for medical research are the place where real uncertainty must be assessed. At that point if there is a P=0.05 for a treatment that can change lives, it would unethical to pass it up. That does not mean it is a "law" but instead, it is something that probably has more to be learned about it. However, if it stops a specific cancer 95% of the time it would be a tragedy if you had to wait until it stopped the cancer 99% of the time. If we are looking for laws of nature then we need high standards for acceptance. If we are looking for useful tools we need to be sure the tool does what it is supposed to most of the time (and is replicable).
Urgelt
5 / 5 (2) Nov 12, 2013
Well, I agree with the points other commenters have offered. Auntie's right to point out that the appropriate P target depends on what's being studied and how much data is available. Others have mentioned affordability; following the recommendations of Johnson, many, many studies simply would never be conducted at all. Thermo is right to point out that sloppy studies, or even biased studies, are probably a more serious cause of irreproducibility than the current P standard.

What nobody has yet mentioned is that Bayesian statistics are inherently wonky and often used to disguise unstated and even unwarranted assumptions about the data being evaluated. It's quite easy to have a statistically 'true' and apparently rigorous P value that gives a bad study result.

For reproducibility, care and transparency in selecting assumptions, and avoiding bias in data collection, are far more important than tightening up P - as is minimizing the need for assumptions in the first place.
antialias_physorg
not rated yet Nov 13, 2013
I think that a lot of the research that has not been able to be replicated in the medical area was just sloppy research and should be replicated before any investment.

The trials aren't done sloppily.

Replication for pharamceutical trials is easier said than done.
Remember that clinical trials go through various phases (for precisely that reason: to catch non-working substances early before you move on to the next, more costly phase).

Each phase has more test subjects than the last. The whole procedure lasts years and costs on the order of 100 million dollars for ONE substance that passes through all phases (the majority of the cost is incurred in the last phase where you have thousands of trial subjects). This cost is borne by companies, BTW - not tax payers.

In recent years streamlined trial procedures have been approved for pharamaceuticals that are deemed critical (vaccines for newly arising epidemics, cancer cures, etc. )
thermodynamics
1 / 5 (1) Nov 13, 2013
Anti: Let me clarify what I said so I might be a little more clear. I said: " I think that a lot of the research that has not been able to be replicated in the medical area was just sloppy research and should be replicated before any investment. I think that the trials for medical research are the place where real uncertainty must be assessed. At that point if there is a P=0.05 for a treatment that can change lives, it would unethical to pass it up."

What I was trying to make clear (and didn't do a great job) is that clinical trials do not start until some basic research has made it clear that there was a reasonable possibility of an approach paying off. It is that basic research that should be verified before spending the huge sums required for clinical trials. As I understand the process, a company finds a promising direction and then applies for stage 1 trials. Continued
thermodynamics
1 / 5 (1) Nov 13, 2013
Continued: Prior to stage 1, the company should replicate the basic research. Even complex basic research is much less expensive than clinical trials. Stage 1, as I understand it, just looks at toxicity of the doses. Stage 2 then looks at how it affects a specific condition when compared to the standard therapy. Then stage 3 is application over a much larger population. There have been studies that showed that the basic research has been flawed in many cases and companies have jumped into stage 1 and followed up with stage 2 clinical tests prior to verifying the basic research the trials were based on. It would be much less resource intensive if they put much more effort in establishing the basic research prior to launching clinical trials.
thermodynamics
1 / 5 (1) Nov 13, 2013
I may as well also do some of the Google searching to point out some of the studies I was alluding to:

This is open access on PLOS so you can read the entire article.

http://www.plosme....0040028

Here is one that led to three cancer clinical trials before they found the basic research was not reproducible.

http://magazine.a...cyjan11/

Here is an article about research that does not share the data and moves to make sure it is shared.

http://www.bmj.co...mj.e4383

That article also links to a number of other sites in the bibliography.
antialias_physorg
not rated yet Nov 13, 2013
It is that basic research that should be verified before spending the huge sums required for clinical trials.

The basic research are done before the phases start: in vitro and on animal models.

Replicating that research has only marginal value because what works in a mouse or a pig (or in vitro) doesn't necessarily work in live humans.

And you have to go through a LOT of red tape to get some animal studies approved (months and months of wrangling with the ethics comission. They usually downgrade the numbers you apply for and there is no way to get more afterwards).
It's easiest with rats and mice since they aren't classified as animals but as 'vermin' (for exactly those legal/ethical reasons). But mouse models aren't good for a lot of types of pharamceuticals. There you need hamsters, sheep, pigs, rabbits and whatnot. And you just don't get 1000 pigs to try something out on. You may get 10 (or fewer)..if you're lucky to get any at all.
antialias_physorg
not rated yet Nov 13, 2013
The time it takes to replicate these studies is also not trivial.

Wikipedia has a good graphic on the trial phases and timelines for various approval processes
http://en.wikiped...l#Phases
Also note this part:
For example, a new cancer drug has, on average, six years of research behind it before it even makes it to clinical trials.
(That is before the phase 0 or phase I trial even starts)

It would be good to do the study again, but that would require another 3-5 years. You could get equivalent statistical power to a retest by upping the number of animals used in the original study. In either case you run into trouble with the ethics comission for frivolously using animal models (and rightfully so!)
triplehelix
1 / 5 (6) Nov 13, 2013
He suggests the research community change its standard of acceptance from .05 to .005 or even to .001.

That's a rather pointless demand. Some sciences would never be able to fulfill that norm (e.g. there would be no way you could get enough test subjects for a pharmaceutical study to have enough power to get a p value of 0.005 or 0.001.) In other sciences where the scope of an event is well defined such standards are too low (e.g. particle physics did had a p value of 0.0000001 before announcing that they had found the Higgs boson)

Science isn't about truth: it's about what works. Banking endeavours (or at the very least further investigation) on a 95% chance of success is adequately merited.
Yes, there will be some studies that get fluke positives (or fluke negatives) - but those are quickly found out when put to a repeat test.


Actually, Pharma and medical sciences can achieve 0.001 easily in many circumstances.
antialias_physorg
not rated yet Nov 13, 2013
Not really.

You'd need many thousands of (human) test subjects and only a very few where it doesn't work to call the results 'not statistically significant' (i.e.: don't go to market with the drug). For some illnesses it would even be OK if only, say, 20 percent were cured (e.g. HIV) to call it a success and have it marketed.

Many drugs are very specific to a certain type of illness. Getting the number of test subjects who are willing/able to test something for 3-7 years is difficult. You can imagine that a sizeable percentage will drop out of the study over such timeframes - either through non-compliance, moving home, death, secondary/other illness (which renders their results not admissible as the effect of that other illness on the primary cannot be easily judged), and a host of other factors.

Especially the last one is tricky. It's not always a clear cut case whether a subject should be excluded from a study or not.
antialias_physorg
not rated yet Nov 13, 2013
Calling these types of studies sloppy really misses something. They are incredibly difficult as you do not have a controlled/controllable setting as in most any other science. Especially when you do work on drugs where we're talking about 'improvement for the patient' we're entering very subjective territory (e.g. alleviation of symptoms) - as two trial patients with the same amount of objective betterment may report subjectively different improvement
(E.g. in a study I was involved in we could show how arthrothis of the knee progressed with/without a given drug via analysis of CT images - however the pain levels reported were not totally corelated with that progression. Some patients even reported less pain with objective worseing of the cartilage/bone situation)

More news stories