We Should Redefine Statistical Significance

July 23rd, 2017, Dan Benjamin, Jim Berger, Magnus Johannesson, Valen Johnson, Brian Nosek, EJ Wagenmakers

stats cropped

Researchers representing a wide range of disciplines and statistical perspectives—72 of us in total—have posted a new paper on PsyArXiv describing a place of common ground.  We argue that statistical significance should be redefined. The paper is forthcoming in Nature Human Behavior.

For claims of discoveries of novel effects, the paper advocates a change in the P-value threshold for a “statistically significant” result from 0.05 to 0.005. Results currently called “significant” that do not meet the new threshold would be called suggestive and treated as ambiguous as to whether there is an effect. The idea of changing the statistical significance threshold to 0.005 has been proposed before, but the fact that this paper is authored by statisticians and scientists from a range of disciplines—including psychology, economics, sociology, anthropology, medicine, epidemiology, ecology, and philosophy—indicates that the proposal now has broad support.

The paper highlights a fact that statisticians have known for a long time but which is not widely recognized in many scientific communities: evidence that is statistically significant at P = 0.05 actually constitutes fairly weak evidence. For example, for an experiment testing whether there is some effect of a treatment, the paper reports calculations of how different P-values translate into the odds that there is truly an effect vs. not. A P-value of 0.05 corresponds to odds that there is truly an effect that range, depending on assumptions, from 2.5:1 to 3.4:1. These odds are low, especially for surprising findings that are unlikely to be true positives in the first place. In contrast, a P-value of 0.005 corresponds to odds that there is truly an effect that range from 14:1 to 26:1, which is far more convincing. 

An important impetus for the proposal is the growing concern that there is a “reproducibility crisis” in many scientific fields that is due to a high rate of false positives among the originally reported discoveries. Many problems (such as multiple hypothesis testing and low power) have contributed to this high rate of false positives, and we emphasize that it is important to address all of these problems. We argue, however, that tightening the standards for statistical significance is a simple step that would help. Indeed, the theoretical relationship between the P-value and the strength of the evidence is empirically supported: the lower the P-value of the reported effect in the original study, the more likely the effect was to be replicated in both the Reproducibility Project Psychology and the Experimental Economics Replication Project

Lowering the significance threshold is a strategy that has previously been used successfully to improve reproducibility in several scientific communities. The genetics research community moved to a “genome-wide significance threshold” of 5×10-8 over a decade ago, and the adoption of this standard helped to transform the field from one with a notoriously high false positive rate to one with a strong track record of robust findings. In high-energy physics, the tradition has long been to define significance for new discoveries by a “5-sigma” rule (roughly a P-value threshold of 3×10-7). The fact that other research communities have maintained a norm of significance thresholds more stringent than 0.05 suggests that transitioning to a more stringent threshold can be done.

Changing the significance threshold from 0.05 to 0.005 carries a cost, however: Apart from the semantic change in how published findings are described, the proposal also entails that studies should be powered based on the new 0.005 threshold. Compared to using the old 0.05 threshold, maintaining the same level of statistical power requires increasing sample sizes by about 70%. Such an increase in sample sizes means that fewer studies can be conducted using current experimental designs and budgets. But the paper argues that under realistic assumptions, the benefit would be large: false positive rates would typically fall by factors greater than two. Hence, considerable resources would be saved by not performing future studies based on false premises. Increasing sample sizes is also desirable because studies with small sample sizes tend to yield inflated effect size estimates, and publication and other biases may be more likely in an environment of small studies.

In research communities where attaining larger sample sizes is simply infeasible (e.g., anthropological studies of a small-scale society), there is a related “cost”: most findings may no longer be statistically significant under the new definition. Our view is that this is not really a cost at all: calling findings with P-values in between 0.05 and 0.005 “suggestive” is actually a more accurate description of the strength of the evidence.

Indeed, the paper emphasizes that the proposal is about standards of evidence, not standards for policy action nor standards for publication.  Results that do not reach the threshold for statistical significance (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods.  Evidence that does not reach the new significance threshold should be treated as suggestive, and where possible further evidence should be accumulated. Failing to reject the null hypothesis (still!) does not mean accepting the null hypothesis.

The paper anticipates and responds to several potential objections to the proposal. A large class of objections is that the proposal does not address the root problems, which include multiple hypothesis testing and insufficient attention to effect sizes—and in fact might reinforce some of the problems, such as the over-reliance on null hypothesis significance threshold and bright-line thresholds. We essentially agree with these concerns. The paper stresses that reducing the P-value threshold complements—but does not substitute for—solutions to other problems, such as good study design, ex ante power calculations, pre-registration of planned analyses, replications, and transparent reporting of procedures and all statistical analyses conducted.

Many of the authors agree that there are better approaches to statistical analyses than null hypothesis significance testing and will continue to advocate for alternatives. The proposal is aimed at research communities that continue to rely on null hypothesis significance testing at a 0.05 threshold; for those communities, reducing the P-value threshold for claims of new discoveries to 0.005 is an actionable step that will immediately improve reproducibility. Far from reinforcing the over-reliance on statistical significance, we hope that the change in the threshold—and the increased use of describing results with P-values between 0.05 and 0.005 as “suggestive”—will raise awareness of the limitations of relying so heavily on a P-value threshold and will thereby facilitate a longer-term transition to better approaches.

The proposed switch to a more demanding P-value threshold involves both a coordination problem (what threshold to use?) and a free-riding problem (why should I impose a more stringent threshold on myself unless others do?). The aim of the proposal is to help coordinate on 0.005 and to discourage free-riding on the old threshold. Ultimately, we believe that the new significance threshold will help researchers and readers to understand and communicate evidence more accurately.

Recent Blogs

The Content of Open Science

What Second Graders Can Teach Us About Open Science

What's Going on With Reproducibility?

Open Science and the Marketplace of Ideas

3 Things Societies Can Do to Promote Research Integrity

How to Manage and Share Your Open Data

Interview with Prereg Challenge Award Winner Dr. Allison Skinner

Next Steps for Promoting Transparency in Science

Public Goods Infrastructure for Preprints and Innovation in Scholarly Communication

A How-To Guide to Improving the Clarity and Continuity of Your Preregistration

Building a Central Service for Preprints

Three More Reasons to Take the Preregistration Challenge

The Center for Open Science is a Culture Change Technology Company

Preregistration: A Plan, Not a Prison

How can we improve diversity and inclusion in the open science movement?

OSF Fedora Integration, Aussie style!

Replicating a challenging study: it's all about sharing the details.

Some Examples of Publishing the Research That Actually Happened

How Preregistration Helped Improve Our Research: An Interview with Preregistration Challenge Awardees

Are reproducibility and open science starting to matter in tenure and promotion review?

The IRIS Replication Award and Collaboration in the Second Language Research Community

We Should Redefine Statistical Significance

Some Cool New OSF Features

How Open Source Research Tools Can Help Institutions Keep it Simple

OSF Add-ons Help You Maximize Research Data Storage and Accessibility

10 Tips for Making a Great Preregistration

Community-Driven Science: An Interview With EarthArXiv Founders Chris Jackson, Tom Narock and Bruce Caron

A Preregistration Coaching Network

Why are we working so hard to open up science? A personal story.

One Preregistration to Rule Them All?

Using the wiki just got better.

Transparent Definitions and Community Signals: Growth in the Open Science Community

We're Committed to GDPR. Here's How.

Preprints: The What, The Why, The How.

The Prereg Challenge Is Ending. What's Next?

We are Now Registering Preprint DOIs with Crossref

Using OSF in the Lab

Psychology's New Normal

How Open Commenting on Preprints Can Increase Scientific Transparency: An Interview With the Directors of PsyArxiv, SocArxiv, and Marxiv

The Landscape of Open Data Policies

Open Science is a Behavior.

Why pre-registration might be better for your career and well-being

Interview: Randy McCarthy discusses his experiences with publishing his first Registered Report

Towards minimal reporting standards for life scientists

Looking Back on the Prereg Challenge and Forward To More Credible Research

This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.