Editor’s note: This article was first published on 8 March 2016. It was republished on 6 January 2017 to become part 7 of the special series Top 100 most-discussed journal articles of 2016.
Amid rising concerns about the reproducibility and replicability of scientific conclusions, the American Statistical Association (ASA) has released a formal statement1 clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value:
Underpinning many published scientific conclusions is the concept of “statistical significance,” typically assessed with an index called the p-value. While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted.
The six principles are:
- P-values can indicate how incompatible the data are with a specified statistical model.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
- Proper inference requires full reporting and transparency.
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
Wasserstein says that the biggest mistakes being made include the misuse of statistical significance as an arbiter of scientific validity, concluding that a null hypothesis is true because a computed p-value is large, and the logical fallacy of concluding something is true that you had to assume to be true in order to reach that conclusion.
The latter error relates to principle 2, which addresses a widespread misconception in regard to p-values. The Retraction Watch interviewer asks Wasserstein:
Some of the principles seem straightforward, but I was curious about #2 – I often hear people describe the purpose of a p value as a way to estimate the probability the data were produced by random chance alone. Why is that a false belief?
Let’s think about what that statement would mean for a simplistic example. Suppose a new treatment for a serious disease is alleged to work better than the current treatment. We test the claim by matching 5 pairs of similarly ill patients and randomly assigning one to the current and one to the new treatment in each pair. The null hypothesis is that the new treatment and the old each have a 50-50 chance of producing the better outcome for any pair. If that’s true, the probability the new treatment will win for all five pairs is (½)5 = 1/32, or about 0.03. If the data show that the new treatment does produce a better outcome for all 5 pairs, the p-value is 0.03. It represents the probability of that result, under the assumption that the new and old treatments are equally likely to win. It is not the probability the new treatment and the old treatment are equally likely to win.
- Wasserstein, R.L. & Lazar, N.A. (2016). The ASA’s statement on p-values: context, process, and purpose. The American Statistician, DOI:10.1080/00031305.2016.1154108 ↩
Also published on Medium.