Introduction
While P value has been used across almost all fields, including biomedical research, debates over its uses, misuses and abuses never ceased (Amrhein, et al. [1-3]). The published views came mostly from mathematical statisticians, the group that produced the tool. Users of the tool, however, have developed, often unknowingly, functions that the producers did not design. Acknowledging the multiple functions of p value, designed and de facto, legitimate and illegitimate, may be a necessary step toward a more comprehensive understanding of this quotidian tool (Zhao [4]).
Function 1
Probabilitizing Observations: P was designed to probabilitize observation, i.e., to indicate the probability of obtaining the same or more extreme effects than observed in a random sample, assuming the observed effect does not exist in the population (Karl Pearson [5]).
Function 2
Preferring Hypothesis: Function 1 serves Function 2. If p is smaller than a predetermined threshold, p<α, the alternative hypothesis, that the effect exists in the population, is preferred over the null hypothesis, that the effect does not exist. Otherwise, if p≥α, the null hypothesis is preferred.
Functions 3 & 4
Projecting Population or Proxying Population: In practice, many have used p to indicate the probability that the effect in the population is in the same direction as observed in the sample (pp) (Hunter [6,7]). Probability of A (effect in the population) given B (effect in a sample) is not equal to the probability of B given A. Equating the two is therefore considered an illegitimate traverse and cited as a main justification for one journal to ban p values (Nuzzo [8-12]). Users, however, need an indicator to proxy for, i.e., to project approximately, the population. Given the strong and positive correlation between p and pp, p appears to be the best available for the task (BAT) (Colquhoun [4,13-15]).
Function 5
Prescreening Effect: As a pretest index, p represents the propensity, but not a precisely defined probability, that the direction of the observed effect contradicts the direction of the effect in a vaguely defined target population. Therefore, if p<α, we acknowledge that the observed direction of the effect is unlikely a fluke, therefore we are sufficiently confident that we may legitimately interpret the observed size of the effect. The pretest function of p value relies on
1) Negative correlation between p and effect size.
2) Negative correlation between p and data size.
3) Positive correlation between p and measurement variation.
They are desirable features. Users need an index that possess these features. No index has been designed to serve the need. P value is the best available for the task. Researchers need to screen out trivial relations and fortuitous occurrences before they take a closer look at the data. Editors and reviewers need to screen out hopeless manuscripts. The test p<0.05 fills these needs. After decades of interactions among researchers, reviewers, editors, and readers of science across disciplines, p < 0.05 now regularly serves as a relatively low hurdle that a researcher must step over before more serious investigation begins.This function of p does not assume that the observed data are a random sample, or that variable distributions are normal or independent and identical (IID). This explains why users still find p values useful when
1) Observed data are not a random sample.
2) Observed data constitute the entire operational population.
3) Observed data are a substantial part of an operational population.
4) While observed data and operational population are from the past, the ultimate target population of most studies projects into future.
For this function, p value has nothing to do with “significance” or “statistical significance.” Instead, p<α indicates “statistical acknowledgement” (Liu, et al. [16-18]). The pretest function of p value urges users to focus more on effect size and less on p value (Zhao, et al. [15]). It also implies that a main task of replication studies is to replicate effect direction, but not p<α.
Function 6
Pretending Significance: “Statistically significant” or simply “significant” has become synonyms of “p < 0.05”, which many experts argue is a consequential misnomer (Amrhein, et al. [1]). Indicating or pretending “significance” is a misfunction of p value. One consequence of the misfunction is to mistake p<0.05 as a main indicator to be replicated if a study is to be replicable. (Tackett, et al. [19-21]). P values vary with data size, measurement variation, and effect size, which expectedly vary across studies. Technically, therefore, p values are not meant to be replicated. More importantly, when proxying population (F4) or pretesting effects (F5), the main research findings are about effect direction and effect size, but not about p value. Across studies, it is the effect direction that needs to be replicated, and effect sizes that need to be averaged.
Conclusion
It’s time to acknowledge the multiple functions, especially the pretest function, of p value and statistical acknowledgement test, aka significance test. It’s time to acknowledge that p value plays a reduced by still important role in scientific enquiries including biomedical research. It’s time to call for research aimed at developing better tools serving each legitimate need.
References
- Amrhein V, Greenland S, McShane B (2019) Retire statistical significance. Nature 567: 305-307.
- Benjamini Y, Veaux R De, Efron B, Evans S, Glickman M, et al. (2021) ASA President’s Task Force Statement on Statistical Significance and Replicability. Harvard Data Science Review 3.3.
- Editorial (2019) It’s time to talk about ditching statistical significance. Nature 567: 283.
- Zhao X (2016) Four functions of statistical significance tests [Presentation at the School of Statistics and Center for Data Sciences Beijing Normal University, 25th].
- Pearson K (1900) X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50(302): 157-175.
- Hunter J E (1997) Needed: A ban on the significance test. Psychological Science 8(1): 3-7.
- Szucs D, Ioannidis JPA (2017) When null hypothesis significance testing is unsuitable for research: A reassessment. In Frontiers in Human Neuroscience 11: 390.
- Nuzzo R (2014) Statistical errors: P values, the “gold standard” of statistical validity, are not as reliable as many scientists assume. Nature 506(7487): 150-152.
- Siegfried T (2015) P value ban: small step for a journal, giant leap for science -- Editors reject flawed system of null hypothesis testing. Science News.
- Thompson B (1999) Journal editorial policies regarding statistical significance tests: Heat is to fire as p is to importance. In Educational Psychology Review 11(2): 157-169.
- Trafimow D, Marks M (2015) Editorial. Basic and Applied Social Psychology 37(1): 1-2.
- Woolston C (2015) Psychology journal bans P values. Nature 519(7541): 9.
- Colquhoun D (2014) An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science 1(3): 140216.
- Zhao X, Liu JS, Deng K (2013) Assumptions behind intercoder reliability indices. Annals of the International Communication Association 36(1): 419-480.
- Zhao X, Zhang XJ (2014) Emerging methodological issues in quantitative communication research. In: J. Hong (Edt.)., New Trends in Communication Studies, pp. 953-978.
- Liu PL, Zhao X, Wan B (2021). COVID-19 information exposure and vaccine hesitancy: The influence of trust in government and vaccine confidence. Psychology, Health and Medicine 7: 1-10.
- Liu PL, Ao SH, Zhao X, Zhang L (2022) Associations between COVID-19 information acquisition and vaccination intention: The roles of anticipated regret and collective responsibility. Health Communication, Advance on.
- Zhao X, Ye J, Sun S, Zhen Y, Zhang Z, et al. (2022) Best Title Lengths of Online Postings for Highest Read and Relay. Journalism and Communication Review 75(3): 5-20.
- Tackett JL, Brandes CM, King KM, Markon KE (2019) Psychology’s Replication Crisis and Clinical Psychological Science. Annual Review of Clinical Psychology 15: 579-604.
- Trafimow D (2018) An a priori solution to the replication crisis. Philosophical Psychology 31(8): 1188-1214.
- Wiggins BJ, Christopherson CD (2019) The replication crisis in psychology: An overview for theoretical and philosophical psychology. Journal of Theoretical and Philosophical Psychology 39(4): 202-217.