The American Statistical Association (ASA) gathered more than two dozen experts to develop a consensus on statistical significance and p-values and issued a statement on it in 2016. But what prompted the ASA for the very first time to issue such a statement which deals with the specific matters of statistical practice. The p-value is defined as the probability, under the assumption of no effect or no difference (which is also known as the null hypothesis), of obtaining a result equal to or more extreme than what you actually observe. The father of modern statistics RA Fisher who introduced p-values as formal research tool proposed to attach the term significance to low p-values which as the term itself suggest worthy of attention in the form of warranting more experiment but not a proof itself.
P-values and statistical significance are often misunderstood and misused. In spite of the various limitations suggested by ASA on p-values such as p-values cannot measure the size of an effect or the importance of the result or the strength of the evidence it still continues to be a tool which separates true findings from false once. One of the biggest misinterpretations is that p values tell us about the probability that results are true or due to random chance. The ASA statement principle no.2 clearly denies it and says that p values only tell us about the probability of seeing one’s results in relation to a particular hypothetical explanation.
(The ASA statement principle no.2: “P-values do not measure the probability that the studied hypothesis is true or the probability that the data were produced by random chance alone.”)
Many researchers believe that p< 0.05 is a license to get your work published in many journals (In many fields, results which are 0.05 or less are considered to be statistically significant). The choice of 0.05 is just arbitrary cut-off and does not have any statistical and logical reasoning. It has become a scientific convention through decades of common use. Researchers tend to snoop around their data and try several different analysis offered by various statistical software until one of them passes the arbitrary cut-off.
The overreliance on the arbitrary cut-off of 0.05 has often made researchers fail to understand the difference between statistical significance and practical significance. The ASA statement principle no.3 clearly warns about scientific conclusions and business or policy decisions based only on p-value passing a specific threshold such as 0.05. Scientific claims based on it (p<0.05) might lead to erroneous beliefs and poor decision making. Faulty statistical conclusions can also lead to real economic consequences.
Case in point: Following the 1973 oil crisis, traffic agencies in the US allowed right turn on red signals to save fuel wasted by commuters while waiting at red light. This was seen as an energy conservative measure. Several studies were conducted in order to determine the safety impact of this change. Various studies found the difference between before and after change not statistically significant. They turned statistical insignificance into practical insignificance. Studies reported a small increase in the number of crashes but not enough data to conclude these increases were significant. As more cities and states in the US began to allow right turn on red signals the implication was that more pedestrians were run over and the number of cars collisions increased. Apparently, no one attempted to aggregate these small studies and produce larger data sets. Until several years later with more data, studies finally showed that 60% more pedestrians and twice as many cyclists were hit at right turns.
Consider another case of Satoshi Kanazawa who published a series of papers on the topic of gender ratios. One of his paper titled, “Beautiful parents have more daughters” was published in 2007 in the Journal of Theoretical Biology. He asserted that beautiful parents have a daughter 52% of the time, while least attractive parents have daughters 44% of the time. Kanazawa reported a statistically significant 8% point difference. Biologically, a 0.3% would seem reasonable but the effect size claimed by Kanazawa had inflated by a factor of more than 20. Later it was found that he had committed several errors in his statistical analysis. A correct analysis showed a much smaller effect & the results were statistically insignificant. There was nothing new learned from the study. This type of error is known truth inflation (true size of the effect gets inflated. It arises in small sample size studies which are underpowered (low probability of detecting an effect of practical importance)). John Ioannidis a professor of Medicine and Statistics at Standford University & one of the most cited scientists worldwide claims that such errors occur in many fields of medical sciences (especially pharmacology, epidemiology and gene association), social sciences (especially psychology) & also fields like ecology and evolution.
In addition to this, Professor John Ioannidis and his colleague have argued that earlier publications in high profile and rapidly moving scientific fields such as genetics are more extreme and opposite since journals are mostly interested in publishing new and exciting results. However subsequent studies do find a much smaller effect. So always be cautious of studies which make a surprising large discovery with small samples.
A recent critical analysis of the UGC-approved list of journals-headed by a Pune-based researcher, Dr. Bhushan Patwardhan, a professor at the Interdisciplinary School of Health Sciences, Pune University has found 88% journal in UGC’s white list as dubious & low quality presenting a grim situation in India. The current development in this story is that UGC has removed 4305 journals from its approved list. One of the possible solutions is to make journals have high statistical reporting standards and promote more transparency. Leading journals in India should lead the charge. More focus on quality statistical education can play a key role. Readers should pressurize journal editors to hold authors to more rigorous standards.
“The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives” by Deirdre McCloskey and Stephen Ziliak gives a vivid description of how statistical significance dominates many sciences today and documents harmful consequences of this phenomenon. Additional 21 commentaries published by various authors have been used as a supplement material to the ASA statement on p-values and statistical significance. They provide a wide range of ideas and discuss solutions.
Of late, there has been an increase in voices favouring statistical reforms. But there is a disagreement on the best possible method to address this problem. A prominent group of 72 statisticians, psychologists, economists, sociologists, political scientists, biomedical researchers and others propose to change the default p-value threshold for statistical significance from 0.05 to 0.005. In response to it, a group of 88 researchers has proposed that academicians justify their use of specific p-values in their study rather than adopt another arbitrary threshold. A recent study by Alberto Abadie at National Bureau of Economic Research (NBER) showed that non-significant results are more prominent than significant results in empirical economics. Some people suggest that p values should be completely abandoned and move towards confidence intervals, prediction intervals, and Bayesian methods. Finally, Darrel Huff the author of the famous book titled, “How to Lie with Statistics” conveys us an important message that research coming out of our universities & laboratories is worthy of our trust but not unconditional trust.