The role of plausibility in the evaluation of scientifi c research

The paper discusses the impact of plausibility (the a priori probability) on the results of scientifi c research, according to the approach proposed by Ioannidis, concerning the percentage of null hypotheses erroneously classifi ed as “positive” (statistically signifi cant). The question “what fraction of positive results are true-positives?”, which is equivalent to the positive predictive value, is dependent on the combination of true and false hypotheses within a given area. For example, consider an area in which 90% of hypotheses are false and α = 0.05 and power = 0.8: for every 1,000 hypotheses, 45 (900 x 0.05) are false-positives and 80 (100 x 0.8) are true-positives. Therefore, the probability of a positive result being a false-positive is 45/125. In addition, the reporting of negative results as if they were positive would contribute towards an increase in this fraction. Although this analysis is diffi cult to quantify, and these results are likely be overestimated, it has two implications: i) plausibility should be considered in the analysis of the ethical adequacy of a research proposal, and ii) mechanisms aimed at registering studies and protocols should be encouraged. DESCRIPTORS: Hypothesis-Testing. Reproducibility of Results. Statistical Methods and Procedures.


INTRODUCTION
In the contemporary statistical methodology, 11 which is widely in use across all fi elds of science, the "null hypothesis" (H 0 ) represents the inexistence of a given effect (therefore its name).H 0 is either "rejected" or "not rejected" based on an appropriate test statistic (for example, Student's t for assessing the difference between two means).Thereafter, the analysis strategy consists of calculating a probability -known as the p-value -associated with this statistic.In cases in which this value is lower than a threshold defi ned a priori (α), the effect is considered to exist, or to be "statistically signifi cant."Two types of error are intrinsic to this procedure, and are known as Type I (rejecting H 0 when it is true) and Type II (not rejecting H 0 when it is false).These errors occur with probabilities "α" and "β," respectively.In general, α is set arbitrarily to 5%, and experimental designs often aim at a level of up to 20% for β (that is, 80% probability of correctly rejecting H 0 when it is false, or the test's "power").
Among other things, a shortcoming of the traditional approach is that it does not consider the effect of plausibility when evaluating a hypothesis.Especially among statisticians of non-classical persuasion, there is the idea that a p-value may overestimate the evidence against a hypothesis, since the effect of plausibility is not evident in classical analyses (that is, p = 0.001 is considered as evidence for rejecting both a plausible and an implausible hypothesis). 3,47][8][9] This approach relates to the percentage of null hypothesis H 0 erroneously classifi ed as "positive results" (statistically signifi cant) in different fi elds of science.According to this, the question "what proportion of positive results is truly positive?"essentially depends on the proportion of true and false hypotheses tested within a given fi eld of knowledge -or the a priori plausibility.This analysis is important for our understanding of the limitations inherent to scientifi c research, especially with respect to the priors (initial probabilities) of a given study.

THE IOANNIDIS APPROACH
7][8][9] His central argument was presented in an article with the provocative title "Why most published research fi ndings are false," 6 in which the author states that "it can be proven that most claimed research fi ndings in most areas of research are false."This article has been cited hundreds of times in the scientifi c literature.
Ioannidis systematized observations initially made by others, such as Browner & Newman 1 and Sterne & Smith. 15Thus, the concepts of Type I and Type II errors were presented in a conceptually equivalent manner, such that the probability of a Type I error was defi ned as a the percentage of all H 0 hypotheses in a given fi eld of research that are erroneously classifi ed as statistically signifi cant; and Type II error was defi ned as the percentage of false H 0 erroneously classifi ed as non-statistically signifi cant.Given a positive fi nding (that is, a rejected H 0 ), the probability that H 0 is indeed false is conditional upon the initial fraction of truly true and truly false hypotheses tested.This statement -which is analogous to the concept of positive predictive value, widely used in diagnostic testing 1,2 -can be understood by considering the following examples: i) all hypotheses tested in a given area are in actuality false.In this case, 100% of positive results would be false.ii) 100% of hypotheses tested are true.Analogously, all positive results would be true.And iii) in a fi eld in which 90% of tested hypotheses are false (and maintaining the conventional values of α = 0.05 and power = 0.8), for every 1,000 hypotheses, 45 will be false-positives (900 x 0.05), and 80 will be true-positives (100 x 0.8).Thus, given a positive result, the probability of it being a false positive is of roughly one-third (45/125) (Figure ).
Between the extremes represented by cases i and ii, the relationship R = true H 0 /false H 0 alters the equivalent to the positive predictive value for a given fi eld of knowledge (given a positive result, the larger R is, the greater the probability of a true-positive).In other words, the lower a study's plausibility is, the greater the probability of a positive result being false.This phenomenon, according to Ioannidis, would help explain why even high-impact scientifi c publications often publish contradictory and non-replicable results. 6annidis also introduced the concept of "u bias," defi ned as the probability of a negative result being erroneously reported as positive by selective use of secondary outcomes, alteration of cutoff points, use of inappropriate statistical methods, or fraud.Based on these concepts, a simulation of R and u for different types of study led to the conclusion that "in the described framework, a positive predictive value exceeding 50% is quite diffi cult to get" 6 (p.699), and that "even well powered epidemiological studies may have only a one in fi ve chance of being true, if R =1:10" 6 (p.699).This justifi es Ioannidis' claims regarding the prevalence of non-replicable results in science.

ANALYSIS AND EVALUATION
Ioannidis' analysis is contingent on two fundamental aspects: i) that the number of false hypotheses in any fi eld is much greater than that of true hypotheses; and ii) that the u rate is indeed high (Ioannidis assumed values ranging from 10% to 80%).The former may be justifi ed by the innovative nature of the scientifi c endeavor, as well as by the constant pressure for results, even in sterile or slow-moving fi elds, but is diffi cult to extrapolate to the majority of scientifi c fi elds.On the other hand, in the absence of so-called "fi le drawer effect," 13,14 the infl uence of the phenomenon discussed by Ioannidis is smaller, given that, if the pattern of R were known, the assessment of the true importance of a positive result would in principle be possible.However, as discussed above, the only way to determine R would be if all negative and positive results in a fi eld were known (for example, if 100 hypotheses concerning a given phenomenon were tested and 95 were determined to be negative, global results would be compatible with a model that assumed the inexistence of this phenomenon).As to the u parameter, Goodman & Greenland 5,a pointed out that: i) the defi nition of u is misleading, since it equates the selective reporting of secondary outcomes with direct fraud; and ii) the values for u assumed by Ioannidis (10%-80%) are speculative and dominated the simulation, that is, his conclusions are dependent on a high prevalence of "fraud."It follows that Ioannidis' assumptions are diffi cult to quantify, and that it is impossible to claim that their effect applies to a substantial fraction of scientifi c results.
According to the authors, Ioannidis' analysis failed to distinguish between different levels of evidence (in terms of p-value) against H 0 . 5,8Ioannidis dichotomized of results as either "statistically signifi cant" or "nonstatistically signifi cant" based on the classically used α cutoff of 0.05.However, in practice, such dichotomization of results is unusual, the indication of specifi c p-values being generally preferred.
Irrespectively of the criticisms made by Goodman & Greenland 5,a (who in fact agree with the central points of Ioannidis' analysis), the effect discussed is highly dependent on the specifi c characteristics of each fi eld of research.Ioannidis suggests two fi elds as being critical: genomic research and the search for associations between nutrients and epidemiological outcomes, in which hypotheses are often tested using a heuristic approach and effects are small and diffi cult to measure.Another important example b is that of the fi eld known as Complementary and Alternative Medicine (CAM), given that it is not difficult to conclude, despite a number of attempts, 10 that the only common, core principle among the countless trends that fall under the CAM denomination is the clear implausibility of their claims.This argument can be added to other points made in the literature (such as the lack of impact of negative results on this fi eld, the inadequate legitimacy conferred to implausible ideas, and the inappropriate allocation of limited resources) to conclude that is conducting CAM studies in human subjects is unjustifi able. 12 the other hand, absence of an effect is impossible to prove, for there is always the possibility of the effect being below the threshold of detection.Furthermore, the amount of resources for research is infi nitely smaller than what would be necessary to analyze all phenomena that can theoretically be proposed.Therefore, at least in the realm of human research, phenomena should only be investigated when they are both relevant and plausible.

FINAL CONSIDERATIONS
Problems inherent to the methods of contemporary science facilitate the improper publication of results that are apparently positive.These problems are related to the plausibility of studies in a given fi eld, but are also linked to the so-called "fi le drawer effect."As discussed above, quantifying these effects is diffi cult, since they are also dependent on the particular conditions of a given research fi eld.
However, two implications are worth highlighting: the fi rst relates to the importance of the operational principle recognized by the Helsinki declaration, which in its 11 th article states that "medical research involving human subjects must conform to generally accepted scientifi c principles, be based on a thorough knowledge of the scientifi c literature, other relevant sources of information, and adequate laboratory and, as appropriate, animal experimentation." 16Lack of plausibility should thus be regarded as an important violation of research ethics.The second implication refers to the need to develop mechanisms for registering study protocols, 13 so as to minimize and facilitate the detection of both the fi le drawer effect and of protocol alterations.Such registration mechanisms would also help to identify duplicate studies and to expedite metaanalysis, thus contributing to greater transparency and effi ciency in scientifi c research.

R
Figure.Proportion of false hypotheses in relation to the total number of statistically signifi cant results.