Abstracts
The paper discusses the impact of plausibility (the a priori probability) on the results of scientific research, according to the approach proposed by Ioannidis, concerning the percentage of null hypotheses erroneously classified as "positive" (statistically significant). The question "what fraction of positive results are truepositives?", which is equivalent to the positive predictive value, is dependent on the combination of true and false hypotheses within a given area. For example, consider an area in which 90% of hypotheses are false and α = 0.05 and power = 0.8: for every 1,000 hypotheses, 45 (900 x 0.05) are falsepositives and 80 (100 x 0.8) are truepositives. Therefore, the probability of a positive result being a falsepositive is 45/125. In addition, the reporting of negative results as if they were positive would contribute towards an increase in this fraction. Although this analysis is difficult to quantify, and these results are likely be overestimated, it has two implications: i) plausibility should be considered in the analysis of the ethical adequacy of a research proposal, and ii) mechanisms aimed at registering studies and protocols should be encouraged.
HypothesisTesting; Reproducibility of Results; Statistical Methods and Procedures
O artigo discute o impacto da plausibilidade (probabilidade a priori) no resultado de pesquisas científicas, conforme abordagem de Ioannidis, referente ao percentual de hipóteses nulas erroneamente classificadas como "positivas" (estatisticamente significante). A questão "qual fração de resultados positivos é verdadeiramente positiva?", equivalente ao valor preditivo positivo, depende da combinação de hipóteses falsas e positivas em determinada área. Por exemplo, sejam 90% das hipóteses falsas e α = 0,05, poder = 0,8: para cada 1.000 hipóteses, 45 (900 x 0,05) serão falsopositivos e 80 (100 x 0,8) verdadeiropositivos. Assim, a probabilidade de que um resultado positivo seja um falsopositivo é de 45/125. Adicionalmente, o relato de estudos negativos como se fossem positivos contribuiria para a inflação desses valores. Embora essa análise seja de difícil quantificação e provavelmente superestimada, ela tem duas implicações: i) a plausibilidade deve ser considerada na análise da conformidade ética de uma pesquisa e ii) mecanismos de registro de estudo e protocolo devem ser estimulados.
Testes de Hipótese; Reprodutibilidade dos Testes; Métodos e Procedimentos Estatísticos
El artículo discute el impacto de la plausibilidad (probabilidad a priori) en el resultado de investigaciones científicas, conforme abordaje de Ioannidis, relacionado con el porcentaje de hipótesis nulas erróneamente clasificadas como "positivas" (estadísticamente significativas). La interrogante "cuál fracción de resultados positivos es verdaderamente positiva?", equivalente al valor predictivo positivo, depende de la combinación de hipótesis falsas y positivas en determinada área. Por ejemplo, sea el 90% de las hipótesis falsas y α= 0,05, poder= 0,8: para cada 1000 hipótesis, 45 (900 x 0,05) serán falsos positivos, y 80 (100 x 0,8) verdaderos positivos. Así, la probabilidad de que un resultado sea un falso positivo es de 45/125. Adicionalmente, el relato de estudios negativos como si fueran positivos contribuiría a la inflación de esos valores. A pesar de que el análisis sea de difícil cuantificación y probablemente superestimado, el mismo tiene dos implicaciones: i) la plausibilidad debe ser considerada en el análisis de la conformidad ética de una investigación y ii) mecanismos de registro de estudio y protocolo deben ser estimulados.
Pruebas de Hipótesis; Reproducibilidad de Resultados; Métodos y Procedimientos Estadísticos
ORIGINAL ARTICLES
The role of plausibility in the evaluation of scientific research
El papel de la plausibilidad en la evaluación de la investigación científica
Renan M V R Almeida
Programa de Engenharia Biomédica. Coppe. Universidade Federal do Rio de Janeiro. Rio de Janeiro, RJ, Brasil
^{Correspondence} Correspondence: Renan MVR Almeida Programa de Engenharia Biomédica Universidade Federal do Rio de Janeiro Caixa Postal 68510 Cidade Universitária 21945970 Rio de Janeiro, RJ, Brasil Email: renan.m.v.r.almeida@gmail.com
ABSTRACT
The paper discusses the impact of plausibility (the a priori probability) on the results of scientific research, according to the approach proposed by Ioannidis, concerning the percentage of null hypotheses erroneously classified as "positive" (statistically significant). The question "what fraction of positive results are truepositives?", which is equivalent to the positive predictive value, is dependent on the combination of true and false hypotheses within a given area. For example, consider an area in which 90% of hypotheses are false and α = 0.05 and power = 0.8: for every 1,000 hypotheses, 45 (900 x 0.05) are falsepositives and 80 (100 x 0.8) are truepositives. Therefore, the probability of a positive result being a falsepositive is 45/125. In addition, the reporting of negative results as if they were positive would contribute towards an increase in this fraction. Although this analysis is difficult to quantify, and these results are likely be overestimated, it has two implications: i) plausibility should be considered in the analysis of the ethical adequacy of a research proposal, and ii) mechanisms aimed at registering studies and protocols should be encouraged.
Descriptors: HypothesisTesting. Reproducibility of Results. Statistical Methods and Procedures.
RESUMEN
El artículo discute el impacto de la plausibilidad (probabilidad a priori) en el resultado de investigaciones científicas, conforme abordaje de Ioannidis, relacionado con el porcentaje de hipótesis nulas erróneamente clasificadas como "positivas" (estadísticamente significativas). La interrogante "cuál fracción de resultados positivos es verdaderamente positiva?", equivalente al valor predictivo positivo, depende de la combinación de hipótesis falsas y positivas en determinada área. Por ejemplo, sea el 90% de las hipótesis falsas y α= 0,05, poder= 0,8: para cada 1000 hipótesis, 45 (900 x 0,05) serán falsos positivos, y 80 (100 x 0,8) verdaderos positivos. Así, la probabilidad de que un resultado sea un falso positivo es de 45/125. Adicionalmente, el relato de estudios negativos como si fueran positivos contribuiría a la inflación de esos valores. A pesar de que el análisis sea de difícil cuantificación y probablemente superestimado, el mismo tiene dos implicaciones: i) la plausibilidad debe ser considerada en el análisis de la conformidad ética de una investigación y ii) mecanismos de registro de estudio y protocolo deben ser estimulados.
Descriptores: Pruebas de Hipótesis. Reproducibilidad de Resultados. Métodos y Procedimientos Estadísticos.
INTRODUCTION
In the contemporary statistical methodology,^{11} which is widely in use across all fields of science, the "null hypothesis" (H_{0}) represents the inexistence of a given effect (therefore its name). H_{0} is either "rejected" or "not rejected" based on an appropriate test statistic (for example, Student's t for assessing the difference between two means). Thereafter, the analysis strategy consists of calculating a probability  known as the pvalue  associated with this statistic. In cases in which this value is lower than a threshold defined a priori (α), the effect is considered to exist, or to be "statistically significant." Two types of error are intrinsic to this procedure, and are known as Type I (rejecting H_{0} when it is true) and Type II (not rejecting H_{0} when it is false). These errors occur with probabilities "α" and "β," respectively. In general, α is set arbitrarily to 5%, and experimental designs often aim at a level of up to 20% for β (that is, 80% probability of correctly rejecting H_{0} when it is false, or the test's "power").
Among other things, a shortcoming of the traditional approach is that it does not consider the effect of plausibility when evaluating a hypothesis. Especially among statisticians of nonclassical persuasion, there is the idea that a pvalue may overestimate the evidence against a hypothesis, since the effect of plausibility is not evident in classical analyses (that is, p = 0.001 is considered as evidence for rejecting both a plausible and an implausible hypothesis).^{3,4}
Thus, the present article discusses the impact of initial plausibility in the results of scientific research, based on the approach of Ioannidis.^{69} This approach relates to the percentage of null hypothesis H_{0} erroneously classified as "positive results" (statistically significant) in different fields of science. According to this, the question "what proportion of positive results is truly positive?" essentially depends on the proportion of true and false hypotheses tested within a given field of knowledge  or the a priori plausibility. This analysis is important for our understanding of the limitations inherent to scientific research, especially with respect to the priors (initial probabilities) of a given study.
THE IOANNIDIS APPROACH
In a recent series of articles, Ioannidis analyzed the role of replication and initial plausibility on the results of scientific research.^{69} His central argument was presented in an article with the provocative title "Why most published research findings are false,"^{6} in which the author states that "it can be proven that most claimed research findings in most areas of research are false." This article has been cited hundreds of times in the scientific literature.
Ioannidis systematized observations initially made by others, such as Browner & Newman^{1} and Sterne & Smith.^{15} Thus, the concepts of Type I and Type II errors were presented in a conceptually equivalent manner, such that the probability of a Type I error was defined as a the percentage of all H_{0} hypotheses in a given field of research that are erroneously classified as statistically significant; and Type II error was defined as the percentage of false H_{0} erroneously classified as nonstatistically significant. Given a positive finding (that is, a rejected H_{0}), the probability that H_{0} is indeed false is conditional upon the initial fraction of truly true and truly false hypotheses tested. This statement  which is analogous to the concept of positive predictive value, widely used in diagnostic testing^{1,2}  can be understood by considering the following examples: i) all hypotheses tested in a given area are in actuality false. In this case, 100% of positive results would be false. ii) 100% of hypotheses tested are true. Analogously, all positive results would be true. And iii) in a field in which 90% of tested hypotheses are false (and maintaining the conventional values of α = 0.05 and power = 0.8), for every 1,000 hypotheses, 45 will be falsepositives (900 x 0.05), and 80 will be truepositives (100 x 0.8). Thus, given a positive result, the probability of it being a false positive is of roughly onethird (45/125) (Figure).
Between the extremes represented by cases i and ii, the relationship R = true H_{0}/false H_{0} alters the equivalent to the positive predictive value for a given field of knowledge (given a positive result, the larger R is, the greater the probability of a truepositive). In other words, the lower a study's plausibility is, the greater the probability of a positive result being false. This phenomenon, according to Ioannidis, would help explain why even highimpact scientific publications often publish contradictory and nonreplicable results.^{6}
Ioannidis also introduced the concept of "u bias," defined as the probability of a negative result being erroneously reported as positive by selective use of secondary outcomes, alteration of cutoff points, use of inappropriate statistical methods, or fraud. Based on these concepts, a simulation of R and u for different types of study led to the conclusion that "in the described framework, a positive predictive value exceeding 50% is quite difficult to get"^{6} (p. 699), and that "even well powered epidemiological studies may have only a one in five chance of being true, if R =1:10"^{6} (p. 699). This justifies Ioannidis' claims regarding the prevalence of nonreplicable results in science.
ANALYSIS AND EVALUATION
Ioannidis' analysis is contingent on two fundamental aspects: i) that the number of false hypotheses in any field is much greater than that of true hypotheses; and ii) that the u rate is indeed high (Ioannidis assumed values ranging from 10% to 80%). The former may be justified by the innovative nature of the scientific endeavor, as well as by the constant pressure for results, even in sterile or slowmoving fields, but is difficult to extrapolate to the majority of scientific fields. On the other hand, in the absence of socalled "file drawer effect,"^{13,14} the influence of the phenomenon discussed by Ioannidis is smaller, given that, if the pattern of R were known, the assessment of the true importance of a positive result would in principle be possible. However, as discussed above, the only way to determine R would be if all negative and positive results in a field were known (for example, if 100 hypotheses concerning a given phenomenon were tested and 95 were determined to be negative, global results would be compatible with a model that assumed the inexistence of this phenomenon). As to the u parameter, Goodman & Greenland^{5,}^{ª} a Goodmn S, Greenlnd S. Assessing the unrelibility of the medicl literture: response to "Why most published reserch findings re flse". Bltimore: Johns Hopkins University; 2007[citdo 2010 jul]. (Working pper, 135). Disponível em: http://www.bepress.com/jhubiostat/paper135 pointed out that: i) the definition of u is misleading, since it equates the selective reporting of secondary outcomes with direct fraud; and ii) the values for u assumed by Ioannidis (10%80%) are speculative and dominated the simulation, that is, his conclusions are dependent on a high prevalence of "fraud." It follows that Ioannidis' assumptions are difficult to quantify, and that it is impossible to claim that their effect applies to a substantial fraction of scientific results.
According to the authors, Ioannidis' analysis failed to distinguish between different levels of evidence (in terms of pvalue) against H_{0}.^{5,8} Ioannidis dichotomized of results as either "statistically significant" or "nonstatistically significant" based on the classically used α cutoff of 0.05. However, in practice, such dichotomization of results is unusual, the indication of specific pvalues being generally preferred.
Irrespectively of the criticisms made by Goodman & Greenland^{5,}^{ª} a Goodmn S, Greenlnd S. Assessing the unrelibility of the medicl literture: response to "Why most published reserch findings re flse". Bltimore: Johns Hopkins University; 2007[citdo 2010 jul]. (Working pper, 135). Disponível em: http://www.bepress.com/jhubiostat/paper135 (who in fact agree with the central points of Ioannidis' analysis), the effect discussed is highly dependent on the specific characteristics of each field of research. Ioannidis suggests two fields as being critical: genomic research and the search for associations between nutrients and epidemiological outcomes, in which hypotheses are often tested using a heuristic approach and effects are small and difficult to measure. Another important example^{b} b Novella S. Are most medical studies wrong? Neurologica Blog. 2007[citado 2010 jul]. Disponível em: http://theness.com/neurologicablog/?p=8 is that of the field known as Complementary and Alternative Medicine (CAM), given that it is not difficult to conclude, despite a number of attempts,^{10} that the only common, core principle among the countless trends that fall under the CAM denomination is the clear implausibility of their claims. This argument can be added to other points made in the literature (such as the lack of impact of negative results on this field, the inadequate legitimacy conferred to implausible ideas, and the inappropriate allocation of limited resources) to conclude that is conducting CAM studies in human subjects is unjustifiable.^{12}
On the other hand, absence of an effect is impossible to prove, for there is always the possibility of the effect being below the threshold of detection. Furthermore, the amount of resources for research is infinitely smaller than what would be necessary to analyze all phenomena that can theoretically be proposed. Therefore, at least in the realm of human research, phenomena should only be investigated when they are both relevant and plausible.
FINAL CONSIDERATIONS
Problems inherent to the methods of contemporary science facilitate the improper publication of results that are apparently positive. These problems are related to the plausibility of studies in a given field, but are also linked to the socalled "file drawer effect." As discussed above, quantifying these effects is difficult, since they are also dependent on the particular conditions of a given research field.
However, two implications are worth highlighting: the first relates to the importance of the operational principle recognized by the Helsinki declaration, which in its 11^{th} article states that "medical research involving human subjects must conform to generally accepted scientific principles, be based on a thorough knowledge of the scientific literature, other relevant sources of information, and adequate laboratory and, as appropriate, animal experimentation."^{16} Lack of plausibility should thus be regarded as an important violation of research ethics. The second implication refers to the need to develop mechanisms for registering study protocols,^{13} so as to minimize and facilitate the detection of both the file drawer effect and of protocol alterations. Such registration mechanisms would also help to identify duplicate studies and to expedite metaanalysis, thus contributing to greater transparency and efficiency in scientific research.
REFERENCES
Received: 7/14/2010
Accepted: 11/14/2010
Study presented at the 8th Congresso Brasileiro de Bioética, held in Buzios, Brazil, en 2009.
The author declares no conflict of interests.
 1. Browner W, Newman TB. Are all significant p values created equal? The analogy between diagnostic tests and clinical research. JAMA. 1987;257(18):245963.
 2. Dawson B, Trapp RG. Basic & Clinical Biostatistics. New York: McGrawHill; 2004.
 3. Goodman SN. Toward evidencebased medical statistics. 1: The P value fallacy. Ann Intern Med. 1999;130(12):9951004.
 4. Goodman SN. Toward evidencebased medical statistics. 2: The Bayes factor. Ann Intern Med. 1999;130(12):100513.
 5. Goodman S, Greenland S. Why most published research findings are false: problems in the analysis. PLoS Med. 2007;4(4):e168. DOI:10.1371/journal.pmed.0040168
 6. Ioannidis JPA. Why most published research findings are false PLoS Med. 2005;2(8):e124. DOI:10.1371/journal.pmed.0020124
 7. Ioannidis JPA. Contradicted and initially stronger effects in highly cited clinical research JAMA. 2005;294(2):21828.
 8. Ioannidis JPA. Why most published research findings are false: author's reply to Goodman and Greenland. PLoS Med. 2007;4(6):e215. DOI:10.1371/journal.pmed.0040215
 9. Ioannidis JPA. Why most discovered true associations are inflated. Epidemiology 2008;16(4) 6408. DOI:10.1097/EDE.0b013e31818131e7
 10. Manzini T, Martinez EZ, Carvalho ACD. Conhecimento, crença e uso de medicina alternativa e complementar por fonoaudiólogas. Rev Bras Epidemiol. 2008;11(2):30414. DOI:10.1590/S1415790X2008000200012
 11. Moore DS. Estatística Básica e sua Prática. Rio de Janeiro: LTC Editora; 2005.
 12. Renkens CNM. Some complementary and alternative therapies are too implausible to be investigated. Focus Alternat Complement Ther. 2003;8(3):3078. Disponível em:
 13. Yamey G. Scientists who do not publish trial results are "unethical". BMJ 1999; 319(7215):939.
 14. Young NS, Ioannidis JPA, AlUbaydli O. Why current publication practices may distort science. PLoS Med. 2008;5(10):e201. DOI:10.1371/journal.pmed.0050201
 15. Sterne JAC, Smith GD. Sifting the evidence: what is wrong with significance tests? BMJ. 2001;322:22631.

^{16}World Medical Association. Declaration of Helsinki  Ethical Principles for Medical Research Involving Human Subjects  2008 version. FerneyVoltaire; 2008[citado 2010 jul]. Disponível em: http://www.wma.net/en/30publications/10policies/b3/index.html
Publication Dates

Publication in this collection
29 Apr 2011 
Date of issue
June 2011
History

Accepted
14 Nov 2010 
Received
14 July 2010