To P or not to P, is that the question? Rethinking experimental design and data analysis to improve biological significance beyond the statistical significance

In a assay published in Nature, Valentin Amrhein (University of Basel) and its colleagues Sander Greenland (University of California) & Blake McShane (Northwestern University), present a series of arguments against the stablished “P value-based statistical significance dichotomania”.1 The authors use some insightful practical examples, such as data related to unintended effects of anti-inflammatory drugs risks, and consider the influence of human cognitive trend to simplistically bucket results into “statistically significant” and “statistically nonsignificant” categories and to consider it definitely different. Importantly, the essay clearly states that the authors are not advocating a ban on P values or statistical measures, but that P values should not be treated categorically or to support dichotomization as statistically significant or not. Similarly, a 2016’ statement of the American Statistical Association2 warns against the misuse of statistical significance and P values, which include as a recommendation “don’t say statistically significant”. On the other hand, John P. A. Ioannidis (Stanford University), argue that “retiring statistical significance would give bias a free pass’ and that ‘irrefutable nonsense would rule’”.3 The author states that dichotomous conclusions can be useful for pinning down discoveries in different fields, but the analysis of effect sizes “can often be better than determining whether an effect exists”. While such discussion is at least provoking and mind challenging, this Editorial goal is to take advantage of the “statistical dogma” questioning to draw some attention to steps that precede the statistical analysis and the generation of a given P value. Indeed, a proper study design can even improve the statistical findings power in a measurable way, but also several aspects of study design and the subsequent data analysis may have a significant (forget P values for a moment), but not statistically quantifiable, impact in the data “significance”. The big mistake would be the overvaluation of statistical analysis methods with the undervaluation of experimental design. In the sequence, some practical examples (derived from our research group data) will be used to illustrate how study design can increase the both “statistical significance” and “biological significance”. Genetic studies are usually based in a classic control approach, were controls and subjects presenting a given condition are compared in regards of the occurrence and frequency of genetic variants. In this context, the P value derived from the unaffected and affected individuals’ comparison is essential to draw any conclusion. In such studies, the number of individual in such groups, but also the frequency of the target genetic variation, the frequency of the studied condition, impact the study power and the determination of the P value, and evidently, the conclusions derived from such data. However, experimental design features, apparently incomputable in the study power determination, can also present a significant impact in the analysis outcome. Thus, in this situation, stratified sampling considering these possible confounding factors could balance the study groups and minimize the effect of external variables on the final data analysis. In the periodontal genetic studies, generally affected individuals (presenting some form of periodontitis) are compared with periodontally healthy subjects.4-7 However, in this context, the possibility to control microbial exposure by oral hygiene methods interfere with the exposure factor, and consequently a periodontally healthy population is comprised by subjects that properly perform oral hygiene methods as a routine. Therefore, irrespective of the putative

and P values, which include as a recommendation "don't say statistically significant". On the other hand, John P. A. Ioannidis (Stanford University), argue that "retiring statistical significance would give bias a free pass' and that 'irrefutable nonsense would rule'". 3 The author states that dichotomous conclusions can be useful for pinning down discoveries in different fields, but the analysis of effect sizes "can often be better than determining whether an effect exists".
While such discussion is at least provoking and mind challenging, this Editorial goal is to take advantage of the "statistical dogma" questioning to draw some attention to steps that precede the statistical analysis and the generation of a given P value. Indeed, a proper study design can even improve the statistical findings power in a measurable way, but also several aspects of study design and the subsequent data analysis may have a significant (forget P values for a moment), but not statistically quantifiable, impact in the data "significance". The big mistake would be the overvaluation of statistical analysis methods with the undervaluation of experimental design. In the sequence, some practical examples (derived from our research group data) will be used to illustrate how study design can increase the both "statistical significance" and "biological significance".
Genetic studies are usually based in a classic control approach, were controls and subjects presenting a given condition are compared in regards of the occurrence and frequency of genetic variants. In this context, the P value derived from the unaffected and affected individuals' comparison is essential to draw any conclusion. In such studies, the number of individual in such groups, but also the frequency of the target genetic variation, the frequency of the studied condition, impact the study power and the determination of the P value, and evidently, the conclusions derived from such data.
However, experimental design features, apparently incomputable in the study power determination, can also present a significant impact in the analysis outcome. Thus, in this situation, stratified sampling considering these possible confounding factors could balance the study groups and minimize the effect of external variables on the final data analysis.
In the periodontal genetic studies, generally affected individuals (presenting some form of periodontitis) are compared with periodontally healthy subjects. [4][5][6][7] However, in this context, the possibility to control microbial exposure by oral hygiene methods interfere with the exposure factor, and consequently a periodontally healthy population is comprised by subjects that properly perform oral hygiene methods as a routine. Therefore, irrespective of the putative 2019;27:e2019-ed001 J Appl Oral Sci. 3/8 susceptible and resistant genotypic nature, such subjects will not develop the disease phenotype due the proper oral care. Indeed, such unique feature clearly differ from the usual characteristics of infectious diseases genetic studies, where affected and unaffected individuals are typically recruited from endemic areas where groups are naturally exposed to a pathogenic challenge, and the resistant and susceptible phenotypes are consequently exposed. 8,9 Therefore, the absence of the microbial factor in a periodontally healthy population, clearly disregard the case-control study architype, which determine if an exposure is associated with an outcome. 8,9 In other words, the absence of an archetypical control with a defined resistance phenotype may limit the odds of the identification of genotypic differences when compared with a susceptible group. In order to adapt the study design to the exposure concept, in periodontitis genetic studies the control group should comprise a microbially exposed group with a distinct phenotypic outcome than chronic periodontitis. 10 Such features can be found in individuals presenting chronic gingivitis, characteristically exposed to a periodontal microbial challenge associated with a reversible low severity disease form characterized by the absence of attachment, which in theory, represent "resistance" phenotype/genotype. Indeed, the "resistant versus susceptible" phenotype analysis, when compared to the traditional "healthy versus diseased" approach, significantly impacted the study power and odds of identification of genetic factors involved in PD. 10 The overall impact in the study power was the boost to >85% of a previous <30% power, while the overall odds ratio values seems to double in this approach; being such impact derived from the proper observation of the archetypical study design, supported by exposurebased phenotypes determination, comprehensively used in infectious diseases genetic studies.
Phenotypic variation can also be a supporting factor for data analysis and interpretation in order to determine the possible involvement of a given factor To P or not to P, is that the question? Rethinking experimental design and data analysis to improve biological significance beyond the statistical significance J Appl Oral Sci. 4/8 interpretation and "biological significance" by the reasons similar to those highlighted for the genetic studies.
A healthy tissue represents homeostasis and a diseased tissue represents pathology. However, both allegedly destructive (RANKL) and protective (OPG) factors are upregulated in diseased tissues, similarly to numerous inflammatory and also anti-inflammatory mediators. 11,12 Indeed, one could argue how increased levels of anti-inflammatory and anti-osteoclastogenic factors, associated to "highly significative" P values (in healthy vs diseased tissues comparison), are presented in inflamed and diseased periodontal sites?
It is also important to consider that if a given molecule, with unknown role, would identified as upregulated in diseased periodontal tissue, the immediate assumption regarding its role in the disease pathogenesis probably will be to label such molecule as "destructive".
In this scenario, the use of additional distinct and phenotypes can allow additional data analysis and provide some insightful information about how host inflammatory immune response mediators can impact periodontitis outcome. One possible approach is to compare periodontitis variants, such as aggressive and chronic, each one characterized by its unique features, such as early vs late onset and different progression rates. 12 In such comparison, it is possible to observe variations in RANKL/OPG ratio, which can explain the possible variations between the forms, but the sole comparison of each form with healthy tissue would not allow such inference, despite the "very significant" P value, "more significant" than the aggressive vs chronic comparison. Importantly, the comparison on distinct disease forms points to a differential balance in the levels of pro-vs anti-osteoclastogenic and pro-vs anti-inflammatory mediators as determinants of disease outcome. However, the determination of the "tipping point" that separates homeostasis from pathology may require additional approaches, which will be explored further below.
One may argue that phenotypic variation may not be necessary, since it would be possible to perform where the lesion size does not necessarily correlates with a "higher" activity signature (i.e. expression of tissue destructive mediators) than smaller lesions (at the moment of sample collection). [19][20][21][22] Therefore, the comparison of periodontitis-derived data with other conditions can also support better interpretation of RANKL/OPG ratio association with active or inactive bone resorption. While the definition of active bone resorption is complex in periodontitis, in orthodontic tooth movement such patterns are more straightforwardly distinguishable. [23][24][25] Categorically, the bone resorption activity is a hallmark of pressure side, and can be comparatively analyzed in the tension side counterpart, where the bone formation activity prevail. [26][27][28] Such data can provide a theoretical cutoff or threshold value that distinguish presence of absence of active bone resorption, which can be applied to periodontitis or other inflammatory osteolytic conditions, such as periapical lesions, to support additional analysis or assumptions. 22,29,30 In an additional example on how the use of distinct phenotypes can provide "biological significance" that upregulated in diseased tissues when compared with healthy ones, associated with a "very significative" P value (P<0.001). 32 Similarly to previously mentioned to OPG and anti-inflammatory mediators, one could argue how increased levels of inflammation suppressors, associated to "highly significative" P values (in healthy vs diseased tissues comparison), are presented in inflamed and diseased periodontal sites? However, such study also demonstrates that SOCS levels are higher in chronic gingivitis than in chronic periodontitis, which provides some interesting additional biological clue, but such association present a "less significative" P value (P<0.05) than the healthy vs diseased approach. 32 Based in the initial comparison, it is possible to recognize that SOCS are generally absent in healthy tissues being upregulated in response to inflammation. However, the second scenario allow us to infer that a more pronounced upregulation in chronic gingivitis could suggest a more efficient suppressive feedback, which could account for some phenotypic variation between gingivitis and periodontitis. Even considering that the P value from "healthy vs diseased" analysis (P<0.001) is higher than in the "phenotypic variation" comparison (P<0.05), the second comparison may be biologically "more significant" or more relevant for data interpretation. 32 It is also important to note that oftentimes, a result that indicates statistically significant differences has little or no biological impact.
Therefore, it is essential that the researcher knows that although methodologically important in study, above the statistics should be his biological knowledge and interpretation about the results. In this context, it is possible to consider that the "biological significance" can overcome the "statistical significance" in providing support the data interpretation, allowing a broader picture of the immunoregulatory scenario.
Indeed, the analysis of data generated from "phenotypic variation" other than the simple "healthy vs diseased" dichotomy allows a series of correlation analysis that would result in false positive results in the "healthy vs diseased" analysis. Please remember that diseased periodontal tissues are characterized by high levels of theoretically destructive elements, such as osteoclastogenic factors (including RANKL) and pro-inflammatory molecules, but also for high levels of supposedly protective elements, such as OPG and anti-inflammatory molecules (such as IL-10), when compared to healthy tissues. Since a high variation in the levels of such molecules is present between health and disease conditions, correlation analysis including samples from both groups presents a high trend to biased "false correlations". 21,32 It is known from experimental studies (whose importance will be considered in the sequence) that IL-10 induces OPG upregulation and RANKL downregulation. However, a "healthy vs diseased" correlation analysis can result in positive correlations between IL-10 and OPG, but also between IL-10 and RANKL, with "very significative R and P values" (unpublished data). When such correlation is performed observing the "phenotypic variation", or performed only with a single disease from samples, the positive correlation between IL-10 and OPG is sustained, but with "less significative R and P values". Additionally, such analysis also reveals that IL- However, in some cases this theory may be beautiful on paper but impossible in practice as it could make a methodologically unviable study due to the large minimum sample size required.
Also, it is mandatory to consider that dichotomization or comparisons based in phenotypic data completely differ from dichotomization or stratification based in random scores frequently attribute to the analysis process. Indeed, it is common to receive in JAOS the submission of papers comprising the use of percentage scores, derived from the quantification of cells positively stained for a given target, such as RANKL.
In this virtual scenario (roughly based in submitted papers), score zero refers to 0 to 5% of stained cells, score one refers to 5 to 25%, score two to 26 to 50%, and so on. Therefore, a 25% sample and a 26% sample would receive different scores, while 26% and 50% samples would receive the very same score. It seems that it is not necessary to apply complex statistical tools to realize that some qualitative 'downgrade' may not be the best option for the subsequent data analysis, especially when the quantitative data is available. This strategy leads to a weakening of the dependent variable and consequently a less robust and accurate data analysis. When the stratification is necessary, and phenotypic data is not available to guide the stratification, the use of tertiles, quartiles, deciles or cluster analysis can be more adequate than the random assignment of samples into scores or While experimental data does not provide any kind of additional "statistical significance" to the associative data derived from human studies, including those previously mentioned along this essay, the "biological significance", despite being numerically unmeasurable, is remarkable. Despite the unprivileged position in the scientific evidence pyramid, the "pre-clinical" research is essential in unraveling mechanistic evidences for biological and pathological processes, and to provide the basis for subsequent clinical interventions. 46 Mice with opposing maximal and minimal inflammatory responsiveness genotypes and phenotypes present distinct susceptibility/resistance patterns when To P or not to P, is that the question? Rethinking experimental design and data analysis to improve biological significance beyond the statistical significance