Familywise type I error of ANOVA and ANOVA on ranks in factorial experiments

ABSTRACT: This research evaluated the importance of a preliminary general analysis of variance (ANOVA) in the interpretation of data from factorial experiments under total nullity. For this, we evaluated the familywise type I error rate (accumulated FWER) of the F test for the unfolding of factorial ANOVA and factorial ANOVA on ranks, which were compared with the FWER for the global effect of treatments. In addition, we evaluated the FWER of the Tukey’s test under total nullity for factorial experiments in the presence or absence of preliminary ANOVA protection (omnibus F test). The study was conducted by simulating data from 2,000 experiments, which were separated into four representative agricultural research scenarios. For both the parametric factorial ANOVA and the non-parametric factorial ANOVA, the FWER significantly exceeded the nominal level of 5%, even under total nullity. While the tests that control the total FWER in the factorials are not being used, the factorial ANOVA should not be performed without the preliminary ANOVA F test showing a significant effect. This, of course, does not apply to tests that are not multiple comparisons, such as Bonferroni, Dunn-Sidak and others, which do not need ANOVA protection. The same recommendation applies to the factorial ANOVA on ranks.

In agricultural experimentation, factorial designs stand out from unstructured experiments for at least two major advantages.First, they allow a formal estimation of the interaction effect, ensuring safer generalizations.Second, factorials allow the reduction of the total number of two-by-two comparisons to be performed by multiple comparison tests.This allows a higher level of sensitivity of the multiple comparison tests (CARVALHO et al., 2023).
A non-parametric counterpart for the factorial ANOVA is the on-rank with aligned-rank transformation (ART) for the interaction, which allows a valid estimation of the significance of the interaction (DURNER, 2019).The approach via rank transformation is a valid technique of great utility, since it allows the unification and simplification of a whole set of non-parametric procedures (CONOVER, 2012;MONTGOMERY, 2017).

CROP PRODUCTION
the interaction (AxB).In the ANOVA of factorial experiments, it is common to proceed directly to the unfolding of the global effect of treatments, only calculating the F for the interaction between the factors and the isolated effects of the factors.However, a few statisticians recommend that this unfolding, even when orthogonal, should not be performed if the general ANOVA does not point to a significant effect of the treatments as a whole (BANZATO & KRONKA, 2006;CRAMER et al., 2016).This recommendation is either unknown or largely ignored in mainstream experimental statistics textbooks and the analysis packages of leading applications, which can lead to high familywise Type I error rates (FWER), since later means tests can indicate significant differences when a protection criterion is not used, such as the F test (in factorial experiments).Thus, this study evaluated the importance of the preliminary general ANOVA in the analysis of factorial experiments and the empirical accumulated FWER of the F test for components of the unfolding of the factorial ANOVA and factorial ANOVA on ranks.An additional aim was to evaluate the empirical accumulated FWER of the Tukey test under total nullity for factorial experiments in the presence or absence of the preliminary general ANOVA.
The study was conducted from the simulation of data from 2,000 experiments that were separated into two initial groups: a group of experiments with a 2 × 5 factorial structure and five repetitions, and another group of experiments with a 5 × 5 factorial structure and three repetitions.Each group was subdivided into two groups: those with higher coefficient of variation (CV) values (between 25% and 40%) and those with lower CV values (between 1% and 10%), totaling four scenarios with 500 experiments each.This sample number was defined considering the power of the one-sided Binomial test (for a proportion) in relation to the magnitude of the expected FWER.The data were simulated by considering a completely randomized model for a factorial structure ( , where: m is the overall average of the observations in each simulated experiment; a i is the average effect of factor a, level i; b j is the average effect of factor b, level j; a i b j is the average effect of the interaction between factors a and b; and e ijk is the error estimate.Only the error component (e ijk ~ NID(0,σ)) was considered as a random variable.
Data were simulated in Apache Open Office -Calc 4.1.7.We used the function "=NORMINV(RAND();mean;standard deviation)" for generating error values with a normal distribution, a procedure similar to that used by SOUSA et al. (2012).First, random values between 1-10 or 25-40 were generated, which were duly converted to feed the "standard deviation" parameter of the previously described function.
All data were previously submitted to the Jarque-Bera and Bartlett tests to verify the conditions of normality and homoscedasticity, respectively.When a simulated experiment did not meet one of these assumptions, it was discarded and replaced by another.The 2,000 simulated experiments were then individually submitted to an ANOVA following the factorial structure of the treatment decomposition into factors A, B, and the interaction.The preliminary F test (omnibus F test) was also performed for the global effect of treatments only, with a nominal α error of 5% always considered as a critical value.The means were then compared by the Tukey test to verify that the false positives indicated by the F test were also indicated by this standard test.If at least one significant difference was identified by the test, the result was counted as a false positive (accumulated FWER or EWER).The same procedure was also performed for the data submitted to an ANOVA on ranks (CONOVER, 2012), with the ART used to estimate the interaction (DURNER, 2019).Analyzes were performed using BioEstat 5.0, Microsoft Excel ® and SPEED Stat 2.5 (CARVALHO et al., 2020).
The analysis of the simulated data showed that the empirical FWER for each of the isolated components of the factorial was close to the 5% limit (Table 1).However, according to the most commonly used interpretation of an ANOVA, it was sufficient for only one of the factorial components (A, B, or AxB) to be significant for it to consider that there was at least one mean that differed from the others.In this case, the FWER oscillated between 10.4% and 14.0% for the different scenarios (Table 1).However, when a preliminary ANOVA was applied, the FWER ranged between 2.6% and 4.2% (Table 1).That is, the usual interpretation of a factorial ANOVA, which disregards the previous verification of the significance of the F test for treatments, led to inflated FWERs even under total nullity.Although, this inflation was expected and is well known (since FWER = 1-(1α) k ), it is important to demonstrate it empirically, as it is generally assumed that this problem occurs only under partial nullity (FRANE, 2021).These results; therefore, corroborate the recommendations of FLETCHER et al. (1989) andCRAMER et al. (2016).However, believing in the global F implies in the problem that the F for treatments reduces its power as the number of treatments increases (LAZIC, 2018), which is especially relevant for factorials.
The analysis of the simulated data after the rank transformation (the ANOVA on ranks with ART to estimate the interaction) also showed that the FWER for each of the isolated components of the factorial were close to the nominal value of 5% (Table 2).Likewise, according to the usual interpretation for the factorial ANOVA on ranks, it was sufficient that only one of the factorial components (A, B, or AxB) was significant for it to be considered that there was at least one mean that differed from the others.In this case, the FWER fluctuated between 9.4 and 12.8% for the different scenarios considered (Table 2).
However, when the preliminary on ranks ANOVA was applied (to test the overall significance for the F of the treatments), the FWER ranged between 2.4% and 5.2% (Table 2).That is, like the parametric factorial ANOVA, the on ranks factorial ANOVA also led to inflated FWERs if we did not previously consider the significance for the omnibus F test.
Furthermore, disregarding the preliminary ANOVA had an impact on the FWER of the Tukey test.If no protection criteria were applied to the Tukey test, its FWER fluctuated between 21.4% and 38.4% (because in factorials the Tukey test only controls the FWER in each subfamily of comparisons).If we considered that the test of means should only be applied when one of the factors (A, B, or interaction) was significant, these error rates ranged between 10.4% and 13.4% (Table 1).Even if we only considered the F tests for the ramifications of the interaction (B's within A i and A's within B j ), these error rates did not approach acceptable levels (Table 1).The FWER of the Tukey's test was ≤ 5% when Fisher's protection criterion was considered using the significance of F for the global effect of treatments (preliminary ANOVA) (Table 1).These results; therefore, showed that the conclusions obtained by RODRIGUES (2015) do not apply to a factorial ANOVA under total nullity.As long as the tests of means are not adapted to control for the total FWER in the factorials, the general ANOVA continues to be useful to some extent.
As with the parametric factorial ANOVA, if no protection criteria were applied to the Tukey on ranks test, the FWERs greatly exceeded the nominal level of 5% (data not shown).Similarly, this problem can be avoided by the simple inclusion of omnibus F test (Table 2).The current discussion on the validity of non-parametric factorial ANOVA procedures (LUEPSEN, 2018;HARRAR et al., 2019) also requires this fact to be considered before suggesting the non-application of the ART for some situations.
Therefore, under total nullity in both the factorial ANOVA and the non-parametric factorial ANOVA, the control of the FWER will be guaranteed by the preliminary ANOVA (the significance of the F for the treatments) and not by the significance of the factorial components (A, B, and interaction).This could to be included in the routines of several analysis software to reduce the frequency of erroneous  --------------------------%-----------------------  statistical conclusions, such as in Minitab, Assistat, Sisvar, R (packages Easyanova, FrF2, Agricolae, among others).Furthermore, the full unfolding of the interaction by the F test does not replace the preliminary ANOVA to ensure the control of the FWER for multiple comparisons, even under total nullity.The use of the Fisher protection criterion in the preliminary ANOVA possibly eliminates further concerns with the full unfolding of the interaction in the ANOVA.

Table 1 -
Percentage of experiments in which factorial ANOVA or preliminary ANOVA or Tukey's test indicated the existence of significant effects (P ≤ 0.05) (empirical accumulated FWER or EWER) in the 500 experiments of each of the four scenarios evaluated.

Table 2 -
Percentage of experiments in which the factorial ANOVA on ranks or the preliminary ANOVA on ranks or the Tukey on ranks test indicated the existence of significant effects (P ≤ 0.05) (empirical accumulated FWER or EWER) in the 500 experiments of each of the four scenarios evaluated.