Acessibilidade / Reportar erro

Type I error in multiple comparison tests in analysis of variance

ABSTRACT.

In a hypothesis test, a researcher initially fixes a type I error rate, that is, the probability of rejecting the null hypothesis given that it is true. In the case of means tests, it is important to present a type I error that is equal to the nominal pre-fixed level, such that this error remains unchanged across various scenarios, including the number of treatments, number of repetitions, and coefficient of variation. The purpose of this study is to analyse and compare the following multiple comparison tests for the control of both conditional and unconditional type I error rates, depending on a significant F-test in the analysis of variance: Tukey, Duncan, Fisher’s least significant difference, Student-Newman-Keuls (SNK), and Scheffé. As an application, we present a motivation study and develop a simulation study using the Monte Carlo method for a total of 64 scenarios. In each simulated scenario, we estimate the comparison-wise and experiment-wise error rates, conditional and unconditional on a significant result of the overall F-test of analysis of variance for each of the five multiple comparison tests evaluated. The results indicate that the application of the means tests based only on the significance of the F-test should be considered when determining the error rates, as this can change them. In addition, we find that Fisher’s test controls for the comparison-wise error rate, the Tukey and SNK tests control for the experiment-wise error rate, and the Duncan and Fisher tests control for the conditional experiment-wise error rate. Scheffé’s test does not control for any of the error rates considered.

Keywords:
comparison-wise error rate; experiment-wise error rate; means tests; Monte Carlo simulation

Introduction

In agricultural experiments, a common problem arises while conducting a comparison among treatments of interest to determine the existence of a difference between them. The most common solution to this problem lies in the application of the analysis of variance (ANOVA) (Girardi, Cargnelutti Filho, & Storck, 2009Girardi, L. H., Cargnelutti Filho, A., & Storck, L. (2009). Erro tipo I e poder de cinco testes de comparação múltipla de médias. Revista Brasileira de Biometria, 27(1), 23-36.).

The overall F-test in ANOVA checks the hypothesis of equality for the population means of the treatments. If the F-test is significant, then a comparison of means test is performed to investigate the possible differences between the pairs of specific means or a linear combination of them (Saville, 2014Saville, D. J. (2014). Multiple comparison procedures - Cutting the Gordian knot. Agronomy Journal, 107(2), 730-735. DOI: https://doi.org/10.2134/agronj2012.0394
https://doi.org/https://doi.org/10.2134/...
).

One of the dilemmas involved in the means tests is their conditional application to a significant F-test. According to Cardellino and Siewerdt (1992Cardellino, R. A., & Siewerdt, F. (1992). Use and misuse of statistical tests for comparison of means. Revista da Sociedade Brasileira de Zootecnia, 21(6), 985-995.), this is a controversial question and should be investigated further. Rodrigues, Piedade, and Lara (2016Rodrigues, J., Piedade, S. M. S., & Lara, I. A. R. (2016). Aplicação condicional de testes de comparação de médias a um resultado significativo do teste F global na análise de variância. Revista Brasileira de Biometria, 34(1), 1-22.), for example, noticed divergent results between the overall F-test and the means tests evaluated in their simulation study. In this motivational study, we present the agronomic experiment of Henrique and Laca-Buendía (2010Henrique, F. H., & Laca-Buendía, J. P. (2010). Comportamento morfológico e agronômico de genótipos de algodoeiro no município de Uberaba - MG. FAZU em Revista, 7, 32-36. ), which compares five cultivars and a new genotype of cotton, and show divergent results between the overall F-test and certain means tests commonly used in agricultural research. Nevertheless, many authors recommend applying means tests only on a significant result of the F-test in ANOVA. Therefore, many questions remain to be answered in this field of study.

However, it is not possible to dissociate this study from the errors that can occur in a hypothesis test. This is because when the hypothesis for an average contrast is analysed, the test, whether or not applied based only on a significant result of the overall F-test, exhibits the probabilities of type I and type II errors, where the type I error rate can be of the comparison-wise or experiment-wise type (Ramos & Vieira, 2014Ramos, P. S., & Vieira, M. T. (2014). Bootstrap multiple comparison procedure based on the F distribution. Revista Brasileira de Biometria, 31(4), 529-546.). The comparison-wise error rate is the long-run proportion of the number of erroneous inferences observed divided by the total number of comparisons made; the experiment-wise error rate is the long-run proportion of the number of experiments conducted with at least one erroneous inference divided by the total number of experiments (Boardman & Moffitt, 1971Boardman, T. J., & Moffitt, D. R. (1971). Graphical Monte Carlo Type I error rates for multiple comparison procedures. Biometrics, 27(3), 738-744. DOI: https://doi.org/10.2307/2528613
https://doi.org/https://doi.org/10.2307/...
).

Several studies have been conducted to evaluate the means tests in relation to the type I error rate control and to propose modifications in the tests that are aimed at controlling this rate (Biase & Ferreira, 2011Biase, N. G., & Ferreira, D. F. (2011). Testes de igualdade e de comparações múltiplas para várias proporções binomiais independentes. Revista Brasileira de Biometria, 29(4), 549-570.; Souza, Lira Junior, & Ferreira, 2012Souza, C. A., Lira Junior, M. A., & Ferreira, R. L. C. (2012). Avaliação de testes estatísticos de comparações múltiplas de médias. Revista Ceres, 59(3), 350-354. DOI: https://doi.org/10.1590/S0034-737X2012000300008
https://doi.org/https://doi.org/10.1590/...
; Gonçalves, Ramos, & Avelar, 2015Gonçalves, B. O., Ramos, P. S., & Avelar, F. G. (2015). Test Student-Newman-Keuls bootstrap: proposal, evaluation, and application productivity data of soursop. Revista Brasileira de Biometria, 33(4), 445-470.). However, the analysis of these concepts associated with the significance of the F-test requires an additional investigation.

In this context, the aim of this study is to analyse and compare the Tukey, Duncan, Fisher’s least significant difference (Fisher’s LSD), Student-Newman-Keuls (SNK), and Scheffé tests, which are used for conducting pair-wise comparisons between the means, with respect to the control of type I error rates, which are conditional and unconditional to a significant result of the overall ANOVA F-test.

Material and methods

Motivational study

We present an experiment conducted by Henrique and Laca-Buendía (2010Henrique, F. H., & Laca-Buendía, J. P. (2010). Comportamento morfológico e agronômico de genótipos de algodoeiro no município de Uberaba - MG. FAZU em Revista, 7, 32-36. ), in which the aim was to compare five cultivars and a new genotype of cotton (Gossypium hirsutum L. r. latiFolium Hutch). The experiment was conducted in Uberaba, Minas Gerais State, Brazil, located at longitude 47°57’22” WGR, latitude 19°44’6.82” S, and an altitude of 775 m.

The experiment was carried out in a randomised block design with six treatments and four replicates. The plots consisted of four lines of 5 m with a 0.7 m spacing between the lines. The two central lines were considered to have a useful area of 3.5 m²; the other two lines, one on each side, were the borders, with each line being an equivalent of a total of 10 plants. In agricultural experiments, it is quite common to use a number of repetitions equal to four, since the plots cover larger areas and more than one ‘individual’ is used per plot.

In the experiment, the following varieties were compared: Delta Opal (Delta and Pine), Delta Penta (Delta and Pine), BRS-Cedro (EMBRAPA), IAC-25 (IAC), EPAMIG Precoce I, and the progeny IAC-06/191 (IAC). The height of the first productive branch (average distance from the soil to the first branch in which there were bolls, in centimetres) and final stand (total number of plants at harvest time) were among the evaluated traits.

In the context of the variable height of the first productive branch, the overall ANOVA F-test is found to be significant at the 5% level, but the Scheffé test shows no difference between the means of the treatments at the same level of significance (Table 1).

Table 1
Multiple comparisons applied to the data of the height of the first productive branch, in centimetres, for five cultivars and a new genotype of cotton.

For the variable final stand, the ANOVA F-test is not significant, but Duncan’s and Fisher’s LSD tests show differences between some of the treatment means (Table 2).

Table 2
Multiple comparisons applied to the data of the final stand of five cultivars and a new genotype of cotton.

These results serve as the basis for establishing scenarios to study the type I error rates presented here, since the means tests can be applied irrespective of whether the F-test is significant or not. A completely randomised design was used to facilitate the simulation process, although the results can be applied to any other type of experimental design.

Study of simulation

A total of 128,000 experiments were simulated using the Monte Carlo method with 2,000 for each scenario for a total of 64 cases consisting of a combination of the following factors: 3, 5, 7, or 9 treatments; 3, 4, 10, or 20 replicates; and the coefficient of variation (CV) of 1, 5, 10, or 20%, without considering the treatment effect. The experiments were simulated using a completely randomised design

y i j = μ + τ i + ε i j , i = 1 , , a , j = 1 , , r ,

where yij represents the simulated value of the response obtained with the i-th treatment in its j-th repetition, µ is the mean that is arbitrarily set at 100, τi is the fixed effect of the i-th treatment (which is considered to be null), and εij is the random error generated independently with a normal distribution having zero mean and a standard deviation (σ) varying according to the desired CV.

In all simulated scenarios, we ensured that the analyses were conducted based on the same randomly used seed, such that possible differences did not occur due to a random error of the simulation process, but rather due to the differences between the tests. Moreover, the nominal significance level adopted in all the cases was 5%. To simulate the experimental data for performing statistical analyses, an algorithm was developed using the R software (R Core Team, 2020R Core Team (2020). R: A language and environment for statistical computing. Vienna, AT: R Foundation for Statistical Computing. Retrieved on Jan. 9, 2021 from 9, 2021 from https://www.R-project.org/
https://www.R-project.org/...
).

Consequently, for each simulated scenario, the type I error rates were estimated by the comparisons and experiments conducted for each of the five tests (Tukey, Duncan, Fisher’s LSD, SNK, and Scheffé). Furthermore, these error rates were considered in two different aspects: (i) the multiple comparison procedure was applied regardless of the overall F-test result, and (ii) it was applied only when the F-test was significant.

The comparison-wise error rate (αc) is defined as the ratio of the number of erroneous inferences, such that µi ≠ µi’ when µi = µi’, to the total number of comparisons performed. Thus, by taking the demonstrative scenario of a = 3, r = 3, and CV = 1%, the unconditional comparison-wise error rate can be estimated from the ratio of the total number of erroneous inferences to the total number of inferences (in this case, 2,000 experiments × 3 contrasts per experiment = 6,000 contrasts).

In the case where the means tests are applied only if the overall ANOVA F-test is significant, the comparison-wise error rate, that is conditional (α1), can be estimated empirically based on the total number of type I errors observed in the experiments that presented a significant F-test and the total number of inferences made. Taking the demonstrative scenario of a = 3, r = 3, and CV = 1%, while considering that out of the 2,000 experiments simulated for this scenario, only 100 showed a significant result for the overall F-test, this error rate can be estimated by the ratio of the total number of erroneous inferences made in 100 experiments to the total number of inferences (6,000 contrasts).

A second way to estimate the conditional comparison-wise error rate (α2) is to take the total number of erroneous inferences within the experiments with a significant result for the overall F-test and divide it by the total number of inferences made within these experiments. Considering the same scenario where a = 3, r = 3, and CV = 1% and the number of 100 experiments with a significant overall F-test, this rate can be estimated by the ratio of the total number of erroneous inferences made in the 100 experiments to the total number of inferences made within these experiments (in this case, 100 experiments × 3 contrasts per experiment = 300 contrasts).

Accordingly, the experiment-wise error rate (αe) is defined as the ratio of the number of experiments with at least one erroneous inference (µi ≠ µi’ when µi = µi’) divided by the total number of experiments. Thus, considering the scenario where a = 3, r = 3, and CV = 1%, the unconditional experiment-wise error rate can be estimated by the ratio between the number of trials with at least one erroneous inference among the three tested to the total number of experiments (in this case, 2,000).

The conditional experiment-wise error rate (α3) can be estimated by taking the total number of experiments that presented a significant overall F-test as well as at least one comparison resulting in a type I error, which is divided by the total number of experiments. Considering that for the scenario where a = 3, r = 3, and CV = 1%, out of the 2,000 experiments simulated, only 100 presented a significant result for the overall F-test, this error rate can be estimated by the ratio between the number of experiments (out of the 100 experiments) that presented at least one comparison resulting in a type I error and the total number of experiments (2,000).

Moreover, the same rate can be estimated based on the experiments that presented a significant result in the overall F-test. This rate (α4) can then be calculated by dividing the total number of trials that presented a significant F-test and at least one comparison resulting in a type I error by the total number of trials out of the 2,000 trials with a significant F-test. For the scenario with a = 3, r = 3, and CV = 1%, considering that out of the 2,000 experiments simulated, only 100 presented a significant result for the overall F-test, this rate can be estimated by dividing the number of experiments (out of the 100) that presented at least one comparison resulting in a type I error and a total of 2,000 experiments with a significant overall F-test (in this case, 100).

To verify whether each of the rates differed from the established nominal significance level (α = 5%), a lower limit of 0.038 and an upper limit of 0.063 were used, which were calculated from a 95% confidence interval (CI) for the proportion p̂ = 0.05, expressed as

C I = p ^ ± z α / 2 p ^ ( 1 - p ^ ) 2,000 ,

Where zα/2 is the quantile value of the standard normal distribution at the α level of significance. Thus, the rates within this interval were not considered to be different from the established nominal level.

Results and discussion

In the 64 scenarios formed by the combination of the number of treatments (a = 3, 5, 7, and 9), the number of repetitions (r = 3, 4, 10, and 20), and the CV (CV = 1%, 5%, 10%, and 20%), the error rates, do not present differences with a high magnitude in the variation of the number of repetitions and the CV (Tables 3, 4, 5 and 6). The same results were noted by Girardi et al. (2009Girardi, L. H., Cargnelutti Filho, A., & Storck, L. (2009). Erro tipo I e poder de cinco testes de comparação múltipla de médias. Revista Brasileira de Biometria, 27(1), 23-36.), who presented the comparison-wise and experiment-wise error rates for 80 scenarios formed by a variation in the number of treatments, number of repetitions, and CV of the experiments.

In all the simulated scenarios, the comparison-wise error rates are lower than the experiment-wise error rates (Tables 3 to 6 ) for the five multiple comparison tests evaluated, which is an expected result according to Girardi et al. (2009Girardi, L. H., Cargnelutti Filho, A., & Storck, L. (2009). Erro tipo I e poder de cinco testes de comparação múltipla de médias. Revista Brasileira de Biometria, 27(1), 23-36.); this is because the equality between them would be obtained only if the totality of the contrasts was significant in all the experiments with at least a significant contrast.

Regarding the comparison-wise error rates αc, the behaviour of the Tukey, SNK, and Scheffé tests are similar, since the presented estimates for this rate lie below the lower limit of the 95% CI calculated (0.0375) in all simulated scenarios (Tables 3 to 6). In general, an increase in the number of treatments causes a decrease in the error rate for these tests.

Further, the Duncan test presents the estimates for αc below the lower limit of the 95% CI in most scenarios. However, for CV = 1%, the test shows that this rate can be controlled in most scenarios where a = 3 and a = 5; while for CV = 5%, 10%, and 20%, this rate can be controlled in scenarios with a = 3. Similar to the Tukey, SNK, and Scheffé tests, an increase in the number of treatments causes a significant decrease in the error rate in the Duncan test.

For Fisher’s LSD test, it is observed that the estimates obtained for the comparison-wise error rates αc always remain within the 95% CI calculated (Tables 3 to 6). In this case, the variation in the number of treatments does not cause significant changes to the results.

Thus, with respect to αc, Fisher’s LSD test proves to be the only one that can control for this error rate regardless of the number of treatments, repetitions, and the CV of the experiments; therefore, it is the most robust test in this case. The Duncan test, in turn, lies in an intermediate situation; it controls for this rate only in the scenarios in which the number of treatments is small, and is conservative in other cases. The Tukey, SNK, and Scheffé tests exhibit the worst performance; they are conservative in all simulated scenarios. In this case, the Scheffé test is the most conservative, followed by the Tukey and SNK tests.

In the context of the experiment-wise error rates αe, the Tukey and SNK tests exhibit a similar behaviour by always presenting the estimates for this rate within the 95% CI (Tables 3 to 6). In both the cases, a variation in the number of treatments does not significantly influence the results.

Table 3
Type I error rates for multiple comparison tests according to the number of treatments (a) and the number of repetitions of treatments (r) with coefficient of variation = 1%.
Table 4
Type I error rates of multiple comparison tests according to the number of treatments (a) and the number of repetitions of treatments (r) with coefficient of variation = 5%.
Table 5
Type I error rates of multiple comparison tests according to the number of treatments (a) and the number of repetitions of treatments (r) with coefficient of variation = 10%.
Table 6
Type I error rates of multiple comparison tests according to the number of treatments (a) and the number of repetitions of treatments (r) with coefficient of variation = 20%.

Further, Duncan and Fisher’s LSD tests show that the values for αe are above the upper limit of the 95% CI (0.0625), and the increase in the number of treatments of the scenarios cause a significant increase in this error rate (Tables 3 to 6).

The Scheffé test shows that the estimates for αe mostly lie below the lower limit of 0.0375, except for two scenarios where a = 3 when CV = 1% and CV = 5% (Tables 3 and 4), and except for three scenarios with a = 3 when CV = 10% and CV = 20% (Tables 5 and 6), where the test controlled this rate. An increase in the number of treatments for the test causes a significant decrease in the error rate.

In general, in the context of αe, Tukey and SNK are the only tests that control for this rate and are therefore considered to be robust in all experimental conditions, regardless of the number of treatments, the number of repetitions, and CV. The Scheffé test, in turn, is in an intermediate situation, since, to some extent, it controls the experiment-wise error rate only in the scenarios with three treatments, while it is conservative in other cases. The Duncan and Fisher’s LSD tests depict the worst performance, showing that they are liberal in all simulated scenarios, with Fisher’s LSD test being the most liberal among them.

According to Girardi et al. (2009Girardi, L. H., Cargnelutti Filho, A., & Storck, L. (2009). Erro tipo I e poder de cinco testes de comparação múltipla de médias. Revista Brasileira de Biometria, 27(1), 23-36.), the equality of the comparison-wise error rate and experiment-wise error rate would be ideal for a multiple comparison test according to the level of significance established. However, according to Perecin and Barbosa (1988Perecin, D., & Barbosa, J. C. (1988). Uma avaliação de seis procedimentos para comparações múltiplas. Revista de Matemática e Estatística, 6, 95-103.), a test that controls for the comparison-wise error rate can become very liberal when applied to the entire experiment, while a test that controls for the experiment-wise error rate can become conservative in a comparison.

Indeed, the Tukey and SNK tests prove to be conservative in terms of controlling for the comparison-wise error rate, while controlling for the experiment-wise error rate equal to the nominal significance level. A similar behaviour exhibited by the tests was observed by Boardman and Moffitt (1971Boardman, T. J., & Moffitt, D. R. (1971). Graphical Monte Carlo Type I error rates for multiple comparison procedures. Biometrics, 27(3), 738-744. DOI: https://doi.org/10.2307/2528613
https://doi.org/https://doi.org/10.2307/...
), Bernhardson (1975Bernhardson, C. S. (1975). Type I error rates when multiple comparison procedures follow a significant F test of ANOVA. Biometrics, 31(1), 229-232. DOI: https://doi.org/10.2307/2529724
https://doi.org/https://doi.org/10.2307/...
), and Girardi et al. (2009Girardi, L. H., Cargnelutti Filho, A., & Storck, L. (2009). Erro tipo I e poder de cinco testes de comparação múltipla de médias. Revista Brasileira de Biometria, 27(1), 23-36.). Fisher’s LSD test, which controls for the comparison-wise error rate, proves to be liberal in terms of controlling for the experiment-wise error rate, as observed by Boardman and Moffitt (1971Boardman, T. J., & Moffitt, D. R. (1971). Graphical Monte Carlo Type I error rates for multiple comparison procedures. Biometrics, 27(3), 738-744. DOI: https://doi.org/10.2307/2528613
https://doi.org/https://doi.org/10.2307/...
), Bernhardson (1975), Perecin and Barbosa (1988Perecin, D., & Barbosa, J. C. (1988). Uma avaliação de seis procedimentos para comparações múltiplas. Revista de Matemática e Estatística, 6, 95-103.), and Girardi et al. (2009Girardi, L. H., Cargnelutti Filho, A., & Storck, L. (2009). Erro tipo I e poder de cinco testes de comparação múltipla de médias. Revista Brasileira de Biometria, 27(1), 23-36.).

Regarding conditional error rates, the comparison-wise rates αc are always equal to the conditional rates α1 for the Scheffé test, while there is a tendency for the rates αc to be slightly higher than α1 for each of the other tests considered (Tables 3 to 6). Meanwhile, the conditional rates α2 are always higher than the rates αc for the five tests. The differences between these last two error rates are large when the number of treatments is small and decrease as the number of treatments increase.

Therefore, none of the means tests controls for the conditional comparison-wise error rates α1 or α2. For α1, all the tests present estimates below the lower limit of the 95% CI calculated, being conservative regarding the control of this rate. For α2, all the tests, in general, show a liberal behaviour with the estimates lying above the upper limit of the CI, except in some of the scenarios with a = 9 for the Tukey and SNK tests, and some of scenarios with a = 7 and all the scenarios with a = 9 for the Scheffé test, in which the tests control this error rate.

The unconditional experiment-wise error rates αe are always equal to the conditional rates α3 for the Scheffé test, whereas for each of the other tests, the rates αe are slightly higher than the rates α3 (Tables 3 to 6). Further, the rates αe are always lower than the conditional rates α4, and although the differences between them are considerable, they decrease as the number of treatments increase. For the Scheffé test, however, the differences between these rates for scenarios with a high number of treatments are not large.

Regarding the control of the conditional experiment-wise error rates, we observed that the Duncan and Fisher’s LSD tests control for the error rate α3, regardless of the number of treatments. The Tukey and SNK tests, in turn, control for this rate in all the scenarios with a = 3 and a = 3; a = 5, respectively. However, for the other scenarios, when they do not control this rate, both tests are conservative. The Scheffé test, however, proves to be conservative in most scenarios, controlling for this rate only in certain scenarios with a = 3. For α4, all the tests show a liberal behaviour, except for the Scheffé test in certain scenarios with a = 9, where this rate is controlled.

Figures 1 and 2 summarise the tendency of the means tests, explaining every rate for the observed tests as the number of treatments increase. Only variation in the number of treatments is considered here, as this is the only factor that led to significant alteration of the results. Therefore, we consider r = 10, CV = 10%, and α = 5%. The lines between the dots represent what occurred to the rates with an increase in the number of treatments. For simplicity, only the case with r =10 and CV = 10% is reported; the results for the other simulated scenarios are similar.

Figure 1
(A) Unconditional comparison-wise error rates αc; (B) conditional comparison-wise error rates α1; and (C) conditional comparison-wise error rates α2 for the various multiple comparison tests with 10 replications, a nominal significance level of 5%, and coefficient of variation = 10%, according to the variation in the number of treatments.

Based on the above results, it can be inferred that the combined use of the overall ANOVA F-test and the multiple comparison tests can change the type I error rates of the means tests. Duncan and Fisher’s LSD tests do not control for the experiment-wise error rate αe, while an increase in the number of treatments leads to an increase in this rate. However, considering the conditional experiment-wise error rate α3, we find that the tests control for this rate, regardless of the number of treatments because the nominal significance level for the ANOVA F-test determines the upper limit for the α3 error rates (Bernhardson, 1975Bernhardson, C. S. (1975). Type I error rates when multiple comparison procedures follow a significant F test of ANOVA. Biometrics, 31(1), 229-232. DOI: https://doi.org/10.2307/2529724
https://doi.org/https://doi.org/10.2307/...
).

Figure 2
(A) Unconditional experiment-wise error rates αe; (B) conditional experiment-wise error rates α3; (C) conditional experiment-wise error rates α4 for the various multiple comparison tests with 10 replications, nominal significance level of 5%, and coefficient of variation = 10%, according to the variation in the number of treatments.

In general, Fisher’s LSD test controls for the comparison-wise error rate αc, whereas the Tukey and SNK tests control for the experiment-wise error rate αe. The Duncan and Fisher’s LSD tests control for the conditional experiment-wise error rate α3. The Scheffé test does not control for any of the error rates considered, possibly because the test can be used for all possible contrasts and not only for the pair-wise contrasts of means (Boardman & Moffitt, 1971Boardman, T. J., & Moffitt, D. R. (1971). Graphical Monte Carlo Type I error rates for multiple comparison procedures. Biometrics, 27(3), 738-744. DOI: https://doi.org/10.2307/2528613
https://doi.org/https://doi.org/10.2307/...
).

Conclusion

Since each multiple comparison test controls for a different error rate, the choice of the test must depend on what error rates are intended to be controlled. If the decision is to control for the comparison-wise error rate αc, then Fisher’s LSD test is the most suitable. If the decision is to control for the experiment-wise error rate αe, then the Tukey and SNK tests are recommended. If the intention is to control for the conditional experiment-wise error rate α3, then the Duncan and Fisher’s LSD tests can be used. Type I error rates, in general, did not show significant changes with a variation in the number of repetitions or the CV, but exhibited changes with a variation in the number of treatments of the trials. When choosing the most applicable test, in addition to the type I error rate, the power function should be considered. It is always desirable for tests with good performance to maintain the coverage of the type I error rate, which should simultaneously have great power. Research on the power function is extremely exhaustive and therefore should be developed by future studies.

Acknowledgements

We would like to thank Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for their financial support

References

  • Bernhardson, C. S. (1975). Type I error rates when multiple comparison procedures follow a significant F test of ANOVA. Biometrics, 31(1), 229-232. DOI: https://doi.org/10.2307/2529724
    » https://doi.org/https://doi.org/10.2307/2529724
  • Biase, N. G., & Ferreira, D. F. (2011). Testes de igualdade e de comparações múltiplas para várias proporções binomiais independentes. Revista Brasileira de Biometria, 29(4), 549-570.
  • Boardman, T. J., & Moffitt, D. R. (1971). Graphical Monte Carlo Type I error rates for multiple comparison procedures. Biometrics, 27(3), 738-744. DOI: https://doi.org/10.2307/2528613
    » https://doi.org/https://doi.org/10.2307/2528613
  • Cardellino, R. A., & Siewerdt, F. (1992). Use and misuse of statistical tests for comparison of means. Revista da Sociedade Brasileira de Zootecnia, 21(6), 985-995.
  • Girardi, L. H., Cargnelutti Filho, A., & Storck, L. (2009). Erro tipo I e poder de cinco testes de comparação múltipla de médias. Revista Brasileira de Biometria, 27(1), 23-36.
  • Gonçalves, B. O., Ramos, P. S., & Avelar, F. G. (2015). Test Student-Newman-Keuls bootstrap: proposal, evaluation, and application productivity data of soursop. Revista Brasileira de Biometria, 33(4), 445-470.
  • Henrique, F. H., & Laca-Buendía, J. P. (2010). Comportamento morfológico e agronômico de genótipos de algodoeiro no município de Uberaba - MG. FAZU em Revista, 7, 32-36.
  • Perecin, D., & Barbosa, J. C. (1988). Uma avaliação de seis procedimentos para comparações múltiplas. Revista de Matemática e Estatística, 6, 95-103.
  • R Core Team (2020). R: A language and environment for statistical computing Vienna, AT: R Foundation for Statistical Computing. Retrieved on Jan. 9, 2021 from 9, 2021 from https://www.R-project.org/
    » https://www.R-project.org/
  • Ramos, P. S., & Vieira, M. T. (2014). Bootstrap multiple comparison procedure based on the F distribution. Revista Brasileira de Biometria, 31(4), 529-546.
  • Rodrigues, J., Piedade, S. M. S., & Lara, I. A. R. (2016). Aplicação condicional de testes de comparação de médias a um resultado significativo do teste F global na análise de variância. Revista Brasileira de Biometria, 34(1), 1-22.
  • Saville, D. J. (2014). Multiple comparison procedures - Cutting the Gordian knot. Agronomy Journal, 107(2), 730-735. DOI: https://doi.org/10.2134/agronj2012.0394
    » https://doi.org/https://doi.org/10.2134/agronj2012.0394
  • Souza, C. A., Lira Junior, M. A., & Ferreira, R. L. C. (2012). Avaliação de testes estatísticos de comparações múltiplas de médias. Revista Ceres, 59(3), 350-354. DOI: https://doi.org/10.1590/S0034-737X2012000300008
    » https://doi.org/https://doi.org/10.1590/S0034-737X2012000300008

Publication Dates

  • Publication in this collection
    03 Mar 2023
  • Date of issue
    2023

History

  • Received
    09 Feb 2021
  • Accepted
    17 June 2021
Editora da Universidade Estadual de Maringá - EDUEM Av. Colombo, 5790, bloco 40, 87020-900 - Maringá PR/ Brasil, Tel.: (55 44) 3011-4253, Fax: (55 44) 3011-1392 - Maringá - PR - Brazil
E-mail: actaagron@uem.br