Modifications for the tukey test procedure and evaluation of the power and efficiency of multiple comparison procedures

Modificações no procedimento para o teste de tukey e poder e eficiência de testes de comparações múltiplas

Abstracts

Multiple pairwise comparison tests of treatment means are of great interest in applied research. Two modifications for the Tukey test were proposed. The power of unilateral and bilateral Student, Waller-Duncan, Duncan, SNK, REGWF, REGWQ, Tukey, Bonferroni, Sidak, unilateral Dunnet statistical tests and the modified tests, Sidak, Bonferroni 1 and 2, Tukey 1 and 2, has been compared using the Monte Carlo method. Data were generated for 600 experiments with eight treatments in a randomized block design, of which 400 had four and 200 eight blocks. The differences between the treatment means in relation to the control were 30%, 20%, 15%, 10%, 5%. Two extra treatments did not differ from the control. A coefficient of variation of 10% and a probability Type I error of α = 0.05 were adopted. The power of all the tests decreased when the differences to the control, decreased. The unilateral and bilateral Student t, Waller-Duncan and Duncan tests showed greater number of significative differences, followed by unilateral Dunnett, modified Sidak, modified Bonferroni 1 and 2, modified Tukey 1, SNK, REGWF, REGWQ, modified Tukey 2, Tukey, Sidak and Bonferroni. There is great loss of efficiency for all tests in relation to the unilateral Student t test for each difference of the treatment to the control, when the differences between means decrease. The modified tests were always more efficient than their original ones.

multiple comparison statistical tests; type I errors; Monte Carlo method; power of tests


Testes de comparações múltiplas entre médias de tratamentos são de grande interesse na pesquisa aplicada. Duas propostas de modificação do teste de Tukey são apresentadas e, usando-se simulação pelo método Monte Carlo, foi comparado o poder dos testes estatísticos: Student unilateral e bilateral, Waller-Duncan, Duncan, SNK, REGWF, REGWQ, Tukey, Bonferroni, Sidak, Dunnet unilateral, e dos testes modificados de Sidak, Bonferroni 1 e 2 e Tukey 1 e 2. Foram gerados dados para 600 experimentos em um delineamento casualizado em blocos com oito tratamentos, sendo 400 com quatro repetições e 200 com oito repetições. Foram adotados coeficiente de variação de 10% e erro tipo I com probabilidade α = 0.05. As diferenças entre as médias dos tratamentos e o controle foram de 30%, 20%, 15%, 10%, 5%; sendo, ainda incluídos, dois tratamentos que, parametricamente, não diferiram da média do controle. Para todos os testes, o poder decresceu quando as diferenças das médias em relação à média do controle decresceram; pela ordem, t de Student unilateral, t de Student bilateral e Waller-Duncan apresentaram maior número de diferenças significativas; seguindo-se Duncan, Dunnett unilateral, Sidak modificado e Bonferroni modificados 1 e 2 e Tukey modificado 1, SNK, REGWF, REGWQ, Tukey modificado 2 e os testes de Tukey, Sidak e Bonferroni. Houve grande perda de eficiência para todos os testes em relação ao teste t de Student unilateral, usado para comparar cada tratamento com o controle, quando o valor da diferença entre médias diminui. Os testes modificados foram sempre mais eficientes do que os respectivos testes originalmente propostos.

testes estatísticos de comparações múltiplas; erro tipo I; método Monte Carlo; poder dos testes


STATISTICS

Modifications for the tukey test procedure and evaluation of the power and efficiency of multiple comparison procedures

Modificações no procedimento para o teste de tukey e poder e eficiência de testes de comparações múltiplas

Armando ConaginI; Décio BarbinII, * * Corresponding author < debarbin@esalq.usp.br> ; Clarice Garcia Borges DemétrioII

IIAC - C.P. 28 - 13001-970 - Campinas, SP - Brasil

IIUSP/ESALQ, Depto. de Ciências Exatas, C.P. 09 - 13418-900 - Piracicaba, SP - Brasil

ABSTRACT

Multiple pairwise comparison tests of treatment means are of great interest in applied research. Two modifications for the Tukey test were proposed. The power of unilateral and bilateral Student, Waller-Duncan, Duncan, SNK, REGWF, REGWQ, Tukey, Bonferroni, Sidak, unilateral Dunnet statistical tests and the modified tests, Sidak, Bonferroni 1 and 2, Tukey 1 and 2, has been compared using the Monte Carlo method. Data were generated for 600 experiments with eight treatments in a randomized block design, of which 400 had four and 200 eight blocks. The differences between the treatment means in relation to the control were 30%, 20%, 15%, 10%, 5%. Two extra treatments did not differ from the control. A coefficient of variation of 10% and a probability Type I error of α = 0.05 were adopted. The power of all the tests decreased when the differences to the control, decreased. The unilateral and bilateral Student t, Waller-Duncan and Duncan tests showed greater number of significative differences, followed by unilateral Dunnett, modified Sidak, modified Bonferroni 1 and 2, modified Tukey 1, SNK, REGWF, REGWQ, modified Tukey 2, Tukey, Sidak and Bonferroni. There is great loss of efficiency for all tests in relation to the unilateral Student t test for each difference of the treatment to the control, when the differences between means decrease. The modified tests were always more efficient than their original ones.

Key words: multiple comparison statistical tests, type I errors, Monte Carlo method, power of tests

RESUMO

Testes de comparações múltiplas entre médias de tratamentos são de grande interesse na pesquisa aplicada. Duas propostas de modificação do teste de Tukey são apresentadas e, usando-se simulação pelo método Monte Carlo, foi comparado o poder dos testes estatísticos: Student unilateral e bilateral, Waller-Duncan, Duncan, SNK, REGWF, REGWQ, Tukey, Bonferroni, Sidak, Dunnet unilateral, e dos testes modificados de Sidak, Bonferroni 1 e 2 e Tukey 1 e 2. Foram gerados dados para 600 experimentos em um delineamento casualizado em blocos com oito tratamentos, sendo 400 com quatro repetições e 200 com oito repetições. Foram adotados coeficiente de variação de 10% e erro tipo I com probabilidade α = 0.05. As diferenças entre as médias dos tratamentos e o controle foram de 30%, 20%, 15%, 10%, 5%; sendo, ainda incluídos, dois tratamentos que, parametricamente, não diferiram da média do controle. Para todos os testes, o poder decresceu quando as diferenças das médias em relação à média do controle decresceram; pela ordem, t de Student unilateral, t de Student bilateral e Waller-Duncan apresentaram maior número de diferenças significativas; seguindo-se Duncan, Dunnett unilateral, Sidak modificado e Bonferroni modificados 1 e 2 e Tukey modificado 1, SNK, REGWF, REGWQ, Tukey modificado 2 e os testes de Tukey, Sidak e Bonferroni. Houve grande perda de eficiência para todos os testes em relação ao teste t de Student unilateral, usado para comparar cada tratamento com o controle, quando o valor da diferença entre médias diminui. Os testes modificados foram sempre mais eficientes do que os respectivos testes originalmente propostos.

Palavras-chave: testes estatísticos de comparações múltiplas, erro tipo I, método Monte Carlo, poder dos testes

INTRODUCTION

In applied research the evaluation of the hypothesis under investigation can be obtained developing experiments in which different treatments are included. Results are generally submitted to statistical analysis of variance, testing a global null hypothesis H0 using the F test and comparing the means by multiple comparison procedures (Hochberg & Tamhane, 1987; Hsu, 1996). A common practice is to compare new treatments to a control. In corn or wheat breeding, for example, new cultivars have to be compared to the main cultivar. In animal husbandry, new feeding treatments have to be compared to a main treatment that is in use. In medical research, new promising medicines have to be compared to the one adopted, before FDA in USA or ANVISA in Brazil give permission for their commercialization.

The area of rejection of the global null hypothesis H0 is generally chosen in such a way that the probability of a Type II error (acceptance of a wrong hypothesis) is as small as possible while the Type I error rate is prefixed or not. For the comparison of the means, the Type I error rate may be of the comparisonwise or experimentwise types. The latter can be under global null hypothesis or partial null hypothesis, or maximum experimentwise error rate (MEER) which is the preferred one.

The behavior of certain statistical tests and their performance in terms of Type I error rate have been evaluated, for example, by Gabriel (1964); Boardman & Moffitt (1971); O'Neill & Wetheril (1971); Bernardson (1975); Hsu (1996) and many others but there are still many questions to be answered in this research field (Hocking, 1985).

Studies by Boardman & Moffitt (1971), regarding the Type I error rate per comparison for experiments with two to eleven treatments (identical treatments), under true global null hypothesis H0, revealed that the Student t test maintained a frequency of rejection of the null hypothesis very near the adopted value of α = 0.05; the Duncan test had values varying from near 0.05 for t = 2 to near 0.025 for t = 11; the SNK, Tukey and Scheffée tests showed values gradually smaller, from 0.05 for t = 2 to near 0.01 for t = 11, different of the adopted Type I error of 0.05.

For the experimentwise Type I error, adopting α = 0.05, the t test revealed an increment of frequency from 0.05, for t = 2, to near 0.55, for t = 11; the Duncan test had values varying near 0.05 for t = 2 to 0.25, for t = 11; the other three tests maintained the frequencies near the nominal value a or gave smaller values. Similar results were obtained by Bernardson (1975) and Perecin & Barbosa (1988). Conagin (1998); Conagin et al. (1999); Conagin (1999) and Conagin & Gomes (2004) using different number of combinations of size, number of treatments, replications and different C.Vs. compared a great number of tests. Conagin & Barbin (2006a, 2006b) evaluated the behavior of various tests and introduced the modified tests Sidak, Bonferroni 1 and 2.

The aim of this study is to propose two modifications for the Tukey test and to evaluate the power and the efficiency of the 11 classical and five modified multiple comparison tests.

MATERIAL AND METHODS

Two modifications for the statistical Tukey test are suggested and the power of unilateral and bilateral Student, Waller-Duncan, Duncan, SNK, REGWF, REGWQ, Tukey, Bonferroni, Sidak, unilateral Dunnet tests and the modified tests Sidak, Bonferroni 1 and 2, Tukey 1 and 2 have been compared using the Monte Carlo simulation method. All classical tests were calculated using the SAS (2003) software.

Data were generated for 600 experiments with eight treatments in a randomized block design, of which 400 had four and 200 eight blocks. The differences between the treatment means in relation to the control were 30%, 20%, 15%, 10%, 5%; two extra treatments did not differ from the control. A coefficient of variation of 10% and a probability Type I error of α = 0.05 were adopted. The evaluation of the power of each test was made by the value of the percentage of the number of significative differences obtained in relation to the number of experiments performed. A brief description of the modifications of the Tukey test is presented.

Modified Tukey Test 1, TuM1

If the global null hypothesis Ho (τ1 = τ2 = ... = τt = 0, where τi, i = 1, ..., t, is the i-th treatment effect), is rejected, the greatest interest of the researcher is to know how the t treatments means differ.

The Tukey test determines for every pair of means whether they are significantly different and is based on a familywise error rate for k = t (t-1)/2 comparisons. The procedure is to test the hypotheses: Ho: µi = µi', versus Ho: µiµi', ii' = 1, ..., t, and Ho is rejected at an a significance level if

mi - mi'>q s √ 1/r or mi - mi' > q s √[1/2(1/ri + 1/ri' )],

where mi and mi' are the estimates of the means and ri and ri' are the number of replicates of treatments i and i' and q = qt,ν,α is the value of the studentized range with t means, n degrees of freedom associated to s2, the Residual Mean Square.

One problem of the Tukey test is that it can be conservative (Carmen & Swanson, 1973) because it is based on the studentized range. A similar procedure employed for the BM2 and siM tests (Conagin & Barbin, 2006a, 2006b) can be used here. The first modification here proposed for the Tukey test, called TuM1, is to carry out all the preliminary phases made for BM2 and siM and determine â, an estimate of the number of significative differences a. As the null hypothesis H0 is rejected, the new H'0 should have a t - â range of ti's = 0. The value of q is now obtained for t - â and n degrees of freedom and used to calculate the least significant difference (lsd = q s √[1/2(1/ri + 1/ri' )]). The differences between means larger than this lsd value will be declared statistically significative according to the TuM1 test.

Modified Tukey Test 2, TuM2

The procedure to estimate a is similar to that used for TuM1 but now the â value is obtained by applying the original Tukey test, which is equal to the number of significative differences (with H0, the global null hypothesis rejected) and the new H'0 hypothesis will have a t - â range of τi's = 0. The value of q now is obtained for t - â and n degrees of freedom and used to calculate the lsd. The differences between means larger than this lsd value will be declared statistically significative according to the TuM2 test.

The argument to accept that â is generally smaller than k is: if the treatments are ranked then the treatments that are situated far apart have differences that are probably statistically significative. Nevertheless, two treatments that are consecutive in the ordered set, due to the size of experimental error or smaller number of replications or other causes, have generally not significative differences. It is sufficient to have at least one or more situations like this to cause â to be smaller then k in the BM2 and SiM tests.

Regarding TuM1, for which the range is t (number of treatments of the experiment), it may be possible that (when all comparisons between two means are performed) â may be larger than t. In this case and for coherence, a restriction shall be imposed: use TuM1 if â < (t-1) and use TuM2 if â > (t-1).

RESULTS AND DISCUSSION

The power of each test was higher for r = 8 than for r = 4; for the larger difference (30%) the power of all tests are high, but differences occur (Table 1). When the real value of the differences decreases, the power of each test decreases and the difference of power among the different tests increases. The unilateral Student test was somewhat more powerful than the bilateral Student t test followed by the Waller-Duncan; Duncan, Dunnett unilateral, siM, BM1, BM2, TuM1, SNK, REGWF, REGWQ, TuM2, Tukey, Sidak and Bonferroni tests.

The new modified tests are of the MEER Type. The efficiency of each test calculated in relation to the unilateral Student t test is shown in Table 2. The discrepancy of their efficiency always increased as the true difference (30%, 20%, 15%, 10% and 5%) decreased. This is very important because in breeding programs and other types of research the new aimed progress always tends to be more difficult to be obtained and the progress is smaller. The power and the efficiency of the various tests were always greater for r = 8 than for r = 4, and the power of the modified tests were always greater than their original ones.

The efficiency of the Bonferroni, Sidak and Tukey tests in relation to their respective modified versions was always smaller than one, and their values rapidly decreased as the true differences (30%, 20%, 15%) decreased. If the error to be adopted (α = 0.05) for the comparison of two means satisfies the researcher, then a comparisonwise type of test such as the Student unilateral t test may be chosen. If he wants an error a for the global H0 or H'0, then he must apply an experimentwise type of test. The values shown in Table 1 may help in his choice. When two means are compared, the software used generally gives the exact probability p of the test; the result helps to evaluate better the degree of confidence of the obtained result.

The efficiency of the modified tests BM2 and SiM is about the same as Dunnett´s unilateral test (Tables 1 and 2), but their advantage increases when all the paired comparisons are made. In this case they are the most efficient test of all experimentwise MEER types. The performance of TuM1 surpasses all the experimentwise types SNK, REGWF, REGWQ, Tukey, Sidak and Bonferroni tests.

Table 3

ACKNOWLEDGEMENTS

To Silvio Sandoval Zocchi, for his collaboration on the final edition of the present paper.

Received May 25, 2007

Accepted January 07, 2008

  • BERNARDSON, C.S. Type error rates when multiple comparison procedures follow a significant test ANOVA. Biometrics, v.31, p.229-232, 1975.
  • BOARDMAN, T.J.; MOFFITT, D.R. Graphical Monte Carlo type Y error rates, for multiple comparison procedures. Biometrics, v.27, p.738-744, 1971.
  • CARMER, S.G.; SWANSON, M.R. Evaluation of ten multiple multiple comparison procedures by Monte Carlo methods. Journal of American Statistical Association, v.68, p.66-74, 1973.
  • CONAGIN, A. Discriminative power of the modified Bonferroni´s test. Revista de Agricultura, v.73, p.31-46, 1998.
  • CONAGIN, A. Discriminative power of the modified Bonferroni's test under general and partial null hypothesis. Revista de Agricultura, v.74, p.117-126, 1999.
  • CONAGIN, A.; BARBIN, D. Bonferroni's modified tests. Scientia Agricola, v.63, p.70-76, 2006a.
  • CONAGIN, A.; BARBIN, D. Poder e eficiência dos diferentes testes estatísticos para comparações múltiplas. Revista de Agricultura, v.81, p.118-137, 2006b.
  • CONAGIN, A.; IGUE, T.; NAGAI, V. Poder discriminativo de diferentes testes de médias Campinas: Instituto Agronômico, 1999. (Boletim Científico, 44).
  • CONAGIN, A.; GOMES, F.P. Escolha adequada dos testes estatísticos para comparações múltiplas. Revista de Agricultura, v.79, p.288-295, 2004.
  • GABRIEL, K.R. A procedure for treating the homogeneity of all set of means in analysis of variance. Biometrics, v.20, p.459-477, 1964.
  • HOCHBERG, Y.; TAMHANE, A.C. Multiple comparisons procedures. New York: John Wiley, 1987. 450p.
  • HOCKING, R.R. The analysis of linear models. Belmont: Brooks/Cole, 1985. 385p.
  • HSU, J.C. Multiple comparisons. London: Chapman and Hall, 1996. 277p.
  • O'NEILL, R.; WETHERIL, G.B. The present state of multiple comparison. Journal of the Royal Statistical Society, v.33, p.218-250, 1971.
  • PERECIN, D.; BARBOSA, J.C. Uma avaliação de seis procedimentos para comparações múltiplas. Revista de Matemática e Estatística, v.6, p.95-103, 1988.
  • SAS INSTITUTE. System for Microsoft Windows, release 9.1 (TS2M0). Cary: SAS Institute, 2003. CD ROM.

Publication Dates

  • Publication in this collection
    21 July 2008
  • Date of issue
    2008

History

  • Accepted
    07 Jan 2008
  • Received
    25 May 2007
São Paulo - Escola Superior de Agricultura "Luiz de Queiroz" USP/ESALQ - Scientia Agricola, Av. Pádua Dias, 11, 13418-900 Piracicaba SP Brazil, Tel.: +55 19 3429-4401 / 3429-4486, Fax: +55 19 3429-4401 - Piracicaba - SP - Brazil
E-mail: scientia@usp.br