MODIFICATIONS FOR THE TUKEY TEST PROCEDURE AND EVALUATION OF THE POWER AND EFFICIENCY OF MULTIPLE COMPARISON PROCEDURES

Multiple pairwise comparison tests of treatment means are of great interest in applied research. Two modifications for the Tukey test were proposed. The power of unilateral and bilateral Student, Waller-Duncan, Duncan, SNK, REGWF, REGWQ, Tukey, Bonferroni, Sidak, unilateral Dunnet statistical tests and the modified tests, Sidak, Bonferroni 1 and 2, Tukey 1 and 2, has been compared using the Monte Carlo method. Data were generated for 600 experiments with eight treatments in a randomized block design, of which 400 had four and 200 eight blocks. The differences between the treatment means in relation to the control were 30%, 20%, 15%, 10%, 5%. Two extra treatments did not differ from the control. A coefficient of variation of 10% and a probability Type I error of α = 0.05 were adopted. The power of all the tests decreased when the differences to the control, decreased. The unilateral and bilateral Student t, Waller-Duncan and Duncan tests showed greater number of significative differences, followed by unilateral Dunnett, modified Sidak, modified Bonferroni 1 and 2, modified Tukey 1, SNK, REGWF, REGWQ, modified Tukey 2, Tukey, Sidak and Bonferroni. There is great loss of efficiency for all tests in relation to the unilateral Student t test for each difference of the treatment to the control, when the differences between means decrease. The modified tests were always more efficient than their original ones.


INTRODUCTION
In applied research the evaluation of the hypothesis under investigation can be obtained develop-ing experiments in which different treatments are included.Results are generally submitted to statistical analysis of variance, testing a global null hypothesis H 0 using the F test and comparing the means by mul-Sci.Agric.(Piracicaba, Braz.), v.65, n.4, p.428-432, July/August 2008 tiple comparison procedures (Hochberg & Tamhane, 1987;Hsu, 1996).A common practice is to compare new treatments to a control.In corn or wheat breeding, for example, new cultivars have to be compared to the main cultivar.In animal husbandry, new feeding treatments have to be compared to a main treatment that is in use.In medical research, new promising medicines have to be compared to the one adopted, before FDA in USA or ANVISA in Brazil give permission for their commercialization.
The area of rejection of the global null hypothesis H 0 is generally chosen in such a way that the probability of a Type II error (acceptance of a wrong hypothesis) is as small as possible while the Type I error rate is prefixed or not.For the comparison of the means, the Type I error rate may be of the comparisonwise or experimentwise types.The latter can be under global null hypothesis or partial null hypothesis, or maximum experimentwise error rate (MEER) which is the preferred one.
The behavior of certain statistical tests and their performance in terms of Type I error rate have been evaluated, for example, by Gabriel (1964); Boardman & Moffitt (1971);O'Neill & Wetheril (1971); Bernardson (1975); Hsu (1996) and many others but there are still many questions to be answered in this research field (Hocking, 1985).
Studies by Boardman & Moffitt (1971), regarding the Type I error rate per comparison for experiments with two to eleven treatments (identical treatments), under true global null hypothesis H 0 , revealed that the Student t test maintained a frequency of rejection of the null hypothesis very near the adopted value of a = 0.05; the Duncan test had values varying from near 0.05 for t = 2 to near 0.025 for t = 11; the SNK, Tukey and Scheffée tests showed values gradually smaller, from 0.05 for t = 2 to near 0.01 for t = 11, different of the adopted Type I error of 0.05.
For the experimentwise Type I error, adopting a = 0.05, the t test revealed an increment of frequency from 0.05, for t = 2, to near 0.55, for t = 11; the Duncan test had values varying near 0.05 for t = 2 to 0.25, for t = 11; the other three tests maintained the frequencies near the nominal value a or gave smaller values.Similar results were obtained by Bernardson (1975) and Perecin & Barbosa (1988).Conagin (1998); Conagin et al. (1999); Conagin (1999) and Conagin & Gomes (2004) using different number of combinations of size, number of treatments, replications and different C.Vs. compared a great number of tests.Conagin & Barbin (2006a, 2006b) evaluated the behavior of various tests and introduced the modified tests Sidak, Bonferroni 1 and 2.
The aim of this study is to propose two modifications for the Tukey test and to evaluate the power and the efficiency of the 11 classical and five modified multiple comparison tests.

MATERIAL AND METHODS
Two modifications for the statistical Tukey test are suggested and the power of unilateral and bilateral Student, Waller-Duncan, Duncan, SNK, REGWF, REGWQ, Tukey, Bonferroni, Sidak, unilateral Dunnet tests and the modified tests Sidak, Bonferroni 1 and 2, Tukey 1 and 2 have been compared using the Monte Carlo simulation method.All classical tests were calculated using the SAS (2003) software.
Data were generated for 600 experiments with eight treatments in a randomized block design, of which 400 had four and 200 eight blocks.The differences between the treatment means in relation to the control were 30%, 20%, 15%, 10%, 5%; two extra treatments did not differ from the control.A coefficient of variation of 10% and a probability Type I error of a = 0.05 were adopted.The evaluation of the power of each test was made by the value of the percentage of the number of significative differences obtained in relation to the number of experiments performed.A brief description of the modifications of the Tukey test is presented.

Modified Tukey Test 1, TuM 1
If the global null hypothesis Ho (t 1 = t 2 = ... = t t = 0, where t i , i = 1, …, t, is the i-th treatment effect), is rejected, the greatest interest of the researcher is to know how the t treatments means differ.
The Tukey test determines for every pair of means whether they are significantly different and is based on a familywise error rate for k = t (t-1)/2 comparisons.The procedure is to test the hypotheses: Ho: where m i and m i' are the estimates of the means and r i and r i' are the number of replicates of treatments i and i' and q = q t,n,a is the value of the studentized range with t means, n degrees of freedom associated to s 2 , the Residual Mean Square.
One problem of the Tukey test is that it can be conservative (Carmen & Swanson, 1973) because it is based on the studentized range.A similar procedure employed for the BM 2 and siM tests (Conagin & Barbin, 2006a, 2006b) can be used here.The first modification here proposed for the Tukey test, called TuM 1 , is to carry out all the preliminary phases made for BM 2 and siM and determine â, an estimate of the number of significative differences a.As the null hypothesis H 0 is rejected, the new H' 0 should have a t -â range of t i 's = 0.The value of q is now obtained for t -â and n degrees of freedom and used to calc ulate the least significant difference (lsd = q s Ö [1/2(1/r i + 1/r i' )]).The differences between means larger than this lsd value will be declared statistically significative according to the TuM 1 test.

Modified Tukey Test 2, TuM 2
The procedure to estimate a is similar to that used for TuM 1 but now the â value is obtained by applying the original Tukey test, which is equal to the number of significative differences (with H 0 , the global null hypothesis rejected) and the new H' 0 hypothesis will have a t -â range of t i 's = 0.The value of q now is obtained for t -â and n degrees of freedom and used to calculate the lsd.The differences between means larger than this lsd value will be declared statistically significative according to the TuM 2 test.
The argument to accept that â is generally smaller than k is: if the treatments are ranked then the treatments that are situated far apart have differences that are probably statistically significative.Nevertheless, two treatments that are consecutive in the ordered set, due to the size of experimental error or smaller number of replications or other causes, have gener-ally not significative differences.It is sufficient to have at least one or more situations like this to cause â to be smaller then k in the BM 2 and SiM tests.
Regarding TuM 1 , for which the range is t (number of treatments of the experiment), it may be possible that (when all comparisons between two means are performed) â may be larger than t.In this case and for coherence, a restriction shall be imposed: use TuM 1 if â < (t-1) and use TuM 2 if â > (t-1).

RESULTS AND DISCUSSION
The power of each test was higher for r = 8 than for r = 4; for the larger difference (30%) the power of all tests are high, but differences occur (Table 1).When the real value of the differences decreases, the power of each test decreases and the difference of power among the different tests increases.The unilateral Student test was somewhat more powerful than the bilateral Student t test followed by the Waller-Duncan; Duncan, Dunnett unilateral, siM, BM 1 , BM 2 , TuM 1 , SNK, REGWF, REGWQ, TuM 2 , Tukey, Sidak and Bonferroni tests.
The new modified tests are of the MEER Type.The efficiency of each test calculated in relation to the unilateral Student t test is shown in Table 2.The discrepancy of their efficiency always increased as the true difference (30%, 20%, 15%, 10% and 5%) de- and Ho is rejected at an a significance level if

Table 1 -
Power of various statistical tests between treatments and the control for differences of 30%, 20%, 15%, 10%, 5% and 0% for eight treatments, four and eight replications and coefficient of variation CV = 10% (rounded values).
*The columns 0%c and 0%e shown the comparisonwise and experimentwise Type I erros, respectively.