Adjusting the Scott-Knott cluster analyses for unbalanced designs

The Scott-Knott cluster analysis is an alternative approach to mean comparisons with high power and no subset overlapping. It is well suited for the statistical challenges in agronomy associated with testing new cultivars, crop treatments, or methods. The original Scott-Knott test was developed to be used under balanced designs; therefore, the loss of a single plot can significantly increase the rate of type I error. In order to avoid type I error inflation from missing plots, we propose an adjustment that maintains power similar to the original test while adding error protection. The proposed adjustment was validated from more than 40 million simulated experiments following the Monte Carlo method. The results indicate a minimal loss of power with a satisfactory type I error control, while keeping the features of the original procedure. A user-friendly SAS macro is provided for this analysis.


INTRODUCTION
A common problem in plant breeding is comparison of new genetic combinations.In order to detect significant difference among treatments, several Multiple Comparison Procedures (MCP) were developed: LSD (Fisher 1935), Tukey (1949), SNK (Student 1908, Newman 1939, Keuls 1952), Scheffé (1953), and Duncan (1955).Nonetheless, all these procedures can result in groups overlapping, where one treatment ends up belonging to two or more groups simultaneously (Calinski and Corsten 1985).This behavior usually prevents a clear division of the whole set into two or more groups of treatments and also leads to a more complex simultaneous analysis of multiple variables due to the presence of overlapping subsets.Thus, selection for advancement of new genetic combinations to the next step in the plant breeding program requires extra effort to overcome this statistical issue.
Cluster analysis is a promising solution to avoid subset overlapping from widely-used MCP (O 'Neill andWetherill 1971, Plackett 1971).One example of an intuitive and satisfactory approach, avoiding subset overlapping, is the use of cluster analysis over the Mahalanobis generalized distance (Rao 1952).Additionally, clustering techniques can be applied to taxonomy purposes since they have high affinity to Hotelling's Principal Component Analysis and Fisher's Discriminant Analysis (Hotelling 1933, Fisher 1936, Edwards and Cavalli-Sforza 1965).
TV Conrado et al.In 1974, Alastair J. Scott and Martin Knott publicized their idea of using the Maximum Likelihood (ML) ratio test to evaluate the significance of partitions from cluster analysis of sample treatment means in designs with an equal number of observations per treatment (Scott and Knott 1974).The first review of methods for Scott-Knott means separation suggesting their use in agronomics was provided several years afterward (Chew 1976).The Scott-Knott approach is an alternative to the MCP in a situation in which two or more internally homogenous subsets of sample treatment means are expected.It uses a univariate form of the divisive clustering procedure (Edwards and Cavalli-Sforza 1965) with a likelihood ratio test for determining when to stop the clustering process to create non-overlapping, distinct, and exclusive subsets of sample treatment means.The process orders the treatment means to minimize the number of possible treatment mean partitions to be pondered (Fisher 1958) and then maximizes the sum of squares between clusters to determine the best partitioning.Despite a significant increase in the calculation volume for every additional treatment even after the ordering of treatment means, it is still feasible, even manually, if the number of partitions remains lower than 12 (Scott and Knott 1974).Indeed, the computations are more onerous than an MCP (Carmer and Walker 1985).Nevertheless, it should not be a problem for any modern computer (Gates and Bilbro 1978).Some procedures with the same idea of partitioning means into non-overlapping groups were published after Scott-Knott (1974).These procedures presented variations in regard to the decision-making process and the clustering logic, ranging from agglomerative to divisive, hierarchical to non-hierarchical, but all of them ensure groups with no overlapping (Jolliffe 1975, Cox and Spjotvoll 1982, Calinski and Corsten 1985, Bozdogan 1986, Bautista et al. 1997, Di Renzo et al. 2002, Ciampi et al. 2008).
Many researchers prefer cluster analysis in order to facilitate interpretation and presentation of results since it results in non-overlapping, distinct, mutually exclusive groupings of the observed treatment means (Gates and Bilbro 1978, Carmer and Lin 1983, Calinski and Corsten 1985, Carmer and Walker 1985).This advantage is very clear when it is necessary to evaluate more than one variable simultaneously because the test easily allows for a positive selection of primary traits and a negative selection for any traits remaining to be evaluated.It can be effortlessly performed over the clustered data with multiple variables by initially applying filters to keep only higher performance clusters for the most important trait (i.e.yield) and then by removing some clusters of lower performance in the variables of secondary importance (i.e. plant height, biomass, etc.).This procedure should result in a highly reduced subset of treatments that present higher performance for the top priority trait, with a desirable level for the secondary traits.
An early evaluation of the Scott-Knott test with agglomerative procedures under scenarios where there is more than one true group of treatment means, or partial true null hypothesis (p-H 0 ), exposed the lack of an appropriate experimentwise type I error control.The result of simulations suggested that the test should be used only when the experiment has been performed with great precision, and it may be unsuitable for experiments where use of MCP would be considered inappropriate, such as those whose design and purpose suggest meaningful, orthogonal, linear contrasts with a single degree of freedom among the treatment means.However, the Scott-Knott test exhibited a higher ability to correctly reject the null hypothesis (power) and detect small differences between treatments than even the LSD test (Willavise et al. 1980).
Moreover, the Scott-Knott test has the highest rate of correct decisions and aptitude for improving performance as the number of treatments increases, in comparison with the SNK, Duncan, t-student, and Tukey tests (Silva et al. 1999, Borges andFerreira 2003).The test exhibits higher than nominal type I error rate when evaluated in simulated scenarios in which the null hypothesis (H 0 ) is false for some treatments (p-H 0 ), although for scenarios where the null hypothesis is true for all treatments, the empirical type I error rate is under nominal levels even for the experimentwise type I error rate (Di Renzo et al. 2002, Borges andFerreira 2003).
The Scott-Knott test also provides higher robustness compared to the MCP tests for mean separation in non-Gaussian distributions (Borges and Ferreira 2003).Despite the lack of control of type I error, the test demonstrates much higher Power than any MCP, although these two features, high robustness and power, are very common to most cluster analyses (Bautista et al. 1997, Silva et al. 1999, Di Renzo et al. 2002, Borges and Ferreira 2003).The Scott-Knott test displays similar type I and type II error in comparison to Bautista et al. (1997) andDi Renzo et al. (2002).However, its performance is superior to that of Jollieffe (1975) (Di Renzo et al. 2002).
Group homogeneity can be improved by changing the clustering approach from divisive to non-grouped treatment clustering (Bhering et al. 2008).It usually reduces the number of significantly different clusters -slightly increasing the number of treatments grouped in each one of the different clusters.In spite of this drawback, this consequence can be useful in plant breeding scenarios in which positive selection followed by retesting is applied, since it can shift a small number of treatments from an inferior cluster to a superior one.
Since most plant breeding designs are unbalanced, the objective of this research is to adjust and validate the Scott-Knott test in order to allow its use in experiments under partially balanced incomplete block designs or balanced designs with missing plots, since the non-adjusted procedure is only applicable to balanced designs.This paper proposes a novel solution for use of the Scott-Knott test under unbalanced designs followed by its validation.In order to ease its use, a user-friendly macro for the SAS/STAT ® software is also provided.

Description of the proposed adjustment procedure
The original Scott-Knott (1974) test begins by ranking all the k treatment means to be grouped and then by calculating B 0 from the k treatments partitioned in two smaller subsets.The B 0 value is calculated for every k -1 possible partition, and the partition with the highest value of B 0 is tested using λ as two distinct subsets of treatment means.The test uses the circumference constant π (=3.14159…) and related adjusts to approximate the λ distribution to the χ 2 distribution.
If the chi-square test with ( k π-2 ) degrees of freedom rejects the null hypothesis, the process repeats; each one of these distinct subsets is, in turn, further subdivided until each of the final clusters is shown to be homogeneous by a likelihood ratio test on λ. (i) The statistic λ (i) depends on B 0 , which is the maximum value from the sum of squares of all the possible partitions of k treatments into two groups, and on σ ˆ 2 0 , which is the maximum likelihood estimator of σ 2 for treatments under the null hypothesis.
Equation (ii) shows how υs 2 is used where s 2 represents an unbiased estimator of σ 2 associated with υ degrees of freedom, y i is the treatment mean i, and y is the mean of all k treatments.The variable n is the number of replications, or the total number of blocks according to the experimental design.Moreover, equation (iv) used in a balanced experimental design can be modified and expressed as equation (v), where a different number of observations for every treatment is also permitted.After the modification, the corrected unbiased estimator of s 2 c can change according to the SE yi of treatments in the partitioned set.Thus, in order to accommodate subsets of treatments with unequal and equal numbers of observations, s 2 c should be calculated for every null hypothesis before testing the statistic λ against a χ 2 distribution with the associated υ degrees of freedom.Hence, for every clustering step, s 2 c can change to adapt to the number of observation of each treatment in the current clustering process. (v) Along with correction of s 2 c , the raw treatment mean y i should be replaced by y ˆi , which is the treatment mean adjusted to the effect of the unequal number of replications/blocks.The following changes in the original procedure are minimal and are disclosed in equations (vi).The notation λ c should be used to identify λ statistics while using the correction even though the testing process against the χ 2 distribution remains the same as the original procedure.

(vi)
As expected, the correction increases the σ ˆ 2 0c value as the number of observations per treatment decreases -lowering the final λ c value.This leads to a lower probability of rejecting the null hypothesis, which protects the test from the type I error.The unbalanced treatment adjustment maintains the same features and results as the original method in balanced treatment scenarios.Indeed, s 2 c only changes for clusters in an unbalanced condition (i.e., missing plots).When clustering the same experiment, after partitioning all treatment means with missing plots, the remaining clusters should have the same s 2 c value.It is important to keep in mind that since the process follows a hierarchical clustering sequence, the very same subset of treatment means with an unequal number of observations can be partitioned multiple times before composing the final specific cluster.Indeed, the calculation of s 2 c for every candidate partition that challenges the χ 2 distribution makes the adjustment hard to be calculated manually, but it provides satisfactory protection to the original Scott-Knott test without a significant reduction in power.

Validation of the proposed adjustment procedure
The s 2 c deduction can indicate how the correction affects the Scott-Knott test; nevertheless, it is necessary to quantify and compare the power and type I error of the adjustment while using it.in order to validate the proposed adjustment, use of the Monte Carlo method (Metropolis and Ulan 1949) is a suitable option to simulate experiments with known parameters and then evaluate the results by comparing the original test against the adjusted solution for unbalanced designs (Carmer and Swanson 1971, Silva et al. 1999, Borges and Ferreira 2003).For that purpose, more than 40 million experiments were simulated for multiple unbalanced levels combined with several α values.The simulation scheme is composed of three main branches: complete H 0 (μ 1 = μ 2 = μ 3 = ... μ I ), partial H 0 (μ 1 = ... = μ I/2 ≠ μ (I/2+1) = ... μ I ), and complete alternative hypothesis H 1 ( μ 1 ≠ μ 2 ≠ μ 3 ≠ ... μ I ).The first branch was used only to quantify type I error and the third only to measure power, while the second branch measures type I error and power.
All three branches contained nine levels of α (0.01, 0.02, 0.05, 0.08, 0.10, 0.12, 0.15, 0.18, and 0.20).Within each α level, there were ten levels of missing data (0.00, 0.01, 0.02, 0.05, 0.08, 0.10, 0.12, 0.15, 0.18, and 0.20).Since the second and third branches were used to evaluate the test Power they also exhibited four (1, 2, 3, and 4) levels of δ (true difference between two treatment means).In order to improve the robustness of the study, 50,000 experiments were simulated for all 810 Monte Carlo simulation setups across all three branches, culminating in a total of 40.5 million simulated experiments.Furthermore, every simulated experiment was composed of a random number of blocks (3 to 20) and a random number of treatments (4 to 100).Experiments with a number of observations lower than 50 were replaced to avoid a small number of degrees of freedom after data removal at random to reach the required missing level.The number of both blocks and treatments were from a uniform distribution.The effects of block and observation error were from a normal distribution with a mean of zero and a standard deviation of one.The differences between subsets were defined as the product of δ and σ x (standard error of the mean).After each experiment was generated, some plot values were removed at random.As the simulation removed plots randomly with no restriction, the minimum number of plots was set to one per treatment to avoid treatments with no plots.
Instead of measuring type I error per comparison, it was measured per experiment, a situation in which rejection of a single incorrect null hypothesis in an experiment scores as experimentwise type I error.This approach is more severe and general because it does not consider the number of treatments in the experiment (i.e., a higher number of treatments promotes an even higher number of contrasts, and it implies a higher probability of type I error).However, this approach should be able to make a better distinction between the original and adjusted procedures.Converging results were expected for both procedures (original and adjusted) under balanced designs.Thus, contrast can be observed only between balanced and unbalanced designs.
All 40.5 million experiments were simulated in SAS/IML ® and analyzed with SAS System for Windows 9.3 (SAS Institute 2011).The data were evaluated using the Generalized Linear Models Procedure (Proc GLM).Output of the adjusted means was grouped by a compiled macro.A recursive SAS local host multithread approach with isolated workplaces was used to speed up the simulation run time.
Stability of the process and the ability to suspend it was ensured by the use of macros capable of error handling, also oriented to processing batches of 5,000 experiments and logging all the processing responses.
Regarding the accuracy of the estimated type I error rates using Monte Carlo simulations, the exact binomial test was applied, contrasting the nominal significance level against the obtained empirical rate (Leemis and Trivedi 1996).In scenarios in which the exact binomial test rejected the null hypothesis (p < 0.01), the performance of the Scott-Knott test should be considered conservative when the empirical rate is lower than the nominal rate and should be considered liberal if higher.In scenarios in which the exact binomial test did not reject the null hypothesis, the tests were classified as precise.The F-value was obtained using equation (vii), where y represents the number of experiments with at least one type I error, α is the nominal significance level, and N is the number of simulated experiments (50,000).The p-value was found using υ 1 = 2(Ny) and υ 2 = 2(y + 1) degrees of freedom (Santos et al. 2001). (vii)

RESULTS AND DISCUSSION
Table 1 summarizes the results of 4.5 million simulated experiments.These experiments were simulated under the complete H 0 hypothesis (no real difference among treatments).For experiments with a balanced design (no missing plots), as the nominal α level increased, the empirical experimentwise type I error became higher.This persisted under experiments with missing plots using the proposed Scott-Knott adjustment, but reduced when the level of imbalance increased.It can be observed that the empirical values obtained using the Monte Carlo method for a balanced design (0% TV Conrado et al. of missing plots), in which the adjusted and non-adjusted procedures exhibit the same results, are below the nominal α level for values smaller than 0.05, but according to the exact binomial test, the difference is not significant.In contrast, the empirical value is significantly higher than the nominal value for some α levels higher than 0.10, which means that the original procedure should be considered liberal at these levels since it does not properly control the type I error even under the complete H 0 hypothesis.The intermittent classification for the alpha levels of 0.12 and 0.15 as a trend for empirical rates to surpass nominal rates as the nominal alpha level increases could be caused by approximation to the χ 2 distribution used by Scott-Knott (1974), but this thesis should be evaluated in further studies and does not belong to the scope of this study.
Moreover, in half of the simulated combinations, the experimentwise type I error was evaluated as significantly different from the nominal value by the exact binomial test.As expected, the adjustment led to a more conservative approach as the level of missing plots increased.This result suggested that in order to use the proposed adjustment, the user must take into account the level of imbalance (either from the planned design or from random loss of plots) before selecting the nominal α level.
In contrast, the adjusted and non-adjusted (original) Scott-Knott test exhibited a higher empirical experimentwise type I error rate than the nominal rate under p-H 0 (Table 2).It also showed a small increase in the experimentwise type I error rate when the level of missing plots became higher, but the magnitude of the experimentwise type I error rate reduced as the α level increased.This result validated the findings of Silva et al. (1999) and exposed the weakest point of the Scott-Knott test -the lack of control of experimentwise type I error under a p-H 0 .
Additionally, lower values of δ culminated in smaller differences in the experimentwise type I error rate between the adjusted and non-adjusted results of the Scott-Knott procedure (Figure 1).This trend persisted upon increasing the nominal α.Increasing α or δ led to a reduction in the difference in Power among balanced and unbalanced experimental designs (Table 3).It is also important to keep in mind that a higher value of δ indicates larger differences among the treatment values.Hence, it is easier for both procedures to detect these differences and reject the null hypothesis for any level of imbalance.The adjusted and non-adjusted tests exhibited lower Power for δ < 1.No significant differences in Power between the adjusted and non-adjusted procedures were noticed for δ > 1.Additionally, the adjusted Scott-Knott test maintained very high Power, even with a small α value under a complete H 1 (Figure 2).
However, as the level of imbalance got higher, there was  a small loss of power when using the proposed adjustment.This performance was expected since missing information causes lower ability to reject the null hypothesis due to the additional protection required to control type I error.The small loss of power is a suitable indicator for adjustment efficiency, which is very important since the Scott-Knott test is recognized for its high power, with superior performance over the LSD and other widely used MCP (Willavise et al. 1980, Silva et al. 1999, Borges and Ferreira 2003).In spite of that, there is a trend of power reduction as the number of members per cluster decreases.This has already been pointed out and is similarities between hierarchical and non-hierarchical procedures (Tasaki et al. 1987), but it should not be assumed to be common to all clustering procedures since the clustering procedure of Bozdogan (1986) shows exactly the opposite response.
Although the loss of power lowers the total number of clusters, it is a tolerable deficiency for scenarios where the entries that are wrongly clustered together should be retested in the next stage of research.Since the retesting routine is often used in plant breeding programs, this error is preferable to the possibility of the error of discarding an entry without a satisfactory level of confidence.Thus, as for the non-adjusted Scott-Knott procedure, it is necessary to understand the error tolerance of the experiment under evaluation before using the proposed adjustment.
It is noteworthy that even using the proposed adjustment, the most common cause of the type I error under p-H 0 for the Scott-Knott test is late compensation for incorrect partitioning in the previous step, as a consequence of divisive binary partitioning.This usually occurs in scenarios where the true number of clusters is different from powers of 2 or from the geometric sequences with common ratio 2 (data not shown).This unsatisfactory compensation is very noticeable when the true number of clusters is 3, which is a weakness common to various clustering procedures (Tasaki et al. 1987).If the gap between clusters is not clear enough, the maximum likelihood test may select a splitting point around the median by mistake.Then, in the next step, while it seeks for the point that maximizes the likelihood, it has a chance to correctly split the subset between the first and second clusters.A clear demonstration of this is an experiment with 9 treatments  TV Conrado et al.
truly distributed in 3 clusters, for example ABC/DEF/GHI, in which the test incorrectly performs the first partitioning as ABCDE/FGHI and then it differentiates the first true cluster from the rest of the subset, resulting in (ABC/DE)/FGHI.In following, for the same reason, the test can correctly discriminate the third true cluster from treatment F, culminating in 4 clusters: (ABC/DE)/[F/GHI].Although the first and third clusters are correct, the second cluster is improperly divided, increasing the type I error rate.This type of result is a consequence of adoption of a divisive hierarchical approach in order to allow comparison of the selected critical value which was obtained by empirical approximation, and afterwards, to declare the computed statistic significant or not (Carmer and Lin 1983).Some approaches avoiding hierarchical clustering have been published to avert this undesirable feature by simply allowing the creation of completely new clusters in every step of evaluation (Cox and Spjotvoll 1982, Calinsk and Corsten 1985, Bozdogan 1986).Despite that, the divisive hierarchical approach is still used for clustering (Di Rienzo et al. 2002, Valdano andDi Rienzo 2007).
Within plant breeding applications, the use of non-overlapping, mutually-exclusive subsets such as Scott-Knott creates a clear cutoff for the genotype advancement procedure, while results with multiple distinct subsets can help in financial management by assigning the right subset to an appropriate testing pipeline.Using the proposed adjustment procedure, this distinguishing feature is extended to experiments with missing data, which are very common in yield trials.For example, using cluster analysis on an unbalanced yield trial that results in 6 distinct subsets, the breeder would be able to submit solely the genotype subset partitioned in the highest category, "Group A", to be tested in the most accurate and expensive Pipeline I (the maximum number of locations in a randomized complete block design).Group B of genotypes could be placed in the intermediate Pipeline II (a smaller set of locations), and Group C and D could be tested in the lower cost Pipeline III (augmented blocks in the same locations as Pipeline II), while discarding the genotypes in Groups E and F (that have inferior performance compared to the commercial checks, clustered in Group C).After harvesting, the breeder can choose to retest only the superior genotypes from Pipeline III together with the new entries to be tested in Pipeline II or I.
A small drawback to the use of the proposed adjustment procedure is the increased complexity and volume of calculations in comparison to the non-adjusted procedure.Thus, in order to promote better dissemination of the proposed adjustment, a free compiled SAS GLM macro was developed and can be downloaded at http://www.tconrado.com/sas/sk.zip.The compressed file also contains an example to provide better understanding of the macro options and about how to use the software.
The proposed adjusted Scott-Knott procedure had performance similar to the original procedure under unbalanced experimental designs, with minimal loss of power, while maintaining satisfactory control of the experimentwise type I error and improved performance at α > 0.05.This adjustment increases the spectrum for use of the test, providing the researcher with an alternative to the MCP, even under a significant loss of experimental data (missing plots), and it is readily available for use in SAS.
Means Square Error (MSE) model is a good measure of variance, it is used as a satisfactory term for estimation of s 2 .Equation (iii) shows the relation between the unbiased estimator s 2 and the Standard Error of the Mean SE y , where RMSE is Root Mean Square Error.It is valid only under an equal number of observations for every treatment (n 1 = n 2 = ... = n k ).Additionally, under a balanced experimental design, SE y has the very same value for every treatment and leads to equation (iv), the base of the proposed adjustment, where the mean of the sum of the squares of SE y estimates s 2 .(iii) (iv) TV Conrado et al.

Figure 1 .
Figure 1.Empirical experimentwise error under the partial null hypothesis in the combination of three significance levels (ɑ) by four levels of true difference between two treatment means (δ).

Figure 2 .
Figure 2. Empirical power under the complete H 1 hypothesis in nine significance levels (ɑ) across ten unbalance levels.

Table 1 .
Empirical experimentwise type I error under no true difference between treatments

Table 2 .
Empirical experimentwise type I error under true difference between treatments of four standard errors of the mean (4σ x )

Table 3 .
Power of Adjusted Scott-Knott in several unbalance levels under the partial null hypothesis (H 0 ) under four levels of true difference between two treatment means (δ)