Spatial statistical analysis and selection of genotypes in plant breeding

The objective of this study was to evaluate the efficiency of spatial statistical analysis in the selection of genotypes in a plant breeding program and, particularly, to demonstrate the benefits of the approach when experimental observations are not spatially independent. The basic material of this study was a yield trial of soybean lines, with five check varieties (of fixed effect) and 110 test lines (of random effects), in an augmented block design. The spatial analysis used a random field linear model (RFML), with a covariance function estimated from the residuals of the analysis considering independent errors. Results showed a residual autocorrelation of significant magnitude and extension (range), which allowed a better discrimination among genotypes (increase of the power of statistical tests, reduction in the standard errors of estimates and predictors, and a greater amplitude of predictor values) when the spatial analysis was applied. Furthermore, the spatial analysis led to a different ranking of the genetic materials, in comparison with the non-spatial analysis, and a selection less influenced by local variation effects was obtained.


Introduction
In plant breeding, two features indicate the preliminary phases of selective programs: the large numbers of new genotypes to be evaluated and the small amount of material for their propagation.Both of them limit the use of replications of these genetic treatments, which are frequently evaluated in a single experimental plot, i.e., without replications.Federer (1956) proposed the augmented experimental designs to deal with this type of limitation, which allow the adjustment of the test line (new treatment) means for environmental effects (blocks, lines, or columns) estimated on the basis of repeated check genotypes.The author also presented the corresponding methods of statistical analysis, based on ordinary least squares (OLS) and, therefore, on the assumption of independence among observations.
The limited availability of propagation material, such as seeds and tubers, on the other hand, forces the breeder to adopt small plots, usually with just one or two rows of plants.This increases the chance of violating the independence among observations assumed when using the OLS method, due to the likely similarity of observations of neighboring plots (Stroup et al., 1994).This phenomenon, referred to as spatial correlationalso called spatial dependence or autocorrelation -can seriously affect the comparison of treatments.Es & Es (1993) have demonstrated that when this correlation occurs, the statistical tests associated with contrasting treatments in plots nearer together have higher probabilities of type II error, which consists of different treatments appearing to be identical.On the other hand, higher probabilities of type I error, i.e., identical treatments appearing to be different, were observed in the contrasts between treatments in which plots were farther apart.
The traditional analysis of variance relies on randomization to neutralize the harmful effects of this type of correlation, but frequently this is not attained adequately (Stroup et al., 1994).For this reason, Kempton et al. (1994) support a greater use of methods that consider some accounting for spatial dependence to improve the precision of variety trials.Recent advances in statistics for spatially distributed data have provided a number of alternative methods.One interesting approach is that of Zimmerman & Harville (1991).In this analysis, the plot effect (trend + error) is modeled in such a way that the observations are collectively taken as a partial outcome of a random field, similar to predictive models used in geostatistical applications (Martínez, 1994).The model aims at estimating the general covariance function, which is used in estimation and prediction, through generalized least squares (GLS).Therefore, it is a mixed linear model with spatially correlated errors, called a random field linear model (RFLM).
Due to the relatively limited use of these techniques among plant breeders, it is necessary to assess their effects on selection of genotypes to finally demonstrate their true potential.This study illustrates the application of the RFLM approach, adapting it to the augmented block design, which is typical in the preliminary phases of the selection process in plant breeding.The attempt does not intend to represent the best spatial approach for the set of data analyzed, but, rather, to demonstrate the benefits of a less restrictive statistical analysis in comparison with the traditional one, based on spatially independent observations.

Material and Methods
The data used in this study were obtained in a soybean variety trial, with F 6:3 lines of the semi-early maturity group, conducted in the locality of Areão, municipality of Piracicaba, SP, Brazil, in 1999/1995.The trial is part of a selection program conducted to increase soybean yield, carried out by the Department of Genetics of Esalq/USP.Genetic materials were evaluated in augmented block design, witht t = 5 check varieties (Bossier, Davis-1, IAC-12, IAS-5 and Viçoja) and p = 110 test lines, distributed in b = 4 blocks with approximately 50 plots each.The plot corresponded to two rows of plants, spaced 0.6 m apart and 5 m long.Only grain yield data (kg ha -1 ) were considered here.For spatial statistical analysis, it was necessary, in addition, to obtain the distances (meters) among plots, which was done from the geographical coordinates of the center of each plot in the experimental field grid -COORDX represents the width coordinate of the plots and COORDY, the length coordinate.
Two mathematical models were used for statistical analysis: i) a model which assumes spatially independent observations; and ii) a model allowing spatial correlations among observations.In both cases, the effects of test lines were taken as random, and here were assumed to be derived from a single base population, that is, varying randomly about a common mean.For this reason, the independent error analysis here does not correspond to the fixed model (OLS).Thus, both are mixed models, despite the adjustment for checks.The only difference between them is the assumption on the experimental error.
In the case of spatially independent observations, this analysis is described as intergenotypic information recovery analysis (Wolfinger et al., 1997;Federer, 1998).One peculiarity of such analysis in the augmented designs is that the mathematical model needs to accommodate two types of treatment effects: fixed effects for the checks (t populations) and random effects for the test lines.These lines constitute the (t+1) th population, which is also assumed to have a fixed effect.Thus, in the first alternative (i), the observations can be individually characterized by the model (an adaptation to the model of Scott & Milliken, 1993, proposed by Duarte, 2000): Y ijk = µ + b j + c k + g i(k) + e ijk in which Y ijk is the observation in the plot with genotype i, stemming from population k in block j; µ is the constant common to all observations; b j is the fixed effect of the j th block (j=1,2,...,b); c k is the fixed effect of the k th population (k=1,2,...,t,t+1; here the check cultivars plus the population of lines); g i(k) is the effect of the i th genotype within the k th population, assumed to be fixed and with a null mean, if the genotype is a check (i (k) = 1), or random with independent distribution N(0, 2 g σ ), if the genotype is a test line (i (k) =1,2,...,p k , with Σp k = p); and e ijk is the random experimental error associated with the ijk th plot, which is assumed to be independent, that is, null covariance among the errors of different plots, and with distribution N(0, 2 e σ ).In model (ii), the term e ijk is assumed to have the distribution e ijk ~ N [0,C(h)], in which C(h) is the covariance between two errors of plots which are h units of distance apart (h≥0).If such errors are denoted by e (s) and e (s+h) , in which s represents the spatial position of the ijk th plot, in the RFLM approach, C(h) is defined as (Littell et al., 1996): Thus, it is assumed that the covariance of the errors is a function of the distance that separates the corresponding plots (f(h)).However, this is not predetermined, but is estimated from the "uniformity experiment" suggested by the residuals ( ijk e ˆ) of the independent error model adjustment.
The fixed effects are in parametric vector β; the random effects, in parametric vector γ, except the errors that are in vector ε; X and Z are incidence matrices of the effects contained in β and γ, respectively.The random genotypic effects (γ) are assumed, without loss of generality, to have a normal distribution with a null mean (φ) and matrix of covariance G = I 2 g σ (where I is an identity matrix).The experimental errors are presumed to have a normal distribution with a null mean and a generic matrix of covariance R. Thus, in the first model (i), R = I 2 e σ , while in the other (ii), R = Σ, i.e., a nondiagonal matrix with structure defined by the general covariance function and by the autocorrelation range.
The first step in this spatial analysis is the adjustment of the model which postulate spatial independence among observations.The components of variance 2 g σ and 2 e σ were estimated by restricted maximum likelihood (REML).The estimated residual vector of this adjustment is: representing the solution vectors of the mixed model equations (Henderson, 1984).The residuals were then used to estimate the spatial correlation structure.This was done graphically by means of a so-called semivariogram or simply variogram (Stroup et al., 1994).
In this graphic representation, estimated values of semivariance, (h) S ˆ, are plotted against their respective distances h, resulting in a scatter plot (sample variogram).The semivariance is defined as: S(h)=½Var[ which is estimated by N(h) being the number of differences at the distance h.In this graph, values (h) S ˆ that are distributed randomly as a function of h reflect independent observations (residuals).The typical configuration of spatial dependence among observations occurs if values (h) S tend to increase as h increases up to a certain distance (range), after which the semivariance stabilizes reaching a plateau (or sill).Less variability is associated with smaller distances.The spatial correlation range (a) is the mean distance influence of an observation (plot), asserted here to be uniform in all directions (isotropy).The sill ( σ 2 ) corresponds to the intrinsic variance of the variable under study (Var[e (s) ] = Cov[e (s) ,e (s) ]), which is also equivalent to the covariance between residuals of plots separated by a distance equal to or greater than the range (Cov[e (s) ,e (s+h) ], with h≥a).
There is an advantage in evaluating spatial dependence by means of the variogram.Under stationarity -spatial law unaffected by translation -the variogram has a direct and simple relation to the function of autocovariance C(h), that is: S(h) = σ 2 -C(h); in which σ 2 = C(h = 0) (Es & Es, 1993;Stroup et al., 1994;Pannatier, 1996).Thus, fitting a continuous model to the sample variogram, the corresponding spatial covariance function for this relation is obtained.The most commonly utilized variogram functions are the so-called spherical, exponential and Gaussian models (Grondona & Cressie, 1991;Zimmerman & Harville, 1991;Vieira, 2000).Due to the wide application of the variogram in geostatistics, software that facilitates this adjustment is available (ex: Variowin; Geo-Eas).In such programs, the search for the function that best fits the observational points is carried out by changing slightly the values of σ 2 and a.In the present case, the exponential model provided the best fit, corresponding to the following covariance function (for isotropic random fields): . After defining parameters ( σ 2 and a) and the general covariance function, the next step is fitting the model to account for spatial dependence (R = Σ).This involves obtaining estimates, predictors, and statistical tests related to treatment effects, which must be free from estimated autocorrelation effects.To evaluate only the effects of the spatial adjustment on the statistical analysis, the same estimate of 2 g σ obtained in the former analysis (under R = I 2 e σ ) was used.The following procedure consisted of resolving the mixed model equations (Henderson, 1984): , which solutions have already been reported.

Characterization of spatial covariance
The experimental data showed a positive spatial correlation of first-order to sixth-order, in the series of residuals (Table 1).This fact is indicative of the violation of spatial independence among observations postulated by the first model (under R = I 2 e σ ).Residuals of this analysis were not randomly distributed in the experimental field (Figure 1).Rather, there is a clear tendency for larger residual values ( ijk e ˆ) to be concentrated in the top of the field map graph, that is, to be associated with plots having smaller COORDX values.
This fact also determined a predominant gradient in the direction of plot widths (COORDX).Considering this is how the blocks were constructed, it is possible that such an orientation may not have been ideal.Given the features of the residual surface, which provides an estimate of the uniformity trial underlying the experiment, it is reasonable to suppose that a lengthwise blocking of the plots would have been more effective in controlling local variation.The possibility of making this diagnosis represents an advantage of the spatial approach, which creates perspectives for the application of alternative forms of a posteriori local control or post-blocking (Federer, 1998).
The variogram obtained for distances less than 30 m is showed in Figure 2. The configuration of the dots is typical of stochastic processes with spatial dependence, that is, with decreasing variability as distance decreases.After 20 m (range) the variability tends to stabilize.The value of this plateau represents the residual variance among independent plots, and the existence of the increasing variogram with a plateau is an indication that the intrinsic hypothesis of stationarity was satisfied (Vieira, 2000).Furthermore, on the assumption of isotropy, the continuous function that best fits the dots is the exponential semivariance model: , with: σ 2 =126450 (kg ha -1 ) 2 and a = 20.4 m.Consequently, the respective autocovariance function is expressed by . This defines the residual Table 1.First to tenth-order autocorrelations ( ρ ˆ) and the respective Durbin-Watson statistics (d) for the residual ( ijk e ˆ) series obtained from the adjustment of an augmented block design model with independent errors. (1)Marginal probability of the statistical test (H 0 : ρ =0) (one-sided to the left, in this case)., in which h is the distance that separates each two plots identified by a row and a column in the matrix.Thus, the spatial covariance inherent to the experiment was characterized.The implications of the use, or not, of this information in statistical analysis are evaluated in the following section.

Comparison of the spatial and non-spatial statistical analysis models
Models with a larger number of covariance parameters always exhibit better fit than those with a simpler structure.For this reason, comparative criteria that penalize the more parametrized models, such as the Akaike's Information Criterion (AIC) and Schwarz's Bayesian Criterion (BIC) should be adopted.Both are based on the value which maximizes the restricted likelihood logarithm, L REML (G,R), reduced from a function of the number of parameters.Thus, the model with the greatest AIC or BIC values should be preferred (Littell et al., 1996).The results in Table 2 show that the covariance structure R = Σ provides a better fit to the respective model in comparison with the independent error model (R = I 2 e σ ).With regard to statistical tests related to the genotypic effects, it was observed that variation among the six fixed populations was not significant in the first analysis (at the 5% level of probability), but reached high statistical significance (p-value<0.01) in the spatial analysis (Table 3).With further partitioning of the population effects, the F values were also higher under the spatial analysis, both in the detection of differences among checks (four degree of freedom) and in the contrast between checks and test lines (one degree of freedom).Considering the three contrasts chosen to illustrate the comparison among some of these lines, the superiority of the spatial analysis was again evident and even greater.While the analysis under R = I 2 e σ did not detect any difference among these genotypes (p-value>0.90), the spatial analysis showed that two of the three contrasts were significant (p-value<0.025).These results reflect greater genotypic discrimination ability under the spatial analysis, compared with the non-spatial procedure.
This superiority was confirmed when the predicted genotypic values (EBLUP) were considered.While in the first analysis these varied between -98.2 and 100.5 (complete data in Duarte, 2000), with a range of about Table 2. Characteristics of the fitting of non-spatial and spatial models for grain yield data (kg ha -1 ) in a soybean variety trial in an augmented block design.Table 3. Statistical tests of some genotypic effects obtained on the basis of spatial and non-spatial analysis models (1) . (1)Grain yield data, in kg ha -1 ; soybean variety trial in augmented block design. (2)NDF and DDF are the numbers of degrees of freedom of numerator and denominator of F (Snedecor) statistic, respectively; the last obtained by the Satterthwaite approximation.

Source
NDF (2)  Non-spatial analysis Spatial analysis Checks Checks vs. Test lines Line G1 vs. Line G3 Line G1 vs. Line G24 Line G3 vs. Line G24 200 kg ha -1 , in the spatial analysis the detected range was greater than 500 kg ha -1 (values between -337 and  200.5).This represents an increase of more than 150% in the differentiation among the test lines, in favor of the spatial analysis.The smaller standard errors associated with EBLUP also confirm the better genotypic discrimination of this analytic model.Pontes (2002) has demonstrated a gain of 7% in the efficiency of these predictors when an iterative process to estimate the variogram and its parameters (a and σ 2 ) was used.
When a selection intensity of 25% of the most productive lines was assumed (28 in 110 genotypes), a coincidence of only 46% between the two statistical analysis models was observed (Table 4).In addition, among the genotypes selected by the more traditional analysis (non-spatial), at least 30% would occupy poor ranking positions in the spatial analysis (up to fiftieth position).Examples include the following lines: USP 93-2048, USP 93-2393, USP 93-2153 and USP 93-2198.On the other hand, four lines classified in the spatial analysis as among the ten most productive would be discarded using the other analysis (under R = I 2 e σ ).
The disagreement between these selections can be better understood if the spatial positions of plots with the selected lines in the experimental field are considered.The evidence of the effect of spatial adjustment on selection can be seen in Figure 3.When the non-spatial model was used, the selected genotypes were located exclusively in the left side strip of the experimental field, probably its most fertile area.However, when the spatial adjustment was taken into account, the selected genotypes were detected in plots scattered throughout the whole experimental area.The predominance of genotypes from the left side-stripe can be explained as a result of possible remaining fertility effects or of the breeder's preference in allocating genotypes of the same parent side by side.In any event, what is expected from experiments of this nature is an outcome as shown in part (b) of Figure 3 rather than one displayed in its part (a).Similar results are also reported by Besag & Kempton (1986), Cullis et al. (1989), and Kempton & Gleeson (1997).
Considering that the cause of the divergence in the two selections was the genotypic adjustment for position effects, which are of purely environmental nature, it can be concluded that, in similar conditions, the use of spatial analysis can assure greater efficiency to the breeding programs.

Conclusions
1.In variety trials with large numbers of treatments and limited availability of propagation material, experimental observations can not be spatially independent; in such conditions, spatial analysis allows better discrimination among genotypes, because it provides increased power in statistical tests, reduced standard errors of genotypic estimates, and greater amplitudes among predicted values.
2. The spatial analysis can be led to a different ranking of the genetic materials, in comparison with the nonspatial analysis, and a selection less influenced by local variation; such differences may have important consequences for the final outcome of plant breeding programs.

Figure 1 .
Figure 1.Residuals (kg ha -1 ) of the augmented block design model adjustment, with recovery of test lines information, under independent errors as a function of the plot center coordinates, in meters (COORDX and COORDY).

Figure 2 .
Figure 2. Sample variogram (dots) of the residuals of an augmented block design analysis that assumed independent errors, and adjustment (continuous line) by the exponential semivariance model [a = 20.4m; s 2 =126450 (kg ha -1 ) 2 ].The dotted line illustrates the corresponding autocovariance function.

Figure 3 .
Figure 3. Location of plots with the most productive test lines (25%) among the 110 evaluated lines (unreplicated), using two models of statistical analysis: with R = I σ 2 (non-spatial) (a) and with R = Σ (spatial) (b) of a soybean variety trial, in augmented block design.