Acessibilidade / Reportar erro

Estimating gain by use of a classic selection index under multicollinearity in wheat (Triticum aestivum)

Abstracts

It was shown that the classic selection index, under multicollinearity, could not give simultaneous gains for wheat grain production and its primary components. This was due to the instability and, consequently, low precision of the coefficient index estimates. A modification of the prediction process of the index was proposed to avoid the adverse effects of multicollinearity, adopting a procedure based on ridge regression theory. The modified classic selection index, or ridge index, gave more statistically viable index coefficient estimates and gains for all of the characters evaluated. However, lower gains for number of grains per spike and grain yield were obtained, when compared to those obtained with selection for grain yield.


Evidenciou-se a inviabilidade do uso do índice de seleção clássico, sob multicollinearidade, na obtenção de ganhos simultâneos para a produção de grãos de trigo e seus componentes primários. Esta inviabilidade foi devido à instabilidade e, consequentemente, pouca precisão das estimativas dos coeficientes do índice. A fim de contornar os efeitos adversos da multicollinearidade, propôs-se modificar o processo de predição do índice adotando-se um procedimento baseado na teoria de regressão em cristas. O índice de seleção clássico modificado proporcionou estimativas dos coeficientes do índice estatisticamente mais viáveis e ganhos em todos os caracteres avaliados. Contudo, com o uso deste índice, obtiveram-se ganhos inferiores para os caracteres número de grãos por espiga e rendimento de grãos, comparado aos obtidos pela seleção para rendimento.


Estimating gain by use of a classic selection index under multicollinearity in wheat (Triticum aestivum)

Samuel Pereira de Carvalho1, Cosme Damião Cruz2 and Claudio Guilherme Portela de Carvalho2

1Departamento de Biologia Geral, Universidade Federal de Lavras, 37200-00 Lavras, MG, Brasil.

2Departamento de Biologia Geral, Universidade Federal de Viçosa, 36571-000 Viçosa, MG, Brasil. Send correspondence to C.D.C.

ABSTRACT

It was shown that the classic selection index, under multicollinearity, could not give simultaneous gains for wheat grain production and its primary components. This was due to the instability and, consequently, low precision of the coefficient index estimates. A modification of the prediction process of the index was proposed to avoid the adverse effects of multicollinearity, adopting a procedure based on ridge regression theory. The modified classic selection index, or ridge index, gave more statistically viable index coefficient estimates and gains for all of the characters evaluated. However, lower gains for number of grains per spike and grain yield were obtained, when compared to those obtained with selection for grain yield.

INTRODUCTION

Selection based on multiple traits has been shown to be more efficient than single-trait selection, because the net worth of the final product tends to be superior. Consequently, the improved breeding material has been more acceptable to farmers and consumers. The identification of superior genotypes, for a group of characteristics, can be done by the use of a selection index.

Selection indices classify individuals and progenies undergoing the selection process according to established criteria. The selection index proposed by Smith (1936) and Hazel (1943) is obtained maximizing the correlation between the index itself and the aggregate genotype. In this case, the criteria for the construction of reliable indices are well-estimated phenotypic and genetic variance-covariance matrices and correctly established relative economic values of the traits.

However, the coefficient index estimates can be adversely affected by multicollinearity effects among the traits involved. These effects can be understood by analogy to what happens in a multiple regression analysis. Hoerl and Kennard (1970b) show that in a multiple regression analysis multicollinearity can affect the square of the distance between the least squares estimator and the parameter b. When there are one or more close linear dependencies among the independent variables, high values for the distances between and b are found. Webster et al. (1974) called attention to the elevated variance of obtained under these conditions.

In the classic selection index, the coefficient vector is a function of the inverse of the phenotypic variance-covariance matrix (P matrix). If there is perfect multicollinearity between some of the variables considered, the P matrix will be singular, that is, a unique inverse will not exist. An infinite number of indices could be established from the generalized inverses of P, but none of them would have practical meaning. The hypothetical condition of perfect multicollinearity is an extreme case. It can, however, be understood that as P comes closer to this condition, the corresponding index becomes less reliable. This can be explained by the variance associated with the index coefficients, that becomes larger as P approaches singularity.

In a multiple regression analysis, the adverse effects of multicollinearity can be avoided by ridge regression (Hoerl and Kennard, 1970a,b). The regression coefficients are estimated using a partially modified version of the normal equations:

= (X'X + KIn)-1X'Y,

where K is a small positive quantity added to the elements of the diagonal of the X'X matrix considered in the correlation format; In is the identity matrix and X'Y is the correlation vector of the dependent variable with each independent variable. Generally, the values of the constant K are considered in the interval, 0 < K < 1, since X'X is found in the correlation format. The correct determination of this constant is the essence of the ridge regression, according to Hocking et al. (1976).

The ridge estimator () is a biased estimator. In spite of the variance of the ridge estimator to be a decreasing function of K the squared bias is an increasing function of K. Also, as K increases, the mean square error (variance plus squared bias) decreases to a minimum and then increases (Hoerl and Kennard, 1970b). The choice of the K value can be based by examining the ridge trace, proposed by these authors. It is a plot of the estimate of each coefficient as a function of the K values. The K value should be one that reduces the estimator variance, to make it possible to obtain more precise estimations of the coefficients. Also, K must give only a small bias and a magnitude of the mean square error (MSE) smaller or equal to the least squares solution (LSS), since there always exists a ridge estimate with a smaller MSE than for the LSS solutions.

In this study an attempt was made to adapt this process to the classic selection index calculation.

MATERIAL AND METHODS

The experimental material evaluated in this study consisted of 81 F3 families of wheat (Triticum aestivum L.) from plants selected in F2, derived from crosses between the cultivars EMBRAPA 16 x EMBRAPA 22. The F3 families were obtained by the Department of Genetic and Plant Breeding of Centro Federal de Educação Tecnológica do Paraná (CEFET/UNED-PR).

The selection trial was planted on May 21, 1996 in Pato Branco, PR, Brazil. A randomized complete block design with two replications was used. Each experimental plot included two 1-m long rows, spaced 30 cm apart, totaling an area of 0.6 m2. Ten seeds were sown per linear meter. Borders formed by planting an adapted cultivar (BR 26 - São Gotardo) were grown around the experiment.

The following traits were assessed:

a) Number of spikes (NS): the medium number of spikes per plant in the parcel;

b) Number of grains per spike (NGS): the medium number of grains per spike;

c) Average weight of 1000 grains (P1000G): taken at the plot level this was determined by the formula:

d) Grain yield (GY): weight of the grain yield, expressed in grams per plant.

The variables NS, NGS, P1000G, and GY are interrelated by a multiplicative effect. Therefore, logarithmic transformation of the corresponding observations was used, resulting in LNS, LNGS, LP1000G, and LGY, respectively.

An analysis of variance was carried out for each of the assessed traits and estimates of heritability (h2%), genotypic coefficient of variation (CVg) and gains (GS%) by direct or indirect selection on GY, and their primary components were obtained according to Cruz and Regazzi (1994). Expected gains from simultaneous trait selection were calculated using a selection index, the coefficients of which were obtained from a modification of the Smith (1936) and Hazel (1943) method. This modification was based on ridge regression theory.

The analysis of this experiment was done by the "Genes" software, developed at Universidade Federal de Viçosa. A selection intensity of 20% was used for the calculation of the expected gains by direct or indirect selection as well as by use of the modified classic selection index.

RESULTS AND DISCUSSION

Methodology

A new variation of the original theory of the selection index proposed by Smith (1936) and Hazel (1943) is presented. In this presentation, fundamental adaptations are made for the understanding of the methodology proposed in this study.

The classic selection index (I) is defined as:

I = b1X1 + b2X2 + .. + bnXn (1)

where (b1b2 ..b n) are index coefficients, represented as the vector , and (X1 X2 .Xn ) are known phenotypic values, represented as the vector X'.

By placing sfj (phenotypic standard deviation for the jth trait) in the numerator and denominator of each term of equation 1, it follows that:

and considering

, we obtain:

I = p 1 x 1 + p 2 x 2 + + pnxn = p'x

where pj = bjsfj

or, in the matrix form,

p = Dfb,

where Df is the diagonal matrix, whose non-zero elements are the phenotypic standard deviations of the studied traits.

Similarly, the aggregate genotype (H) is defined as:

H = a1G1 + a2G2 + + anGn (2)

where (a1a2 ..an) are known relative economic values represented as the vector a', and (G1G2 ..Gn) are unknown genetic values represented as the vector d'.

By placing sgj (genetic standard deviation for the jth trait) in the numerator and denominator of each term of equation 2, we have:

and considering

, results in:

H = q1g1 + q2g2 + ..+qngn = q'g

where qj = ajsgj

or, in the matrix form,

q = Dga

where Dg is the diagonal matrix whose non-zero elements are the genetic standard deviations of the traits.

The following relationships were found:

where Rf, Rg and G are the phenotypic correlation, genetic correlation, and genetic variance-covariance matrices, respectively. The correlation between the index and the aggregate genotype is represented by rIH.

The classic selection index is obtained by maximizing rIH, resulting in the expression:

For pratical purposes, the constant

does not affect the proportionality of the index. Therefore,

In case of perfect multicollinearity among the variables, the Rf matrix will be singular, and consequently unsuitable values of the index coefficients will be obtained. An alternative way to calculate the vector of index coefficients is to add a constant K to the diagonal of the Rf matrix, similar to the ridge-regression method proposed by Hoerl and Kennard (1970 a,b):

in which b* is the vector of the modified classic selection index coefficients or ridge index coefficients. In this case, the ridge index is defined as I* = b*'X. For K = 0, b* is equal to b. The newly calculated b* vector never gives larger values for the estimated correlation between the ridge index and the aggregate genotype (estimated rI*H) than those obtained using the classic selection index (estimated rIH). However, for some K values, it is possible to obtain estimates of stable bj(j = 1,2, .,n) and of satisfactory rI*H.

Considering

, we have:

Under multicollinearity and adopting K = 0,

and under multicollinearity and adopting K ¹ 0,

in which

In this assay, the K values were chosen by the ridge trace exam obtained by plotting the b* estimates as a function of the K values in the interval 0 < K < 1. A K value capable of stabilizing the estimated and of giving an estimate of rI*H not much lower than the Smith (1936) and Hazel (1943) method was sought.

The determinant and the number of condition (NC) were calculated for the diagnosis of the Rf matrix multicollinearity. Number of condition consists of the ratio between the largest and smallest eigenvalue of the matrix. According to Montgomery and Peck (1981), as the determinant gets closer to zero, multicollinearity becomes more intense. Still according to these authors, if NC < 100, multicollinearity is not a serious problem. If 100 < NC < 1000, multicollinearity is from moderate to strong, and NC > 1000 shows a severe multicollinearity.

APPLICATION

Selection by indices is especially important when indirect selection by correlated responses becomes inadequate. This can happen, for example, when the grain yield of a plant and its primary components are inversely correlated with each other. In wheat, the estimates of the gains by direct selection on LNS, LNGS, LP1000G and LGY show that selection for a single trait leads to gains that are not satisfactory for all traits (Table I).

Table I
- Estimates of heritabilities (h2%) and gains from selection (GS%) obtained by direct and indirect selection on four traits evaluated in wheat, in one environment in Viçosa, MG, 1991/92.

h2%: Broad-sense heritability estimate of the trait; LNS: logarithm of number of spikes; LNGS: logarithm of the number of grains per spike; L1000G: logarithm of the average weight of 1000 grains; LGY: logarithm of the grain yield.

When selection is carried out to increase GY, for example, there is a reduction in P1000G. This can be understood by analyzing the estimated Rf matrix:

Number of grains per spike is correlated negatively with P1000G and positively with GY. Additionally, the correlation between P1000G and GY is practically zero. When GY increases, NGS also increases, and consequently P1000G is reduced. Satisfactory gains in all evaluated characteristics can be obtained by using selection indices.

An index to give maximization of gains for GY and its primary components can be predicted using the Smith (1936) and Hazel (1943) method. The values of the determinant and of the NC of the Rf matrix are -3.9044 x 10-5 and 101214.2, respectively, and, according to Montgomery and Peck (1981), these values are evidence of severe multicollinearity. Thus bj (j = 1,2, .,n) estimates obtained by this method cannot be statistically precise nor do they make sense biologically.

An attempt to remove the condition of near singularity of the Rf matrix and to allow a more precise estimation of the index coefficients was proposed by adding a constant quantity (K) to the diagonal of this matrix. Also, simulation studies were done using this process to detect combinations of relative economic values that provided gains for all of the characteristics. Initially, according to Cruz (1990), as relative economic values were used, the 6.46:5.83:2.53:4.57 values refer to the genotypic coefficients of variation of the LNS, LNGS, LP1000G, and LGY characteristics, respectively. The relative economic values can be established from the statistics of the same experimental data, and the genotypic CVg available maintains the proportionality between the characters, and is dimensionless. The estimates of and of rI*H as a function of the K values, in the interval 0 < K < 1, and using CVg as a relative economic value, were determined (Figure 1 and Table II).


Figure 1 - Estimated ridge index coefficient (estimated ) as a function of the K values, in the interval 0 < K < 1, using the genotypic coefficient of variation as the relative economic value.

Table II
- Estimated correlation between the ridge index and the aggregate genotype (estimated rI*H) as a function of the K values, in the interval 0 < K < 1, using the genotypic coefficient of variation as the relative economic value.

Given the existence of severe multicollinearity, the estimates were unstable as K increased, and tended to stabilize for a value of K, as can be seen by examining the ridge trace (Figure 1). The value K = 0.15 was used for index prediction. This value was capable of stabilizing the vector b*, without drastically reducing rI*H (0.97 to 0.88). According to Kalil (1977), in the method based on the ridge trace, the value of K is chosen following subjective criteria. Consequently, this can vary according to the researcher.

Expected gains based on the modified classic selection index, obtained by the addition of a constant K = 0.15 to the diagonal of the Rf matrix and by use of the genotypic coefficient of variation as a relative economic value, are found in Table III. The gain estimates obtained for the characters LNS, LNGS, LP1000G, and LGY, based on the index I = 2.19LNS + 2.23LNGS - 1.14LP1000G + 6.7LGY, were similar to those of selection for grain yield.

Table III
-Expected gains (GS%) for four traits evaluated in wheat, based on the modified classic selection index, when K = 0.15 and the genotypic coefficient of variation is considered as a relative economic value (a).

For abbreviations see Table I.

The estimates of and of rI*H as a function of the K values, in the interval 0 < K < 1, with relative economic values equal to 3:2:3:1, were determined (Figure 2 and Table IV). For K = 0.1, the tended to stabilize and rI*H was not drastically reduced (0.96 to 0.88).


Figura 2 - Estimated ridge index coefficients (estimated ) as a function of the K values, in the interval 0 < K < 1, using relative economic values equal to 3:2:3:1.

Table IV
-Estimated correlation between the ridge index and the aggregate genotype (estimated rI*H) as a function of the values assumed by the K constant, in the interval 0 < K < 1, using relative economic values equal to 3:2:3:1.

The use of the index I = 1.06LNS + 0.32LNGS + 1.09LP1000G + 2.28LGY made gain in all characters possible (Table V), despite proportionately lower gains for the characters NGS and GY, when compared to the ones obtained by selection for grain yield.

Trait a GS(%) LNS 3 1.06 1.74 LNGS 2 0.32 3.57 LP1000G 3 1.09 0.50 LGY 1 2.28 5.86

Table V - Expected gains (GS%) for four traits evaluated in wheat, based on the modified classic selection index, when K = 0.1 and relative economic values (a) equal to 3:2:3:1 are used.

For abbreviations see Table I.

RESUMO

Evidenciou-se a inviabilidade do uso do índice de seleção clássico, sob multicollinearidade, na obtenção de ganhos simultâneos para a produção de grãos de trigo e seus componentes primários. Esta inviabilidade foi devido à instabilidade e, consequentemente, pouca precisão das estimativas dos coeficientes do índice. A fim de contornar os efeitos adversos da multicollinearidade, propôs-se modificar o processo de predição do índice adotando-se um procedimento baseado na teoria de regressão em cristas. O índice de seleção clássico modificado proporcionou estimativas dos coeficientes do índice estatisticamente mais viáveis e ganhos em todos os caracteres avaliados. Contudo, com o uso deste índice, obtiveram-se ganhos inferiores para os caracteres número de grãos por espiga e rendimento de grãos, comparado aos obtidos pela seleção para rendimento.

REFERENCES

Cruz, C.D. (1990). Aplicação de algumas técnicas multivariadas no melhoramento de plantas. Doctoral thesis, Piracicaba, SP, USP/ESALQ.

Cruz, C.D. and Regazzi, A.J. (1994). Modelos Biométricos Aplicados ao Melhoramento Genético. Imp. Univ., Viçosa, MG, pp. 390.

Hazel, L.N. (1943). The genetic bases for constructing selection indexes. Genetics 28: 476-490.

Hocking, R.R., Speed, F.M. and Lynn, M.A. (1976). Class of biased estimators in linear regression. Technometrics 18: 426-437.

Hoerl, A.E. and Kennard, R.W. (1970a). Ridge regression: Applications to nonorthogonal problems. Technometrics 12: 69-82.

Hoerl, A.E. and Kennard, R.W. (1970b). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12: 55-68.

Kalil, M.N. (1977). Aplicação do método de regressão de cumeeira ("Ridge Regression") na estimação de funções de demanda e de produção. Master's thesis, Piracicaba, USP/ESALQ.

Montgomery, D.C. and Peck, E.A. (1981). Introduction to Linear Regression Analysis. John Wiley & Sons, New York, pp. 504.

Smith, H.F.A. (1936). A discriminant function for plant selection. Ann. Eugen. 7: 240-250.

Webster, J.T., Gunst, R.F. and Mason, R.L. (1974). Latent root regression analysis. Technometrics 16: 513-522.

(Received April 14, 1997)

  • Cruz, C.D. (1990). Aplicaçăo de algumas técnicas multivariadas no melhoramento de plantas. Doctoral thesis, Piracicaba, SP, USP/ESALQ.
  • Cruz, C.D. and Regazzi, A.J. (1994). Modelos Biométricos Aplicados ao Melhoramento Genético. Imp. Univ., Viçosa, MG, pp. 390.
  • Hazel, L.N. (1943). The genetic bases for constructing selection indexes. Genetics 28: 476-490.
  • Hocking, R.R., Speed, F.M. and Lynn, M.A. (1976). Class of biased estimators in linear regression. Technometrics 18: 426-437.
  • Hoerl, A.E. and Kennard, R.W. (1970a). Ridge regression: Applications to nonorthogonal problems. Technometrics 12: 69-82.
  • Hoerl, A.E. and Kennard, R.W. (1970b). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12: 55-68.
  • Kalil, M.N. (1977). Aplicaçăo do método de regressăo de cumeeira ("Ridge Regression") na estimaçăo de funçőes de demanda e de produçăo. Master's thesis, Piracicaba, USP/ESALQ.
  • Smith, H.F.A. (1936). A discriminant function for plant selection. Ann. Eugen. 7: 240-250.
  • Webster, J.T., Gunst, R.F. and Mason, R.L. (1974). Latent root regression analysis. Technometrics 16: 513-522.

Publication Dates

  • Publication in this collection
    02 June 1999
  • Date of issue
    Mar 1999

History

  • Received
    14 Apr 1997
Sociedade Brasileira de Genética Rua Cap. Adelmio Norberto da Silva, 736, 14025-670 Ribeirão Preto SP Brazil, Tel.: (55 16) 3911-4130 / Fax.: (55 16) 3621-3552 - Ribeirão Preto - SP - Brazil
E-mail: editor@gmb.org.br