Degree of multicollinearity and variables involved in linear dependence in additive ‐ dominant models

The objective of this work was to assess the degree of multicollinearity and to identify the variables involved in linear dependence relations in additive-dominant models. Data of birth weight (n=141,567), yearling weight (n=58,124), and scrotal circumference (n=20,371) of Montana Tropical composite cattle were used. Diagnosis of multicollinearity was based on the variance inflation factor (VIF) and on the evaluation of the condition indexes and eigenvalues from the correlation matrix among explanatory variables. The first model studied (RM) included the fixed effect of dam age class at calving and the covariates associated to the direct and maternal additive and non-additive effects. The second model (R) included all the effects of the RM model except the maternal additive effects. Multicollinearity was detected in both models for all traits considered, with VIF values of 1.03–70.20 for RM and 1.03–60.70 for R. Collinearity increased with the increase of variables in the model and the decrease in the number of observations, and it was classified as weak, with condition index values between 10.00 and 26.77. In general, the variables associated with additive and non-additive effects were involved in multicollinearity, partially due to the natural connection between these covariables as fractions of the biological types in breed composition.


Introduction
In animal breeding studies, an obstacle to obtaining reliable results is the presence of linear correlations between explanatory variables, which is defined as multicollinearity.Multicollinearity is caused mainly by physical restrictions in the model or population, due to sampling techniques or to a model with excessive terms (Mason et al., 1975).In this situation, the ordinary least squares method -an important methodology used to estimate genetic parameters -yields unstable regression coefficients with large standard errors, leading to erroneous inferences (Bergmann & Hohenboken, 1995).Collinearity also makes the model outputs sensitive to changes in the database and to the addition or reduction of variables in the model (Belsley, 1991).Moreover, it results in high variances, which are detrimental to the use of hypothesis tests for regression coefficients, estimation, and prediction (Mansfield & Helms, 1982).
Problems related to multicollinearity in models for estimation of genetic effects in crossbred populations were reported in several studies (Cassady et al., 2002;Roso et al., 2005a;Pimentel et al., 2006;Toral et al., 2009;Lopes et al., 2010).Rodríguez-Almeida et al. (1997), in a study about direct and maternal additive effects for birth and weaning weights in multiracial populations, identified the presence of multicollinearity between direct and maternal heterosis, and between direct and maternal additive effects of the same biological type.Similarly, Roso et al. (2005b), working with purebred and crossbred animals from Angus, Blond d'Aquitaine, Charolais, Gelbvieh, Hereford, Limousin, Maine-Anjou, Salers, Shorthorn, and Simmental breeds, estimated high correlations between maternal dominant and direct epistatic effects as well as between direct and maternal additive effects.In those cases, multicollinearity was responsible for an overestimation of variance components, a bias in estimates of genetic effects, and greater standard errors associated to regression coefficients.Consequently, the efficiency of selection and crossbreeding strategies based on these results was affected.
The objective of this work was to assess the degree of multicollinearity and to identify the variables involved in linear dependence relations in additive-dominant models.

Materials and Methods
Data of birth weight (BW), yearling weight (YW), and scrotal circumference (SC) from 149,469 animals of Montana Tropical breed born between 1994 and 2008 were used (Table 1).These individuals are progenies of 92,729 dams and 853 sires, providing genetic information from three generations (Brinks et al., 1961).The database is formed by animals reared in Brazil and Uruguay, and kept in tropical pastures, mostly in acid soils with Urochloa spp.grass.
Salt and mineral supplementation were given to the animals during all experimental period.Animals were grouped into contemporary groups (CG) that considered year of birth, herd, management group within herd, and sex.After initial selection, only animals with valid measurements and parentage information were kept in the database.Furthermore, records from the CG with less than five animals with valid measurements, with progenies of only one sire or formed by individuals with only one breed composition were deleted from the database.
Since Montana Tropical is a multibreed population, the individuals from the different breed compositions were grouped according to the NABC system (Ferraz et al., 1999;Mourão et al., 2007) Two models were considered.The first one, denominated RM, included the fixed effects of dam age class at calving: AOD 1 (less than 27 months of age), AOD 2 (between 27 and 41 months), AOD 3 (from 42 to 59 months), AOD 4 (between 60 and 119 months), AOD 5 (between 120 and 143 months), AOD 6 (from 144 to 167 months), and AOD 7 (more than 168 months).The covariates were associated to the direct (BTA, BTB, and BTC) and maternal (MBTA, MBTB, and MBTC) additive effects of the biological types and to the non-additive effects of direct (NxA, NxB, NxC, AxB, AxC, and BxC) and maternal (HM) heterozygosity.The second model, denominated R, considered the same effects of RM, with the exception of the maternal additive effects.For scrotal circumference, the age of the animal at measurement was also included in the models.
Coefficients for direct (BTA, BTB, and BTC) and maternal additive effects (MBTA, MBTB, and MBTC) of biological types were equal to the proportion of each biological type in the breed composition of the calf and in the breed composition of the dam, respectively.Because the sum of the proportions of biological types is equal to one, direct and maternal additive effects of the biological type N were excluded from the statistical models.The same strategy was adopted for dam age class at calving.For this covariate, the fourth class (AOD 4 ) was also excluded.
The non-additive effects of heterozygosity were obtained by a linear relationship to the coefficients of direct heterozygosity (HD) and maternal total (HM), which were calculated by the following equations (Roso et al., 2005b), and in which: the number 4 on top of the summation sign is the number of biological types (N, A, B, C); and S i , D i , MGS i , and MGD i are the fractions of the i th biological type of sire, dam, grandsire, and granddam, respectively.
Multicollinearity diagnostics was based on the variance inflation factor (VIF) and on the study of the condition indexes (CI) and eigenvalues from the correlation matrix among explanatory variables, all obtained through the Proc Reg procedure from the statistical software SAS.
The variance inflation factor (VIF) for the predictor variable X i was obtained by the equation ), in which: R i 2 is the multiple determination coefficient for the linear regression of X i on the other covariates.The VIF describes the increase in the coefficient variance in the presence of multicollinearity (Freund & Littell, 2000).Therefore, the VIF was used to distinguish which covariates are possibly involved in quasi-dependence relations.Generally, values greater than ten for the covariates in the model suggest the existence of multicollinearity as the cause of estimation problems, such as ambiguity in the identification of important predictor variables, direction and magnitude of regression coefficients contrary to the prior expectation or without biological  (1)  Every breed composition that does not comply with the conditions above 7,039 2,456 546 (1) No animal with valid measurements for this trait complies with the criteria of the respective genetic group.
The determinant of the correlation matrix among the explanatory variables is equal to the product of the eigenvalues λ i .In the presence of multicollinearity, these eigenvalues and, consequently, the determinant are small.The condition index is calculated as CI = (λ max /λ i ) 0.5 , in which λ max is the largest eigenvalue and λ i is the i th eigenvalue of the correlation matrix.Therefore, high CI values are indicators of dependence between the covariates because λ i will be close to zero.Based on this, the CI was used to determine the number of collinearities in the model.CI values between 10 and 30 indicate weak multicollinearity, whereas CI values greater than 30 suggest strong multicollinearity (Belsley, 1991).
To detect which covariates are involved in linear dependences, the decomposition of variance associated to the eigenvalues was carried out according to Belsley (1991): Var ( β) = σ 2 (X'X) -1 = σ 2 VΛ -1 V', in which: σ 2 is the estimated residual variance; V are the eigenvectors of the matrix; and Λ are the eigenvalues of the diagonal matrix.If V = v ij is the variance of the i th element of β, the variance of each parameter estimate can also be defined as the sum of the p components, with each number associated with an eigenvalue, as follows: in which p is the number of explanatory variables.Because the eigenvalues are in the denominator, the variance components associated with linear dependences (small λ j ) will be relatively high compared with the other components.Therefore, a high proportion of two or more coefficients related to small eigenvalues shows that the corresponding dependences are causing problems.
With t ij = v ij 2 /λ j and t i = t ij , the proportion of variance of the i th regression coefficient associated with the j th component of this decomposition will be obtained by the equation π ij = t ij /t i , which i = 1, 2, ..., p.To detect multicollinearity, Belsley et al. (2004) recommend the identification of the eigenvalues with CI greater than 30.The variables with variance decomposition proportion (π ji ) greater than 0.5 for each of these eigenvalues are candidates to linear dependence.
With the reduction of covariates included in the analysis model (model R), a decrease was observed in the number of covariates involved in multicollinearity and in the VIF values.For birth weight, only BTC showed a VIF value greater than 10.For yearling weight, the covariates BTA, BTC, NxC, and AxC showed VIF values greater than the established threshold, whereas, for scrotal circumference, this was observed for the covariates BTC, NxC, AxC, and BxC.This result was already expected, since in the presence of multicollinearity the results are sensible to changes in the model and in the database (Belsley, 1991).Moreover, the exclusion of variables from the model is one alternative to mitigate multicollinearity effects on the results (Mason et al., 1975).
The same behavior was observed in the number of collinearities identified by CI values.For the RM model, two weak collinearities (CI=12.61and 18.76) for birth weight, four weak collinearities (CI=10.00, 15.02, 19.79, and 26.77) for yearling weight, and three weak collinearities (CI=11.93, 18.33, and 24.73) for scrotal circumference were detected.Considering the R model, the following were observed: one weak collinearity (CI=11.28)for birth weight, two weak collinearities (CI=11.02and 22.12) for yearling weight, and one weak collinearity (CI=21.17)for scrotal circumference.These CI values were associated with eigenvalues ranging from 0.01 to 0.03.
The fixed effect of dam age class at calving (AOD) and the covariate of maternal heterozigoty (HM) were not involved in multicollinearity, since these variables were not detected in linear dependence relations nor by VIF or variance-decomposition proportions associated with the largest values of CI.
Multicollinearity can be a consequence of deficient sample data or of interrelationships among the variables that are inherent to the process under investigation (Chatterjee & Hadi, 2006).In these situations, not all combinations of predictor variables are represented by the data and, without data collected under all possible conditions, the effects of individual variables cannot be determined.Specifically for the present study, these circumstances occur due to the imbalance in the number of individuals in each genetic group, the restrictions imposed on breeding composition of the Montana Tropical breed (Table 2), and the natural connection between additive and non-additive effects, since this is a crossbred population and the fraction of a biological type in the animal breed composition depends on the proportions from the other biological types.As an example, no purebred animals from biological type C were measured for yearling weight and scrotal circumference, and only two individuals from this genetic group were considered for birth weight analyses.Therefore, no performance information was observed in the sample when BTC was high and the fractions of the biological types A, B, and N were low.Similarly, it is not possible to obtain records when the fractions of biological types are all high or low, given that the sum of these proportions must be equal to one.As a result, independently of the diagnostic method used, the variables related to the additive and non-additive effects were detected as involved in a linear dependence relation, which was also reported by Rodríguez-Almeida et al. (1997) and Roso et al. (2005aRoso et al. ( , 2005b)).
Based on multicollinearity causes, one solution to mitigate this regression problem is to collect additional data, so that more combinations among the explanatory variables are represented (Chatterjee & Hadi, 2006).In fact, an increase was observed in the degree of collinearity with the reduction in the amount of records by the comparison of VIF and CI values in the analyses for scrotal circumference and birth weight, which confirms the validity of this strategy.In this case, when multicollinearity is related to additive and non-additive effects, the additional data should involve representative individuals of several breed compositions and arrangements of biological types.However, it is often not possible to collect more data because of constraints on budget, time, and staff.Differences in requirements and production related to the diversity of animal size and growth rate make it difficult to maintain and sell cattle from diverse breed compositions in a same herd.Moreover, not all breeds can be used for beef production in Brazilian conditions of management and climate.Therefore, it is fairly difficult to ensure a balanced population for genetic analysis.
Another option to minimize multicollinearity complications in regression analysis is to reduce the number of covariates in the model.The means of VIF for the RM model were 8.66, 18.52, and 14.60 for birth weight, yearling weight, and scrotal circumference, respectively; whereas for the R model, these means is based on the addition of non-negative coefficients to the principal diagonal of the correlation matrix, which reduces or eliminates linear dependencies.Although biased, in the presence of multicollinearity, ridge estimators present lower standard errors and are more stable.Consequently, more accurate estimates are obtained than by using the ordinary least squares method.The least absolute shrinkage and selection operator (Lasso) regression (Tibshirani, 1996) is also a methodology used to deal with sparse solutions caused by multicollinearity, in which regression coefficients associated with irrelevant or redundant variables are reduced to zero.Roso et al. (2005a), Pimentel et al. (2006), Dias et al. (2011), Long et al. (2011), andLi &Sillanpää (2012) showed the advantages of these methodologies in regression models for genetic analyses when multicollinearity is present.

Conclusions
1.In the presence of multicollinearity, the results are sensitive to changes in database and in the model.
2. Additive and non-additive effects are commonly involved in collinearity relations due to the inherent relationship between these variables.
3. Since the estimates yielded by the least squares method are less accurate when multicollinearity occurs, it is important to consider multicollinearity diagnostics as a preliminary analysis in animal breeding programs to avoid erroneous inferences and low selection efficiency.
4. It is still necessary to evaluate the impact of multicollinearity in the estimation of genetic parameters and breeding values and to assess the efficiency of an alternative method to the least squares methodology in order to complement the available information on this subject.

Figure 1 .
Figure 1.Variance inflation factor (VIF) for the explanatory covariates considered in the design matrix X of the models RM (■) and R (■) for birth weight (A), yearling weigth (B), and scrotal circumference (C).The dotted line (VIF>10) is an indicative of involvement in collinearity.AOD1 to AOD7 are the age classes of dam at calving; BTA, BTB, and BTC are the additive effects associated with the individual biological composition types for A, B, and C, respectively; MBTA, MBTB, and MBTC are the maternal additive effects associated with the maternal biological composition types for AM, BM, and CM, respectively; NxA, NxB, NxC, AxB, AxC, and BxC are the direct heterozygosity; and HM is the maternal total heterozygosity.
, in which breeds are classified into four biological types.The biological type N included Bos indicus breeds, such as Gyr, Guzerat, Indubrazil, Nellore, Tabapuan, Boran, and other Zebu breeds.The biological type A is characterized by Bos taurus cattle adapted to the tropics by natural or artificial selection, and descent of animals introduced by the colonizers, as, for example, Bonsmara and Belmont Red.The biological type B is formed by Bos taurus breeds with British origin, like Angus, Devon, and Hereford.The biological type C is typified by Bos taurus breeds from continental Europe, including Charolais, Limousin, and Simmental (Table2).

Table 2 .
Number of observations for birth weight, yearling weight, and scrotal circumference in each genetic group based on the NABC system.