Contemporary groups in the genetic evaluation of Nellore cattle using Bayesian inference

The objective of this work was to evaluate the criteria for the formation of contemporary groups (CGs) in the genetic evaluation of body weight at weaning in Nellore cattle. A total of 713,474 records from 3,066 herds located in Midwestern and Northern Brazil were used. Data were obtained from the genealogical registry of zebu breeds of the Brazilian association of zebu breeders. Data structures were defined based on the number of standard deviations (SDs) for outlier removal (±2.0, ±2.5, ±3.0, and ±3.5) and on the minimal number of animals per CG (3, 7, and 15). Genetic evaluation was performed with an animal model using Bayesian inference. Data structures with ±3.5 SDs and CG with at least 15 animals presented the highest additive genetic variance (82.65±2.93), and those with ±2.0 SDs and CG with at least 3 animals showed the lowest one (60.23±1.96). The proper formation of CGs results in better-quality data archives, allowing to obtain more trustable estimates for the genetic parameters. Better selection responses are obtained when the following criteria are adopted for the removal of outliers: 2.5, 3.0, and 3.5 standard deviations and a minimum of 15 animals per contemporary group.


Introduction
In the genetic evaluation of beef cattle, nongenetic factors have been identified as important sources of variation and may include: herd; year of birth; month of birth; nutritional, reproductive, and sanitary managements; and sex of the animal.These factors can be combined to form contemporary groups (CGs), which, in a statistical model, allow removing the effects of environment and of different animal managements, in order to evaluate the expression of animal phenotypes under the same environmental conditions (Cobuci et al., 2006).
However, there is no consensus regarding the size of the group of animals and the common conditions necessary to form them.These criteria aim to maximize the homogeneity within the CG, which results in a smaller number of animals per group.Therefore, the strategies used to define the CG might affect the prediction accuracy of the expected differences among progenies.The incorrect formation of CGs leads to incorrect decision making, which, in turn, leads to the prediction of both overestimated and underestimated genetic values, respectively, for animals favored in the CG and subjected to less favorable conditions to express their genetic potential (Cobuci et al., 2006).
Outlier removal based on phenotypic standard deviation (SD) can be used in genetic evaluation for data consistency.However, the number of SDs for outlier removal is determined empirically in genetic evaluations and there are no robust scientific studies on CG formation that indicate the most adequate number to maximize the accuracy of genetic values.
In Brazil, there is still no agreement in the literature on the use of SDs for outlier removal and on the minimum number of animals per CG for the genetic evaluation of Nellore cattle.Bignardi et al. (2011), Shiotsuki et al. (2012), and Ferriani et al. (2013), for example, used ± 3.5 SDs and a minimum of 9, 3, and 4 animals per CG, respectively; Pedrosa et al. (2010), Matos et al. (2013), Santos et al. (2012), andAmbrosini et al. (2014), respectively, adopted ± 3.0 SDs and a minimum of 3, 7, 4, and 5 animals per CG.
The objective of this work was to evaluate the criteria for the formation of CGs in the genetic evaluation of body weight at weaning in Nellore cattle.

Materials and Methods
The weaning weights of 713,474 Nellore cattle, aged between 165 and 255 days, standardized to 210 days (W210), were recorded from 1993 to 2014.This data set is originated from 3,066 herds raised in the Midwestern and Northern regions of Brazil, and is included in the genealogical registry of the Brazilian association of zebu breeders.
For data consistency, records on the following animals were removed: offspring and parents with similar identification; bulls and cows less than 609 days old at calving time; cows over 25 years old at calving; siblings with age difference less than 315 days; and animals derived from embryo transfer or in vitro fertilization techniques.
Birth months were grouped into four seasons: from November and December of the previous year to January of the following year; from February to April; from May to July; and from August to October.Three breeding conditions were assessed: pre-weaning animals, weaning animals, and animals subjected to weight-gain tests.Regarding feeding management, animals were raised in extensive pasture, semiconfinement, and confinement.Cow age at calving was grouped into 14 classes: class 1, containing 1.15% of the animals, corresponded to cows with less than 30 months of age at calving; classes 2 to 13 were formed for every 12-month interval, starting at 30 months of age; and class 14, containing 2.31% of the animals, grouped cows over 174 months of age at calving.Classes 2 (15.82%) and 3 (13.18%)presented the highest percentage of animals; both were combined with the factor sex of the animals and included as a qualitative systematic (non-genetic) effect in the model.
The CGs were formed considering herd, breeding condition, feeding management, sex, weighing date, year and season of birth.The CG was also included as a qualitative systematic effect in the model, whereas the age of the animal at weighing was considered as a quantitative systematic (linear and quadratic) effect.
Different criteria were assumed for outlier removal in each CG.The first criterion was comparing the SD to the mean (2.0, 2.5, 3.0, and 3.5) in each CG, and the second one was the removal of the CGs with less than 3, 7, and 15 animals.By merging these two criteria, 12 different data sets (E1 to E12) were generated (Table 1).Data consistency and editing were performed by the R software (The R Foundation, 2011).
Connectedness was determined for each data set, previously defined according to the total number of direct genetic links.This number was calculated based on the number of offspring from bulls and cows in common among the CGs, using the AMC software (Roso et al., 2004).
For the genetic evaluation, a single-trait animal model was used.Under matrix notation, this model can be described as follows: , in which  y is the vector of the phenotypic observations on individual animals;  b is the vector of systematic effects;  a is the vector of direct additive genetic random effects;  m is the vector of maternal additive genetic random effects; pm  is the vector of maternal permanent environmental random effects;  e is the vector of residual effects; and X, Z 1 , Z 2 , and Z 3 are the incidence matrices that relate the observations to the systematic, direct additive genetic, maternal additive genetic, and maternal permanent environmental effects, respectively.This model was fitted by Bayesian inference, assuming probability distributions for data and unknown parameters according to Sorensen & Gianola (2002).
The conditional distribution of y, given the parameters, was assumed to be: and the residual (co)variance matrix is R I e = σ 2 , in which I is an identity matrix; N is the multivariate normal distribution, with mean and covariance assumed between parentheses; and G = A ⊗ G 0 , in which A is the numerator relationship matrix of Wright's coefficients and G 0 is the additive genetic covariance matrix.
For the systematic effects, it was assumed that: b ~ N(b 0 ,V 0 ), in which V 0 is the non-informative diagonal variance matrix, assuming V 0 → 10 10 .
The direct and maternal additive genetic effects were assumed as: is the maternal additive genetic variance; and G, G 0 , and A have been previously described.The covariance between these genetic effects was assumed to be zero, based on the results reported by Mallinckrodt et al. (1995), who suggested that possible negative genetic correlations between  a and  m may result in less accurate genetic predictions rather than in null correlations.
The maternal permanent environmental effect was assumed as: pm is the maternal permanent environmental variance.
The scaled inverse chi-squared distribution was assumed for σ pm 2 and σ e 2 , with their respective densities.For the residual and maternal permanent environmental variances, the following distribution was used: The scaled inverse-Wishart distribution (IW 2 ) was assumed for G 0 , G 0 ~ IW 2 (∑, v a ), so that is the covariance matrix between direct and maternal additive genetic effects, defined according to the literature for W210; and v a = 5 is the degree of confidence for these values.
Bayesian inference was performed by Monte Carlo methods via Markov chains, using the GIBBS2F90 software (Misztal et al., 2014), which was run for a total of 500,000 iterations, after a burn-in period of 50,000 iterations.Samples were generated every 10 Table 1.Description of the evaluated data sets (E1 to E12), according to the standard deviations around the mean for outlier removal (SDOR), as well as minimum number of animals per contemporary group (MACG), number of observations (N), number of animals in the data archive, number of contemporary groups (CGs), means in kilograms, minimum and maximum values, and SD for weight at 210 days of age, in Nellore cattle (1) .iterations (thin = 10), resulting in a total of 45,000 effective samples for inference.The convergence was checked by visual inspection of trace plots and by Geweke's Z criterion (Geweke, 1992).
The genetic gains for different data sets were calculated by multiplying heritability (h 2 ) and the selection differential.The latter is the difference between the mean of the selected animals and of the entire population, with the selection of 1% of the best bulls and 20% of the best cows (Top 1_20) or of 5% of the best bulls and 40% of the best cows (Top 5_40).
The criteria for CG formation were evaluated and compared considering the percentage of common animals (ranking coincidence), according to the predicted breeding values, the differences in variance components and in genetic parameters, and the genetic gains.Statistical inference on the results was based on credibility intervals (CI) overlapping from the different data sets.

Results and Discussion
For all data sets, the percentage of CG connectedness was higher than 99%.This indicates that the number of offspring per bull or cow in each CG was sufficient to ensure their genetic links.This result is explained by the intense commercialization of semen from proven Nellore bulls, whose genetic material is widely distributed among different herds.
The main criterion for obtaining the phenotypic records used to estimate variance components and genetic parameters was based on the minimum number of animals per CG.When the data sets with at least 15 animals per CG (E3, E6, E9, and E12) were compared with those with at least 7 and 3 animals, respectively, there was a decrease of 25.9 and 37.11% in the number of animals in the data archives.Moreover, when the CGs with at least 7 animals were compared with those with at least 3, a decrease of 15.09% was observed (Table 1).
Regarding the number of CGs, there was a decrease of 46.23 and 74.84%, respectively, for the data sets with at least 7 and 15 animals, in comparison with those with at least 3 animals.When the number of the CGs with at least 7 animals was compared with those with at least 15, a decrease of 53.21% was verified (Table 1).
Due to their greater flexibility, the data sets with 3.5 SDs (E10, E11, and E12) had more phenotypic records than those with 2.0, 2.5, and 3.0 SDs, which showed a reduction of 3.81, 1.05, and 0.23%, respectively, regarding the number of observations.The number of animals in the data archives decreased, respectively, 2.7, 0.74, and 0.16% for 2.0, 2.5 and 3.0 SDs.Note that, as expected, the data sets with 3.5 SDs also presented the highest numbers of bulls and cows (Table 1).
According to the CI overlapping, there was no significant difference between data sets and within each tested SD for direct (Figure 1 A) and maternal (Figure 1 B) genetic variances.Higher values of direct additive genetic variances were observed for the data sets with higher SD (Figure 1 A).Higher direct and maternal additive genetic variances were found in the E9 and E12 data sets (Figure 1 A and B).
There was also no statistical difference between data sets for 2.5, 3.0, and 3.5 SDs regarding the maternal permanent environmental variance (Figure 1 C).However, by using 2.0 SDs, the E3 and E1 data sets were similar to E2, but differed between each other due to the non-overlapping of the CI.In general, the maternal permanent environmental variance increased as the number of SDs increased.Higher values of maternal permanent environmental variances were observed for 3.0 and 3.5 SDs.
The data sets with a lower number of phenotypic records in each CG, i.e., less animals per CG, presented lower residual variance estimates regardless of the SD used (Figure 1 D).When the number of animals in the CG increased from 3 to 7 and from 3 to 15, residual variances were reduced in 6.28 and 10.38% for 2.0 SDs; 4.26 and 8.49% for 2.5 SDs; 3.47 and 6.72% for 3.0 SDs; and 3.0 and 5.51% for 3.5 SDs, respectively.In addition, a significant increase was observed in residual variances as the number of SDs increased (Figure 1 D).The residual variances from the data sets with 3.5 SDs were 3.0, 8.7, and 18.85% higher than those of the data sets with 3.0, 2.5, and 2.0 SDs, respectively.The E3 data set had the lowest residual variance (206.91±1.4633),whereas E7 had the highest one (276.42±1.4658).
The data sets with at least 15 animals per CG presented higher direct and maternal genetic variances regardless of SD (Figure 1 A and B).However, these data sets also presented lower maternal permanent environmental and residual variances (Figure 1 C and D).These results may be responsible for the observed trend towards higher heritability estimates and genetic gain (Figure 1 E and 2).Furthermore, the data sets with at least 3 animals per CG presented lower direct genetic variance, as well as higher residual variances, with a trend towards lower h 2 estimates and genetic gain (Figure 1 E and 2).The ); E, heritability (h 2 ); and F, maternal heritability (h 2 m ), with their respective credibility intervals for the different data sets (E1 to E12) evaluated.For each data set, the following standard deviations and number of animals per contemporary groups were considered, in between parentheses, respectively: E1 (2.0 and 3), E2 (2.0 and 7), E3 (2.0 and 15), E4 (2.5 and 3), E5 (2.5 and 7), E6 (2.5 and 15), E7 (3.0 and 3), E8 (3.0 and 7), E9 (3.0 and 15), E10 (3.5 and 3), E11 (3.5 and 7), and E12 (3.5 and 15).Pesq.agropec.bras., Brasília, v.52, n.8, p.643-651, ago.2017 DOI: 10.1590/S0100-204X2017000800010 obtained results show that the values of all variance components were increased as the number of SDs increased (Figure 1 A, B, C, and D).This explains the similar h 2 estimates (direct and maternal) and genetic gain regardless of the number of SDs (Figure 1 E and F, and Figure 2).
The variation in the total number of animals between data sets allowed a redistribution of variances.The data sets with higher SDs (2.5, 3.0, and 3.5) had higher direct and maternal genetic variances, but lower maternal permanent environmental and residual variances.Therefore, the data sets with 2.0 SDs and at least 3 animals per CG would not be recommended for genetic evaluations in this population.
There were no significant statistical differences based on CI overlapping among or within the different SDs for direct (h 2 ) and maternal heritability (h 2 m ) estimates (Figure 1 E and F).Higher h 2 values (Figure 1 E) were observed for E3, E6, E9, and E12 due to the combination of the greater direct genetic and residual variances (Figure 1 A and D).This shows that, by using at least 15 animals per CG in genetic evaluations, a higher genetic gain is expected (Figure 2).
Similar h 2 m estimates were reported by Boligon et al. (2010), who assumed 3.0 SDs for outlier removal and a minimum of 4 animals per CG; and by Laureano et al. (2011) and Lopes et al. (2013), who used a CG with at least 3 animals.Silva et al. (2013) adopted at least 9 animals per CG and found maternal heritability estimates lower (0.03) than those observed in all the data sets evaluated in the present study.
It should be highlighted that, even if outlier removal and CG formation are standardized for all genetic evaluation in beef cattle, there would still be differences in genetic parameter estimates due to the specific features inherent to each population and to the environmental conditions that animals are subjected to.Therefore, genetic parameter estimates with greater accuracy are expected in genetic evaluations when appropriate criteria are used and the specific characteristics of each population are taken into account.
For each SD, the CGs with at least 3 and 7 animals had the highest percentages of common animals, with values of 78.93 and 81.72% for Top 1_20 and Top 5_40, respectively.The lowest percentages were observed between the CGs with at least 3 and 15 animals, with values of 58.47 and 61.47% for Top 1_20 and Top 5_40, respectively.In addition, higher percentages of common animals were verified by using 3.0 and 3.5 SDs (Table 2).
Evaluating the number of common animals in different data sets enabled understanding the effects of outlier removal.It also showed that the empirical outlier removal by using different SD values and a minimum number of animals per CG may affect individual genetic merit and, consequently, the selection process.
Higher genetic gain was observed for the CGs with at least 15 animals associated with 2.5, 3.0, or 3.5 SDs, i.e., for E6, E9, and E12, respectively (Figure 2).However, when these data sets were compared with E1, with a lower genetic gain, these differences were, on average, of 1.06 and 0.77 kg for Top 1_20 and Top 5_40, respectively.When the E6, E9, and E12 data sets were compared with E3, with a lower genetic gain, the differences were, on average, of 0.54 and 0.28 kg greater for Top 1_20 and Top 5_40, respectively.
Limitations on the number of animals per CG tended to affect genetic gain.The data sets with at least 15 animals per CG showed a trend towards higher genetic gain, whereas those with at least 3 animals, a trend gain towards lower genetic gain.Therefore, the data sets with at least 15 animals per CG may be recommended for genetic evaluations of this population.This may be justified by the concept of population sampling; i.e., the higher the number of phenotypic records in each CG, the higher is the chance of this random sampling being close to the average of the population.This way, the representativeness of the genetic parameters will be higher for this population (Cobuci et al., 2006).
Table 2. Percentage of common animals between data sets (E1 to E12) for 1% of the best bulls and 20% of the best cows (Top 1_20) above the diagonal, and 5% of the best bulls and 40% of the best cows (Top 5_40) below the diagonal. ( 1 For each data set, the criteria considered, in between parentheses, were: standard deviations around the mean for outlier removal and minimum number of animals per contemporary group, respectively.
, based on the variance estimates previously reported in literature for W210; and v e = v pm = 5 is the degree of confidence for these values.