Generalized linear mixed models for the genetic evaluation of binary reproductive traits: a simulation study1

The objective of this study was to evaluate the use of probit and logit link functions for the genetic evaluation of early pregnancy using simulated data. The following simulation/analysis structures were constructed: logit/logit, logit/probit, probit/logit, and probit/probit. The percentages of precocious females were 5, 10, 15, 20, 25 and 30% and were adjusted based on a change in the mean of the latent variable. The parametric heritability (h2) was 0.40. Simulation and genetic evaluation were implemented in the R software. Heritability estimates (ĥ2) were compared with h2 using the mean squared error. Pearson correlations between predicted and true breeding values and the percentage of coincidence between true and predicted ranking, considering the 10% of bulls with the highest breeding values (TOP10) were calculated. The mean ĥ2 values were underand overestimated for all percentages of precocious females when logit/probit and probit/logit models used. In addition, the mean squared errors of these models were high when compared with those obtained with the probit/probit and logit/logit models. Considering ĥ2, probit/probit and logit/logit were also superior to logit/probit and probit/logit, providing values close to the parametric heritability. Logit/probit and probit/logit presented low Pearson correlations, whereas the correlations obtained with probit/probit and logit/logit ranged from moderate to high. With respect to the TOP10 bulls, logit/probit and probit/logit presented much lower percentages than probit/probit and logit/logit. The genetic parameter estimates and predictions of breeding values of the animals obtained with the logit/logit and probit/probit models were similar. In contrast, the results obtained with probit/logit and logit/probit were not satisfactory. There is need to compare the estimation and prediction ability of logit and probit link functions.


Introduction
One of the factors that determine the economic viability of sustainable beef cattle enterprises is the reproductive performance of the herd.Therefore, the identification and evaluation of reproductive traits that can be easily measured and present a potential for selection are essential.In this respect, early pregnancy is an easily measured trait since it comprises only two categories (pregnant and non-pregnant).In addition, the evaluation of phenotypic expression does not result in additional costs, since the diagnosis of pregnancy is a routine practice in beef cattle farming.
However, one of the main challenges of using categorical traits in breeding programs is the development of adequate statistical methods for the estimation of parameters and the prediction of breeding values.One of the methods frequently used for this purpose is generalized linear mixed models (GLMM) (Thompson, 1979;Gianola & Foulley, 1983;Harville & Mee, 1984;Gilmour et al., 1985Gilmour et al., , 1987;;Tempelman, 1998;Pereira et al., 2006Pereira et al., , 2007)).According to McCullagh & Nelder (1989), these models can be defined based on the specification of three components: i) a random component represented by independent random variables that belong to the same distribution, which is part of an exponential family such as binomial and Poisson; ii) a systematic component, called linear predictor, in which explanatory variables enter in the form of the linear sum of their effects; iii) a link function that combines the random and systematic components, i.e., relates the mean to the linear predictor and thus permits modeling, for example, the probability of a female being precocious.
The link function for binary data most widely used in animal breeding is the probit link, also known as the threshold model (Gianola & Foulley, 1983;Abdel-Azim & Berger, 1999;Kadarmideen et al., 2000Kadarmideen et al., , 2001;;Silva et al., 2005;Shiotsuki et al., 2009) Other functions can also be where Φ (.) is the standard normal distribution.
Once parametric values are assumed for β and σ 2 u , the probabilities of early pregnancy can be derived from [4] and [5] and are then used in the Monte Carlo simulation to generate phenotypic expressions of the early pregnancy trait.These expressions correspond to values 0 and 1, which are generated from binomial distributions, with the probability of success being given by P According to the method used, the source of variation not explained by the model in the generation of phenotypic expressions, which may be called residual, is implicit in the randomness of the values generated for Y i by the binomial distributions with probability π i .Therefore, when a probit link function is used, the residual variance (σ 2 ep ), since it is unknown, is generally assumed to be 1.The objective of this value is to render the likelihood function, which is constructed based on the approach of probit regression, identifiable (Sorensen & Gianola, 2002).For the logit function, the residual variance is an approximation based on the variance of the logistic distribution (Southey et al., 2003) and is described by expression [6]: [6] where b is the standard deviation of the logistic distribution, which is generally assumed to be 1 (Southey et al., 2003), as in the present study.
The parametric heritability used in the simulation was 0.40 and reflects the results of recent studies involving Nellore animals, which reported moderate to high estimates (Eler et al., 2002;Silva et al., 2003;Eler et al., 2004;Silva et al., 2005;Pereira et al., 2007;Shiotsuki et al., 2009).Different additive genetic variances were employed according to the link function to be used in the analysis of the simulated data.Thus, expressions [7] and [8] can be used to calculate the respective additive genetic variances explored, such as the logit link, which is widely used in the area of Biometrics (Breslow & Clayton, 1993;Demétrio, 2001;Nunes et al., 2004).When modeling a random variable using probit and logit functions, it is assumed that the latent variable (liability) possesses a standard normal and logistic distribution, respectively.Although important, studies comparing these functions are scarce and are based on the analysis of true data (Gilmour et al., 1987;Kadarmideen et al., 2004).Therefore, an interesting alternative to evaluate the estimation and prediction ability of probit and logit link functions is the use of simulated data.
The objective of the present study was to evaluate the application of a GLMM to the estimation of genetic parameters and prediction of breeding values for early pregnancy by comparing probit and logit link functions using simulated data.

Material and Methods
The simulated population was structured based on the random mating of 50 bulls to 1500 unrelated cows, resulting in 1500 females derived from 50 groups of 30 half-sibs, which were submitted to the evaluation of early pregnancy.
Fixed and random effects were introduced into the simulation through the specification of a linear predictor given by: η η η η η = Xβ β β β β + Zu, [1] where η η η η η px1 is the linear predictor and p is the number of heifers (1500); β β β β β 3x1 is the vector of fixed effects, β β β β β' = [μ,β 1 ,β 2 ], whose values adopted for the simulation were β 1 = 1and β 1 = -1, and μ is the overall mean (Table 2); u px1 is the vector of additive genetic values, with u ~Np (0, Aσ 2 u ), where A is the relationship matrix between animals and σ 2 u is the additive genetic variance of the population; X px3 and X pxp are incidence matrices of effects β and u, respectively.
Considering that π i = P[Y i = 1] is the probability of heifer i (i = 1,2,…,1500) to be precocious, the logit and probit link functions are given by expressions [2] and [3], respectively: [3] where Φ -1 (.) is the inverse function of the standard normal cumulative distribution.
According to the theory of GLMM, the probabilities of interest (π i ) can be isolated by matching the linear predictor [1] to the link functions [2] and [3].Considering the logit and probit link functions, these probabilities are given by expressions [4] and [5]: for the logit (σ 2 ul ) and probit (σ 2 up ) link functions: [7] [8] The simulation was structured based on the following components (Table 1): i) inverse of the link function (expressions [4] and [5]); ii) link function used for estimation and prediction (expressions [2] and [3]).
One practical aspect that can influence the estimation of genetic parameters and prediction of breeding values for categorical traits when GLMM are used is the variation in the frequencies of observations for each category (Abdel-Azim & Berger, 1999).For this reason, different percentages of precocious females were adopted in the simulation since the results obtained with the logit and probit functions may vary depending on these percentages, which were modeled according to the values assumed for μ (Table 2).The following percentages of precocious females were adopted: 5, 10, 15, 20, 25, and 30%.
The different simulation scenarios were defined by combinations of the distinct models (logit/logit, probit/ logit, logit/probit and probit/probit) and percentages of precocious females.This simulation process was implemented in the R software (R Development Core  ) through a binomial function, considering the specifications (link = "probit")$linkinv and (link = "logit")$linkinv for probit and logit links, respectively.These probabilities corresponded to the parameters of the binomial distribution necessary for the generation of binary data obtained with the rbinom function.
The data sets generated for each scenario, corresponding to 100 repetitions, were analyzed by the GLMM method using the pedigreemm package (Vazquez et al., 2009) of the R software (R Development Core Team, 2009).This package uses the restricted maximum likelihood method for the estimation of variance components.In the present study, due to the adoption of a sire model, the sire  Isolation of 2 ĥ from expressions [9] and [10], which correspond to the logit and probit link functions, respectively, was necessary to obtain the estimates of h 2 and to compare them with the parametric value of 0.40: [9] [10] where σ ˆis the sire variance.
The heritability estimates obtained were compared with the parametric value using the mean squared error.According to Casela & Berger (1990), it is a function that quantifies the expected value of the squared difference between the estimator ( 2 ĥ ) and parameter ( 2h ).The mean squared error is defined as follows: [11] where N is the number of repetitions per scenario, in this case 100.The predictions of additive breeding values were compared with parametric values using Pearson correlations.In addition, the percentage of coincidence between true and predicted ranking, considering only the 10% of bulls with the highest breeding values (TOP10) was measured.

Results and Discussion
The logit/probit and probit/logit simulation/analysis structures yielded mean estimates of h 2 below and above the parametric value, respectively, for all percentages of precocious females (Table 3).In addition, the mean squared errors obtained with these models were high when compared with those obtained with the probit/probit and logit/logit models.Considering only bias, the last two structures also showed a better performance than probit/logit and logit/ probit since they provided estimates that were much closer to the parametric value.
The h ^2 values that showed the greatest bias in the probit/probit (0.3658) and logit/logit (0.3554) simulation/ analysis structures were obtained at percentages of precocious females of 25 and 30%, respectively.However, these estimates can be considered close to the simulated parametric value (0.40).For these models, the highest mean squared error was obtained at a percentage of precocious females of 5% (0.0478 for probit/probit and 0.0392 for logit/logit), indicating that a lower proportion of early pregnancy success may increase the variation in h 2 estimates.
The logit/probit and probit/logit models presented low correlations (close to zero), indicating a poor ability of prediction (Table 4).In contrast, the probit/probit and logit/ logit models showed correlations ranging from moderate to high.The correlations obtained with the logit/logit model were slightly higher than those obtained with the probit/ probit model, except for the percentage of precocity of 5% (0.5936 and 0.6025).These differences are probably related to the fact that inadequate likelihood functions are used when the data are analyzed as probit and logit, and vice-versa.It is understood that the use of an inadequate distribution for the data (likelihood functions) may compromise the predictions of breeding values.
Correlations between predicted and true breeding values increased with increasing percentage of precocious females when the probit/probit and logit/logit simulation models were used (Table 4).This behavior can be attributed to the fact that logit and probit link functions yield better results when the proportions of zero and one tend to be the same, i.e., 50%.According to Abdel-Azim & Berger (1999)  of multicategorical (probit) threshold models increases as the differences between the frequencies of observations for each category decrease.Since the percentages of precocious females adopted in the simulation did not include the value of 50%, the best results were reported for the closest percentage, i.e., 30%.The values adopted here were chosen according to previous studies involving Nellore cattle (Eler et al., 2002;Silva et al., 2003;Eler et al., 2004;Silva et al., 2005;Pereira et al., 2007;Shiotsuki et al., 2009).
The percentages of animals coinciding with the TOP10 bulls obtained with the logit/probit and probit/logit models were lower than those obtained with the probit/probit and logit/logit models (Table 5).These findings suggest that comparison between logit and probit link functions is necessary for the ranking of the best animals, in agreement with the results obtained for the estimation of heritability (Table 3) and for the prediction of additive genetic effects (Table 4).In addition, the most satisfactory results were observed with increasing percentage of precocious females, a fact also demonstrated by Pearson correlations (Table 4).
Taken together, the results obtained with the simulation models proposed show the need to compare the estimation and prediction ability of logit and probit link functions, irrespective of the percentage of precocious females.Simulating data on the logit scale and smoothing them using a probit function (logit/probit), or vice-versa (probit/ logit), provided less satisfactory results than those obtained when the data were generated and analyzed using the same scale and a link function (probit/probit and logit/logit).

Conclusions
The generalized linear mixed models using logit and probit link functions provided similar genetic parameter estimates and predictions of breeding values of the animals when logit/logit and probit/probit simulation/analysis models were used.However, the results obtained with the probit/logit and logit/probit models were not satisfactory.It is therefore necessary to compare the estimation and prediction ability of link functions in order to determine which function is the most appropriate for the genetic evaluation of binary traits.

Table 1 -
Team, 2009)analysis structures used according to the inverse function applied to obtain the probabilities of early pregnancy in heifers (π i ) and the link function used in the analysisTeam, 2009), considering 100 repetitions per scenario.The mvrnorm function of the MCMCpack was used to generate additive genetic values (u) in [1] of a multivariate normal distribution.The numerator relationship matrix A, which comprises the covariance matrix of this distribution, was obtained with the kinship function of the same package.The rbinom function was used to determine to which fixed effect of the female it would be submitted, with a probability of 0.5 of the animal being under the effect of β 1 or β 2 .Values of η i were then determined based on the sum of fixed effects and the breeding value (x´iβ + u i ).These values of η i were used to calculate the probabilities P[Y i = 1] = π i derived from the inverse link functions(expressions [4] and [5]

Table 2 -
Mean values (μ) used to determine the percentage of precocious females for each simulation/analysis structures variance estimates correspond to ¼ of the estimated value for the additive genetic random component.

Table 3 -
Mean heritability estimates (h ^2) and mean squared errors (MSE) obtained with each simulation structure according to the different percentages of precocious females (%PF)

Table 5 -
Mean percentage of coincidence in top 10% for true and predicted ranking obtained with each simulation/analysis model for the different percentages of precocious females

Table 4 -
Mean Pearson correlations between true and predicted breeding values obtained with each simulation/analysis structures for the different percentages of precocious females