**BREEDING, GENETIC AND REPRODUCTION**

**Generalized linear mixed models for the genetic evaluation of binary reproductive traits: a simulation study ^{1}**

**Diogo Anastácio Garcia ^{I}; Idalmo Garcia Pereira^{II}; Fabyano Fonseca e Silva^{III}; Guilherme Jordão de Magalhães Rosa^{IV}; Aldrin Vieira Pires^{II}; Roseli Aparecida Leandro^{V}**

^{I}Faculdade de Ciências Agrárias e Veterinárias/UNESP - Jaboticabal, SP

^{II}Departamento de Zootecnia - FCA/UFVJM - Diamantina, MG

^{III}Departamento de Estatística - UFV - Viçosa, MG

^{IV}Department of Dairy Science, University of Wisconsin - Madison, WI, USA ]]>
^{V}Departamento de Ciências Exatas - LCE/ESALQ - Piracicaba, SP

**ABSTRACT**

The objective of this study was to evaluate the use of probit and logit link functions for the genetic evaluation of early pregnancy using simulated data. The following simulation/analysis structures were constructed: logit/logit, logit/probit, probit/logit, and probit/probit. The percentages of precocious females were 5, 10, 15, 20, 25 and 30% and were adjusted based on a change in the mean of the latent variable. The parametric heritability (*h ^{2}*) was 0.40. Simulation and genetic evaluation were implemented in the R software. Heritability estimates (

*ĥ*) were compared with

^{2}*h*using the mean squared error. Pearson correlations between predicted and true breeding values and the percentage of coincidence between true and predicted ranking, considering the 10% of bulls with the highest breeding values (TOP10) were calculated. The mean

^{2}*ĥ*values were under- and overestimated for all percentages of precocious females when logit/probit and probit/logit models used. In addition, the mean squared errors of these models were high when compared with those obtained with the probit/probit and logit/logit models. Considering

^{2}*ĥ*, probit/probit and logit/logit were also superior to logit/probit and probit/logit, providing values close to the parametric heritability. Logit/probit and probit/logit presented low Pearson correlations, whereas the correlations obtained with probit/probit and logit/logit ranged from moderate to high. With respect to the TOP10 bulls, logit/probit and probit/logit presented much lower percentages than probit/probit and logit/logit. The genetic parameter estimates and predictions of breeding values of the animals obtained with the logit/logit and probit/probit models were similar. In contrast, the results obtained with probit/logit and logit/probit were not satisfactory. There is need to compare the estimation and prediction ability of logit and probit link functions.

^{2}**Key Words:** beef cattle, early pregnancy, genetic parameters, GLMM, link function

**Introduction**

One of the factors that determine the economic viability of sustainable beef cattle enterprises is the reproductive performance of the herd. Therefore, the identification and evaluation of reproductive traits that can be easily measured and present a potential for selection are essential. In this respect, early pregnancy is an easily measured trait since it comprises only two categories (pregnant and non-pregnant). In addition, the evaluation of phenotypic expression does not result in additional costs, since the diagnosis of pregnancy is a routine practice in beef cattle farming.

]]> However, one of the main challenges of using categorical traits in breeding programs is the development of adequate statistical methods for the estimation of parameters and the prediction of breeding values. One of the methods frequently used for this purpose is generalized linear mixed models (GLMM) (Thompson, 1979; Gianola & Foulley, 1983; Harville & Mee, 1984; Gilmour et al., 1985, 1987; Tempelman, 1998; Pereira et al., 2006, 2007). According to McCullagh & Nelder (1989), these models can be defined based on the specification of three components: i) a random component represented by independent random variables that belong to the same distribution, which is part of an exponential family such as binomial and Poisson; ii) a systematic component, called linear predictor, in which explanatory variables enter in the form of the linear sum of their effects; iii) a link function that combines the random and systematic components, i.e., relates the mean to the linear predictor and thus permits modeling, for example, the probability of a female being precocious.The link function for binary data most widely used in animal breeding is the probit link, also known as the threshold model (Gianola & Foulley, 1983; Abdel-Azim & Berger, 1999; Kadarmideen et al., 2000, 2001; Silva et al., 2005; Shiotsuki et al., 2009) Other functions can also be explored, such as the logit link, which is widely used in the area of Biometrics (Breslow & Clayton, 1993; Demétrio, 2001; Nunes et al., 2004). When modeling a random variable using probit and logit functions, it is assumed that the latent variable (liability) possesses a standard normal and logistic distribution, respectively. Although important, studies comparing these functions are scarce and are based on the analysis of true data (Gilmour et al., 1987; Kadarmideen et al., 2004). Therefore, an interesting alternative to evaluate the estimation and prediction ability of probit and logit link functions is the use of simulated data.

The objective of the present study was to evaluate the application of a GLMM to the estimation of genetic parameters and prediction of breeding values for early pregnancy by comparing probit and logit link functions using simulated data.

**Material and Methods**

The simulated population was structured based on the random mating of 50 bulls to 1500 unrelated cows, resulting in 1500 females derived from 50 groups of 30 half-sibs, which were submitted to the evaluation of early pregnancy.

Fixed and random effects were introduced into the simulation through the specification of a linear predictor given by:

where **η*** _{px1}* is the linear predictor and

*p*is the number of heifers (1500);

**β**

*is the vector of fixed effects,*

_{3x1}**β**' = [µ,β

_{1},β

_{ 2}], whose values adopted for the simulation were β

_{1}= 1and β

_{1}= -1, and µ is the overall mean (Table 2);

*u**is the vector of additive genetic values, with*

_{px1}

*u**~N*(

_{p}**0**,

*A*), where

*A*is the relationship matrix between animals and is the additive genetic variance of the population;

*X*and

_{px3}*X*are incidence matrices of effects β and

_{pxp}*u, r*espectively.

Considering that π* _{i}* = P[

*Y*=

_{i}*1*] is the probability of heifer

*i*(

*i*=

*1,2,...,1500*) to be precocious, the logit and probit link functions are given by expressions [2] and [3], respectively:

where Φ^{-1}(.) is the inverse function of the standard normal cumulative distribution.

According to the theory of GLMM, the probabilities of interest (π_{i}) can be isolated by matching the linear predictor [1] to the link functions [2] and [3]. Considering the logit and probit link functions, these probabilities are given by expressions [4] and [5]:

where Φ (.) is the standard normal distribution.

Once parametric values are assumed for β and , the probabilities of early pregnancy can be derived from [4] and [5] and are then used in the Monte Carlo simulation to generate phenotypic expressions of the early pregnancy trait. These expressions correspond to values 0 and 1, which are generated from binomial distributions, with the probability of success being given by P[*Y _{i}* = 1] = π

*.*

_{i}*Y*by the binomial distributions with probability π

_{i}*.Therefore, when a probit link function is used, the residual variance (), since it is unknown, is generally assumed to be 1. The objective of this value is to render the likelihood function, which is constructed based on the approach of probit regression, identifiable (Sorensen & Gianola, 2002). For the logit function, the residual variance is an approximation based on the variance of the logistic distribution (Southey et al., 2003) and is described by expression [6]:*

_{i}where *b* is the standard deviation of the logistic distribution, which is generally assumed to be 1 (Southey et al., 2003), as in the present study.

The parametric heritability used in the simulation was 0.40 and reflects the results of recent studies involving Nellore animals, which reported moderate to high estimates (Eler et al., 2002; Silva et al., 2003; Eler et al., 2004; Silva et al., 2005; Pereira et al., 2007; Shiotsuki et al., 2009). Different additive genetic variances were employed according to the link function to be used in the analysis of the simulated data. Thus, expressions [7] and [8] can be used to calculate the respective additive genetic variances for the logit () and probit () link functions:

The simulation was structured based on the following components (Table 1): i) inverse of the link function (expressions [4] and [5]); ii) link function used for estimation and prediction (expressions [2] and [3]).

One practical aspect that can influence the estimation of genetic parameters and prediction of breeding values for categorical traits when GLMM are used is the variation in the frequencies of observations for each category (Abdel-Azim & Berger, 1999). For this reason, different percentages of precocious females were adopted in the simulation since the results obtained with the logit and probit functions may vary depending on these percentages, which were modeled according to the values assumed for µ (Table 2). The following percentages of precocious females were adopted: 5, 10, 15, 20, 25, and 30%.

The different simulation scenarios were defined by combinations of the distinct models (logit/logit, probit/logit, logit/probit and probit/probit) and percentages of precocious females. This simulation process was implemented in the R software (R Development Core Team, 2009), considering 100 repetitions per scenario. The *mvrnorm* function of the MCMCpack was used to generate additive genetic values (*u*) in [1] of a multivariate normal distribution. The numerator relationship matrix *A*, which comprises the covariance matrix of this distribution, was obtained with the *kinship* function of the same package. The *rbinom* function was used to determine to which fixed effect of the female it would be submitted, with a probability of 0.5 of the animal being under the effect of β_{1} or β_{2} . Values of η_{i} were then determined based on the sum of fixed effects and the breeding value (*x´ _{i}*β +

*u*). These values of η

_{i}_{i}were used to calculate the probabilities P[

*Y*= 1] = π

_{i}*derived from the inverse link functions (expressions [4] and [5]) through a binomial function, considering the specifications*

_{i}*(link = "probit")$linkinv*and

*(link = "logit")$linkinv*for probit and logit links, respectively. These probabilities corresponded to the parameters of the binomial distribution necessary for the generation of binary data obtained with the

*rbinom*function.

The data sets generated for each scenario, corresponding to 100 repetitions, were analyzed by the GLMM method using the pedigreemm package (Vazquez et al., 2009) of the R software (R Development Core Team, 2009). This package uses the restricted maximum likelihood method for the estimation of variance components. In the present study, due to the adoption of a sire model, the sire variance estimates correspond to ¼ of the estimated value for the additive genetic random component.

Isolation of *ĥ ^{2}* from expressions [9] and [10], which correspond to the logit and probit link functions, respectively, was necessary to obtain the estimates of

*h*and to compare them with the parametric value of 0.40:

^{2}where is the sire variance.

The heritability estimates obtained were compared with the parametric value using the mean squared error. According to Casela & Berger (1990), it is a function that quantifies the expected value of the squared difference between the estimator (*ĥ ^{2}*) and parameter (

*h*). The mean squared error is defined as follows:

^{2}where *N* is the number of repetitions per scenario, in this case 100.

The predictions of additive breeding values were compared with parametric values using Pearson correlations. In addition, the percentage of coincidence between true and predicted ranking, considering only the 10% of bulls with the highest breeding values (TOP10) was measured.

**Results and Discussion**

The logit/probit and probit/logit simulation/analysis structures yielded mean estimates of *h ^{2}* below and above the parametric value, respectively, for all percentages of precocious females (Table 3). In addition, the mean squared errors obtained with these models were high when compared with those obtained with the probit/probit and logit/logit models. Considering only bias, the last two structures also showed a better performance than probit/logit and logit/probit since they provided estimates that were much closer to the parametric value.

The *ĥ ^{2}* values that showed the greatest bias in the probit/probit (0.3658) and logit/logit (0.3554) simulation/analysis structures were obtained at percentages of precocious females of 25 and 30%, respectively. However, these estimates can be considered close to the simulated parametric value (0.40). For these models, the highest mean squared error was obtained at a percentage of precocious females of 5% (0.0478 for probit/probit and 0.0392 for logit/logit), indicating that a lower proportion of early pregnancy success may increase the variation in

*h*estimates.

^{2}

Correlations between predicted and true breeding values increased with increasing percentage of precocious females when the probit/probit and logit/logit simulation models were used (Table 4). This behavior can be attributed to the fact that logit and probit link functions yield better results when the proportions of zero and one tend to be the same, i.e., 50%. According to Abdel-Azim & Berger (1999), the quality of multicategorical (probit) threshold models increases as the differences between the frequencies of observations for each category decrease. Since the percentages of precocious females adopted in the simulation did not include the value of 50%, the best results were reported for the closest percentage, i.e., 30%. The values adopted here were chosen according to previous studies involving Nellore cattle (Eler et al., 2002; Silva et al., 2003; Eler et al., 2004; Silva et al., 2005; Pereira et al., 2007; Shiotsuki et al., 2009).

The percentages of animals coinciding with the TOP10 bulls obtained with the logit/probit and probit/logit models were lower than those obtained with the probit/probit and logit/logit models (Table 5). These findings suggest that comparison between logit and probit link functions is necessary for the ranking of the best animals, in agreement with the results obtained for the estimation of heritability (Table 3) and for the prediction of additive genetic effects (Table 4). In addition, the most satisfactory results were observed with increasing percentage of precocious females, a fact also demonstrated by Pearson correlations (Table 4).

]]>

Taken together, the results obtained with the simulation models proposed show the need to compare the estimation and prediction ability of logit and probit link functions, irrespective of the percentage of precocious females. Simulating data on the logit scale and smoothing them using a probit function (logit/probit), or vice-versa (probit/logit), provided less satisfactory results than those obtained when the data were generated and analyzed using the same scale and a link function (probit/probit and logit/logit).

**Conclusions**

The generalized linear mixed models using logit and probit link functions provided similar genetic parameter estimates and predictions of breeding values of the animals when logit/logit and probit/probit simulation/analysis models were used. However, the results obtained with the probit/logit and logit/probit models were not satisfactory. It is therefore necessary to compare the estimation and prediction ability of link functions in order to determine which function is the most appropriate for the genetic evaluation of binary traits.

**Acknowledgements**

The authors thank the state funding agency Fundação de Amparo a Pesquisa do Estado de Minas Gerais for financial support and for the Masters fellowship granted.

**References**

ABDEL-AZIM, G.A.; BERGER, P.J. Properties of threshold model predictions. **Journal of Animal Science**, v.77, p.582-590, 1999. [ Links ]

BRESLOW, N.E.; CLAYTON, D.G. Approximate inference in generalized linear mixed models. **Journal of the American Statistical Association**, v.88, n.421, p.9-25, 1993. [ Links ]

CASELA, G.; BERGER, R.L. **Statistical inference**. Belmont: Duxbury Press, 1990. 660p. [ Links ]

DEMETRIO, C.G.B. **Modelos lineares generalizados em experimentacão agronômica**. Piracicaba: ESALQ, 2001. 113p. [ Links ]

ELER, J.P.; SILVA, J.A. II V.; FERRAZ, J.B.S. et al. Genetic evaluation of the probability of pregnancy at 14 months for Nellore heifers. **Journal of Animal Science**, v.80, p.4951-4954, 2002. [ Links ]

ELER, J.P.; SILVA, J.A. II V.; EVANS, J.L. et al. Additive genetic relationships between heifer pregnancy and scrotal circumference in Nellore cattle. **Journal of Animal Science**, v.82, n.9, p.2519-2527, 2004. [ Links ]

GIANOLA, D.; FOULLEY, J.L. Sire evaluation for ordered categorical data with threshold model. **Genetics Selection Evolution**, v.15, n.2, p.201-224, 1983. [ Links ]

GILMOUR, A.R.; ANDERSON, R.D.; RAE, A.L. The analysis of binomial data by generalized linear mixed model. **Biometrika**, v.72, n.3, p.539-599, 1985. [ Links ]

GILMOUR, A.R.; ANDERSON, R.D.; RAE, A.L. Variance components on an underlying scale for ordered multiple threshold categorical data using a generalized linear mixed model. **Journal of Animal Breeding and Genetics**, v.104, p.149-155, 1987. [ Links ]

HARVILLE, D.A.; MEE, R.W. A mixed model procedure for analysing ordered categorical data. **Biometrics**, v.40, p.393-408, 1984. [ Links ]

KADARMIDEEN, H.N., THOMPSON, R.; SIMM, G. Linear and threshold model genetic parameters for disease, fertility and milk production in dairy cattle. **Animal Science**, v.71, p.411-419, 2000. [ Links ]

KADARMIDEEN, H.N.; REKAYA, R.; GIANOLA, D. Genetic parameters for clinical mastitis in Holstein-Friesians: A Bayesian analysis. **Animal Science**, v.73, p.229-240, 2001. [ Links ]

KADARMIDEEN, H.N.; SCHWORER, D.; ILAHI, H. et al. A. Genetics of osteochondral disease and its relationship with meat quality and quantity, growth, and feed conversion traits in pigs. **Journal of Animal Science**, v.82, p.3118-3127, 2004. [ Links ]

McCULLAGH, P.; NELDER, J.A. **Generalized linear models**. 2.ed. London: Chapman & Hall, 1989. 511p. [ Links ]

NUNES, J.A.R.; MORAIS, A.R.; BUENO FILHO, J.S.S. Modelagem da superdispersão em dados por um modelo linear generalizado misto. **Revista de Matemática e Estatística**, v.22, n.1, p.55-70, 2004. [ Links ]

PEREIRA, I.G.; SILVA, F.F.; CARNEIRO, A.P.S. et al. Genetic evaluation of the binary reproductive trait via generalized linear mixed models (GLMM). In: WORLD CONGRESS ON GENETICS APPLIED TO LIVESTOCK PRODUCTION, 2006, Belo Horizonte. **Proceedings...** Belo Horizonte, v.8, p.233-233, 2006. [ Links ]

PEREIRA, I.G.; SILVA, F.F.; GARCIA, D.A. et al. Estimativa de herdabilidade para prenhez de novilhas aos 14 meses em um rebanho Nelore por meio de modelos lineares generalizados mistos. In: REUNIÃO ANUAL DA SOCIEDADE BRASILEIRA DE ZOOTECNIA, 44., 2007, Jaboticabal. **Anais...** Brasília: Sociedade Brasileira de Zootecnia, 2007. (CD-ROM). [ Links ]

R Development Core Team. **R:** A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2009. Available at: <http://www.R-project.org>. Accessed on: March 5, 2009. [ Links ]

SHIOTSUKI, L.; SILVA, J.A. II V.; ALBUQUERQUE, L.G. Associação genética da prenhez aos 16 meses com o peso à desmama e o ganho de peso em animais da raça Nelore. **Revista Brasileira de Zootecnia**, v.38, n.5, p.1211-1217, 2009. [ Links ]

SILVA, J.A.II V.; MELIS, M.H.V.; ELER, J.P. et al. Estimação de parâmetros genéticos para probabilidade de prenhez aos 14 meses e altura da garupa em bovinos da raça Nelore. **Revista Brasileira de Zootecnia**, v.32, n.5, p.1141-1146, 2003. [ Links ]

SILVA, J.A. II V.; DIAS L.T.; ALBUQUERQUE L.G. Estudo genético da precocidade sexual de novilhas em um rebanho Nelore. **Revista Brasileira de Zootecnia**, v.34, n.5, p.1568-1572, 2005. [ Links ]

SOUTHEY, B.R.; RODRIGUEZ-ZAS, S.L.; LEYMASTER, K.A. Discrete time survival analysis of lamb mortality in a terminal sire composite population. **Journal of Animal Science**, v.81, p.1399-1405, 2003. [ Links ]

SORENSEN, D.; GIANOLA, D. **Likelihood, Bayesian, and MCMC Methods in quantitative genetics**. New York: Springer-Verlag, 2002. p.607. [ Links ]

TEMPELMAN, R.J. Generalized linear mixed models in dairy cattle breeding. **Journal of Dairy Science**, v.81, p.1428-1444, 1998. [ Links ]

THOMPSON, R. Sire evaluation. **Biometrics**, v.35, p.339-353, 1979. [ Links ]

VAZQUEZ, A.I.; BATES, D.M.; ROSA, G.J.M. et al. Technical Note: An R package for fitting generalized linear mixed models in animal breeding. **Journal of Animal Science**, v.88, p.497-504, 2009. [ Links ]

Received August 23, 2010 and accepted July 6, 2011.

Corresponding author: diogo.agarcia@gmail.com

1 Project financed by FAPEMIG.