Abstract
Finite mixture models are helpful for uncovering heterogeneity due to hidden structure; for example, unknown major genes. The first part of this article gives examples and reviews quantitative genetics issues of continuous characters having a finite mixture of Gaussian components. The partition of variance in a mixture, the covariance between relatives under the supposition of an additive genetic model and the offspringparent regression are derived. Formulae for assessing the effect of mass selection operating on a mixture are given. Expressions for the genetic correlation between a mixture and a Gaussian trait are presented. If there is heterogeneity in a population at the genetic or environmental levels, then genetic parameters based on theory treating distributions as homogeneous can lead to misleading interpretations. Subsequently, methods for parameter estimation (e.g., maximum likelihood) are reviewed, and the Bayesian approach is illustrated via an application to somatic cell scores in dairy cattle.
Bayesian methods; dairy cattle; maximum likelihood; mixture distributions; quantitative genetics; somatic cell scores
BREEDING, GENETIC AND REPRODUCTION
Mixture models in quantitative genetics and applications to animal breeding
Daniel Gianola^{I, III, *}; Paul J. Boettcher^{II}; Jørgen Ødegård^{III}; Bjørg Heringstad^{III}
^{I}Department of Animal Sciences. University of WisconsinMadison. Madison, WI 53706, USA
^{II}Institute of Biology and Biotechnology of Agriculture National Research Council. Segrate 20090. Italy
^{III}Department of Animal and Aquacultural Sciences. Norwegian University of Life Sciences. P.O. Box 5025, N1432 Ås Norway
ABSTRACT
Finite mixture models are helpful for uncovering heterogeneity due to hidden structure; for example, unknown major genes. The first part of this article gives examples and reviews quantitative genetics issues of continuous characters having a finite mixture of Gaussian components. The partition of variance in a mixture, the covariance between relatives under the supposition of an additive genetic model and the offspringparent regression are derived. Formulae for assessing the effect of mass selection operating on a mixture are given. Expressions for the genetic correlation between a mixture and a Gaussian trait are presented. If there is heterogeneity in a population at the genetic or environmental levels, then genetic parameters based on theory treating distributions as homogeneous can lead to misleading interpretations. Subsequently, methods for parameter estimation (e.g., maximum likelihood) are reviewed, and the Bayesian approach is illustrated via an application to somatic cell scores in dairy cattle.
Key words: Bayesian methods, dairy cattle, maximum likelihood, mixture distributions, quantitative genetics, somatic cell scores
Introduction
Linear models with mixed effects have been used extensively in animal breeding since the early 50's (e.g., Henderson, 1973). An account of the theory can be found in Searle et al. (1992), whereas Bayesian treatments are in Gianola & Fernando (1986) and Sorensen & Gianola (2002). Mixed linear models are flexible and can be fitted in crosssectional, longitudinal, spatial or multiresponse settings. In animal breeding, these models are used to infer genetic parameters such as heritability and genetic correlations, linear combinations of fixed effects (e.g., diferences in mean value of cohorts born in successive generations), and to predict breeding values of candidates for selection. Animal breeding implementations typically involve large data sets and hundreds of thousands of correlated random effects due to coancestry relationships between animals (Gianola, 2001).
However, linear models do not accommodate well discrete and censored variates and may not be robust enough when there is "concealed" structure in the data.
Finite mixture models, used in biology and in genetics since Pearson (1894), are helpful for uncovering heterogeneity due to hidden structure or incorrect assumptions. The objective of this article is to review some aspects related to analysis with finite mixture models in animal breeding contexts.
The paper is organized as follows. First, examples are presented where mixture models can play a useful role, followed by some basic statistical and quantitative genetics issues. Subsequently, maximum likelihood estimation of parameters of a mixture is discussed. Since use of Bayesian methods is exploding in biology, a brief account of a Bayesian application of mixtures to dairy cattle data is presented. The paper ends with concluding remarks.
Examples
Arguably, finite mixture models can play an increasingly important role in animal breeding. To illustrate, consider four problems arising in animal genetics. Unknown loci with major effects can create "bumps" (sometimes quite subtle) in a phenotypic distribution, and this heterogeneity may be resolved by fitting a mixture, i.e., by calculating conditional probabilities that a datum is drawn from one of the several potential, yet unknown, genotypes. A brief review of the use of mixtures for uncovering major genes is in Lynch & Walsh (1998). Also, many quantitative trait loci (QTL) detection procedures are based on ideas from mixture models (Haley & Knott, 1992). Basically, given marker and phenotypic data, one computes the probability that an individual has genotype QQ; Qq or qq at a putative QTL.
The second example is from dairy cattle breeding. Mastitis is an infammation of the mammary gland of cows associated with bacterial infection. Genetic variation in susceptibility to the disease exists, and genetic selection for resistance is a feasible strategy (Heringstad et al., 2000). However, routine recording of mastitis events is not conducted in most nations. Instead, milk somatic cell scores (SCS) measured in cows have been used as a proxy for the disease in genetic evaluation of artificial insemination bulls (with Gaussian mixed effects models), much as the prostate specific antigen is treated as a proxy for prostatic cancer. It is not obvious how SCS information should be treated optimally in genetic evaluation, because normal, clinical and different types of subclinical cases are hidden. Some of the challenges may be met using finite mixture models, as suggested by Detilleux & Leroy (2000), Ødegård et al. (2003, 2005), Gianola et al. (2004) and Boettcher et al. (2005, 2007).
Another example is from transcriptional analysis and genomics. In microarray studies, messenger RNA samples are collected from 2 target tissues, converted into complementary DNA, labelled with dyes of different colors (typically red and green), and hybridized against thousands of known pieces of DNA (genes) spotted in a slide. If a gene is expressed in the targets, hybridization is detected via fluorescence. Some spots are green, others red, plus every color in between! Image analysis is used for quantitating the extent of hybridization and differential expression. Observed expression may not reflect true differential expression, however. One can think in terms of a mixture of at least 2 distributions:
1) if there is differential expression, the distributions of measurements in the red and green channels for a given gene would have different parameters, and
2) in the absence of differential expression, these parameters should be equal.
The fourth example is that of assessing genetic change in populations of animals subject to artificial selection. Animals are born, die or are culled at any point in time, so generations overlap. Since many such animals have unknown parents, it is difficult to give a crisp definition of "generation". Animal breeders "group" individuals into more or less arbitrary cohorts (Quaas, 1988; Westell et al., 1988). However, the "true" group structure might be finer or coarser. An alternative to arbitrary grouping is to assume that unobservable genetic effects of unknown parents are drawn from a mixture of distributions.
Statistical and quantitative genetic ISSUES 100
Density, mean and variance. A random variate y (the distinction between variables and realized values is ignored in the notation) is drawn from one of K mutually exclusive and exhaustive distributions (groups), without knowing which of these underlies the draw. For instance, the observed SCS in the milk of a cow may be from a healthy or from an infected animal; if the disease is mastitis, the case may be clinical or subclinical. In the absence of a precise veterinary diagnosis, there is uncertainty about to which group the observed SCS score pertains to. Here, K = 3 and the underlying groups are: uninfected, clinical and subclinical. The density of y can be written as
where
K (assumed known) is the number of components of the mixture; P_{i} is the probability that the draw is from the i^{th} component; pi (yθ_{i}) is the density of the distribution of y under component i; θ_{i} is a parameter vector indexing such distribution, and θ = [θ'_{1}, θ'_{2}, ..., θ'_{K}, P_{1}, P_{2}, .. P_{K}]' is the collection of all distinct parameters, subject to . In general, y may be either scalar or vector valued, or may be discrete. Here, we consider the situation where component distributions are Gaussian. In what follows, the notation N (ym,σ^{2}) will denote a univariate normal distribution or density with mean m and variance σ^{2}; whereas N (yµ,Σ) pertains to a multivariate normal setting, where µ is the mean vector and Σ is the variance covariance matrix.
The mean and variance of a finite mixture of K Gaussian distributions, with now θ = [P_{1}, P_{2}, ..., P_{K}, µ_{1}, µ_{2}..., µ_{K}, , , ..., ]', are
and
The first term in (2) can be construed as the average variance, whereas measures dispersion between group means; if the µ's are equal, this second term is null. Note that the variance of the mixture depends not only on the group variances, but on the group means as well.
The additive genetic mixture model
The quantitative genetics of characters distributed as mixtures has not been studied extensively, although the idea underlies work of, e.g., Latter (1965) and Kimura & Crow (1978). What follows is a summary of results in Gianola et al. (2006). Suppose an observable random variable (y_{i}; phenotype of individual i) is drawn from the finite mixture of G_{E} Gaussian components
where p_{e} is a vector containing the mixing proportions Pe_{k} (summing to 1); µ_{e} and are each G_{E} x 1 vectors of means and variances with typical elements µ_{k} and , respectively; a_{i} is the genetic value of i. The mean and variance of this conditional (given the genetic effect) distribution are
and
respectively, where is the residual or environmental variance. Informally, is the part of the environmental variance contributed by population heterogeneity.
Assume that the genetic effect a_{i} is also drawn from the mixture with G_{A} components
where
p_{a} = [Pa_{1}, ..., Pa1_{A}]'; α = [α_{1},...αG_{A}]', and are the vectors of mixing proportions, component means and component variances, respectively.
where is the genetic variance, and is interpretable as variance between genetic means. In Gaussian linear models the distribution of the random genetic effects is often taken to be N (a,  0,); where is the additive genetic variance, so it may be reasonable to introduce the restriction in the mixture (Verbeke & Lesaffre, 1996). The joint density of a_{i} and y_{i} is obtained by multiplication of (3) and (6), yielding
which is a finite mixture of G_{E} x G_{A} bivariate normal distributions, with mixing proportion Pe_{k} Pa_{m} for the km^{th} component; note that From standard Gaussian linear models theory, given the km component (let the indicator δ_{km} = 1 denote such situation)
where N_{2} (..,.) denotes a bivariate normal distribution. Further
where
and
Under the standard additive genetic model, this regression of "genotype on phenotype" b_{km} is the heritability of the character under the km^{th} component of the bivariate mixture. The joint density (8) is also expressible as
The marginal density of yi is arrived at by integrating (9) over ai; yielding
This is a finite mixture of G_{E} x G_{A} univariate normal distributions with mixing proportions Pe_{k} Pa_{m}. The mean and variance of the phenotypic distribution are
and
A standard problem in quantitative genetics is that of inferring genetic values from phenotypes. From (9) and (10), the density of the conditional distribution of a_{i} given y_{i} is
where
Hence, the conditional distribution of a_{i} given y_{i} is a mixture of the G_{E} x G_{A} normal distributions, where the mixing proportion is Q_{km}, the conditional probability that the datum is drawn from ; given the observation y_{i}. The best predictor of genetic value is the conditional expectation function (Henderson 1973; Bulmer 1980; Fernando & Gianola 1986).
which is a weighted average of the conditional expectations peculiar to each of the G_{E} x G_{A} components of mixture (13). This result is important: the regression of genotype on phenotype is not linear in y_{i}. Therefore, standard linear models give less than optimal predictions of genetic effects for traits distributed as mixtures. Further, the variance of the conditional distribution
In the standard additive genetic linear model, the variance of the conditional distribution of genotypes given phenotypes is (1  h^{2}), where h^{2} is the coefficient of heritability; this conditional variance is homogeneous and does not depend on the data. In a mixture model, however, the dispersion about the regression function is heteroscedastic and nonlinear on the phenotypic value. Hence, both point and interval prediction of genetic value in mixtures involve strikingly different formulae.
Truncation selection
Consider the standard truncation selection setting in which individuals kept as parents are such that y_{i} > t; with the proportion of individuals selected being Pr (y_{i} > t) = γ. From (10), the distribution of phenotypic values within selected individuals has density
where
Above, γ_{km} is the proportion selected within the km^{th} mixture component and Ø(.) is the standard normal distribution function. The proportion selected γ is, thus, a weighted average of the individual component selection proportions γ_{km}. Since the threshold is fixed, the components that are most prevalent, have largest means and are most variable, will be influential.
The mean value of selected individuals is
where i_{km} is the selection intensity factor under the km^{th} component and
are relative weights summing to 1. The phenotypic superiority of selected individuals, or selection differential (S) is given by the difference between (17) and (11). Further, the mean genetic value of selected parents is
This expression cannot be evaluated analytically, because it is a highly nonlinear function of the phenotypic values. Finally, the genetic superiority of accepted parents over the unselected population is
The expected fraction of the selection differential that is realized can be assessed as Δ_{a}/S; and this will differ from what could be expected from the regression of offspring on mid parent, because of nonlinearity.
Heritability
The fraction of variance attributable to additive genetic effects (usual definition of heritability) is location invariant for a Gaussian trait, i.e., it does not involve mean values. In a mixture, heritability becomes
The partition of variance depends on componentspecific variances ( and ), on mixing proportions (Pa_{m} and Pe_{k}) and on mean values (µ_{k} and α_{m}) as well. In the simpler case in which the genetic distribution is the homogeneous process N (a_{i}0, ); "heritability" becomes
and this is expected to be lower than in a homogeneous population because fixed effects contribute to variance.
Offspringparent regression.
The standard formula for the regression of the phenotypic value of a progeny (O) on that of a parent (P) gives
If the distribution of genetic effects is homogeneous, this simplifies to
The consequences are clear: if there is heterogeneity either in the distribution of sampling model residuals or of genetic effects, then β_{OP} is affected by the mixing proportions and by the means µ_{k}. To illustrate, suppose that the genetic distribution is homogeneous; let G_{E} = 2, take µ_{1} = 0 as origin, and Then (20) is expressible as
When P_{e} = 1; the formula gives half of heritability, which is a standard result. The function is symmetric with respect to P_{e}; since P_{e} (1  P_{e}) is maximum at ; the regression is minimum at this value. As an example, consider the offspringparent regression as a function of P_{e} for four situations with different additive genetic variance () and distances between means (Δ) in the two distributions of the mixture: 1) = 1, Δ = 1; 2) = 1, Δ = 2; 3) = .10, Δ = 1, and 4) = .10, Δ = 2. Situations (1) and (2) correspond to a trait with a heritability of .50 under homogeneity, while (3) and (4) are for a lowly heritable trait (h_{2} .09). In (1) and (2), the regression b decreases from 0.25 to about 0.22 and 0.17; respectively, representing relative decreases in heritability of 12 and 32%. The relative decreases in heritability are 18 and 47% in cases (3) and (4), respectively. In brief, heritability in heterogeneous or admixed populations depends on the mixing proportion, on the mean difference between mixture components and on the "homogeneous situation" heritability.
Correlations with a Gaussian trait
Correlations between a mixture trait and a normally distributed character (ω) may be of interest. For example, the mixture trait could be SCC in dairy cattle, with several component distributions corresponding to different unknown statuses of mammary gland disease. The Gaussian trait could be milk yield of a cow. Is the genetic correlation between the two traits affected by heterogeneity of somatic cell count?
The effect of Pa_{m} on the genetic correlation is illustrated next for a 2component mixture. Let be a heteroscedasticity factor, where , the genetic variance under the first component of the mixture, is viewed as "baseline" genetic variance, i.e., a measure of variability in the absence of heterogeneity. Then, it can be shown that under some simplifying assumptions
where ρ_{homo} is the genetic correlation in the absence of a mixture and is the factor by which ρ_{homo} is modified by heterogeneity. Since the sign of ρa_{y},a_{w} is invariant with respect to Pa_{1}, it suffices to examine its values only under positive ρ_{homo}. Figure 1 displays the relationship between the genetic correlation and Pa_{1} for two values of ρ_{homo} (0.7 and 0.3) and of λ (1.5 and 2). As Pa_{1} increases, the proportion of the component with the smaller genetic variance (m = 1) to increase.The genetic correlation increases monotonically with Pa_{1}, the less variable component, and more rapidly so at the largest value of genetic heteroscedasticity. Similar algebra and considerations hold for the environmental correlation between traits.
Summary remarks
If there is heterogeneity in a population either at the genetic or environmental levels, then genetic parameters based on theory treating distributions as homogeneous can lead to misleading interpretations.
Some peculiarities of mixture characters are: heritability depends on the mean values of the populations, the offspringparent regression is nonlinear, and genetic or phenotypic correlations cannot be interpreted devoid of the mixture proportions and of the parameters of the component distributions.
Maximum likelihood estimation
Motivation
Detilleux & Leroy (2000) pointed out advantages of a mixture model for analysis of SCS in dairy cows. The mixture model can account for effects of infection status on SCS and produce an estimate of prevalence of infection, plus a probability of status (infected versus uninfected) for individual cows, given the data and values of the parameters. Detilleux & Leroy (2000) proposed a 2component mixture model, which will be referred to as DL hereafter. Although additional components may be required for finer statistical modelling of SCS, our focus will be on a 2component specification, as a reasonable point of departure. An important issue is that of parameter identification. In likelihood inference this can be resolved by introducing restrictions in parameter values, although creating computational difficulties. In Bayesian settings, proper priors solve the identification problem. A Bayesian analysis with Markov chain Monte Carlo procedures is straightforward, but priors must be proper. However, many geneticists are refractory to using Bayesian models with informative priors, so having alternative methods of analysis available is desirable. Hereafter, a normal mixture model with correlated random effects is presented from a likelihoodbased perspective.
Hierarchical DL
The mixture model is developed hierarchically. Let P be the probability that a SCS is from an uninfected cow. Unconditionally to group membership, but given the breeding value of the cow, the density of observation i (i = 1, 2, ..., n) is
where y_{i} and a_{i} are the SCS and additive genetic value, respectively, of the cow on which the record is taken, and β= [β'_{0}; β'_{1}]'. The probability that the draw is made from distribution 0 is supposed constant from individual to individual.
Assuming that records are conditionally independent, the density of all n observations, given the breeding values, is
The joint density of y and a is then
and the marginal density of the data is
When viewed as a function of the parameters (23) is Fisher's likelihood. This can be written as the product of n integrals only when individuals are genetically unrelated; here, would not be identiable. On the other hand, if a_{i} represents some cluster effect (e.g., a sire's transmitting ability), the betweencluster variance can be identified.
DL assume normality throughout and take and Here, x'_{0i} and x'_{1i} are known incidence vectors relating fixed effects to observations. The assumption about the genetic effects is aA,~ N (0,A). Let now z_{i} ~ Bernoulli (P), be an independent (a priori) random variable taking the value z_{i} = 0 with probability P if the datum is drawn from process N_{0}; or the value z_{i} = 1 with probability 1  P if from N_{1}. Assuming all parameters are known, one has
Thus, is the probability that the cow belongs to the infected group, given the observed SCS, her breeding value and the parameters.
A linear model for an observation (given z_{i}) can be written as
A vectorial representation is
where Diag(z_{i}) is a diagonal matrix with typical element z_{i}; X_{0} is an n x p_{0} matrix with typical row x'_{0i}; X_{1} is an n x p_{1} matrix with typical row x'_{1i}; a = {a_{i}} and e = {e_{i}}. Specific forms of β_{0} and β_{1} (and of the corresponding incidence matrices) are contextdependent, but care must be exercised to ensure parameter identifiability and to avoid what is known as label switching. For example, DL take X_{0}β_{0}= 1µ_{0} and X_{1}β_{1}= 1µ_{1}.
EM Algorithm
One can extremize (23) with respect to Θ via the expectationmaximization algorithm, or EM. An EM version with stochastic steps was developed by Gianola et al. (2004). The EM algorithm augments (22) with n binary indicator variables z_{i} (i = 1, 2, ..., n), taken as independently and identically distributed as Bernoulli; with probability P. If z_{i}= 0; the SCS datum is generated from the uninfected component; if z_{i}= 1; the draw is from the other component. Let z = [z_{1}, z_{2}, ..., z_{n}]' denote the realized values of all z variables. The complete data is the vector [a', y', z']', with [a', z']' constituting the missing part and y representing the observed fraction. The joint density of a, y and z can be written as
Given z; the component of the mixture generating the data is known automatically for each observation. Now
for i = 1,2,...,n. Then, (25) becomes
The form of (26) leads to conditional distributions needed for implementing the Monte Carlo EM algorithm. Details (very tedious) are in Gianola et al. (2004).
Genetic evaluation for SCS could be based on â_{i}, the ith element of â, the mean vector of the distribution While and follow from the maximum likelihood procedure, â must be calculated more conveniently using Monte Carlo methods. Another issue is how the SCS information is translated into chances of a cow belonging to the uninfected group. A simple option is to estimate (24) as
Statistically (27) does not take into account the error of the maximum likelihood estimates of all parameters. If the likelihood function is sharp and unimodal (large samples), this is a minor concern.
Bayesian analysis with case study
Ødegård et al. (2003) developed a Bayesian approach for analysis of a 2component mixture model for SCS with heterogeneous residual variances, and applied it to simulated data. Their model considered heterogeneity of variances for residual effects only, and it was extended subsequently, to derive a criterion suitable for selection against putative mastitis by Ødegård et al. (2005). If SCS is a trait that differs genetically between infected and uninfected cattle, allowing for heterogeneity of genetic and permanent environmental (PE) variances may be appropriate. Boettcher et al. (2007) allowed for heterogeneous variances of genetic and PE effects, and, applied the mixture models to data on SCS collected in US Holsteins. Several models of increasing levels of complexity were compared for fit, in an attempt to assess which model was most appropriate for use in genetic evaluation of SCS. A brief account of this study is given here.
Data
Testday records from primiparous Holstein cattle in 105 large (>200 cows) herds primarily in Wisconsin) were used. The somatic cell count records had been converted to linear somatic cell scores (SCS), using a standard log 2 transformation. Because herds were wellmanaged, the mean SCS of around 2.20 was less than the US national average of approximately 3.00. The dataset analyzed included 177,846 records from 31,040 cows, daughters of 3,082 different sires. An additive relationships file was created by tracing pedigrees at least 3 generations, including ancestors that were related to at least 2 animals with records. The pedigree file included 54,143 animals.
Models
Five different models were fitted (Figure 2). Model 1 was a standard testday repeatability model. Fixed effects of systematic nongenetic factors and random additive genetic and PE effects were fitted. The other 4 specifications were 2component Gaussian mixture models differing according to the type of heterogeneity of variances considered. All 3 variances (additive, PE, and residual) were homogeneous for Model 2, whereas all variances were heterogeneous for Model 5. Analyses were based on previous work of Ødegård et al. (2003), with some extensions to accommodate Models 4 and 5.
For the mixture models, observations of SCS were assigned to 1 of the 2 components, assumed to be indicative of health status. Assignments were defined by a (unknown) vector z, where z_{i}= 0 for a record i from a healthy cow and z_{i} = 1 for records from infected cows. Following the notation used by Ødegård et al. (2003), the equations for the various models can be written, given z, as shown in Figure 3. The fixed effects in β_{0} included 3 regression coeficients for effects of days in milk on SCS, 17 effects of age at calving, and 3,361 herdtestday effects. Regression coefficients for days in milk were based on the curve by Wilmink (1987). Ageatcalving effects were one for each age from 20 through 36 months. The β_{1} vector included a single element, the mean difference (shift) between components 1 and 2. For the nonmixture model (Model 1), all elements of Mz were zero. For models with homogeneous genetic and PE variances (i.e., Models 1, 2, and 3), a_{0} = a_{1} and p_{0} = p_{1}. For these models, a_{0} ~N(0, A), where A is the numerator relationship matrix and is the additive genetic variance, and p_{0} ~ N(0, I), where I is an identity matrix of order 31,040 and is the permanent environmental variance. When genetic and PE effects were heterogeneous, expectations of a_{0}, a_{1}, p_{0} and p_{1} where all zero. The covariance sctructure of genetic and PE effects was as in Figures 4 and 5, respectively. There, G is the variancecovariance matrix between additive genetic values under the healthy and diseased statuses Further, P is the variancecovariance matrix between corresponding PE effects. Conditionally on the breeding values and PE effects, the variance matrix of the observation vector (residual variance matrix) was expressed as in Figure 6, where I is an identity matrix of order n and and are residual variances for observations from the first and second components, respectively. For models with homogenous residual variance, (i.e., Models 1, 2, and 4) Equation (4) simplifies to R = I.
Bayesian Analysis
Briefly, a Gibbs sampler was run in which all unknown parameters and the indicator z were drawn from their conditional posterior distributions. Five sampling chains of 205,000 cycles each were generated for each model. For each chain, the first 5,000 cycles were discarded as burnin period so that a total of 1,000,000 posterior samples were available for each model. Convergence was assessed by the approach of Gelman et al. (2004). Posterior distributions of (co)variances were assessed based on sampling every 20th cycle. Posterior means for breeding values were obtained by averaging realizations from every 500th cycle.
where y = vector of n observations for testday SCS, β_{0} = vector of fixed effects commom to all records; β_{1} = vector of fixed effects corresponding to observation from infected cows; I = identity matrix of order n; M_{z} = matrix with diagonal elements corresponding to vector z; a_{0} = vector of randon additive genetic effects on SCS in the healthy state; a_{1} = vector of random additive genetic effects on SCS in the infected state; p_{0} = vector of random PE effectos in the healthy state; p_{1} = vector of random PE effects in the infection state; e = vecor of residual effects; and X_{0}, X_{1}, Z_{a} and Z_{p} = incidence matrixes corresponding to fixed (X.) and random (Z.) effects, respectively.
Comparison of models
The models were compared based on the Deviance information criterion (DIC) proposed by Spiegelhalter et al. (2002). The model with the lowest DIC is considered to be the most appropriate model statistically.
Estimated breeding values (EBV) resulting from the different models were evaluated for similarity. The posterior means of additive genetic effects (calculated by sampling every 500th cycle) were used as EBV. Pearson correlation coefficients were calculated between all pairs of the 7 sets of animal solutions: 1 set of EBV from each of the 3 models (1, 2, and 3) with homogeneous genetic variance and 2 sets each from the 2 models (4 and 5) with heterogeneous genetic variance. Correlation coefficients were calculated for 2 sets of animals: 1) all animals, N = 54,143, and 2) only sires with at least 10 offspring (N = 541). To examine changes in rank, all sires with at least 10 offspring were sorted in ascending order based on each of the 7 sets of EBV. Then, the top and bottom 50 sires were identified for each set. Finally, the number of animals in common between each pair (high ranking with high ranking and low vs. low) of these sets was observed. Low numbers of mismatches were assumed to indicate high similarity among evaluation models.
Results
Most models produced similar estimates of the mixing proportion (p_{2}), with around 5% of the observations in the second component (presumably associated with mastitis), and 95% in the healthy group. Results from Model 4, however, were strikingly different. First, the proportion of records assigned to the second (high) component was much greater, at about 8%, versus around 5% for the other 3 mixture models. All mixture models (2 to 5) had much lower residual variance than did the standard linear model (Model 1). Residual variance was generally around 1.00 for the mixture models (with the exception of a residual variance of 1.20 for the second component of Model 3), versus 1.60 for the linear Gaussian model. This difference is due to the variability in means between the 2 components in the mixture models, which is unaccounted for in the linear model specification.
When heterogeneous variance was allowed, the residual variance estimate was somewhat larger for the diseased component of the mixture. No obvious trend was observed for genetic variance when comparing the standard model with the 4 mixture models. In Model 4, the estimates of genetic and PE variances for the second component were much larger than the variances obtained by either of the other 3 mixture models. The genetic variance of the second component was 2.41 in Model 4, versus 0.52 for Model 5; corresponding PE variances were 3.22 and 0.76, respectively.
According to the DIC, Model 4 was favored, by far; recall that a model with the lowest DIC is preferred. The DIC of Model 1 was twice as large as for any of the mixture models. The correlations among EBV from the different pairs of models were all about 0.90. Despite high correlations among EBV, the degree of sire reranking among models indicated that the use of a mixture model would lead to real changes in sire selection if applied instead of the linear model. For all mixture models (Models 2 through 5), the top 50 sires (low SCS) differed by at least 10 sires (>20%) from the top 50 identified by the linear model (Model 1). Eleven sires were in common among the top 50, and 13 were in common among the bottom 50.
Conclusion
Based strictly on statistical considerations, mixture models are more appropriate for analysis of, e.g., SCS data of dairy cattle than standard linear models. In the case studied, a shuffling in order of the highest ranked sires was observed, demonstrating that practical differences would be realized with the adoption of a mixture model for genetic evaluation.
Although the statistical evidence supporting the use of mixture models is strong, questions remain about the biological ramifications of applying a mixture model, and about the precise meaning of the different EBV resulting from a mixture model with heterogeneous genetic effects. Another issue is how a genetic evaluation for SCS can be translated into a selection criterion, as discussed in Ødegård et al. (2005).
A challenge for scientists confronted with massive data sets, such as those in animal breeding and in gene expression analysis, is making the computations needed for implementing the mixed effects mixture models (MMM) feasible. Gianola et al. (2004) described a maximum likelihood analysis of a Gaussian mixture with random effects using the EM algorithm with a MonteCarlo Estep. Ødegård et al. (2003) presented a Gibbs sampling scheme for a Bayesian hierarchical 2component mixture model (with thousands of random effects), and retrieved accurate estimates of parameters in simulation studies. While Markov chain Monte Carlo may be the only way of carrying out a fully Bayesian analysis, diagnosing convergence to the equilibrium distribution is a serious problem for models with thousands of unobservable random effects. Similarly, nonBayesian analysis may be carried out more efficiently with algorithms based on second derivatives than with EM; in the latter, augmenting the likelihood with indicator variables (so that the missing data fraction becomes very large) can slowdown convergence painfully.
Standard models for quantitative traits can lead to erroneous results if fitted to heterogeneous data. If a mixture is suspected, two of the most suitable methods for inferring unknown mixture parameters are maximum likelihood and Bayesian analysis. Procedures for likelihood or posteriorbased inference applied to mixtures are discussed extensively in Titterington et al. (1985) and McLachlan and Peel (2000), including situations in which the component distributions are notnormal, e.g., skewed survival processes.
Implementations suitable for fitting different types of quantitative genetic mixture models have been described and applied by Ødegård et al. (2003, 2005), Gianola et al. (2004), and Boettcher et al. (2005, 2007). Prediction of breeding values is discussed in Gianola (2005). A convenient software for the analysis of mixtures with random effects is available in a forthcoming update of Version 6.0 of the DMU package described in Madsen & Jensen (2002).
Literature cited
Correspondências devem ser enviadas para: gianola@ansci.wisc.edu
Running head: Mixture models in animal breeding
 BOETTCHER, P.J.; MORONI, P.J.; PISONI, G. et al. Application of a finite mixture model to somatic cell scores of Italian goats. Journal of Dairy Science, v.88, p.22092216, 2005.
 BOETTCHER, P.J.; CARAVIELLO, D.; GIANOLA, D. Genetic analysis of somatic cell Scores in US Holstein with a Bayesian mixture model. Journal of Dairy Science, v.90, p.435443, 2007
 BULMER, M.G. The Mathematical Theory of Quantitative Genetics Clarendon Press, Oxford. 1980.
 DETILLEUX , J.; LEROY, P.L. Application of a mixed normal mixture model to the estimation of mastitisrelated parameters. Journal of Dairy Science, v.83, p.23412349, 2000
 FERNANDO, R.L.; GIANOLA, D. Optimal properties of the conditional mean as a selection criterion. Theoretical and Applied Genetics, v.72, p.822825. 1986.
 GELMAN, A.; CARLIN, J.B.; STERN, H.S. et al. Bayesian data analysis, 2nd edn. Chapman and Hall, Boca Raton. 539, 2004
 GIANOLA, D.; FERNANDO, R.L. Bayesian methods in animal breeding theory. Journal of Animal Science, v.63, p.217244, 1986.
 GIANOLA, D. Inferences about breeding values. In Handbook of Statistical Genetics (D.J. Balding, M. Bishop and C. Cannings, eds.) 645672. Wiley, Bu¢ ns Lane. 2001.
 GIANOLA, D. Prediction of random effects in finite mixture models with Gaussian components. Journal of Animal Breeding and Genetics, v.122, n.546, p. 145160, 2005.
 GIANOLA, D.J., ØDEGÅRD, B., HERINGSTAD, G. et al. Mixture model for inferring susceptibility to mastitis in dairy cattle: a procedurefor likelihoodbased inference. Genetics, Selection, Evolution, v.36, p.327, 2004
 GIANOLA, D.; HERINGSTAD, B.; ØDEGÅRD, J. On the quantitative genetics of mixture characters. Genetics, v.173, p.22472255, 2006.
 HALEY, C.S.; KNOTT, S.A. A simple regression method for mapping quantitative trait loci in line crosses using .anking markers. Heredity, v.69, p.315.324, 1992
 HENDERSON, C.R. Sire evaluation and genetic trends In Proceedings of the Animal Breeding and Genetics Symposium in Honor of Dr. Jay L. Lush 1041. American Society of Animal Science and American Dairy Science Association, Champaign, IL. 1973
 HERINGSTAD, B.; KLEMETSDAL, G.; RUANE, J. Selection for mastitis resistance in dairy cattle: a review with focus on the situation in the Nordic countries. Livest. Prod. Sci, v.64, p.95106, 2000.
 KIMURA, M.; CROW, J.F. Effect of overall phenotypic selection on genetic change at individual loci. Proc. Natl. Acad. Sci., v.75, n.567, p.61686171, 1978.
 LATTER, B.D.H. The response to arti.cial selection due to autosomal genes of large effect. Aust. J. Biol. Sci, v.18, p.585598, 1965.
 LYNCH, M.; WALSH, B. Genetics and Analysis of Quantitative Traits. Sinauer, Sunderland. 1998.
 MADSEN, P. & JENSEN, J. A User.s Guide to DMU. A package for analysing mixed models. Version 6, Release 4.3. 19pp. Danish Institute of Agricultural Sciences. 2002.
 McLACHLAN, G. & PEEL, D. Finite Mixture Models. Wiley, New York. 2000.
 ØDEGÅRD, J.; JENSEN, J.; MADSEN, P. et al. Mixture models for detection of mastitis in dairy cattle using testday somatic cell scores: a Bayesian approach via Gibbs sampling. Journal of Dairy Science, v.86, p.36943703, 2003.
 ØDEGÅRD, J.; MADSEN, P.; GIANOLA, D. et al. ThresholdNormal Mixture Model for Analysis of a Continuous Mastitis Related Trait. J. Dairy Sci, v. 88, p.26522659, 2005.
 PEARSON, K., Contributions to the mathematical theory of evolution. Phil. Trans. Roy. Soc. A v. 185, p.71110 1894.
 QUAAS R.L. Additive genetic model with groups and relation ships. Journal of Dairy Science, v.71, p. 13381345, 1988.
 SEARLE, S.R.; CASELLA, G.; McCULLOCH, C.E. Variance Components. Wiley, New York. 1992.
 SORENSEN, D.; GIANOLA, D. Likelihood, Bayesian and MCMC Methods in Quantitative Genetics. Springer, New York. 2002.
 SPIEGELHALTER, D.J.; BEST, N.G.; CARLIN, B.P. Bayesian measures of model complexity and fit. J. Royal
Publication Dates

Publication in this collection
05 Aug 2008 
Date of issue
July 2007