Cubic-spline interpolation to estimate effects of inbreeding on milk yield in first lactation Holstein cows

Milk yield records (305d, 2X, actual milk yield) of 123,639 registered first lactation Holstein cows were used to compare linear regression (y = β0 + β1X + e), quadratic regression, (y = β0 + β1X + β2X2 + e) cubic regression (y = β0 + β1X + β2X2 + β3X3 +e) and fixed factor models, with cubic-spline interpolation models, for estimating the effects of inbreeding on milk yield. Ten animal models, all with herd-year-season of calving as fixed effect, were compared using the Akaike corrected-Information Criterion (AICc). The cubic-spline interpolation model with seven knots had the lowest AICc, whereas for all those labeled as “traditional”, AICc was higher than the best model. Results from fitting inbreeding using a cubic-spline with seven knots were compared to results from fitting inbreeding as a linear covariate or as a fixed factor with seven levels. Estimates of inbreeding effects were not significantly different between the cubic-spline model and the fixed factor model, but were significantly different from the linear regression model. Milk yield decreased significantly at inbreeding levels greater than 9%. Variance component estimates were similar for the three models. Ranking of the top 100 sires with daughter records remained unaffected by the model used.


Introduction
Commercial adoption of artificial insemination by the dairy industry has facilitated the intensification of selection for traits such as milk, fat and protein yields. Selection is largely based on Predicted Transmitting Ability (PTA), resulting from best linear unbiased prediction (BLUP) and selection indices. Both the moderate heritability of production traits and the use of BLUP have lead to the predominant use of specific family-lines of bulls (Weigel, 2001), especially in the Holstein population, which in turn has resulted in mating of related animals. A consequence of such matings is an increase in average inbreeding. The development of multiple ovulation and embryo transfer technology have also contributed to increasing levels of inbreeding by intensifying selection for high producing cows used as dams of AI bulls (Kearney et al., 2004). The average inbreeding coefficient in the Holstein population in the United States, as reported by the Animal Improvement Program Laboratory for animals born between January and June of 2009 (using 1960 as base year), was 5.50% (ARS-AIPL, 2009). Several studies have shown the detrimental effects of inbreeding on several production, reproduction and health traits, even at low levels. For example, Hudson and Van Vleck (1984a,b) studied ef-fects of inbreeding on first lactation milk and fat production in five dairy breeds, using a linear regression approach. They reported decreased milk yield in inbred animals in all the five breeds, with an estimated loss of 21 kg of milk per 1% increase in the inbreeding coefficient in Holsteins. They also noted that the effects of inbreeding were non-linear, when inbreeding was modelled as a classification variable. With similar statistical models, Miglior et al. (1992) reported a significant reduction in total milk yield of 10 kg for each 1% increase in inbreeding in Canadian Jersey cattle. The Canadian group also reported the non-linear effects of inbreeding. Thompson et al. (2000) found a decrease in milk yield of 35 kg per 1% increase in inbreeding between 0 and 7%, compared to a 55 kg per 1% increase for higher levels of inbreeding in Holstein cows, this again indicating a non-linear effect on milk yield.
The most common method used to estimate effects of inbreeding is linear regression of production on inbreeding coefficient (e.g., Falconer and MacKay, 1996). The regression coefficient is the expected change in the trait of interest per 1% increase in inbreeding and is a measure of inbreeding effects. Few studies have used non-linear regression of production on inbreeding coefficient. McParland et al. (2007), when comparing models with higher order polynomials and a classification model, reported significant quadratic effects when considering inbreeding as either a continuous variable or a fixed factor. Nevertheless, they urged caution in interpreting nonlinear results, due to large-standard errors in estimates at higher levels of inbreeding. Croquet et al. (2007) compared the fit of linear (y = b 0 + b 1 X + e), quadratic (y = b 0 + b 1 X + b 2 X 2 + e) and cubic (y = b 0 + b 1 X + b 2 X 2 + b 3 X 3 + e) regression models for estimating the effects of inbreeding based on milk yield. The coefficients in all the three methods were significantly different from zero. The largest t-value was for the simple linear regression coefficient. They proposed using the linear regression model for estimating the effects of inbreeding, especially for animals less than 10% inbred. Gulisija et al. (2007) used a non-parametric approach to estimate inbreeding effects on production in Jerseys. They reported that, on including inbreeding as a linear covariate, the fit of the model at low levels of inbreeding improved (< 7%). The lack of fit was detected in the linear regression model at higher levels of inbreeding (> 10%), for which a third-order regression model seemed to adequately fit the data. Another method which could be used is cubic-spline interpolation to estimate the non-linear effects of inbreeding on milk yield. The rationale is that this method, besides providing better coverage of data points (Harrell Jr, 2001), may result in more accurate estimates of the effects of inbreeding on milk yield.
The objective of this study was to compare cubicspline interpolation to estimate effects of inbreeding on milk yield with traditional linear regression and fixed factor models.

Data and edits
Records of actual milk yield adjusted to 305-d in lactation (2X) for first-calf Holstein heifers, freshening between 2002 and 2006, were used. The records were provided by the Dairy Herd Information Association (DHIA-North Carolina). Only records of registered Holstein heifers were included in the analysis. Animals with less than five recorded test-day milk yields were deleted from the data. Records of annual milk yield included in the data set, consisted of milk records within four times the standard deviation from the average (approximately 9,100 kg). As standard deviation was approximately 1,660 kg, rounding up the confidence range to the nearest 100 kg resulted in some yields that were less than 2,400 kg, and others more than 15,800 kg. Through being considered as outliers, these were omitted from the data set. Contemporary groups were herd-year-seasons (HYS) of freshening, with season 1 defined as October 1 through March 31, and season 2 defined as April 1 through September 30. Records of HYS with less than 10 freshening heifers were deleted. Pedigree information was provided by the Holstein Association, USA. The pedigree file used was extended backwards, so that at least one of the paternal and maternal grandsires or granddams was known for all the heifers included in the analysis. Re-cords of animals that did not fit the minimum requirements were omitted. Nonetheless, these animals were included in the pedigree file, which finally consisted of 541,249 animals. Individual inbreeding coefficients (%F) were provided by AIPL, as calculated by Wiggans et al. (1995). After edits, the analysis included records of 123,639 heifers in 5,839 HYSs, with individual inbreeding coefficients ranging from 0 to 40%. Few animals (approximately 0.1%) had levels of inbreeding greater than 18.75%. The distribution of inbreeding coefficients was similar to that found in previous studies (e.g., Hudson and Van Vleck, 1984a,b;Miglior et al., 1992). Figure 1 presents average unadjusted milk yields with values within one standard deviation) by inbreeding class. Fluctuation of averages for inbreeding coefficients greater than 12% may be due to the small number of animals in those categories. In fact, milk yield records of animals with inbreeding greater than 12% constituted less than 1% of the available data, with several of these inbreeding levels (8 out of 19) having less than 10 observations each.

Statistical models
All the models included a random animal genetic effect with HYS treated as a fixed factor. The ten models compared were grouped as follows: Fixed factor models The first model was a saturated fixed-factor model, where the inbreeding coefficients (F) were included as levels of a fixed factor. Inbreeding coefficients rounded to the nearest integer were considered as unique levels. Thus, 32 different inbreeding levels were formed (0% to 40%, with levels 22, 24 and 33 through 39 missing).
The second fixed factor model consisted of grouping F into seven classes as shown in Table 1. This classification has been used in several studies (e.g., Hudson and Van Vleck, 1984a,b;Miglior et al., 1992;McParland et al., 2007).The general form of the fixed factor models is: with y ijkl , milk yield of Animal k in HYS j with an inbreeding coefficient F falling within inbreeding level i; m, a con- 444 Geha et al. stant; F I, the effect of the i th level of inbreeding for i = 1, ..., 32 with the saturated fixed factor model, and i = 1, ..., 7 with the fixed factor model; HYS j , the effect of the j th HYS contemporary group treated as a fixed factor; Animal k , a random additive genetic value of the k th animal with s a 2 = 0.5s e 2 , and e ijkl , a random error effect normally and independently distributed with a mean of zero and a variance, s e 2 .

Regression models
Inbreeding coefficients were included as covariates with linear, linear and quadratic or linear, quadratic and cubic effects, resulting in three models. The linear regression model was: with y ijk , milk yield of Animal j in HYS i with an inbreeding coefficient F; b 0 , intercept; b 1 , regression coefficient for the linear effect of inbreeding; F, inbreeding coefficient of animal j; HYS, Animal and e effects as defined earlier.
The linear and quadratic regression model was: with y ijk , milk yield of Animal j in HYS i with an inbreeding coefficient F; b 0 , intercept; b 1 , coefficient of the linear effect of inbreeding; F, inbreeding coefficient of animal j; b 2 , coefficient for the quadratic effect of inbreeding; and HYS, Animal and e effects as defined earlier. The cubic regression model was: with y ijk , milk yield of Animal j in HYS i with an inbreeding coefficient F; b 0 , intercept; b 1 , coefficient of the linear effect of inbreeding; F, inbreeding coefficient of animal k; b 2 , coefficient for the quadratic effect of inbreeding; b 3 , coefficient for the cubic effect of inbreeding; and HYS, Animal and e effects as defined earlier.

Cubic-spline models
The spline models had three to seven knots (t 3 … t 7 ) for F. Choice of knots was based on consideration of the different possible inbreeding coefficients that could be attained from mating of directly related animals such as full-sibs, half-sibs,parent-progeny and grandparent-grandprogeny matings. Some knots were also chosen to divide the inbreeding levels into equally spaced groups. For the cubic-spline model with seven knots, the knots were chosen to match the seven levels of the fixed factor model. The knot positions are presented in Table 2. This approach resulted in five different models, the most complete being: with y ijk , milk yield of Animal j in HYS i with an inbreeding coefficient F; b 0 , intercept; b 1 , coefficient for the linear effect of inbreeding; F, inbreeding coefficient of animal j; b 1 , ..., b 6 , the Z-spline coefficients of the cubic-spline interpolation for F; F 2 , ..., F 6 , the knot functions; and HYS, Animal and e effects as previously described.
Depending on the number of knots in the model, some terms were removed.
Other models could have also been used; for example, higher order linear polynomials, such as quartic and quintic. Inclusion of other fixed and /or random effects might improve the fit of the model. The focus of the project was to compare cubic-spline interpolation with traditionally used models. Thus, the models used were kept as similar as possible to the traditional models. Furthermore, as discussed later, comparisons required fixed genetic variance. Animal genetic variance was chosen to match heritability estimates from other studies (e.g., Swalve and VanVleck, 1987).

Model comparisons
Models were compared using Akaike's Information Criterion (AIC; Akaike, 1974), one of the several model selection methods based on the principle of parsimony. Other methods based on the same principle could have been used but the AIC is one of the easiest methods to compute and does not require extensive computation. Model selection could have been also conducted in the context of null hypothesis testing, however this approach has relatively poor performance given its dependence on the significance level specified to test if effects should be included or omitted Inbreeding effects on milk yield 445 from a model (a = 0.05 or 0.01 or 0.15). The null hypothesis testing approach also has a relatively poor performance when testing non-nested models (Burnham and Anderson, 2002). The AIC is an approximate measure of the Kullback-Leibler information of a model based on the estimate of log-likelihood and number of parameters estimated: AIC = 2p -2l, where p is the number of parameters estimated, and l is the log-likelihood of the model used.
When the ratio of the number of observations, n, to the number of parameters estimated, p, is less than 40, it is recommended that corrected AIC be used: AICc = AIC + [2p(p + 1)/(n -p -1)] (Burnham and Anderson, 2002). As the number of observations becomes large, AICc converges to AIC. The idea behind AIC is that the difference between two competing models, A and B, can be detected by differences in estimates of residual error variances from the two models. As, by increasing the number of parameters in a model, the goodness of fit improves, i.e., there is a reduction in the residual sum of squares and an increase in log-likelihood in nested models, a growing penalty function, equal to twice the number of parameters included, is deducted to discourage "overfitting". One disadvantage of overfitting is that, given enough parameters, an irrational model may fit the data perfectly, even if it does include nonsensical parameters. Models were ranked according to their AICc because the ratio of the number of observations to the number of parameters estimated, was always less than 40. The model with the smallest AICc was considered the best. The significance of differences between models was based on their AICc values, as described by Burnham and Anderson (2002) for comparison of models. In general, a small difference in AICc between two competing models (less than 2) indicates that neither of the two is adequate. A moderate difference (4 to 7) indicates that the model with the higher AICc does not fit the data as well as the one with the lower AICc rank. A large difference (greater than 10) indicates that the model with the highest AICc is inadequate compared to that with the lowest. The log-likelihood used in the AIC, and subsequently the AICc, is based on maximizing likelihood (ML). The software used in the analysis however maximized the restricted maximum likelihood (REML). Comparison of the different models therefore required manipulation of the likelihood estimates from REML to be converted into ML estimates. Therefore, the animal genetic variance was fixed for all models to facilitate estimating random residual variance. Heritability of milk yield at first lactation was assumed to be 0.33 (Swalve and Van Vleck, 1987;Miglior et al., 1992 =( / ) . Milk yield was assumed to be approximately normally distributed, y~N(Xb, Vs e 2 ), so that the likelihood function given y is: where y is the n x 1 vector of observations; n is total number of observations; V is a matrix of constants, since the animal genetic variance was fixed at one-half residual-error variance; s e 2 is the residual-error variance; and s e 2 is the product of the design matrix, X, and the vector of fixed effects, b. Thus, the log-likelihood function is: The constant is the same for all models because n, 2p and V are the same for all models.

Comparison of models for inbreeding effects
The analyses were re-run without restrictions on the animal genetic variance to obtain, not only estimates of variance components and predicted animal breeding values for traditional linear regression and fixed factor models, but also the best cubic-spline model for comparing estimates of the effects of inbreeding on milk yield and of variance components and heritability. For each model, the predicted breeding values (EBV) for sires of daughters with records were compared to detect changes in sire-ranking due to the model used. The pedigree file contained 9,618 sires of which 1,409 had daughters with records. The proc reg procedure in SAS © 9.2 was used to regress EBVs of these sires from the cubic-spline model, on the corresponding EBV from the fixed factor and linear regression models. The calculated correlations among EBV of sires of daughters with records from all three methods were examined using the proc corr procedure in SAS © 9.2. Ranking of the top 100 sires for the three models were also compared. Finally, the estimated milk yields by inbreeding level using the best model were regressed on estimated milk yields by inbreeding level from the linear regression and fixed factor models using the proc reg procedure in SAS © 9.2.

Results and Discussion
Estimates of residual-error variance, log-likelihood, and AICc for each model, as well as differences in AICc from the model with the lowest AICc (assumed best), are given in Table 3. Based on AICc, the best model was cubic-spline interpolation of inbreeding with seven knots. The model with the highest AICc was the saturated fixed factor model (a difference of +50.447 from the best model). The fixed factor model and simple linear regression model, i.e. the traditional methods of analysis, ranked second and third to last, respectively. The cubic-spline model with 4 knots had 0.48 higher AICc than the model with seven knots indicating that the two models are nearly equivalent. The cubic-spline models with five and six knots as well as the cubic regression model had AICc differences from the best model ranging between 2 and 6 reflecting that these models do not fit the data as well as the best model. The remaining models (i.e. linear regression, linear and quadratic regression, cubic-spline with three knots, fixed factor and saturated models) all had differences of AICc larger than 10 units compared with the best model indicating a poor fit to the data compared to the cubic-spline with seven knots.
The counter-intuitive result of the cubic-spline models with five and six knots, which performed worse than that with four, is explained by differences in log-likelihoods and the number of parameters between the models. Compared to the model with four knots, one additional parameter was estimated in the model with five knots, and two in that with 6. Even though this difference was small, it lead to AICc ranking differing from what was to be expected.
Inbreeding effects on milk yield 447 Estimates of the effects of inbreeding on milk yield for a cubic-spline model with seven knots are represented in Figure 2. Table 4 contains t-values to test for significance of reduction in 305d milk yield between each inbreeding level and no inbreeding. Significant differences at the 0.05 rejection level were at |t-value| > 1.96. These results showed no significant decrease in milk production up to an inbreeding level of 8%. Milk yields significantly decreased at inbreeding levels of 9% to 21%. No significant differences in milk yields were found for inbreeding levels of 23% to 27%. Significant differences in milk yields were detected for inbreeding levels of 28% and beyond. The linear regression model indicated a decrease of 21.49 kg of milk per 1% increase in inbreeding, in agreement with estimates in the literature for Holstein cattle, which ranged from 9.84 to 26.00 kg (Thompson et al., 2000). Estimates of losses due to inbreeding using the fixed factor model are represented in Table 5. These estimates generally agree with previous studies in which five inbreeding classes were used (Hudson and Van Vleck, 1984a,b;Miglior et al., 1992;Thompson et al., 2000).
Estimates of variance components and heritability for the cubic-spline model with seven knots, for the linear regression model and for the fixed factor model with seven levels are shown in Table 6. The three models resulted in similar estimates of variance components and heritability 448 Geha et al.   of approximately 0.31 (SD = 0.01) which agrees with estimates of heritability reported in the literature of 0.33 (e.g., Swalve and Van Vleck, 1987).
Regression of EBV from the cubic-spline model on EBV from the linear regression model for the 1,409 sires having daughters with records, resulted in R 2 of 0.99, a correlation of 0.99 and a regression coefficient of 0.97. Regression of EBV from the cubic-spline model on EBV from the fixed factor model had a R 2 of 0.99, a correlation of 0.99 and a regression coefficient of 0.99. The regression coefficient from regression of EBV from the cubic-spline model on the EBV from the linear regression model, was significantly different from 1.00 (p < 0.0001). The regression coefficient of the regression of EBV from the cubic-spline model on the EBV from the fixed factor model, was not significantly different from 1.00 (p = 0.31). Correlations between EBV from the cubicspline model and EBV from the linear regression and fixed factor models were both significantly different from 1.00 (p < 0.0001), when the Fisher (rho0 = 1.00) option was specified in the CORR procedure in SAS © . Thus, estimates of breeding values could possibly be different for the three models. Nonetheless, ranking of the top 100 sires by each method revealed that only the rank of one sire changed between the linear regression model and the two other models indicating that selection for breeding value was minimally affected by the model used.
The regression of milk yield estimates by inbreeding level from the cubic-spline model on milk yield estimates by inbreeding level from the linear regression model had a R 2 of 0.99, a correlation of 0.79 and a regression coefficient of 0.97 that was significantly different from 1.00 (p = 0.02). The regression of estimates by inbreeding level from the cubic-spline model on estimates by inbreeding level from the fixed factor model had a R 2 of 0.99, a correlation of 0.88 and a regression coefficient of 0.98 that was not significantly different from 1.00 (p = 0.13).
Consequently, the linear regression model is apparently not the best for estimating the effects of inbreeding on milk yield. Despite its lower AICc ranking, the fixed factor model appears to be a simpler and more effective alternative to the more complex cubic-spline. Estimates of the effects of inbreeding on milk production were similar for the cubic-spline model and the fixed factor model. This may be due to the positions of the knots in the cubic-spline model which coincide, for the most part, with the inbreeding levels for the fixed factor model.
A substantial reduction in profit is reflected from losses derived from the detrimental effects of inbreeding. At an inbreeding level of 9%, the estimated loss in milk yield would be 95 kg, thereby reflecting a potential loss of US$ 26.18 per lactation at the average milk-price of US$ 12.5 per hundred weight. The average inbreeding in the U.S. Holstein population was 5.50% for the year 2009, and is on the increase, as reported by the Animal Improvement Program Laboratory.
In this study, knots were positioned at inbreeding levels reflecting the possible mating of directly related animals (within a family line). Therefore, given the similarity in estimates of effects of inbreeding on milk yield, the fixed factor model, despite poorer AICc, presents a simple and effective alternative to using the complex cubic-spline.
If adjustment of milk yields for inbreeding is to be based on estimates using a cubic-spline model, it would be advantageous to develop an algorithm, useful in detecting the positions of knots providing the best fit to the data. Their positioning would first need to be defined in one or several reference data sets, and then validated in others. Cubic-spline coefficients would require periodical re-validation.
Inbreeding effects on milk yield 449