Prediction of egg weight from egg quality characteristics via ridge regression and regression tree methods

This study was conducted on 2049 eggs, collected from commercial white layer hybrids, with the purpose of predicting egg weight (EW) from egg quality characteristics such as shell weight (SW), albumen weight (AW), and yolk weight (YW). In the prediction of EW, ridge regression (RR), multiple linear regression (MLR), and regression tree analysis (RTM) methods were used. Predictive performance of RR and MLR methods was evaluated using the determination coefficient (R2) and variance inflation factor (VIF). R2 (%) coefficients for RR and MLR methods were found as 93.15% and 93.4% without multicollinearity problems due to very low VIF values, varying from 1 to 2, respectively. Being a visual, non-parametric analysis technique, regression tree method (RTM) based on CHAID algorithm performed a very high predictive accuracy of 99.988% in the prediction of EW. The highest EW (71.963 g) was obtained from eggs with AW > 41 g and YW > 17 g. The usability of RTM due to a very great accuracy of 99.988 (%R2) in the prediction of EW could be advised in practice in comparison with the ridge regression and multiple linear regression analysis techniques, and might be a very valuable tool with respect to quality classification of eggs produced in the poultry science.


Introduction
In poultry production, animal products, such as poultry meat and egg, which are of a cardinal economic significance, are quite essential for the development of the economy of a country and meeting nutrient requirements of humans in the world.In the context of egg production, the quality of an egg, which is not only known as the basic product in poultry activities by breeders, but also provided cheaply to consumers, is dependent on internal (albumen weight and yolk weight) and external (shell weight) quality traits.Aktan (2004) emphasized that significant correlation coefficients between the weight of inner thick albumen and the weight of yolk area were found to be 0.489 and 0.796, respectively (P<0.001).Alkan et al. (2010) remarked that egg weight, shell weight, shell thickness, and weight of egg yolk and albumen were significant egg traits affecting egg quality, chick weight, and hatching performance under optimal management conditions.In an earlier survey determining the influence of line and age on albumen quality and egg weight for commercial white layer hybrids, Aktan (2011) reported a positive correlation between EW and albumin quality.
In recent years, RTM has been adopted extensively in the animal science field.The RTM was applied in the prediction of body weight and milk yield in different sheep and cattle breeds (Eyduran et al., 2008;Bakir et al., 2010).Applying RTM, Khan et al. (2014) estimated body weight from several body measurements in Harnai Sheep.Similarly, Mohammad et al. (2012) employed RTM with the intent to predict body weight from withers height, chest girth, and body length for indigenous Pakistan sheep.Eyduran et al. (2013) applied RTM to predict 305-d milk yield of Brown Swiss cattle.In a previous document by Yilmaz et al. (2013), RTM was fitted to birth records of Brown Swiss cattle with a view to determine the effect of non-genetic factors on calf birth weight.With RTM, a slaughter weight of Ross 308 broiler chickens was predicted (Mendeş and Akkartal, 2009).In the poultry science, former investigations concerning RTM are not enough.
It is very easy to apply RTM for ordinal, nominal, and continuous variables.In particular, the CHAID algorithm in RTM among data mining algorithms (CART, CHAID, and Exhaustive CHAID) is practiced to construct a decision tree for regression problems, because it reveals non-linear and interaction effects among independent variables, and is a favorable alternative to RR and MLR techniques for continuous dependent and independent variables, as in the investigation.The CHAID algorithm uses merging, splitting, and stopping stages in the construction of a decision tree, and converts continuous variables into ordinal variables.It forms homogenous groups (nodes) by recursively splitting nodes for maximizing variability among nodes (Nisbet et al., 2009).
There is a limited availability of RTM in the poultry science (Mendes and Akkartal, 2009).Taking into consideration earlier studies concerning RTM and its advantages, RTM based on CHAID algorithm can be an admirable tool in the classification of eggs, existing in egg quality criteria instead of traditional regression methods.But there is no reported knowledge which proved the prediction of EW using regression tree method.Hence, the aim of this present investigation was to predict EW from AW, YW, and SW on commercial layer hybrids by means of MLR, RR, and RTM (based on CHAID algorithm) analyses for developing egg quality standards.

Material and Methods
Data of this study have been taken from Saygici (2004) to show how to apply MLR, RR, and especially, RTM constructed by CHAID algorithm, one of the data mining algorithms, as well as to show how to interpret their outputs.A total of 228 commercial layer hybrids at 21 weeks of age were provided for predicting EW from different egg traits in the current study.See Saygici (2004) for obtaining more detailed information on the animal material.
Structures of the inspected variables on 2049 eggs used in the experiment could be summarized briefly as follows: shell weight (SW, g) -continuous (quantitative) variable; yolk weight (YW, g) -continuous (quantitative) variable; albumen weight (AW, g) -continuous (quantitative) variable; and egg weight (EW, g) -continuous (quantitative) variable.
In the current study, EW was taken as a dependent variable (target), and YW, AW, and SW were considered independent (explanatory) variables, in order to predict the EW via MLR, RR, and RTM (CHAID algorithm) analysis methods.
In general, MLR can be written in matrix notation as: Y = Xβ + e.In this model, Y is a dependent variable; X is an independent variable(s); β is the regression coefficient(s); and e is a vector of residuals.Based upon the least square method, regression coefficients can be expressed as: the least squares estimator of .Ridge regression is utilized as a more effective method than the least squares method in the event of multicollinearity.In RR analysis, the cross-product matrix for descriptive independent variables (SW, AW, and YW) is placed and ascended to one of the diagonal elements.
Ridge regression was an alternative predictor having a lower mean square error.Its estimator is indicated by a parameter 0 ≤ k ≤ 1 such that: in which kI is a diagonal matrix with all the elements comprising a random small constant k.
By the means of k, bias increases.Higher values of k reduce multicollinearity, which decreases the total variation, but a k of zero yields the least squares estimates.Afterwards, RR aims to estimate the best value of k.
Choosing k according to VIF, k can be assigned if VIF<10.
As a tree-based model, RTM recognizes the best independent variables influencing the target variable (Camdeviren et al., 2005;Mohammad et al., 2012).RTM is a more beneficial method than the old-style or classic methods, particularly if analysts have large complex data sets and a great number of independent variables (Camdeviren et al., 2005).For RTM, no assumptions, such as normality, constant variance, linearity, and non-multicollinearity are required, because it is a nonparametric method.The RTM recursively continues splitting process in order to reduce variation on a dependent variable (Camdeviren et al., 2007).In the study, RTM was identified with the aid of the CHAID algorithm in IBM SPSS 22 program, and thus a decision tree structure was established.
There are three steps (merging, splitting, and stopping) in the CHAID algorithm that allows multiple splits of any node for a regression problem (Nisbet et al., 2009).A decision tree from the algorithm is grown by repeatedly using these three steps for each node, starting from the root node (Ali et al., 2015).The root node was partitioned recursively if the variability of a dependent variable is minimized within the nodes, and the variation among the nodes is maximized.
The CHAID algorithm merely manipulates nominal or ordinal categorical independent variables.For this reason, continuous independent variables are converted into ordinal independent variables prior to using the following algorithm.For a given set of break points a 1 , a 2 ,...,a K −1 (in ascending order), a known x is mapped into category C(x) herein below: When K is the preferred number of bins, for the estimation of the break points x i frequency weights are unified in calculating the ranks.In the case of being tied, the average rank is employed.The rank and the corresponding values in the ascending order can be explained as .
For k = 0 to (K−1), set , in which (x) indicates the floor integer of x.If I k is not empty, .The break points are adjusted by becoming equal to the x values corresponding to the i k , excluding the largest (Breiman et al., 1984).
Bonferroni adjustment was performed for RTM based on CHAID algorithm to obtain Adjusted P values of F values.The tree-based algorithm, having an automatically pruning process in ignoring unnecessary nodes in the decision tree, uses F significance test when a continuous dependent variable was used.We applied a ten-fold crossvalidation under the statistical evaluation.
Initially, Pearson correlations were estimated between pairs of egg traits.The predictive power of RR, MLR, and RTM was measured by using the coefficient of determination (%R 2 ) as a proportion of the explained variability in EW.In the study, IBM SPSS 22 program was used for the statistical analyses.

Results
In multicollinearity case, RTM algorithms visually admit quite easier interpretation of the data to construct decision trees, in comparison with the implementation of traditional methods, such as MLR and RR.In the statistical performance, RR analysis is traditionally advisable when compared with the multiple linear regression analysis.
To explain the total variability in EW, the collected egg data were analyzed using MLR.The MLR results are summarized in Table 1.The ANOVA result (F value) for the MLR model revealed that MLR model built in the present paper was statistically significant (P<0.01).All of the independent variables (such as AW, YW, and SW) accounted for 93.4% of total variability in EW as a response (dependent) variable for MLR, without multicollinearity difficulty owing to VIF values, varying from 1.459 to 1.984.Taking into account the positive coefficients, an increment in EW would be expected, as AW, YW, and SW increased.
Ridge regression for the collected data was executed for the prediction of total variability in EW, also preferred as an alternative method to MLR. Results from the regression analysis suggested that there was a very good explanation of 93.15% and illustrated a very similar tendency to the present MLR results addressed above (Table 2).
A decision tree diagram was constructed via CHAID algorithm for obtaining detailed information on the independent variables significantly affecting EW (Figure 1).Albumen weight, YW, and SW were the significant independent variables constructing a decision tree diagram, and accounted for nearly all (99.988%) of the total variation in EW.Of these independent variables, AW (F = 1885.446,df1 = 7, and df2 = 2041) was a prime variable on EW, which was followed by YW, and SW in significance order.As a result of increasing AW, EW averagely displayed a great range of 40.195 to 70.125 g from node 1 through node 8 (Figure 1).
Node 0, also known as a root node at the top of RTM diagram, was presented in all the studied eggs.The average EW for node 0 was 58.764 (S = 7.872) g from 2049 eggs.Node 0 was divided into eight new child nodes (nodes 1-8) on the basis of AW.Within these eight nodes, nodes 1, 2, 4, 6, and 7 appeared to be terminal nodes in the RTM diagram drawn via CHAID algorithm.
The average EW of 61.629 (S = 2.213) g was obtained from node 6, a cluster of eggs with 36 < AW ≤ 38 g (n = 251 eggs).Node 7, consisting of a cluster of eggs with 38 < AW ≤ 41 g, provided the average EW of 64.983 (S = 2.045) g from 232 eggs.
The average EW of 70.125 (S = 3.001) was predicted by using node 8 as a cluster of eggs with AW > 41 (n = 184), which was again branched into two child nodes 13-14, in relation to YW, respectively.This elucidates that the statistically significant effect of YW on EW of eggs included in node 8 (F = 76.738,df1 = 1, and df2 = 182) reappeared (Adjusted P<0.01) Comprising eggs with AW > 41 and YW ≤ 17 g, node 13 produced 68.680 (S = 2.991) g of EW (n = 103 eggs).Node 14, containing eggs with AW > 41 and YW > 17 g, had on average the heaviest EW of 71.963 (S = 1.757) g in EW, (n = 81 eggs), as the lightest EW was produced by node 1.
We mention a marvelous agreement between the real and predicted values obtained by RTM based on CHAID algorithm, having almost 100 (%R 2 ).

Discussion
In the current study, there were positive correlations among all independent variables (P<0.001), which confirmed the result of Ratherd et al. (2011), who informed that there is a high correlation between inner and external egg quality traits in Japanese quails, and acknowledged that multicollinearity occurs in regression models between independent variables.They applied the principal component regression to get rid of the multicollinearity problem, according to their findings.The benefits of ridge regression are mentioned in the presence of multicollinearity.
In many studies, it was stated that the correlations between inner and external egg quality variables were found to be highly substantial and at highly strong positive associations (P<0.01)(Alkan et al., 2010;Narinç et al., 2011;Kul and Şeker, 2004;Üçkardeş et al., 2010).Akbaş et al. (1996) remarked on positive correlations between pairs of the hen age and egg weight, yolk width, albumen width, and albumen length.However, they recorded some negative correlations between egg weight and shell strength, shell thickness, yolk index, albumen index, yolk height, albumen height, and Haugh unit.
Regression, correlation, and MLR analyses have been employed to identify the relationship between egg weight and these quality traits in different layer strains.Ridge regression in multicollinearity problem could give more accurate estimates than MLR analysis.In fact, making more accurate decision on preferring the most effective statistical methods is the most important matter in the estimation of EW from egg quality traits.In comparison with these statistical methods highlighted above, it was said that RTM, which can be understood and interpreted more easily in visual form, was not influenced by multicollinearity, outliers, and missing observations (Mendes and Akkartal, 2009;Karabağ et al., 2010).However, the results depicted that no multicollinearity problem was detected in either MLR or RR analyses.Because there was no multicollinearity problem in MLR, no application of ridge regression analysis was necessary for this type of data for interpretation.

Conclusions
Egg quality is a very important product for the poultry farming and sector.Regression tree analysis based on the CHAID algorithm, with a very much higher predictive accuracy of 99.988 (%R 2 ), is a powerful approach that detects the relationship between egg weight and internal (albumen and yolk weights) and external (shell weight) quality traits, which are indicative of egg quality.The decision tree from regression tree analysis depicts that the highest egg weight (71.963 g) is obtained from eggs with albumen weight >41 g and yolk weight >17 g.Consequently, it is expected that the employability of regression tree analysis will be hybrids in the poultry sector, because it does not involve any assumption about independent variables in regression tree analysis for being a non-parametric technique.

Figure 1 -
Figure 1 -Decision tree diagram for the studied model.

Table 1 -
Estimated parameter and significance levels in multiple linear regression analysis VIF -variance inflation factor.Ridge regression coefficient section for k = 0.005000.

Table 2 -
Results of ridge regression analysis