Introduction

In poultry production, animal products, such as poultry meat and egg, which are of a cardinal economic significance, are quite essential for the development of the economy of a country and meeting nutrient requirements of humans in the world. In the context of egg production, the quality of an egg, which is not only known as the basic product in poultry activities by breeders, but also provided cheaply to consumers, is dependent on internal (albumen weight and yolk weight) and external (shell weight) quality traits. ^{Aktan (2004}) emphasized that significant correlation coefficients between the weight of inner thick albumen and the weight of yolk area were found to be 0.489 and 0.796, respectively (P<0.001). ^{Alkan et al. (2010}) remarked that egg weight, shell weight, shell thickness, and weight of egg yolk and albumen were significant egg traits affecting egg quality, chick weight, and hatching performance under optimal management conditions. In an earlier survey determining the influence of line and age on albumen quality and egg weight for commercial white layer hybrids, Aktan (2011) reported a positive correlation between EW and albumin quality.

In recent years, RTM has been adopted extensively in the animal science field. The RTM was applied in the prediction of body weight and milk yield in different sheep and cattle breeds (^{Eyduran et al., 2008}; ^{Bakir et al., 2010}). Applying RTM, ^{Khan et al. (2014}) estimated body weight from several body measurements in Harnai Sheep. Similarly, ^{Mohammad et al. (2012}) employed RTM with the intent to predict body weight from withers height, chest girth, and body length for indigenous Pakistan sheep. Eyduran et al. (2013) applied RTM to predict 305-d milk yield of Brown Swiss cattle. In a previous document by ^{Yilmaz et al. (2013}), RTM was fitted to birth records of Brown Swiss cattle with a view to determine the effect of non-genetic factors on calf birth weight. With RTM, a slaughter weight of Ross 308 broiler chickens was predicted (Mendeş and Akkartal, 2009). In the poultry science, former investigations concerning RTM are not enough.

It is very easy to apply RTM for ordinal, nominal, and continuous variables. In particular, the CHAID algorithm in RTM among data mining algorithms (CART, CHAID, and Exhaustive CHAID) is practiced to construct a decision tree for regression problems, because it reveals non-linear and interaction effects among independent variables, and is a favorable alternative to RR and MLR techniques for continuous dependent and independent variables, as in the investigation. The CHAID algorithm uses merging, splitting, and stopping stages in the construction of a decision tree, and converts continuous variables into ordinal variables. It forms homogenous groups (nodes) by recursively splitting nodes for maximizing variability among nodes (^{Nisbet et al., 2009}).

There is a limited availability of RTM in the poultry science (^{Mendes and Akkartal, 2009}). Taking into consideration earlier studies concerning RTM and its advantages, RTM based on CHAID algorithm can be an admirable tool in the classification of eggs, existing in egg quality criteria instead of traditional regression methods. But there is no reported knowledge which proved the prediction of EW using regression tree method. Hence, the aim of this present investigation was to predict EW from AW, YW, and SW on commercial layer hybrids by means of MLR, RR, and RTM (based on CHAID algorithm) analyses for developing egg quality standards.

Material and Methods

Data of this study have been taken from ^{Saygici (2004}) to show how to apply MLR, RR, and especially, RTM constructed by CHAID algorithm, one of the data mining algorithms, as well as to show how to interpret their outputs. A total of 228 commercial layer hybrids at 21 weeks of age were provided for predicting EW from different egg traits in the current study. See Saygici (2004) for obtaining more detailed information on the animal material.

Structures of the inspected variables on 2049 eggs used in the experiment could be summarized briefly as follows: shell weight (SW, g) - continuous (quantitative) variable; yolk weight (YW, g) - continuous (quantitative) variable; albumen weight (AW, g) - continuous (quantitative) variable; and egg weight (EW, g) - continuous (quantitative) variable.

In the current study, EW was taken as a dependent variable (target), and YW, AW, and SW were considered independent (explanatory) variables, in order to predict the EW via MLR, RR, and RTM (CHAID algorithm) analysis methods.

In general, MLR can be written in matrix notation as: *Y* = *Xβ* + *e*. In this model, *Y* is a dependent variable; *X* is an independent variable(s); *β* is the regression coefficient(s); and *e* is a vector of residuals. Based upon the least square method, regression coefficients can be expressed as: the least squares estimator of .

Ridge regression is utilized as a more effective method than the least squares method in the event of multicollinearity. In RR analysis, the cross-product matrix for descriptive independent variables (SW, AW, and YW) is placed and ascended to one of the diagonal elements.

Ridge regression was an alternative predictor having a lower mean square error. Its estimator is indicated by a parameter 0 ≤ *k* ≤ 1 such that:

in which *kI* is a diagonal matrix with all the elements comprising a random small constant *k*.

By the means of *k*, bias increases. Higher values of *k* reduce multicollinearity, which decreases the total variation, but a *k* of zero yields the least squares estimates. Afterwards, RR aims to estimate the best value of *k*. Choosing *k* according to VIF, *k* can be assigned if VIF<10.

As a tree-based model, RTM recognizes the best independent variables influencing the target variable (^{Camdeviren et al., 2005}; ^{Mohammad et al., 2012}). RTM is a more beneficial method than the old-style or classic methods, particularly if analysts have large complex data sets and a great number of independent variables (Camdeviren et al., 2005). For RTM, no assumptions, such as normality, constant variance, linearity, and non-multicollinearity are required, because it is a nonparametric method. The RTM recursively continues splitting process in order to reduce variation on a dependent variable (Camdeviren et al., 2007). In the study, RTM was identified with the aid of the CHAID algorithm in IBM SPSS 22 program, and thus a decision tree structure was established.

There are three steps (merging, splitting, and stopping) in the CHAID algorithm that allows multiple splits of any node for a regression problem (^{Nisbet et al., 2009}). A decision tree from the algorithm is grown by repeatedly using these three steps for each node, starting from the root node (^{Ali et al., 2015}). The root node was partitioned recursively if the variability of a dependent variable is minimized within the nodes, and the variation among the nodes is maximized.

The CHAID algorithm merely manipulates nominal or ordinal categorical independent variables. For this reason, continuous independent variables are converted into ordinal independent variables prior to using the following algorithm. For a given set of break points *a*
_{1}, *a*
_{2},...,*a _{K −1}
* (in ascending order), a known x is mapped into category

*C*(

*x*) herein below:

When K is the preferred number of bins, for the estimation of the break points *x _{i}
* frequency weights are unified in calculating the ranks. In the case of being tied, the average rank is employed. The rank and the corresponding values in the ascending order can be explained as

For k = 0 to (K−1), set , in which (*x*) indicates the floor integer of x. If *I _{k}
* is not empty, . The break points are adjusted by becoming equal to the x values corresponding to the

*i*, excluding the largest (

_{k}^{Breiman et al., 1984}).

Bonferroni adjustment was performed for RTM based on CHAID algorithm to obtain Adjusted P values of F values. The tree-based algorithm, having an automatically pruning process in ignoring unnecessary nodes in the decision tree, uses F significance test when a continuous dependent variable was used. We applied a ten-fold cross-validation under the statistical evaluation.

Initially, Pearson correlations were estimated between pairs of egg traits. The predictive power of RR, MLR, and RTM was measured by using the coefficient of determination (%R^{2)} as a proportion of the explained variability in EW. In the study, IBM SPSS 22 program was used for the statistical analyses.

Results

In multicollinearity case, RTM algorithms visually admit quite easier interpretation of the data to construct decision trees, in comparison with the implementation of traditional methods, such as MLR and RR. In the statistical performance, RR analysis is traditionally advisable when compared with the multiple linear regression analysis.

Significantly positive correlations were found among egg quality traits (P<0.01). Pearson correlations between SW and YW (r = 0.470), SW and AW (r = 0.539), YW and AW (r = 0.654), SW and EW (r = 0.642), YW and EW (r = 0.777), and AW and EW (r = 0.932) were estimated in the present egg data.

To explain the total variability in EW, the collected egg data were analyzed using MLR. The MLR results are summarized in Table 1. The ANOVA result (F value) for the MLR model revealed that MLR model built in the present paper was statistically significant (P<0.01). All of the independent variables (such as AW, YW, and SW) accounted for 93.4% of total variability in EW as a response (dependent) variable for MLR, without multicollinearity difficulty owing to VIF values, varying from 1.459 to 1.984. Taking into account the positive coefficients, an increment in EW would be expected, as AW, YW, and SW increased.

SE - standard error; t - t test value; VIF - variance inflation factor.

SW - shell weight; YW - yolk weight; AW - albumen weight.

S = 2.01925; R-Sq = 93.4%; R-Sq(adj) = 93.4%.

Ridge regression for the collected data was executed for the prediction of total variability in EW, also preferred as an alternative method to MLR. Results from the regression analysis suggested that there was a very good explanation of 93.15% and illustrated a very similar tendency to the present MLR results addressed above (Table 2).

A decision tree diagram was constructed via CHAID algorithm for obtaining detailed information on the independent variables significantly affecting EW (Figure 1). Albumen weight, YW, and SW were the significant independent variables constructing a decision tree diagram, and accounted for nearly all (99.988%) of the total variation in EW. Of these independent variables, AW (F = 1885.446, df1 = 7, and df2 = 2041) was a prime variable on EW, which was followed by YW, and SW in significance order. As a result of increasing AW, EW averagely displayed a great range of 40.195 to 70.125 g from node 1 through node 8 (Figure 1).

Node 0, also known as a root node at the top of RTM diagram, was presented in all the studied eggs. The average EW for node 0 was 58.764 (S = 7.872) g from 2049 eggs. Node 0 was divided into eight new child nodes (nodes 1-8) on the basis of AW. Within these eight nodes, nodes 1, 2, 4, 6, and 7 appeared to be terminal nodes in the RTM diagram drawn via CHAID algorithm.

Node 1 (a cluster of eggs with AW ≤ 27 g) produced the average EW of 40.195 (S = 6.169) g (n = 174 eggs). Node 2 (a cluster of eggs with 27 < AW ≤ 31 g) had an EW of 50.065 (S = 2.903) g (n = 169 eggs). Node 3 (a cluster of eggs with 31< AW ≤ 33) was branched into nodes 9 and 10, with reference to YW, respectively, and the average EW of node 3 was estimated as 56.983 (S = 2.172) g (n = 235 eggs). Yolk weight had a very significant influence on EW of eggs available in node 3 (F = 52.243, df1 = 1, and df2 = 233) (Adjusted P<0.01). Node 9 (a cluster of eggs with YW ≤ 14 among eggs with 31 < AW ≤ 33 g) gave the average EW of 56.083 (S = 2.189) g (n = 121 eggs). As a cluster of eggs with YW > 14 among eggs with 31< AW ≤ 33 g, node 10 yielded the EW average of 57.939 (S = 1.700) g (n = 114 eggs). Node 4 (a terminal node obtained from eggs with 33 < AW ≤ 34 g) had the average EW of 58.961 (S = 2.241) g (n = 415 eggs). The average EW for node 5 (a cluster from n = 389 eggs with 34 < AW ≤ 36 g among all eggs) was 60.784 (S = 2.257) g. Node 5 was branched into nodes 11 and 12 on the basis of YW, respectively. Node 11 (a cluster of eggs having YW ≤ 17 among eggs with 34 < AW ≤ 36 g) generated the EW average of 59.946 (S = 1.654) g from n = 299 eggs. A cluster of eggs with YW >17 g among eggs with 34 < AW ≤ 36 g was node 12, with the average EW of 63.567 (S = 1.690) g from 90 eggs, which was heavier than the average of node 11.

This indicates that there was a profound impact of YW on EW of eggs referring to node 5 (F = 328.139, df1 = 1, and df2 = 387) (Adjusted P<0.01). Node 11, significantly influenced by SW (F = 43.027, df1 = 1, and df2 = 297), was divided into two new child nodes 15-16, respectively. EW averages from nodes 15 and 16 with 59.244 (S = 1.333) g and 60.438 (1.682) g were predicted nearly similar, respectively.

The average EW of 61.629 (S = 2.213) g was obtained from node 6, a cluster of eggs with 36 < AW ≤ 38 g (n = 251 eggs). Node 7, consisting of a cluster of eggs with 38 < AW ≤ 41 g, provided the average EW of 64.983 (S = 2.045) g from 232 eggs.

The average EW of 70.125 (S = 3.001) was predicted by using node 8 as a cluster of eggs with AW > 41 (n = 184), which was again branched into two child nodes 13-14, in relation to YW, respectively. This elucidates that the statistically significant effect of YW on EW of eggs included in node 8 (F = 76.738, df1 = 1, and df2 = 182) reappeared (Adjusted P<0.01) Comprising eggs with AW > 41 and YW ≤ 17 g, node 13 produced 68.680 (S = 2.991) g of EW (n = 103 eggs). Node 14, containing eggs with AW > 41 and YW > 17 g, had on average the heaviest EW of 71.963 (S = 1.757) g in EW, (n = 81 eggs), as the lightest EW was produced by node 1.

We mention a marvelous agreement between the real and predicted values obtained by RTM based on CHAID algorithm, having almost 100 (%R^{2)}.

Discussion

In the current study, there were positive correlations among all independent variables (P<0.001), which confirmed the result of Ratherd et al. (2011), who informed that there is a high correlation between inner and external egg quality traits in Japanese quails, and acknowledged that multicollinearity occurs in regression models between independent variables. They applied the principal component regression to get rid of the multicollinearity problem, according to their findings. The benefits of ridge regression are mentioned in the presence of multicollinearity.

In many studies, it was stated that the correlations between inner and external egg quality variables were found to be highly substantial and at highly strong positive associations (P<0.01) (^{Alkan et al., 2010}; ^{Narinç et al., 2011}; ^{Kul and Şeker, 2004}; ^{Üçkardeş et al., 2010}). ^{Akbaş et al. (1996}) remarked on positive correlations between pairs of the hen age and egg weight, yolk width, albumen width, and albumen length. However, they recorded some negative correlations between egg weight and shell strength, shell thickness, yolk index, albumen index, yolk height, albumen height, and Haugh unit.

Regression, correlation, and MLR analyses have been employed to identify the relationship between egg weight and these quality traits in different layer strains. Ridge regression in multicollinearity problem could give more accurate estimates than MLR analysis. In fact, making more accurate decision on preferring the most effective statistical methods is the most important matter in the estimation of EW from egg quality traits. In comparison with these statistical methods highlighted above, it was said that RTM, which can be understood and interpreted more easily in visual form, was not influenced by multicollinearity, outliers, and missing observations (^{Mendes and Akkartal, 2009}; ^{Karabağ et al., 2010}). However, the results depicted that no multicollinearity problem was detected in either MLR or RR analyses. Because there was no multicollinearity problem in MLR, no application of ridge regression analysis was necessary for this type of data for interpretation.

Conclusions

Egg quality is a very important product for the poultry farming and sector. Regression tree analysis based on the CHAID algorithm, with a very much higher predictive accuracy of 99.988 (%R^{2)}, is a powerful approach that detects the relationship between egg weight and internal (albumen and yolk weights) and external (shell weight) quality traits, which are indicative of egg quality. The decision tree from regression tree analysis depicts that the highest egg weight (71.963 g) is obtained from eggs with albumen weight >41 g and yolk weight >17 g. Consequently, it is expected that the employability of regression tree analysis will be hybrids in the poultry sector, because it does not involve any assumption about independent variables in regression tree analysis for being a non-parametric technique.