DATA MINING TO ESTIMATE BROILER MORTALITY WHEN EXPOSED TO HEAT WAVE

Heat waves usually result in losses of animal production since they are exposed to thermal stress inducing an increase in mortality and consequent economical losses. Animal science and meteorological databases from the last years contain enough data in the poultry production business to allow the modeling of mortality losses due to heat wave incidence. This research analyzes a database of broiler production associated to climatic data, using data mining techniques such as attribute selection and data classification (decision tree) to model the impact of heat wave incidence on broiler mortality. The temperature and humidity index (THI) was used for screening environmental data. The data mining techniques allowed the development of three comprehensible models for estimating specifically high mortality during broiler production. Two models yielded a classification accuracy of 89.3% by using Principal Component Analysis (PCA) and Wrapper feature selection approaches. Both models obtained a class precision of 0.83 for classifying high mortality. When the feature selection was made by the domain experts, the model accuracy reached 85.7%, while the class precision of high mortality was 0.76. Meteorological data and the calculated THI from meteorological stations were helpful to select the range of harmful environmental conditions for broilers 29 and 42 days old. The data mining techniques were useful for building animal production models.


INTRODUCTION
Brazilian poultry production has been developing in a very competitive scenario.This scenario requires losses control to reduce production costs improving its productivity.The main factors that affect the thermal comfort of broilers are environmental temperature, relative humidity and wind speed, which may imply in mortality for extreme conditions (Teeter et al., 1985;Yahav et al., 1995;Macari & Furlan, 2001).Broiler mortality patterns in commercial houses exposed to high environmental temperatures present higher values for the first two weeks up to 30 days of rearing (Xin et al., 1994;Cony & Zocche, 2004;Tabler et al., 2006).One way to access bird thermal comfort for studying mortality is by using an index that combines several factors (Tao & Xin, 2003;Chepete et al., 2005).
Heat wave is a meteorological event reaching extreme dry bulb temperatures that may impact on animal production.These events have been more frequent lately due to the global climatic changes, however, very little is known about their impact on Brazilian broiler production.The COPA/COGECA ( 2004), an agricultural committee from the Europe Union that produces a report about European heat wave impact and losses in agriculture, shows general economical losses of 15-30% in poultry production for the heat wave that hit European producer countries in 2003.St-Pierre et al. (2003) estimated that in the United States the production loss can reach 128 million dollars when environmental conditions depart from the thermoneutral zone, based on the temperature and humidity index (THI) calculated from meteorological station databases.
Data mining is a promising approach to estimate poultry production mortality.This new research area has emerged as a means of extracting hidden patterns or previously unknown implicit information from large repositories of data (Fayyad et al. 1996;Fayyad & Stolorz, 1997;Rezende et al., 2005).The fascination with the promise of analysis of large volumes of data has led to an increasing number of successful applications of data mining in recent years.For instance, Zhang et al. (2005) showed how data mining techniques can be used to model native pasture productivity.The results revealed that data mining techniques were very efficient to predict the main critical points and pasture productivity.
The objective of this research was to build up a model (decision tree) using data mining techniques such as feature selection and data classification, for predicting broiler mortality caused by heat wave incidence, using a database composed of poultry production and meteorological attributes.

MATERIAL AND METHODS
The database was organized from two similar experimental data (mortality with no significant differences, P > 0.05 by T test) using 1,000 broilers each, from November to December of 1997 and 1998 in two similar poultry houses.The results showed mortalities above normal in the 5 th and 6 th weeks of age due to heat wave incidence.
The broiler houses where the flocks were reared had natural ventilation and open sides with lat-eral curtains, and were East-West solar oriented, 22 o 42' S, 47º38' W, and altitude of 528 m.The meteorological data was taken from USP/ESALQ (2005).The database used for building the model was organized using the following data: broiler mortality and corresponding housing and outside environmental data (dry bulb temperature, DBT; humid bulb temperature, HBT; wind speed, WS; maximum and minimum temperatures, TMAX, TMIN; relative humidity, RH; calculated black globe temperature index, BGTI; and calculated temperature and humidity index, THI).The Broiler database referred to the 5 th and 6 th weeks (between 29 and 42 days old) in a total of 28 instances (observations), screened when the heat wave occurred.In this research both attribute and feature have the same meaning.
Some attributes were derived from the original data for each observation in the database, including the thermal amplitudes inside and outside the housing up to five days prior to the heat wave incidence.The final database used for the analysis contained 70 attributes, distributed as follows: five coming from experimental original data, 16 from the original data from the meteorological stations, 34 derived attributes from the original meteorological database, 12 derived attributes from the housing database, two derived attributes from the interaction between meteorological and housing database, and one class attribute which classified the mortality in one of the following classes; high mortality (HM) and normal mortality (NM), as shown in Table 1.
The data mining techniques were applied according to the CRISP-DM methodology comprising the following steps: domain understanding, data acquisition, understanding, preparation, modeling and evaluation according to the knowledge from the domain experts (Chapman et al., 2000).
The software used for the analysis was Weka ® 3-4 (Witten & Frank, 2005) which is composed of a collection of machine learning algorithms for data mining tasks (e.g., classification).In particular, the classification algorithm chosen was J48, which generates a decision tree for classifying broiler mortality as normal or high.J48 (also known as C4.5) is an algorithm introduced by Ross Quinlan (1993) for inducing Classification Models, also called Decision Trees.The decision tree generated by J48 can be used for classification and for this reason it is often referred to as a statistical classifier.A decision tree is the representation of recognizable patterns that describe a large number of instances of the training data in a concise and most general way to allow the best possible classification of unknown data.For the tree construction process information theoretical concepts (Shannon, 1948) are used to define the best attributes depending on the largest information gain (difference in entropy) that results from choosing an attribute for splitting the data.The attributes define the possible branches of the growing tree.Early assigned attributes are more important than attributes assigned later during the tree growth.In this way the "most important" attribute -whose values divide the data items into nearly pure subsets with respect to the classification -represents the tree root.Thus the tree construction offers a ranking in the significance of a certain attribute regarding the classification.The attribute with the highest normalized information gain is the one used to make the decision.The algorithm then recourses on the smaller sub lists.The pseudo code of the algorithm J48 can be found in (Quinlan, 1993;Quilan, 1996).
The algorithm J48 is one of best approaches for mining rules through decision trees found in the literature (Han & Kamber, 2006).Apart from that, this algorithm is available in free and commercial softwares and several experimental results using J48 for variable selection show that this algorithm maintains classiûcation accuracy in many bench mark problems, reducing signiûcantly running times (Martínez, & Fuentes, 2005).
Due to the large number of attributes generated in the first data pre-processing, a feature selection was used to remove the attributes with low correlation values.The tools used for the attribute selection were: (i) -Principal Component Analysis (PCA) which involves a mathematical procedure to transform Table 1 -Summary of used data and features assumed for organizing the final data set.
Feature type: 0 -Class attribute calculated by mortality and specialist judgment; 1-Attribute of the experiment data registered inside the housing; 2 -Attribute of meteorological station (registered data); 3 -Attribute derived from meteorological station (calculated data); 4 -Attribute derived from housing (calculated data); 5 -Attribute derived from housing and meteorological (calculated data).a number of (possibly) correlated variables into a smaller number of uncorrelated variables called principal components; (ii) -Chi-squared test which evaluated the dependence between the attribute and its classifier (the class attribute); (iii) -Wrapper, that evaluates the attribute cluster in a machine learning process and verifies the classifying accuracy of crossing validation; (iv) -Correlation Feature Selection (CFS) that searches the cluster of correlated attributes avoiding re-use of the same information; (v) -InfoGain, that evaluates the gain in information in relation to the classifier; and (vi) -GainRatio that analyzes the information gain rate related to the specific class correcting impaired measurements.Alternatively, a new feature selection approach was used considering the knowledge of the domain experts who selected the main attributes based on their expertise.The evaluation of the models was made by two domain experts (specialized in poultry production) analyzing the generated models.Their evaluation took into account how much comprehensible the models are, the selected attributes that were used to build up other models with other feature selection approach, and the importance of the models concerning the mortality estimation (the model accuracy and the class precision for high mortality).
The model accuracy was calculated by a confusion matrix (Table 2) and it is expressed as the percentage of correctly classified test instances over all test instances, including True positives and True negatives.On the other hand, the class precision was also calculated by the confusion matrix (Table 2) and it is expressed as a rate raging from 0 to 1, representing the instances that were correctly classified as True positives or True negatives (Gomes, 2002).
The classes (HM and NM) have different number of members.For example, the class NM has 25 members, while the class HM has only three.Thus, before building the model it was necessary to balance the number of members per class; otherwise the outcomes could be biased toward the classes with more members.One strategy to deal with this problem was to produce a random subsample of the database using sampling with replacement (Breiman, 1996).Two restrictions were applied: preserve the total number of elements and generate a uniform distributed subsample.To accomplish that, the module Resample of Weka ® was used.
The class attribute was chosen as a function of daily broiler mortality.Using daily and weekly broiler mortalities two classes were selected (HM and NM) which took into account the predicted mortality for that specific breed.The values of daily mortality were compared with those of the weekly mortality in order to avoid classification error.

RESULTS AND DISCUSSION
In the first approach, a decision tree was built without using feature selection and class balance, generating a model with low class precision for HM (Table 3).To improve the model accuracy the resample technique was used to balance the number of members per class, as recommended by Breiman (1996).Thus, using the feature selection approaches and resample, three comprehensible models were generated allowing the classification of the high mortality as a function of the other attributes of the database.
Two of the comprehensible models were generated using Principal Component Analysis (PCA) and Wrapper feature selection as well, both yielding a model accuracy of 89.3%.The third screening of classifiers was based on the knowledge of domain experts in which the attributes that are not related to mortality were removed, generating the third decision tree, which reached a model accuracy of 85.7% (Table 3).
The HM classification yielded a class precision of 0.83 when using PCA and Wrapper selection tech-   niques, while the selection from the experts' point of view reached a class precision of 0.76.The class precision from classification in NM was greater than 0.95, for all models, and less important from this research that focuses on high poultry mortality.Models generated using the other feature selection approaches (Chi-squared, CFS, InfoGain, GainRation) were discarded because of the low model accuracy they yielded (less than 75%) and, most importantly, because of the low comprehensibility considered by the judgment from the knowledge of the experts in poultry.In addition, the accuracy of such models was similar to that one obtained without applying feature selection and resampling.The methodologies imply the discharge of less important models from the point of view of the domain experts, and retry other models with other approach in the domain (Chapman et al. 2000).
The use of feature selection was fundamental to identify the most salient features for building the poultry production model.In general, the use of feature selection reduces the complexity of models, improves the predictive accuracy and comprehensibility of such models (Kim et al., 2002;Guyon & Elisseeff, 2003).Clearly, this justifies the low accuracy of the model generated without using feature selection.
The attribute selection was performed using the PCA and Wrapper approaches, which greatly improved the model accuracy, reduced the complexity and highlighted the precision of the most important class (HM) when compared to the model without feature selection and resampling.On the other hand, when comparing the models built using PCA and Wrapper with that one generated by the experts, it was observed that the models based on feature selection yielded better results.The main reason is that experts frequently use an empirical approach to select features, which may result in a set of redundant or noisy attributes.Such attributes compromise the accuracy and the complexity of a model (Kim et al., 2002;Guyon & Elisseeff, 2003).
The decision tree generated from the model without feature selection and class balance is depicted in Figure 1.The root of this tree is the attribute maximum wind speed (measured daily), which was interpreted by the experts as a noise attribute.Even though the variable wind speed is well known as an important mitigation for heat stress (Tao & Xin, 2003;Sevegnani et al., 2001), the class precision found for the HM was very low (0.50).The maximum wind speed occurred always during night when the bird mortality had already taken place (from 15h00 to 18h00) which led to discharge this variable in the model.
The decision trees with the best understanding by the domain experts were generated using PCA and Wrapper; however the model built using Wrapper presented less complexity (Figure 2, Decision-Tree a), which may be an advantage for monitoring mortality based on historical and regional data.
Regarding the decision tree shown in Figure 2, Decision-Tree a, only the attribute average THI was used, which is calculated using the equation given in Chepete et al. (2005).The attribute average THI was computed considering the data from the meteorological station.The decision tree built using PCA showed an average daily dry bulb temperature and an average daily wind speed as the main attributes for predicting broiler high mortality due to heat stress, according to various authors (Teeter et al., 1985;Macari & Furlan, 2001;Tao & Xin, 2003;Chepete et al., 2005).The effect of wind speed in broiler performance in the range of 0.2 to 1.2 m s -1 was investigated by Tao & Xin (2003) and by Sevegnani et al. (2001) within the limits of 0.3 to 1.0 m s -1 .All the authors found that thermal comfort in adult broilers is dependent of average daily wind speed.In this particular experiment, the association of average local wind speed below 1.4 m s -1 with dry bulb temperature above 24 o C was responsible for high incidence of broiler mortality.
The decision tree in which the feature selection was made by the domain experts was built considering the average THI (Chepete et al., 2005) associated to the average daily wind speed (Figure 3).However, the model accuracy was 85.7%, while the class precision for HM was 0.76, less than those generated by using PCA and Wrapper approaches.
Analyzing the meteorological data, it was observed that in only one of the three cases of high broiler mortality the daily maximum absolute tempera- ture was over 32°C for more than one day.In this case two consecutive days with maximum absolute temperature reaching 33 and 32°C were observed.In the other incidences, only one day with daily maximum absolute temperature equal or greater than 32°C was sufficient to lead to high broiler mortality.The conventional definition of heat wave used for humans, as shown in INMET (2005), is not adequate for poultry production.Thus, characterizing a heat wave, including its magnitude and intensity, deserves further exploration, in the sense of affecting broiler production.This confirms Abaurrea et al. (2006) findings that the heat waves in Europe and in the United States have different profiles.
One effective way to build up simpler models would be the inclusion of another attribute related to temperature, humidity and wind speed index (THVI, Tao & Xin, 2003).The index THVI should be adjusted to broilers in the range of 29 to 42 days of age.The models showed in Figures 2 and 3 can be applied to predict broiler mortality and the use of each one of them may depend on the available data.
In general, meteorological stations generate a large number of data that are seldom used for animal production.Thus, the development of models using regional or local data to predict production losses may be useful for producers who use mitigation action to reduce the economical impact of heat wave loss.

CONCLUSION
It was possible to build up a predictive model of broiler mortality using historical flock database and local meteorological data.This model can be applied to situations in which the internal environment directly reflects the outside environmental conditions when the housing does not use cooling systems.In particular, data mining techniques such as feature selection and data classification (decision tree) are useful for building animal production models that exhibit low complexity in terms of the number of attributes, better predictive accuracy and improved comprehensibility.

Figure 1 -
Figure 1 -Decision tree generated from the data without feature selection (Model accuracy = 82.1% and class precision HM = 0.50).
1 N is equal to the number of instances in the test set.

Table 3 -
Model accuracy of the models and HM Classes as function of the feature selection.