Heuristic methods applied in reference evapotranspiration modeling

The importance of the precise estimation of evapotranspiration is directly related to sustainable water usage. Since agriculture represents 70% of Brazil’s water consumption, adequate and efficient application of water may reduce the conflicts over the use of water among the multiple users. Considering the importance of accurate estimation of evapotranspiration, the objective of the present study was to model and compare the reference evapotranspiration from different heuristic methodologies. The standard Penman-Monteith method was used as reference for evapotranspiration, however, to evaluate the heuristic methodologies with scarce data, two widely known methods had their performances assessed in relation to Penman-Monteith. The methods used to estimate evapotranspiration from scarce data were PriestleyTaylor and Thornthwaite. The computational techniques Stepwise Regression (SWR), Random Forest (RF), Cubist (CB), Bayesian Regularized Neural Network (BRNN) and Support Vector Machines (SVM) were used to estimate evapotranspiration with scarce and full meteorological data. The results show the robustness of the heuristic methods in the prediction of the evapotranspiration. The performance criteria of machine learning methods for full weather data varied from 0.14 to 0.22 mm d-1 for mean absolute error (MAE), from 0.21 to 0.29 mm d-1 for root mean squared error (RMSE) and from 0.95 to 0.99 coefficient of determination (r2). The computational techniques proved superior performance to established methods in literature, even in scenarios of scarce variables. The BRNN presented the best performance overall.


INTRODUCTION
Evapotranspiration is a fundamental parameter in comprehending Earth's hydrological cycle, as well as agricultural crops' water demand (French;Hunsaker;Thorp, 2015;Zhang;Kimball;Running, 2016).Since agriculture is one of the bases of Brazil's economy, the determination of evapotranspiration has great relevance in decision-making, especially for irrigation management.In this context, the evapotranspiration reflects directly on the calculations of crop's daily demand (Feng et al., 2017a;Guermazi;Bouaziz;Zairi, 2016;Toureiro et al., 2017).
The precise estimation of evapotranspiration is interconnected to sustainable water usage, where the supply of adequate amounts of water to crops is crucial (Djaman et al., 2018;Petropoulos et al., 2016Petropoulos et al., , 2018)).Since agriculture represents 70% of Brazil's water consumption, the conflict over water resources in many regions, such as the Brazilian Cerrado, has increased with the expansion of irrigated areas (Maneta et al., 2009).Taking in consideration a specific volume of water, an adequate and efficient application of water quantities allows the attendance of larger irrigated areas, reducing the conflicts over the use of water among the multiple users (Abdullah et al., 2015;Karimi;Bastiaanssen, 2015;Khanal;Fulton;Shearer, 2017).
The economical point of view should also be noted, for an example, if one chooses to apply a larger amount of water than necessary, not only will the expenses with electric energy be greater, but there may be losses related to soil erosion and leaching of soil nutrients, which leads to eutrophication of water bodies (Reddy;Cunha;Kurian, 2018).On the other hand, when a quantity smaller than demanded by the crop is applied, these may present a decrease in productivity, resulting in a lower profitability of the agribusiness.
According to the Food and Agriculture Organization (FAO), the standard method for the determination of reference evapotranspiration (ETo) is the Penman-Monteith methodology (Allen et al., 1998).However, this methodology requires a wide range of climatic variables measured at site, such as temperature (maximum, average and minimum), solar radiation, wind speed and relative humidity, variables that aren't readily available for many regions (Abdullah et al., 2015;Feng et al., 2017a).This fact makes it relevant to investigate more parsimonious and efficient possibilities for the estimation of reference evapotranspiration.
Heuristic methods, also known as machine learning or pattern recognition algorithms, have presented promising results in the modeling of meteorological parameters (Adnan;Latif;Nazir, 2017;Feng et al., 2017b;Yao et al., 2017;Zhou et al., 2017).These authors have, for other regions, estimated evapotranspiration from the same meteorological data required by the Penman-Monteith, as well as from scarce data.Methods such as neural networks, regression trees, support vector machines and many others present, for the most part, a heuristic characteristic coupled with a high ability to generalize and model patterns (Wang et al., 2017b(Wang et al., , 2017c)).
Considering the importance of accurate estimation of evapotranspiration for the region and the high generalization capacity of machine learning algorithms, the objective of the present study was to model and compare the reference evapotranspiration from different heuristic methodologies, using as reference the standard Penman-Monteith equation.

Study area
The study area comprises the mesoregions Noroeste de Minas and Triângulo Mineiro/Alto Paranaíba in the state of Minas Gerais, Brazil (Figure 1).Altogether, 85 municipalities compose the two mesoregions, with a total area and population of approximately 154 thousand km 2 and 2.7 million inhabitants.

R G
Ws e e T ETo Ws The Noroeste de Minas and Triângulo Mineiro/ Alto Paranaíba are similar mesoregions in their extensive plateau in regions of savannah, which facilitates the planting and use of agricultural machinery.For these reasons, medium and large sized entrepreneurs have widely explored the region while developing local agriculture with extensive use of technology (Bastos;Gomes, 2010).
The mesoregions have a characteristic climate of the Brazilian Cerrado, with annual rainfall varying from 1000 to 1400 mm, and presenting two well defined seasons, where 80 to 90% of the annual rainfall predominates during the rainy season (Mello et al., 2007).During the dry season, the water scarcity strongly affects the region, whereas the production of the second harvest is only possible through intensive use of irrigation.

Data
For the development of the study, daily data were used referring to 11 meteorological stations spaced in the study area (Figure 1).The adopted study period was between 1987 and 2016, summing 30 years of climatological information.The data were obtained from the Meteorological Database for Teaching and Research (BDMEP), a database of the National Institute of Meteorology (INMET) of Brazil.The stations are located in the mesoregions of the Triângulo Mineiro/ Alto Paranaíba and Noroeste de Minas.The information concerning the geographic coordinates, referenced in the Datum WGS-84, and hypsometry of each station are presented in Table 1.
The data provided by the meteorological stations, which were used in the present research are: average (T avg , °C), maximum (T max , °C) and minimum air temperature (T min , °C); relative humidity (RH, %); solar radiation (SR, MJ m -2 d -1 ); and, wind speed (Ws, m s -1 ).

Reference evapotranspiration
The reference evapotranspiration is the evapotranspiration referring to a hypothetical crop that completely covers the soil, is in active growth, does not present water and nutritional restriction, and presents specific characteristics such as albedo equal to 0.23 and height between 8 and 15 cm (Allen et al., 1998).Among the various methods of ETo estimation, the Penman-Monteith (Equation 1), presented by the FAO, is recommended as the standard (Torres;Walker;McKee, 2011): where: ETo = reference evapotranspiraton by Penman Monteith, mm d -1 ; R n = net radiation, MJ m -2 d -1 ; G = soil heat flux, MJ m -2 d -1 ; T avg = mean air temperature, °C; Ws = wind speed at 2 m height, m s -1 ; e s = saturation vapour pressure, kPa; e a = actual vapour pressure, kPa; Δ Δ = slope of the saturation vapour pressure function, kPa °C-1 ; and, (2) 7 3 5 2 2 6.75 10 7.71 10 1.79 10 0.49 10 10 1, 6 In order to evaluate the performance of machine learning methods using scarce meteorological data, the methodologies adopted in this research were the Priestley-Taylor (Priestley;Taylor, 1972) and Thornthwaite  (Thornthwaite, 1948) (Equations 2 and 3): where: ETpt = reference evapotranspiration by Priestley-Taylor, mm d -1 ; and, α = empirical coefficient equivalent to 1.26. in which: ETtw = reference evapotranspiration by Thornthwaite, mm month -1 ; d = days in a month; Ta = monthly mean temperature, °C; and, I = annual heat index.

Stepwise Regression
One of the difficulties when working with linear regressions is selecting the predictors to be employed in the model.The Stepwise Regression (SWR) is a numerical modeling that may be applied by computational techniques, in which the independent variables are repeatedly removed and added to the multiple linear regressions in order to verify which equation will adapt and perform the best (Abraham et al., 2017).The final model is shown in Equation 4: where: β 0 = linear regression gain; β 1 ...β n = offsets for every independent variable; X 1 ...X n = independent variables.

Random Forest
Random Forest (RF) is a machine learning method capable of performing both classification and regression procedures.A large number of regressions trees are created each from a "Bootstrap" sampling of the original dataset (Breiman, 2001).As an inherent characteristic of machine learning methods are the ability to solve complex interactions between predictor variables, even when they present high collinearity (Brokamp et al., 2017).In the case of regressions, the final prediction is performed through the means between the predictions of each tree (Rahman et al., 2016).

Cubist Regression
The Cubist regression (CB) is also based in regression trees.At the end of the model are generated several rules, where each rule is associated with a linear model.The model has the characteristic of being based on multiple regression models, making the final product the average of these.Although the prediction is made based on a multiple linear regression, the result is smoothed by the means of the predictions realized in previous knots of the tree (Im et al., 2012).

Bayesian Regularized Neural Network
The neural network is formed by simple elements operating in parallel.Inspired by a biological neural network, the neural network receives its independent neurons in its input.The variables are passed to subsequent layers of neurons, where, passing through a transfer function, the weighted sum of input values are calculated, providing an output for the neuron in analysis (Wang et al., 2017a).The bayesian regularized neural networks (BRNN) are more robust than the networks that use the back propagation of the errors, besides avoiding the overfitting of the model (Ticknor, 2013).

Support Vector Machines
Support vector machines (SVMs) are considered as supervised learning methods, which can be used both to classify a set of samples and to regress them.When a dataset is submitted to SVM analysis, they separate the data by constructing hyperplanes, aiming at the implantation of surfaces with the greatest possible margin between the datasets considered as different (Wang et al., 2017a).The larger the margin of the hyperplane, the greater is the generalization capacity of the model created to predict or classify the data set.

Performance criteria
In this study, five methods of pattern recognition were evaluated and compared to each other based on mean absolute error (MAE), root mean squared error (RMSE) and coefficient of determination (r²), which can be expressed by Equations 5 to 7:  in which: O i = observed evapotranspiration data of order i; and, P i = simulated evapotranspiration of order i.The Penman-Monteith ETo will be used as reference for observed values.
The absolute mean error indicates the mean amplitude of the errors, whereas the root of the mean squared error adds a greater weight to the errors of greater magnitude, being important for highlighting possible outliers.The coefficient of determination indicates how much the generated model explains the observed variable.

Models validation
For heuristic methods, datasets are commonly divided in proportions such as 70-30 or 80-20 for training and test sets.However, the ideal ratio isn't a consensus.For this reason, the data set was initially randomly divided into a training set containing only 1% of the data, which was gradually increased in order to verify how much longer the models would take to train and if performance improvement was significant.The training set was used to train and obtain optimal tuning parameters for each model, after that, a repeated k-fold cross validation was performed, where k = 10, and the number of repetitions = 10.The models were then used to predict the ETo for the test set.

Input combination scenarios
To evaluate the performance of the methods based on machine learning, five scenarios were simulated by different input combinations of variables for the models.The combinations used in this study are presented in Table 2, where, as can be seen, the numbers in front of each method indicate the input combination.For the scenarios "2" through "5", with the intention to reduce possible collinearity effects, only the temperature variable with highest correlation to ETo (T) was used.

Data analysis
The statistics of daily climatic parameters, X mean , X max , X min , S x , C v e C x , which denote the mean, maximum, minimum, standard deviation, coefficient of variation and skewness, respectively, are shown in Table 3.The daily ETo ranged from 0.80 to 10.10 mm, with a mean of 4.15 mm and C v of 0.31.SR presented a daily average of 18.95 MJ m -2 , ranging from 6.00 to 33.00 MJ m -2 throughout the study period.Along with SR, the T max and T avg showed greater correlation to ETo than Ws, RH and T min .The temperature variables showed overall low coefficient of variation.In Figure 2 it is possible to verify the behavior of the variables in relation to ETo as well as among themselves for the training phase.Besides asymmetry, the kurtosis may also be observed in the diagonal plot of Figure 2.
Although presenting high correlation among themselves, the maximum, mean and average temperature were kept for the computational modeling of the scenario "1".This was done in the expectation that they could increase the explanatory power of these methods, given that machine learning methods are known to present high capacity of finding patterns in the data, even if they have high collinearity.For the remaining scenarios, only the T max was used, due to its higher correlation to ETo (Table 3).

Thornthwaite and Priestley-Taylor methods
The Priestley-Taylor method presented superior performance over the Thornthwaite method (Table 4).This was expected because the Thornthwaite is a monthly evapotranspiration method based solely on the average air temperature, while the Priestley-Taylor method takes in account the solar radiation, the climatic variable of highest correlation to ETo in the region (Table 3).The Priestley-Taylor method presented MAE and RMSE of 0.58 and 0.80 mm d -1 , respectively, about 38% lower than the errors obtained through the Thornthwaite method.
The Priestley-Taylor method showed a greater efficiency than the Thornthwaite method, which corresponds to the lower data dispersion and is represented by the coefficient of determination (Figure 3).The Priestley-Taylor was more effective as well, as its fit line is closer to the ideal (1:1).

Heuristic methods
The size of training set was gradually increased during training phase.However, by the time the training set was composed by 20% of the original dataset, the time required to train the models had increased over 20 times while little performance enhancement was observed.Using 20% of dataset as training set improved MAE and RMSE by only 0.01 to 0.02 mm d -1 .For these reasons and because the performance was already judged satisfactory, further analysis and discussion were based on the training set composed by 1% of full dataset.
While modeling evapotranspiration in each scenario, different optimal parameters were obtained for the heuristic methods.Optimal tuning parameters for the scenarios "1", "2", "3", "4" and "5" were: for RF, 5, 2, 2, 2 and 2 variables randomly chosen to build each tree, respectively, with 500 trees for all the scenarios; for the hidden layer of the BRNN model, 5, 5, 5, 5 and 4 neurons, respectively; for the number of neighbors and committees in the CB models, equal to 20 and 9, 20 and 9, 10 and 9, 20 and 0, and, 20 and 0, respectively.SWR final equations were composed by T max , T med , RH, SR and Ws for scenario "1", T max , SR and Ws for scenario "2", and for the other scenarios, all available variables were used.SVM models kept the C parameter at a constant 1 for all scenarios.The validation and testing performance of the SWR, RF, CB, BRNN and SVM methods for predicting ETo are presented in Table 5.The methods based on all available data (input combination "1") presented higher accuracy (Table 5).The methods using only T max , RH, SR e Ws still presented high performance in the estimation of ETo, with r² varying between 0.93 and 0.98.The best performing methods, from best to worst, were BRNN, CB, RF, SWR and SVM.In addition, the training time of the models for the methods was considered (machine cost), which is highly dependent on the computer processor speed and number of cores available.Although providing the third best prediction performance, the RF method presented the highest machine cost, taking up to 10 times longer to train its models when compared to the CB and BRNN models.The methods with lowest machine cost were the SVM and SWR; on the other hand, these methods had lower performance when compared to the others.Gocić et al. (2015) estimated the reference evapotranspiration in Serbia using heuristic methods and obtained, at their best result, MAE and RMSE equal to 0.054 and 0.233 mm d -1 , respectively.The values of MAE were lower than the findings of the present study; on the other hand its RMSE was higher, which may indicate that, although the models are more accurate, they are not as efficient.
When comparing the Priestley-Taylor method to the heuristic methods while using only temperature and radiation as input, the mean MAE and RMSE obtained by the heuristic methods were 28% and 15% lower, respectively, which is given by the superiority of the methods based in patterns recognition over traditional methods.Using scarce data, the authors Shiri et al. (2014) estimated reference evapotranspiration for Iran using several methods, including SVM and artificial neural networks (ANN).While using temperature and radiation data as input for SVM and ANN methods, Shiri et al. (2014) obtained values of MAE ranging from 0.33 to 0.64 mm d -1 , and RMSE ranging from 0.45 to 0.80 mm d -1 , where the intermediate values referred to semi-arid regions, results similarly to the ones presented here.
It is important to highlight the robustness of the machine learning methods, since the validation performance presented better metrics than conventional equations diffused in the literature, such as the Priestley-Taylor and Thornthwaite equations.The way the training and test sets were divided shows the high generalization capacity of the applied methods, since 1% of the meteorological station data (training set) was able to explain all the rest of the data (test set -99%).This indicates that even with a dataset composed of a few years, the models would still be able to obtain similar performance as presented here.
Figure 4 illustrate the dispersions of the estimates obtained for the test phase by the optimal models.The fit line for the BRNN and CB models were very close to the ideal line (1:1) (Figure 4), however, for the input combinations 3, 4 and 5, BRNN performed better.The dispersion of the data, represented by r², was lower for the BRNN and CB methods, with similar performance.The SVM and SWR methods showed similar data dispersion, and the worst results among all methods here tested.The input combinations "4" presented higher explanatory power over the input combination "5", which was expected, since SR presented higher correlation to ETo than RH (Table 3).For this reason, the use of RH instead of SR influenced on a much more significant dispersion of the data for the scenario "5".Thus, using only T max and RH in the daily estimate of evapotranspiration isn't as efficient, though its performance is still superior to estimating average evapotranspiration monthly, as in the case of the Thornthwaite method.
In general, the greater the number of variables available, the greater was the efficiency and effectiveness of modeling by the computational methods.This implies that a complete database (T max , T avg , T min , RH, SR and Ws) are necessary for a precise estimation of ETo.However, the results obtained while using fewer climatic variables indicates that the high generalization capacity of the models presented here are a viable reality for regions that lack meteorological datasets, or even have only a couple of years of data.The ETo calculated by the Penman-Monteith method was adopted in this study as reference, though this method isn't a direct measure of evapotranspiration.For this reason, the Penman-Monteith method also presents errors when compared to methods such as the lysimeter.In this context, the ideal approach would be fitting the models to evapotranspiration measured directly, where the methods could present even better results.
It is important to emphasize that for these regions, even though some meteorological stations are available, those are punctually acquired data, losing representativeness the farther the agricultural properties are.Therefore, the heuristic methodologies presented here open up the opportunity for irrigation managers to access precise estimations of evapotranspiration according to local data availability.

CONCLUSIONS
The present work evaluated and compared the performance of the pattern recognition methods RandomForest, Cubist, Bayesian Regularized Neural Networks, Support Vector Machines and Stepwise Regression in the estimation of ETo using the climatic variables maximum, average and minimum temperature, relative humidity, solar radiation and wind speed.The climatic data from 11 meteorological stations present in the mesoregions of the Triângulo Mineiro/Alto Paranaíba and Noroeste de Minas were used.In general, the heuristic methods presented better results when a greater number of variables were available.The results show the robustness of the methods in the prediction of the ETo, and, even in scenarios of scarce data, presented smaller errors than alternative methods established in the literature.The climatic variables that presented the highest correlations and explanatory power over the evapotranspiration phenomenon were solar radiation and temperature.To conclude, the Bayesian Regularized Neural Network model presented, for all combinations of data entry, the most accurate and precise results.The results can be applied within the water resource management field, mainly in aid to estimation of evapotranspiration for irrigation and water balance.

Figure 1 :
Figure 1: Localization of area of study and meteorological stations.

Figure 2 :
Figure 2: Correlation and interaction between climatic variables for the training phase.

Figure 3 :
Figure 3: Comparison between the reference evapotranspiration estimated by the Priestley-Taylor and Thornthwaite methods in relation to the Penman-Monteith method.

Figure 4 :
Figure 4: Comparison between the observed values and those estimated by the optimal models of the heuristic methods.

Table 1 :
Meteorological stations, geographic coordinates and altitude.Denomination of the meteorological station according to the world meteorological organization. *

Table 2 :
Combinations of input variables for each artificial intelligence method.

Table 3 :
Statistical parameters for the daily data set.

Table 4 :
Performance criteria of the Priestley-Taylor and Thornthwaite models in relation to Penman-Monteith.

Table 5 :
Comparisons of the different methods in ETo prediction.
RF: Random Forest; CB: cubist; BRNN: Bayesian Regularized Neural Network; SVM: Support Vector Machine; SWR: Stepwise Regression; MAE is in mm per day; RMSE is in mm per day; r² is dimensionless; Time is in seconds.