Generalizability of machine learning models and empirical equations for the estimation of reference evapotranspiration from temperature in a semiarid region

: The Penman-Monteith equation is recommended for the estimation of reference evapotranspiration (ET o ). However, it requires meteorological data that are commonly unavailable. Thus, this study evaluates artifi cial neural network (ANN), multivariate adaptive regression splines (MARS), and the original and calibrated Hargreaves-Samani (HS) and Penman-Monteith temperature (PMT) equations for the estimation of daily ET o using temperature. Two scenarios were considered: (i) local, models were calibrated/developed and evaluated using data from individual weather stations; (ii) regional, models were calibrated/developed using pooled data from several stations and evaluated independently in each one. Local models were also evaluated outside the calibration/training station. Data from 9 stations were used. The original PMT outperformed the original HS, but after local or regional calibrations, they performed similarly. The locally calibrated equations and the local machine learning models exhibited higher performances than their regional versions. However, the regional models had higher generalization capacity, with a more stable performance between stations. The machine learning models performed better than the equations evaluated. When comparing the ANN models with the HS equation, mean RMSE reduced from 0.96 to 0.87 and from 0.84 to 0.73, in regional and local scenarios, respectively. ANN and MARS performed similarly, with a slight advantage for ANN.


INTRODUCTION
Quantifi cation of evapotranspiration is of vital importance for irrigation scheduling. The FAO-56 Penman-Monteith (FAO-PM) equation is widely recommended for the estimation of reference evapotranspiration (ET o ) (Allen et al. 1998). However, it requires meteorological variables that are commonly unavailable or unreliable (Almorox et al. 2018, Pinheiro et al. 2019. Thus, equations that require only air temperature can be used as an alternative way since temperature is commonly measured. The Hargreaves-Samani (HS) equation can be used when only air temperature data are available (Allen et al. 1998, Zanetti et al. 2019). In addition, several studies have shown that the FAO-56 Penman-Monteith equation using only measured data on temperature, commonly named Penman-Monteith temperature (PMT), can also be used (Raziei & Pereira 2013, Alencar et al. 2015, Almorox et al. 2018. However, both the HS equation and the PMT equation have their performance varying according to the climatic conditions of the place where they are used. Thus, the calibration of these equations is extremely important (Zanetti et al. 2019).
In recent years, machine learning methods, such as artificial neural network (ANN), support vector machine (SVM) and gene expression programming (GEP), have been used to estimate environmental, hydrologic and climatological parameters (Ferreira et al. 2019a, Mehdizadeh et al. 2017, Ozoegwu 2019, Saggi & Jain 2019. These methods are known for their abilities in working with complex problems. Thus, they become powerful tools for ET o modeling. Among machine learning models, ANN has been used for the estimation of ETo by several authors (Antonopoulos & Antonopoulos 2017, Kumar et al. 2011, Ferreira et al. 2019a. Wang et al. (2011), using ANN to estimate ET o in arid regions of Africa, reported that this technique outperformed empirical equations. Ferreira et al. (2019a), evaluating temperature-based ANN in several places of Brazil, reported better results of this technique over empirical equations. Kumar et al. (2011) evaluated several studies and concluded that ANN is superior to conventional methods.
Another promising technique for the estimation of ET o is multivariate adaptive regression splines (MARS). This is a nonparametric regression analysis used to study nonlinear relations between a response variable and a set of predictor variables (Koc & Bozdogan 2015). Mehdizadeh et al. (2017), working with several data availability scenarios, found that MARS was more efficient to estimate ET o than empirical equations, SVM and GEP. Ferreira et al. (2019b) reported better results for MARS in relation to empirical equations in several climate types and data availability scenarios. In contrast with ANN, the use of a MARS model, after its development, occurs through an algebraic equation, which may facilitate the use of the final model. Despite its potential, there are limited studies using MARS for the estimation of ET o (Ferreira et al. 2019b, Mehdizadeh et al. 2017.
ET o models can be calibrated/developed with local or regional data. The first case is the most common approach in the literature. However, a local model can show good performance in the station where it was developed and show poor performance in other stations, which can limit its real applicability or even make it useless (Kiafar et al. 2017, Reis et al. 2019. Thus, it is important to evaluate the generalization capacity of the models, assessing their performance outside the calibration/training station. On the other hand, regional models (i.e., models calibrated/ developed with pooled data from several weather stations) can be key options in places without data for calibration or development of local models. In contrast with local models, regional models are developed to be used at any place of a particular region. In Brazil, studies addressing the development of regional models are scarce (Ferreira et al. 2019a, Reis et al. 2019, Zanetti et al. 2019. In northern Minas Gerais, Brazil, a semiarid climate prevails. In this region, in addition to a large number of farms, there are public irrigation perimeters, where irrigation plays a fundamental role in the existence of a profitable agriculture. Thus, the development of studies that can contribute to a better irrigation and water resources management is of essential importance. In this context, this study evaluated the performance of ANN and MARS and the original and calibrated HS and PMT equations to estimate daily ET o in a semiarid region of Minas Gerais, Brazil, considering two scenarios: (i) local, models were calibrated/developed and evaluated using data from individual weather stations; and (ii) regional, models were calibrated/developed using pooled data from several weather stations and evaluated independently in each one (leave-one-out cross-validation). The local models were also evaluated considering a cross-station approach.
Maximum and minimum air temperature, relative humidity, sunshine duration and wind speed were used. Wind speed, measured at 10 m height, was converted to 2 m and solar radiation was estimated based on sunshine duration, according to Allen et al. (1998). Days with missing data were removed. The dataset was divided into training set (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011) and test set (2012-2016), which were used to develop/calibrate the models and to test them, respectively. The mean numbers of samples (for each weather station) contained in the training and test sets were 3186 and 1312, respectively.

Methods for the estimation of ETo
To calibrate the PMT and HS equations and to develop the ANN and MARS models, as well as to evaluate these models, daily ET o estimated using the FAO-PM equation with all required data (Equation 1) was adopted as reference.
where: ET oFAO-PM -reference evapotranspiration calculated by Penman-Monteith, mm d -1 ; R n -net solar radiation, MJ m -2 d -1 ; G -soil heat fl ux, MJ m -2 d -1 (considered as null for daily estimates); T -daily mean air temperature, °C; u 2 -wind speed at 2 m height, m s -1 ; e s -saturation vapour pressure, kPa; e a -actual vapour pressure, kPa;   ∆ -slope of the saturation vapour pressure function, kPa °C -1 ; γ -psychometric constant, kPa °C -1 . Two data management scenarios were used in this study: local scenario: models were calibrated/developed and evaluated using data from each weather station individually; and regional scenario: models were calibrated/ developed using pooled data from all the weather stations, except the station in which the model was evaluated, performing a 9-fold crossvalidation (leave-one-out cross-validation) ( Figure 2). The local models were also evaluated considering a cross-station evaluation, evaluating them outside the calibration/training station ( Figure 2). All the models studied were calibrated/developed using data from 2002 to 2011 (ten years) and evaluated using data from 2012 to 2016 (five years). Both cross-station evaluation for local models and leave-oneout cross-validation for regional models are important strategies to assess the performance of the models outside the training station, allowing a more robust evaluation.
To estimate ET o with the PMT equation, Equation 1 was used, with actual vapour pressure and solar radiation estimated using Equations 2 and 3, respectively, and wind speed was set at 2 m s -1 , as recommended by Allen et al. (1998).
where: e a -actual vapour pressure, kPa; T minminimum air temperature, °C.
To estimate ET o using the HS equation, the following equation was implemented: where: ET oHS -reference evapotranspiration calculated by Hargreaves-Samani, mm d -1 ; R a -extraterrestrial radiation, mm d -1 ; T maxmaximum air temperature, °C; T min -minimum air temperature, °C; T -mean air temperature, °C.
The calibrations of the PMT and HS equations were performed by simple linear regression, as suggested by Allen et al. (1998). For this, a linear regression was fitted with ET o values estimated using the FAO-PM equation as the dependent variable and those estimated using the equation under evaluation as the independent variable. The obtained intercept (a) and slope (b) were used as calibration parameters, according to Equation 5.
w h e re : E T o cal -cal i b ra te d re fe re n ce evapotranspiration, mm d -1 ; a and b -calibration parameters; ET o -reference evapotranspiration estimated by the original equation (equation under study), mm d -1 .
Regarding the machine learning methods, ANN and MARS were developed considering maximum temperature, minimum temperature and extraterrestrial radiation as input variables.
ANN is a supervised machine learning model inspired by the human brain that can be used for classification and regression tasks. It typically consists of layers of neurons, with weights representing the connections between neurons. Further details regarding ANN and its usage for ET o modeling can be seen in Kumar et al. (2011).
ANNs of the feed-forward multilayer perceptron type with stochastic gradient descent training algorithm optimized with momentum term were used. The ANNs architecture (i.e., number of layers and neurons), momentum term and learning rate were defined by trial and error. Thus, the ANNs developed were composed of an input layer, one hidden layer and an output layer. The input layer was composed of three variables, the hidden layer was composed of ten neurons, and the output layer was composed of one neuron, as shown in Figure 3. Hyperbolic tangent function was used as activation function in the hidden layer and identity function was used in the output layer. Learning rate and momentum term were set to 0.001 and 0.9, respectively. The number of training epochs was 500 in local scenario and 400 in regional scenario. ANN models were implemented using the TensorFlow and Keras libraries for the Python programming language.
Multivariate adaptive regression splines (MARS) is a regression technique initially proposed by Friedman (1991). This technique is able to model nonlinearities and interactions and automatically choose the input variables that are really important. A MARS model is composed of base functions, which are set at  different intervals of the independent variables. Base functions work according to the following equations: where: c -constant called knot; x -input variable; y -output variable.
To build a MARS model, two steps are required, the forward and backward steps. In the first one, an over-fitted model is built, with a large number of knots; in the backward step, a pruning technique is used to remove redundant knots (Kisi 2015). More details regarding MARS can be seen in Cheng & Cao (2014). As an example, a one-dimensional model is illustrated in Figure  4. MARS models were implemented using the py-earth library for the Python programming language. Hyperparameter tuning was done by grid search with k-fold cross-validation (k=3). The following hyperparameters and their respective values were assessed: penalty (3.0, 5.0, 10.0, 20.0, 30.0), endspan_alpha (0.01, 0.05, 0.1, 0.5), and minspan_alpha (0.01, 0.05, 0.1, 0.5). The order of interaction (max_degree) was limited to four to avoid extremely complex equations.

Performance comparison criteria
The performance of the models was evaluated for each weather station, with data from the test set, using root mean square error (RMSE), coefficient of determination (R²) and mean bias error (MBE), according to the following equations.
where: RMSE -root mean square error, mm d -1 ; R 2 -coefficient of determination; MBE -mean bias error, mm d -1 ; Pi -value predicted by the model, mm d -1 ; Oi -observed value, mm d -1 ; -mean of values predicted by the model, mm d -1 ; -mean of observed values, mm d -1 ; n -number of data pairs.

Empirical equations
The original PMT and HS equations had a wide performance variation between the weather stations, with RMSE ranging from 0.64 to 1.45 and from 0.70 to 1.29 mm d -1 and MBE ranging from -0.99 to 0.63 and from -0.68 to 0.92 mm d -1 , respectively for the PMT and HS equations ( Figure 5). For R 2 , a variation from 0.34 to 0.77 was observed for both equations. According to Raziei & Pereira (2013) Analyzing MBE behavior for the PMT and HS equations, both equations obtained negative values only in Espinosa and Monte Azul stations. This is possibly explained by the higher mean wind speed and lower mean relative humidity found in these sites (Table I). According to Allen et al. (1998), wind has a great effect on ET o in dry and hot environments due to the greater removal of water vapour stored in the air. In addition, Gavilán et al. (2006) found that the HS equation underestimated ET o in cases in which wind speed exceeded 1.5 m s -1 .
By evaluating R 2 results, the lowest R 2 values were observed at Espinosa, Janaúba and Monte Azul weather stations, where there are the highest standard deviations of wind speed, 1.1, 1.1 and 1.2 m s -1 , respectively (Table I). This is probably due to the difficulty of the PMT and HS equations in capturing the effect of large wind speed oscillations since it promotes ET o fluctuations that are not directly captured by these equations. , working in a hyperarid region, concluded that wind speed is one of the variables that most affects ET o .
Comparing In the cross-station evaluation, an unstable behavior was observed after local calibration, with gains and losses of performance in relation to the original equations. For the PMT and HS equations, the models calibrated at Januária, Montes Claros, Pedra Azul and Pirapora stations had a relatively good generalization capacity, showing performance improvements outside the calibration stations. These models exhibited RMSE values lower or close to those obtained for the original equations in most stations, however, they performed worse at Espinosa and Monte Azul stations. The models calibrated at the other stations showed performance improvements over the original equation only for some stations.
Regarding regional calibrations, the regional HS showed more expressive performance gains over its original version than the regional PMT over its original version. The regional HS only did not have lower RMSE values at Espinosa and Monte Azul stations, reducing RMSE for all other stations. Mean RMSE over the stations was reduced from 1.02 to 0.96 (6%) and median RMSE reduced from 0.93 to 0.77 (17%). For the PMT equation, although regional calibration reduced RMSE for some stations, it increased RMSE for Espinosa, Monte Azul and Montes Claros stations. Mean RMSE over the stations was increased from 0.95 to 0.97 (2%). However, median RMSE decreased from 0.83 to 0.80 (4%). Comparing the regional HS and PMT, they generally had the same performance.

Machine learning methods
The ANN and MARS models obtained similar performances in local and regional scenarios, but the ANN models performed a little better, with slightly lower RMSE values and slightly higher R 2 values ( Figure 6). In the cross-station evaluation, as reported for the empirical equations, there was an unstable behavior. The models with the best results were those developed at Januária, Montes Claros and Pirapora stations. The models developed at Espinosa, Janaúba and Monte Azul had the worst results outside the training stations. On the other hand, the regional models had a more stable performance, with RMSE values higher than those obtained with the local models, but lower than some of the values observed in the cross-station evaluation.

Overall evaluation
It is important to highlight that Espinosa, Janaúba and Monte Azul stations had the worst performances for all models studied. This is probably because there are larger oscillations of wind speed (greater standard deviations) in these sites (Table I), and all the models evaluated do not directly capture these oscillations since they do not use wind speed as input. In addition, the models calibrated/developed in these stations had the worst generalization capacities, not showing good performances outside the calibration/training stations.
The empirical equations, ANN and MARS models developed with local data outperformed the models developed with regional data when they were evaluated in the same station that they were calibrated/developed (Figure 7). Shiri et al. (2014) also obtained superior performance for models developed with local data. However, despite the higher performance of local models, they are commonly required in places where there are no data available to calibrate/develop them. Thus, a local model should be applied in places with climatic characteristics similar to the place where it was calibrated/developed, which limits its use. If this requirement is not met, the calibrated model can perform even worse than the original one.
In this study, it was observed that, in some cases, the use of a model developed or calibrated at a station more distant can provide better results than a model from a nearer station. For example, all the models developed at Januária station performed better at Montes Claros station than those developed at Juramento station, which is much closer to Montes Claros station. On the other hand, models calibrated/developed on a regional scale can be a more fl exible approach, allowing to use a single model in an entire region and avoiding problems with highly site-specific models. Therefore, regional models can be an interesting approach, especially for places without data for calibration/development of local models. In addition, according to Pereira et al. (2015), machine learning models remain empirical and may not translate well in time and space. Thus, since regional models are developed with a larger amount of data, they can be more stable in time and space than local models.
When comparing the machine learning models and the empirical equations, the first ones showed better performances in both regional and local scenarios (Figure 7). When comparing the ANN models with the HS equation, mean RMSE reduced from 0.96 to 0.87 (9%) and from 0.84 to 0.73 (13%), in regional and local scenarios, respectively. It should be noted that at Espinosa, Janaúba and Montes Azul stations, there was a high increase in R 2 values, mainly for the local models, indicating the higher capacity of machine learning models to capture complex relations between input variables and ET o (Figures 5 and 6). This behavior reaffi rms the superiority of machine learning models over traditional equations, reported by Kumar et al. (2011) andMehdizadeh et al. (2017), among others.
Comparing the performance of the regional ANN and MARS with the PMT and HS equations, it was noted that the machine learning models perform better than the original and regionally calibrated versions of the mentioned equations ( Figure 7) and, at Janaúba and Pirapora stations, even better than the locally calibrated equations. Shiri et al. (2014) and Feng et al. (2017) also reported superior performance of machine learning methods developed with regional data in relation to empirical equations. Thus, the regional ANN and MARS are good options to estimate ET o in the study region, outperforming traditional equations. Future studies should focus on the development of regional models with higher performances, trying to get even closer to the performance of local models.
To make the models obtained in this study available for future studies or practical applications, the local and regional calibration parameters of the PMT and HS equations, as well as the regional and local MARS models, are presented in Tables II and III, respectively. CONCLUSIONS ANN, MARS, and empirical equations (PMT and HS) in their original and calibrated forms were evaluated for the estimation of daily ET o based on temperature data. Two scenarios were considered: (i) local, models were calibrated/ developed and evaluated using data from individual weather stations; (ii) regional, models were calibrated/developed using pooled data from several weather stations and evaluated independently in each station (leave-one-out cross-validation). The local models were also evaluated considering a cross-station approach.
The original PMT equation exhibited better performance than the original HS equation, however, after local or regional calibrations, these had similar performances.
The local calibration of the PMT and HS equations promoted higher performance gains than those obtained with regional calibration. Similarly, the local ANN and MARS had better performance than their regional versions. However, the regional empirical equations, ANN and MARS models had higher generalization capacity, showing a more stable performance between the stations evaluated.
The machine learning techniques studied had better performance than the PMT and HS equations in their original and calibrated forms in local and regional scenarios. The ANN and MARS models showed similar performances, however, the ANN models performed slightly better. On the other hand, MARS has the advantage that it can be used in the form of algebraic expression.