Air temperature estimation techniques in Minas Gerais state, Brazil, Cwa and Cwb climate regions according to the Köppen-Geiger climate classification system

Air temperature significantly affects the processes involving agricultural and human activities. The knowledge of the temperature of a given location is essential for agricultural planning. It also helps to make decisions regarding human activities. However, it is not always possible to determine this variable. It is necessary to make a precise estimate, using methods that are capable of detecting the existing variations. The aim of this study was to develop models of multiple linear regression (MLR), artificial neural network (ANN), and random forest (RF) to estimate the mean (Tmean), maximum (Tmax), and minimum (Tmin) monthly air temperatures as a function of geographic coordinates and altitude for different localities in Minas Gerais state, Brazil, with climatic classification Cwa or Cwb. The average monthly data (Tmean, Tmax, and Tmin), over a period of 30 years, were collected from 20 climatological stations. The MLR was able to estimate the Tmax with accuracy. However, the predictive capacity of estimating Tmean and Tmin was low. The algorithms RF and ANN were used to estimate Tmean, Tmax, and Tmin with high accuracy. The best results were obtained using the RF model.


INTRODUCTION
It is important to monitor the meteorological elements to achieve proper growth and yield of crops. Efficient monitoring can help in evapotranspiration estimates, irrigation planning, pest and disease risk zoning, animal comfort index mapping, etc. One of the most important meteorological elements is air temperature, which influences plant physiology. Changes in air temperature can lead to change in the growth and development of plants (Benlloch-González et al., 2016;Cardoso et al., 2012;Wahid et al., 2007). The air temperature influences various physiological processes occurring in a plant, such as the speed of chemical reactions (Benavides et al., 2007) that occur in the temperature range of 0 -40 °C. The extent of influence exerted depends on the plant species. When the air temperature exceeds the ideal range for each species, morphological, physiological, and biochemical changes may be induced, leading to adverse effects on plant growth (Wahid et al., 2007). Studies on the characterization of air temperature, precipitation, and the climatic classification of the regions where agriculture predominates should be conducted to improve crop yields (Cardoso et al., 2015;Costa et al., 2012).
In coffee crop science, one of the main crop types grown in the Minas Gerais State, Brazil (Compahia Brasileira de Abastecimento -CONAB, 2020), the optimum mean annual temperature falls in the range of 18 -23 °C for the proper growth of C. Arabica specie. The optimum temperature falls in the range of 22 -26 °C for the proper growth of C. Canephora (Damatta et al., 2018). Temperatures that fall outside this range influence the growth and yields of the crops. When the temperature is extremely low, the activity of the coffee crop reduces, and the photosynthetic performance is noticeably affected. The net photosynthetic activity ceases almost completely (Batista- Santos et al., 2011;Partelli et al., 2009). On the other hand, very high temperatures may cause a decrease in the net photosynthetic rates of the leaves (Cannell, 1985). The ideal temperature interval produces a high crop yield over the years. The temperature outside the optimal range results in reduced crop yield. Therefore, it is important to determine the mean air temperature and the extreme temperatures (maximum and minimum). Furthermore, considering the characteristics of the relief and location of the Minas Gerais State, the accurate estimation of extreme temperatures is important because the state exhibits topographic conditions that allow the formation of frosts on an annual basis in the southern region. The maximum temperatures (40 -42 °C) are recorded in the northern regions of the state.
The mean, maximum, and minimum air temperatures can be monitored on a daily basis in weather stations. However, in the Minas Gerais region, the coverage of the official network of surface weather stations is limited. Besides, interruptions and errors in the database generated by these stations are quite common. The errors can be attributed to reading errors, damaged devices, and other unintended observational problems (Dumedah;Coulibaly, 2011;Mwale;Adeloye;Rustum, 2012). These factors limit climatic studies, e.g., studies on the climatic characterization of the region and studies on meteorological elements that slow down the development of agriculture.
Considering the fact that the average monthly air temperature varies with geographic coordinates and altitude, several researchers working in different regions of Brazil have been trying to develop techniques and models for estimating the air temperature. The multiple linear regression (MLR) model considers the latitude, longitude, and altitude of the location as independent variables (Alvares et al., 2013;Maluf;Matzenauer, 2008;Pezzopane et al., 2004;Sediyama;Melo Júnior, 1998). These estimates have been made with different levels of precision and accuracy. However, the development of new tools such as the Artificial Neural Network and Random Forests technique can maximize the performance, precision, and accuracy of estimating the air temperature.
The new techniques have been developed with the aim of achieving higher accuracy during the estimation of variables. The Artificial Neural Network (ANN) is a promising and effective tool for non-linear modeling and complex time-series. It has been used in different fields of science such as medicine (Muhammad et al., 2019), hydrology (Asadi et al., 2019), and agriculture (De Oliveira Aparecido et al., 2020). The ANN model is a mathematical model in which the architecture is analogous to brain functioning. The interconnecting processing elements are arranged in several layers (Kumar;Raghuwanshi;Singh, 2011). The ANN method helps understand and generalize the relationships between complex datasets. This expands the scope of the application of the method Dandy;Maier, 2014).
ANNs have been used for the estimation of meteorological variables with good accuracy. Estimation of reference evapotranspiration (Antonopoulos;Antonopoulos, 2017;Kumar;Raghuwanshi;Singh, 2011), solar radiation (Bou-Rabee et al., 2017), and air temperature (Moreira;Cecílio, 2016) have been carried out using this technique. It is important to conduct this study to verify the applicability of the ANN method for estimating the mean, maximum, and minimum air temperature. The efficiency of the technique has been investigated. Reports on the use of ANNs (used to estimate the temperature in the region under study) are scarce.
The Random Forest (RF) is non-parametric statistical data modeling methods (Breiman, 2001). The models have been used to analyze data in different fields of science, such as medicine (Xie et al., 2020), biology (Fabris et al., 2018), and geoprocessing (Vogels et al., 2017). According to James et al. (2013), decision trees detect non-linear relationships in the evaluated system when the use of linear relationships, e.g., linear regression analysis, is restricted. According to Seyedhosseini and Tasdizen (2015), RF is a classification and regression technique used to grow ensemble decision trees such that the correlation between the trees remains as low as possible. This condition can be achieved by the method of bootstrap sampling. In this method, resamples are replaced by simulating a single random sample. It must represent samples taken from the original population. Data from previously conducted analytical experiments are required to enhance the predictive and generalization abilities (Hesterberg et al., 2002).
RF has also been adopted to predict meteorological variables such as solar radiation (Benali et al., 2019) and air temperature (Noi;Degener;Kappas, 2017). RF has been found to be a more efficient predicting tool compared to other tools like ANN (Benali et al., 2019;Zhou et al., 2016). The RF is still little applied, and the interest in this predictive tool is increasing as it exhibits a good practical performance (Scornet, 2016). Therefore, it is important to evaluate the RF potential for estimating air temperature and to compare it with different methods.
The objective of this study was to develop and compare the performances of multiple linear regression (MLR), Artificial Neural Networks (ANN), and Random Forests (RF) models for estimating the mean, maximum, and minimum monthly air temperatures using input variables such as geographical coordinates and altitude for different areas in the Minas Gerais State with climatic classification Cwa or Cwb (Köppen;Geiger, 1928).

Study area and data sources
The present study was developed for municipalities in the Minas Gerais state that are within the regions classified as Cwa (humid temperate climate with dry winter and hot summer) and Cwb (humid temperate climate with dry winter and moderately hot summer). This classification was proposed by Köppen and Geiger (1928) (Figure 1). This Climatic Classification Systems (CMS) was developed by Köppen in 1918, and its most popular version was published in 1928 in collaboration with Rudolf Oskar Robert Williams Geiger. The Köppen and Geiger (1928) CMS a simple and comprehensive system, and hence it is widely used. The mean annual rainfall recorded in the region under study is 1379 mm (Brasil, 1992). The study was limited to the areas classified as Cwa and Cwb. The aim was to determine the maximum efficiency of the models tested. Highly accurate data were obtained when the models were used in regions exhibiting similar climatic characteristics.
According to De Sá Júnior et al. (2012), the regions classified as Cwa and Cwb represent 21% and 11% of the area of the Minas Gerais state, respectively. There are 20 climatological stations located in the region under study.
The regions fall under the realm of the national network of climatological stations (National Institute of Meteorology (INMET)). The respective geographical coordinates and climatic classification have been presented in Table 1. The average monthly data (mean (Tmean), maximum (Tmax), and minimum (Tmin) air temperature) over a period of 30 years, from 1987 to 2017, of each conventional station were used for the studies. The data were extracted from the Meteorological Database for Teaching and Research -BDMEP of INMET. Although some locations do not have a record of 30 years of data (Table 1), all stations presented more than 90% of the consistent data.

Multiple linear regression (mlr) method
Based on the independent variables (geographic coordinates and altitude), MLR was developed to estimate the mean, maximum, and minimum average temperature of each month of the year for each location. The average temperatures were calculated as follows (Equation 1): (1) where Yi is Tmean, Tmax, or Tmin in °C and is the dependent variable. ALT represents the altitude in m, LAT represents the latitude in degrees, and LON represents the longitude in degrees, which are indepedent variables. β0, β1, β2, and β3, are the regression coefficients. MLR was implemented using the data analysis tool in Microsoft Excel ® . Contrary to the methodology applied for ANN and RF, the month was not used as an input variable. Therefore, the data for Tmean, Tmax, and Tmin were classified based on the month. Subsequently, the MLRs were adjusted. Each month had a characteristic equation generating a specific statistical result. The methodology reported by Sediyama and Melo Júnior (1998) were used for the studies. This methodology increases the predictive capacity of MLR and facilitates the analysis of each independent variable in the month. The influence of each variable on the result can also be analyzed.     The input data consisted of the month, latitude, longitude, and altitude of each evaluated location. Each ANN setting estimated the Tmean, Tmax, or Tmin for all the months. There are good reasons behind using these variables for these studies. The temporal variable consists of the cumulative month component, which is required to execute the projections. The latitude and longitude are the variables related to the position. The temperature changes with the position as the position changes from the Poles to the Equator Line. The temperature gradually increases from the poles to the equator. The altitude variable is regarded as the surface component. It can be stated that the higher the altitude, the lower the temperature. The ANN follows a mathematical structure connecting the processing nodes (neurons). The output of a neuron is the input of the subsequently combined neurons. The final model is built based on various assumptions on activation function (Equations 2, 3, 4, 5, 6, 7 and 8). The equations are as follows: Equations 2-5 represent the mathematical abstraction of the ANN built in Figure 2 extracting the neurons equations. Equations 6 -8 are the estimate vectors of each output. W i,j represents the weights estimated using the backpropagation algorithm during ANN processing. The value of B i,j represents the bias associated with each measurer. The activation function applied was sigmoidal with non-linear output.
All adjustments were cross-assessed. Twenty folds of the sample set were used for the assessment for training to compensate for the reduced number of instances. Two different configurations were evaluated (Table 2). Results from the preliminary tests indicated that changes in the number of training epochs and the number of neurons present in the hidden layer interfered with the performance of the models. However, changes in the other parameters did not significantly influence the model performances.

Development of the Random Forest (RF) model
The implementation of RF in WEKA has its basis on a previously reported study (Breiman, 2001). Two configurations of RF were used, with the input variables being month, latitude, longitude, and altitude of each evaluated location. Thus, each RF setting could be used to estimate Tmean, Tmax, or Tmin for all the months under study. The steps followed has been presented in Figure 3.
In this study, preliminary examinations were conducted for several configurations. The configurations with 100 and 500 interactions exhibited better performance compared to other values obtained in the preliminary analysis. The preliminary tests revealed that the changes in the other parameters did not positively influence the model performance. The tests exhibited two distinct configurations for better results (Table 3).

Statistical tests
Various statistical indices were used to assess the predictive quality of each technique in terms of variation, precision, accuracy, and performance. The mean absolute error (MAE) and root mean square error (RMSE) indicates revealed how close the predicted values were to the observed value. Thus, the accuracy of each model could be predicted. The variation was quantified by the determination coefficient (R²), which represents the percentage of the variation of the dependent variable explained by the independent variable. The best model should produce an R² value close to unity. The precision of the models was quantified based on Pearson's correlation coefficient (r), which indicates the degree of dispersion of the data obtained in terms of the mean. Accuracy was quantified using Willmott's index of agreement (d) and the performance index (c) (Camargo;Sentelhas, 1997). The performance index was calculated using the equation c = r. d. This equation was also used to quantify the performance of the model. The performances were classified as: Excellent (1 - Weka provides a tool to compare different combinations and different algorithms called WEKA Experiment Environment (Figure 4). This tool was used to compare the performance of each algorithm and configuration used in the present study conducted using the cross-validation technique. According to Noi, Degener, and Kappas, (2017), cross-validation is one of the most popular validation methods used to compare different combinations and different algorithms. In the crossvalidation method, the dataset is divided into k groups (k-fold) of approximately the same size. Due to the number of observations, a 20-fold cross-validation method was used. The algorithms were applied for each fold, generating statistical performance values. Later, these average performance values were compared by Tukey's test at 5% probability. The statistical software Sisvar (Ferreira, 2019) was used for analysis. The MLR method was not implemented in WEKA. The approach was different from that was used in the ANN and RF methods. Hence, it was not possible to compare the MLR method with the other techniques using Tukey's test. The comparison between MLR and other techniques was made by comparing the statistical performance indicators.

RESULTS AND DISCUSSION
The MLR method coefficients were adjusted to estimate the Tmean, Tmax, and Tmin monthly air temperatures. The respective mean absolute errors (MAE), root mean square errors (RMSE), determination coefficient (R²), Pearson's correlation coefficient (r), Willmott's index of agreement (d), and the consistency index (c) are shown in Table 4.
The models used to estimate Tmean (Table 4) reveal that R² values were in the range of 0.38 -0.93 and the r valued ranged from 0.62 to 0.97. The models for estimating the data for the months of July and August exhibited a "bad" and "poor" performance (Camargo;Sentelhas, 1997), respectively. For these months, these models are not recommended to estimate the Tmean values. The model performances were "Good" when the other months were analyzed. The linear coefficients altitude (β1) and latitude (β2) were significant. A negative correlation was observed between altitude and Tmean and between latitude and Tmean, exhibiting a decrease in Tmean values with increasing altitude and latitude. These results were expected and in accordance with the vertical thermal gradient in the troposphere. Cargnelutti Filho, Maluf and Matzenauer (2008) and Gomes et al. (2014) reported a negative correlation between altitude and Tmean (Rio de Janeiro state and the Rio Grande do Sul state, respectively). However, there was no significant influence in latitude. During the estimation of Tmax, RMSE was found to be in the range of 0.51 -0.74. The R² values ranged between 0.63 and 0.86, and the r values ranged between 0.80 and 0.93 (Table 4). The model for February exhibited the lowest statistical indicators, and the model's performance was "Good" (Camargo;Sentelhas, 1997). The linear coefficient of altitude (β1) was significant in all models. There was no significant influence of the linear coefficients longitude (β3) on the months of January, February, and March. In the other months, a significant influence of β2, β3, and β4 was observed. Gomes et al. (2014) analyzed the models to estimate the maximum monthly air temperature of Rio de Janeiro. R² values were found to be in the range of 0.51 -0.71. A significant influence of the altitude and latitude was observed. However, the linear coefficient of longitude did not significantly affect the data of most months. This difference can be explained by the small longitudinal difference between the meteorological stations in Rio de Janeiro state compared to the region evaluated in this study. The meteorological stations under consideration are at a sufficient longitudinal distance to be influenced by the continentality effect.
While estimating Tmin, it was observed that the r values ranged between 0.32 and 0.93. The R² values ranged between 0.10 and 0.86, and the RMSE values ranged between 0.36 -1.92 (Table 4). The models used for estimating the Tmin values for the months between February and October exhibited a "Poor", "Bad", or "Terrible" performance index (Camargo;Sentelhas, 1997), reflecting the low precision and degree of accuracy. Furthermore, significant β1, β2, and β3 values were not recorded when these models were used to study the data corresponding to the abovementioned months.
The Tmin, corresponding to these months, varied due to the variation in other factors, such as wind, ocean currents, local topographic conditions, rain, cloudiness, and passage of the cold front (Aguado;Burt, 2010). According to Silveira et al. (2019), in addition to the statistical factors (vegetation, maritime, continentality, geographic coordinates, etc.), climatic conditions are influenced by dynamic atmospheric systems such as cold fronts. After the passage of the cold front, under conditions of clear skies and low atmospheric humidity, the heat loss by irradiation during the night is very high. This results in a drop in temperature, mainly during winter, autumn, and spring. In some cases, this facilitates the occurrence of radioactive frosts (Escobar, 2007). Therefore, the Tmin values could not be estimated with high precision using these models. In the other months (November, December, and January), the models performed well, and a significant influence of altitude was observed. Medeiros et al. (2005)  The ANN and RF statistical performance indicators for estimating Tmean, Tmax, and Tmin in the regions classified as Cwa and Cwb (Minas Gerais state) are shown in Table 5. Contrary to the MLR model, which used separate equations for each month, the architectures chosen for the ANN and RF models could be used to estimate the Tmean, Tmax, and Tmin of all months together. Thus, to estimate the Tmean, Tmax, or Tmin of a given location, latitude, longitude, altitude, and the month were used as the input data. Moreover, the statistics for each configuration (Table 5) refer to all the months of the year. The model performance indices for each month need not be distinguished (unlike the MLR model). The lower RMSE and MAE were observed when the RF technique was used (compared to the case when ANN was used). A significant difference was observed in the results obtained using these techniques (ANN and RF). There was no significant difference between the different configurations tested within each technique. The RMSE and MAE were higher estimating Tmin values compared to the Tmax and Tmean values, suggesting more variation within the Tmin estimates. The r values, calculated using the RF method, were higher than those calculated using the ANN method during the calculation of the Tmean, Tmax, and Tmin values. The values of the coefficient r did not differ significantly when these two techniques (and different configurations of the techniques) were used to determine the Tmean and Tmin values. Nevertheless, a significant difference was observed in the Tmax values when these two techniques were used. The other indices indicate that the RF model was superior to the ANN model. However, both the techniques could be used to estimate the Tmean, Tmax, and Tmin values with very high accuracy ( Table 5). The fit quality of both models can be confirmed by the high values of the performance index (c). These values were "Excellent" according to the evaluation criteria proposed by Camargo and Sentelhas (1997).
There was no significant difference between the RF configurations. However, the use of the concept of Break ties randomly when several attributes look equally good, a WEKA solution, increased the predictive capacity of the model. That option gets triggered when the output reaches a local optimum. When this condition becomes true, the algorithm initializes a random process to escape from a local optimal spot to reach the bests solutions. This procedure has been explained in detail by Breiman (2001). The previous studies suggest the execution of 100 interactions; however, 500 interactions were required to improve the RF performance. Were et al. (2015) reported more stable results using a higher number of interactions.
The changes made to the ANN parameters did not significantly influence the Tmean, Tmax, and Tmin values. However, increasing the Number of Training Epochs from 500 to 1000 improved the Tmean and Tmin predictive capacity of ANN. This Number of Training Epochs is a hyperparameter that defines the number of times the learning algorithm works through the entire training dataset. The best results were obtained when six neurons were integrated into one hidden layer during the estimation of Tmean and Tmin. However, the best result was obtained when five neurons were integrated into the hidden layer during the estimation of Tmax. The choice of the size of Ciência e Agrotecnologia, 45:e023920, 2021 the hidden layer is very important because underestimated numbers of neurons can lead to poor approximation and generalization capabilities, while the use of excessive neurons can potentially result in overfitting. This can eventually make the search for the global optimum more difficult (Lee;Lam, 1995).
Although the MLR model could be used to estimate the Tmean, Tmax, and Tmin for some months of the year, in general, the RF and ANN models exhibited superior predictive abilities (for all the analyzed statistical indices) than the MLR models. The RF model was found to be superior to the ANN model. Moreover, the low MLR predictive capacity (Tmin estimation) can cause problems for producers who need this information because the regions categorized as Cwa and Cwb are more suitable for the development of agricultural activities that require lower temperatures and average temperatures during the winter (below 20 °C; De Sá Júnior et al., 2012). Therefore, RF and ANN methods are more suitable for this region.
Several literature reports (reporting various applications) have indicated the superiority of the RF model in the regression estimation (Benali et al., 2019;Noi;Degener;Kappas, 2017;Rodríguez-Lado et al., 2015). The superiority of the RF model can be attributed to the advantages of the method, which include not making distributive assumptions about the predictors.
The importance of each variable can be determined using this model, and the method is less sensitive to noise or overfitting (Armitage;Ober, 2010;Ismail;Mutanga, 2010). Even though RF is superior to ANN, the ANN method can be used to determine the Tmean, Tmax, and Tmin values with high accuracy. This has also been reported by Hasni et al. (2012). They concluded that the ANN technique could be reliably used for determining the temperatures.
The plot, shown in Figure 5, indicates the importance of each input attribute in the response variable of the evaluated algorithms. The most important contribution toward the estimation of the Tmean value was for the month. This was followed by the effect of the altitude (for all the evaluated models). In the estimate of Tmax by RF1 and RF2, the altitude exerted the maximum effect. However, when the ANN1 and ANN2 methods were used, the month was found to exert the maximum effect on the results. This was followed by the contribution of the altitude. The trend was similar to the trend observed when the MLR method was used. A significant influence of the altitude was observed for all months when the MLR model was used for the calculations. The month attribute had the largest contribution to the Tmin estimate. This contribution was the maximum. These results can potentially explain the low capacity of the MLR model toward the estimation of Tmin as the month is not considered a variable in this model.
The results revealed that, for locations where it is difficult to collect data from weather stations (due to lack of infrastructure, reading errors, or use of damaged devices), the use of RF and ANN models is recommended for estimating the Tmean, Tmax, and Tmin values. In addition, researchers and producers can use such methods to create a risk zoning of pests and diseases, develop works related to plant growth, and develop crop varieties based on the temperature of the region.
An estimation of the Tmin values can help prevent the formation of frost in all the locations under study. This is because the region under analysis is susceptible to the occurrence of this phenomenon. According to Pimenta, Angélico and Chalfoun (2018), adverse weather conditions (such as the formation of frost) can harm the production of the coffee fruit, affecting productivity and thereby changing the market value of the product. It is important to develop an efficient technique to determine the Tmax and Tmin values to develop a more accurate agricultural zoning of climatic risk. This can assist the producers in the choice of sowing time and harvest planning. Extreme weather conditions, especially in less developed regions, can be avoided. However, no statistical method can produce results that are exactly the same as the observed and/or recorded data. Hence, it is important that the weather stations function continuously (Alves et al., 2020). Furthermore, it is important to have computational knowledge to implement the RF and ANN models, therefore, mobile applications are needed to facilitate the use of these techniques. Further studies in the area are needed, and the results of the present study may support future forecasts.

CONCLUSIONS
The results of this study can help farmers, researchers, technicians, and local government officials in urban planning. Urbanization is characterized by surface alterations. Vegetated areas are replaced with impervious surfaces and buildings. This surface change alters the energy balance, increasing absorption and heat transfer between the earth's surface and the lower atmosphere, resulting in increased surface air temperatures (Song;Wu, 2016). Accelerated urban growth has been observed in the region under study. An effective tool for estimating the air temperature can assist in the application of new technologies that can potentially reduce the surface heating process. The RF model exhibited a greater predictive performance compared to the ANN and MLR models for estimating the Tmean, Tmax, and Tmin values. The RF model explains at least 94% of the variability of the variables estimated using the independent dataset, i.e., only 6% of the response variable could not be predicted by the model. The RF is the most suitable technique for estimating the air temperature. The input attributes were sufficient for the estimation. Therefore, this model is recommended for conducting studies in this specific region.