Machine-learning methods for hydrological imputation data: analysis of the goodness of fit of the model in hydrographic systems of the Pacific - Ecuador

Computational methods based on machine learning have had extensive development and application in hydrology, especially for modelling systems that do not have enough data. Within this problem, there are data series that are missing, and that should not necessarily be discarded; this is achieved by means of the imputation of the same ones, obtaining complete sets. For this reason, this research proposes a comparison of computer-learning techniques to identify those best suited for hydrographic systems of the Pacific of Ecuador. For the elaboration of this investigation, the hydro-meteorological records of the monitoring stations located in the watersheds of the Esmeraldas, Cañar and Jubones Rivers were used for 22 years, between 1990 and 2012. The variables that were imputed were precipitation and flow. Automatic learning machines of the Python Scikit_Learn module were used; these modules integrate a wide range of automated learning algorithms, such as Linear Regression and Random Forest. Finally, results were obtained that led to a minimum useful mean square error for Random Forest as an automatic machine-learning imputation method that best fits the systems and data analyzed.


INTRODUCTION
In recent years, methods based on machine learning have advanced considerably and have been applied in several areas of science and technology. Within hydrology, they have been widely applied in the development of basin behavior models, especially those that do not have enough information to apply physical models. In this aspect, another problem is that much of the available information is incomplete, and the series that are available are useless, further reducing the data for the work of hydrological modelling.
Hydrologists and water managers have made use of observed relationships between rainfall and runoff to predict streamflow ever since the creation of the rational method in the 19th century (Beven, 2012), a properly designed monitoring network with optimal data allows us to know the relationship between these behaviours and to be able to apply this in studies of water interest. However, streamflow and rainfall records suffer from missing observations, mostly resulting from unexpected causes including the loss of records, sensor problems, or disruption of data collection (Ng et al., 2009).
In the area of analysis, one of the problems is that there are not enough nor adequate monitoring systems, and from those that exist a large amount of missing data is evident. This makes the process of modelling these watersheds complicated and inaccurate; the application of this type of study generates knowledge of the area and its subsequent exploitation for different activities linked to water. These data would result in an incorrect response of hydrological models, but it is illogical to ignore abnormal or missing values if there is limited data available; substantial uncertainty in hydrologic and water quality modelling can be driven by these missing records . There are several methods to solve the problem of missing observations from statistics based on linear regressions that have already been validated in other investigations. These depend on the amount of data existing in the series and the data and relationships that they have with neighbouring weather stations (Mwale et al., 2012;Rees, 2009). For these reasons, authors such as (Adeloye, 2009) indicate that regression methods might only be applicable when all predictors exist.
Artificial neural networks (ANNs), regression trees, and support vector machines have been shown to be powerful tools for predictive modelling and exploratory data analysis, particularly in areas that do not meet the conditions for using traditional statistical methods (Shortridge et al., 2016). These methods have mathematical formulations that require a high cost of computational processes, but are very effective when there are non-linear relationships to use traditional statistical methods (Dawson et al., 2010). These strengths make them very useful, especially in countries with poor monitoring traditions, where gaps of information in climatological and hydrologic time series are ubiquitous (Campozano et al., 2014).

MATERIALS AND METHODS
The analysis stations are located in three main river basins of the coastal zone of Ecuador. These are the Cañar River, the Jubones River and the Esmeraldas River. These basins have around 318 weather stations and 106 hydrological stations. The three systems have been chosen because they have the largest monitoring network in the country and represent a significant area of analysis (see Figure1). The sample of the stations to be analyzed was obtained by simple random sampling. One meteorological station and one hydrological station were selected per basin, as well as two nearby reference stations for the case of meteorological stations and a reference station for the example of hydrological stations. These nearby reference stations were selected as predictors at the time of analysis of data.

Data Imputation Methods
For the development of the imputation model with the uses of machine learning, we work with a pattern search to optimize parameters and later cross-validation for the periods of analysis of the research ) (Carpenter, 1999 ). This imputation method is based on supervised learning models, that is, the machine is presented with the response information at the same time as the input information, with which the machine will learn to arrive at the answer through an iterative process. Within the operation of a machine learning, we have an input data vector that transfers it to the network where the complexity of the training is determined, thus obtaining a vector of data output as a result of the model (Guo et al., 2015).
The process is divided into training data, test data, and validation data; this division is established through cross-validation that allows an adequate distribution of the data between test data and validation so that the model does not over-fit in the trained data and have deficiencies in the validation data (see Figure 2). Followed by the supervised learning machine, this information is processed and a linear model based on the least square's method is established. The multiple iterations that the learning machine performs with the training and test data allow us to identify the best linear model, which will allow the imputation of hydrometeorological data. The Tansig Function is used as a transfer function since it gives efficient results within hydrological studies Akhter, 2017), the function is as Equation 1 follows: Where xi is the input in the network, y is the output in the network, N is the number of neurons in the input vector, wi is the connection weight between input and output, f is the transfer function, and b is the bias term.
To analyze the weight of each calculation, the neural networks use a back-propagation algorithm, where the error in the output data and the observed data are analyzed. It is a type of supervised learning based on the generalization of the delta rule (Veintimilla-Reyes and Cisneros, 2015; Hsu et al., 1995;Bisoyi et al., 2019). This algorithm updates weights by moving along the gradient descent of the error function, which allows the steepest decreasing change. The advantages of this algorithm are its ability to adjust the learning rate by updating the learning rate parameter and it also guarantees less oscillation with the momentum constant . The process is repeated until the error is minimized. This method is widely used in hydrological studies (Dawson and Wilby, 2001).
In this research, we analyze and compare two methods. The first method is autonomous learning based on linear regression, which integrates statistical models for relating responses to linear combinations of predictor variables (Ahmad et al., 2010;Srivastava et al., 2013). The second method is the random forest algorithm, which is widely used for the study of water resources. Applications falling under this category include streamflow modeling using datadriven rainfall-runoff models, while streamflow imputation of missing values is also generating increased interest (Tyralis et al., 2019).
Random Forest is a supervised Machine Learning algorithm based on a stochastic model that relates a result to explanatory variables or characteristics. Each decision within the tree can be viewed as a set of conditions, organized hierarchically and applied successively to the data set. For regression applications, they provide independent numerical predictions of the phenomenon of interest. In the end, the result corresponds to the mean forecast of all individual trees (Muñoz et al., 2018).

RESULTS AND DISCUSSIONS
The machine of autonomous learning based on linear regression and in random decision forests produced models that allowed the imputation of missing data in the hydrometeorological records of the stations located in the study basins, i.e., the stations of the basins of the Esmeraldas, Cañar and Jubones Rivers. The following models are calibrated to meet the imputation of missing data from each station within the period of records comprising 22 years, from 1990 to 2012, for both meteorological stations and hydrological stations. It should also be taken into account that we worked only with the hydro-meteorological stations near the hydrographic basin, and that in the analysis of correlations they maintained between them a correlation value greater than or equal to 0.75.
The analysis of the best regressions obtained for each imputed data in the selected meteorological stations is presented (see Figure 3). These models were established with the Linear Regression learning machine of the Sklearn Python library, and their data sets were applied for cross-validation as a fundamental pillar for the validation of results (Hastie et al., 2017). The analyses have a relationship between the test values and the predicted values. As a result of these results, the linear models allowed imputation of missing data in the hydrometeorological records. In the figure, it can be seen that there is a linear relationship for each of the data and the stations. This relationship has to be validated by statistical indicators of goodness-of-fit between observed and predicted data (Tyralis et al., 2019;Zambrano-Bigiarini, 2017;2011). These analyses are presented in Section 3.1, where they are compared between the two methods presented in this research. Table 1 shows the equations of the linear models that have been obtained for each station with the stations with which it has been correlated with the previous spatial analysis. This line regression model is obtained by the Machine Learning Linear Regression algorithm between test values and the predicted values of the stations.
Similarly, the analysis of the allocation models based on the Random Forest learning machine (see Figure 4) is presented. The relationship between the values of the data of the reference station and the data of the analysis station is evaluated; each parameter of the model is calibrated so the statistical indicator is the most reliable.

Analysis of statistical index for model validation
To perform an interpretation, comparison, and analysis of applicability for large hydrographic systems where the basin has many variations in both meteorological and hydrological behaviour, it is necessary to obtain indicators of goodness-of-fit that indicate which is the optimal model. The statistical significance of the performance statistics is an aspect that is generally ignored that helps in reducing subjectivity in the proper interpretation of the model performance (Ritter and Muñoz-Carpena, 2013). To obtain these indicators, the data observed in the meteorological stations have been compared with the simulated data (imputed through the techniques analyzed). This process allows us to validate which of the two methods has a better fit, and therefore, which would be more efficient and result better in subsequent applications.
The indicators have been analyzed, and the efficiency coefficient of Nash and Sutcliffe (Nash and Sutcliffe, 1970) has received considerable attention in hydrological modelling (Gupta and Kling, 2011;Moussa, 2010). It has already been used for the imputation of missing data and it is generally used in other fields of science (Schaefli and Gupta, 2007). It is also tested by the Kling-Gupta Index of Efficiency (KGE) (Galleguillos et al., 2017), which uses the values between -1 and 1, with the value 1 as an ideal and positive values greater than 0.5 as sufficiently robust correctors. The Mean Square Error is a standard indicator for this type of analysis (Gupta et al., 2009).
To determine the best method for data imputation, the indices are compared (Table 2). It is observed that the best method in the three reliability analyses is the Random Forest. The Mean Square Error is the one that indicates the highest relation of the imputed values and, observed with 0.01, the NSE and KGE index has values of 0.76 and 0. 67 respectively, indicating good data adjustments. The evaluations of the indices verify that the models have given good results. Reliable indices are appreciated with the tools of artificial intelligence, although reviewing each accurate indicator the NSE index for Linear Regression is 0.69 compared to 0.76 for Random Forest; the two values show right adjustments between the data (Waseem et al., 2017). The KGE index between the models has more noticeable differences between Linear Regression (0.48) and Random Forest (0.67), the Random Forest indicator has given values that establish (Näschen et al., 2018;Pool et al., 2018) the use of that model as the best for analysis of large river basins with variations of the Pacific climatology.
For several decades the need to have complete time series to validate subsequent studies has meant that many studies are done, and various techniques have been used. In Aissia et al. (2017), a review of multivariate methods and their application to reduce the loss of information is made; these range from simple relationships such as linear regressions that are based on spatial approximations, artificial intelligence techniques, and even much more innovative methods, such as those presented by Williams et al. (2018), who formed two methods with Bayesian structures to generate an algorithm that represented the signal of the time series of 9 Machine-learning methods for hydrological imputation … Rev. Ambient. Água vol. 16 n. 3, e2708 -Taubaté 2021 temperatures. In Teegavarapu (2019) research, spatial interpolation methods were evaluated, although Euclidean distances were substituted to improve fit indicators' goodness.
In another study carried out by Chen et al. (2019), several techniques were tested to impute precipitation data with the premise that having complete series improves analyses within hydrographic basins. After using several methods, they decide to perform the hydrological model with the best fit; This leads us to consider that there is no "best" technique, but rather that the analysis must be based on several determining factors such as the type of hydrometeorological variable, the years of the series, the time scale, spatial variations, conditioning factors or external phenomena.
The apparent difference compared to traditional methods is that the response to abnormal weather patterns can be better exploited, which is of great interest for rainfall patterns as variable as that of the Pacific in Ecuador, which is influenced by various external phenomena.

CONCLUSIONS
In this study, we propose the analysis of two traditional methods of artificial intelligence to study the accuracy and use for imputation of missing data in Hydrographic Systems on the slope of the Pacific in Ecuador. The two methods proposed and evaluated were Linear Regression and Random Forest, which were tested in the three most representative Hydrographic Systems of the country, in the basins of the Esmeraldas, Cañar and Jubones Rivers, intending to cover the extension of the surface and the water from north to the south of Ecuador.
After carrying out a preliminary analysis of the data, we worked with 9 test stations in the three systems, observing goodness-of-fit indicators for each of the stations and each model, we worked with Medium Squared Error (MSE), Nash and Sutcliffe (NSE) and Kling-Gupta Index of efficiency (KGE). The values obtained after the goodness-of-fit analysis mark ranges for good efficiency analyses, but the Random Forest model has the three best indicators both on average and for each of the analysis stations. It is important to highlight the importance of carrying out this type of analysis in watersheds of the Pacific slope of Ecuador, since available information is scarce and the hydro-meteorological behaviour is different from the Amazon slope. They are also very large systems in extension but with hypsometries marked by very marked altitudinal differences in a small area of land and with a lower amount of water compared to the Amazon slope.
After the data analysis and the discussion process with authors who have carried out similar works worldwide for several decades, it can be seen that techniques that can reproduce atypical effects should be evaluated in the first place and then validated before their application to the management of water resources.

REFERENCES
ADELOYE, A. The relative utility of regression and artificial neural networks models for rapidly predicting the capacity of water supply reservoirs.