Application of Artificial Neural Networks (ANNs) in the Gap Filling of Meteorological Time Series

This study estimates and fills real flaws in a series of meteorological data belonging to four regions of the state of Rio de Janeiro. For this, an Artificial Neural Network (ANN) of Multilayer Perceptron (MLP) was applied. In order to evaluate its adequacy, the monthly variables of maximum air temperature and relative humidity of the period between 05/31/2002 and 12/31/2014 were estimated and compared with the results obtained by Multiple Linear Regression (MLR) and Regions Average (RA), and still faced with the recorded data. To analyze the estimated values and define the best model for filling, statistical techniques were applied such as correlation coefficient (r), Mean Percentage Error (MPE) and others. The results showed a high relation with the recorded data, presenting indexes between 0.94 to 0.98 of (r) for maximum air temperature and between 2.32% to 1.05% of (MPE), maintaining the precision between 97% A 99%. For the relative air humidity, the index (r) with MLP remained between 0.77 and 0.94 and (MPE) between 2.41% and 1.85%, maintaining estimates between 97% and 98%. These results highlight MLP as being effective in estimating and filling missing values.


Introduction
Studying climatic processes and atmospheric phenomena may require a large number of data, which are obtained through a set of devices, such as satellites, balloons, radars, sensors and meteorological stations. These devices operate in a constant-collection regime, obtaining data in various time periods such as minutes, hours, days or months, and thus generate a large volume.
These data have great value, both historical and for governmental organizations, private companies and academic institutions. Such importance is due to the wide field of application of these data, which can be employed in areas such as civil security, agriculture, energetics, transports, ecology, health, among others.
However, the occurrence of problems in the devices leads to various measurement errors and generate inconsistent data or even the occurrence of gaps. According to Wanderley et al. (2014), the lack of a continuous series of climatological data may limit the understanding of the spatial and temporal variability of various meteorological and hydrological processes, and also damages the characterization of the climate of a region.
Although these gaps are usual, they require great attention in the application of these data in studies that demand continuous time series. The availability of a reliable series, without gaps, is fundamental for the application of these data in different areas.
Various methodologies have been commonly employed in the reconstruction of these time series. This filling is performed through the substitution of missing data by values estimated through statistical and mathematical methods, such as means, spatial and/or temporal interpolations, linear regressions and others (Wanderley et al., 2012). Fernandez (2007) compared the techniques of simple mean, Steurer, normal proportion and multiple linear regression in the prediction of missing means of air temperature, relative air humidity and rainfall of thirteen stations of Rio Grande do Sul, Brazil. In 2012, the interpolation by kriging was applied in the gap filling of pluviometric data of the state of Alagoas, Brazil (Wanderley et al., 2012). Oliveira et al. (2010) applied the methods of regional means, linear regression, potential regression and multiple regression in the gap filling of historical series of annual rainfall of six pluviometric stations of the state of Goiás, Brazil.
However, although these methodologies are regularly applied, some of them may require a large set of historical data and ignore the local spatial variation of the studied variables, which may ultimately generate a large amount of errors (Huth and Nemesová, 1995). In contrast, the reconstruction of incomplete time series continues to be the study object of innumerous scientific studies, which has stimulated the search for methodologies that are able to improve this process, such as the techniques of computational intelligence known as artificial neural networks (ANNs) (Wanderley et al., 2012).
ANNs are inspired in the neural structure of intelligent organisms, characterized by the recognition of patterns and generalization of information, besides the capacity to learn and acquire knowledge through experience (Haykin, 2001). These characteristics have led ANNs to be widely used to model a series of meteorological processes, Like filling in and estimating failed data, prediction of floods, prediction of reservoir levels, climatic classifications, among others. In regard to the filling and prediction of meteorological data, the application of models based on ANNs has aroused the interest of various researchers. Wanderley et al. (2014) applied an ANN in the gap filling of monthly pluviometric data of the state of Alagoas, Brazil. Gomes and Montenegro (2010) applied ANNs in the prediction of natural inflow rates and in the treatment of pluviometric and fluviometric data of the reservoir of Três Marias, of the São Francisco River, Brazil. Sobrinho et al. (2011) applied an ANN to estimate the reference evapotranspiration (ETo) of Dourados, Mato Grosso do Sul, Brazil. Maqsood et al. (2004) utilized ANNs to provide predictions of 24 h of air temperature, wind speed and relative air humidity in the Regina Airport, in Canada. Olcese et al. (2015) applied ANNs to predict and fill missing data of aerosols in the South portion of the Coast of the United States and in the Iberian Peninsula. Bustami et al. (2007) also used an ANN to estimate missing data of rainfall and water levels in the state of Sarawak in Malaysia. Depiné et al. (2013), uses an MLP RNA to fill flaws in historical series of hourly rainfall of the Taboão stream basin in Rio Grande do Sul. Correia et al. (2016), used an RNA to fill faults of four pluviometric stations located in the Espírito Santo mountainous region. Ventura et al. (2013) applied ANNs in the gap filling of temperature series of a Cerrado region in the state of Mato Grosso, in Brazil. All of these applications showed satisfactory results regarding the utilized statistical parameters. Therefore, the present study aimed to apply and compare the ANN models of Multilayer Perceptron (MLP) and Multiple Linear Regression (MLR) in the reconstruction of temporal series of meteorological data from the state of Rio de Janeiro, in Brazil.
The municipality of Campos dos Goytacazes is located in the North Fluminense region of the state, and the basin where the municipality is inserted is responsible for more than 80% of the petroleum production of the country, and also stands out as a center of the sugarcane crop (Miranda et al., 2010;Reis Junior and Monnerat, 2002). Cordeiro belongs to the Serrana region, which is responsible for a large part of the production of vegetables in the state. In addition, in 2011, this region suffered with the largest climatic catastrophe of the country. Itaperuna belongs to the Northwest Fluminense region, which is responsible for a great part of the agricultural production, and Rio de Janeiro, which belongs to the Metropolitan region, where the commercial center of the state is concentrated. However, approximately 28.9% of its territory is still Atlantic Forest, of which the main areas are Tijuca Forest, Gericinó Forest, Pedra Branca Forest, Restinga da Marambaia, Grumari Municipal Natural Park, among others (SMAC, 2016).

Utilized data
The data set used in this study was provided by the National Institute of Meteorology (INMET). The following variables were used: monthly means of maximum air temperature and relative air humidity. The data were recorded in the period from May 31, 2002, to December 31, 2014, totaling 152 records for each variable of each station. These variables were selected based on the survey of gaps that occurred in their historical series.

Proposed model for filling the gaps
Firstly, the data collected by the stations were inserted in an electronic worksheet program and then the gaps and inconsistencies were identified (Table 1).
After this step, the missing data were removed from all stations, ie, if a station x of the set of stations did not have the record of the monthly mean of maximum air temperature or relative air humidity for the period of April 30, 2008, the record was removed from all other stations. This procedure guaranteed the creation of a homogeneous series, causing all stations to have the same number of data and the same recorded months. Figure 2 demonstrates the applied sorting process. After the missing data were removed, the data set was normalized, by altering the actual scale of the values to an interval between zero and one (Coutinho et al., 2016). Then, a correlation matrix between the stations was calculated to confirm the degree of correlation of the utilized data sets (Table 2 and Table 3).
To compare observed values with those estimated by the techniques, the data set of each variable was divided and subjected to the models of Mean of the Regions, Multiple Linear Regression (MLR) and Multilayer Perceptron (MLP) in two parts, 70% for training/adjusting and 30% for validation. After this process, the information set of each variable was changed, becoming the standard for training the variable maximum air temperature of any of the stations had 88 data, and 38 data for validation. Relative air humidity, however, had a total of 62 training data and 27 validation data as the standard for the stations.
The validation step consisted in subjecting the set of estimating data to the models in order to estimate each one of the data of the variables maximum air temperature or relative air humidity, and its efficiency was evaluated through statistical techniques applied to the results obtained. After confirming the capacity of the model to predict the subjected variable, the actual gaps were filled using data from the stations determined as estimators removed in the sorting process. In other words, if the station of Rio de Janeiro does not have the measurement of maximum air temperature for the period of April 30, 2006, But the remaining stations have it, so data from Campos dos Goytacazes, Cordeiro and Itaperuna would be used, and in case of failure in Campos dos Goytacazes, data from Rio de Janeiro, Cordeiro and Itaperuna would be used, and so on.

Multiple Linear Regression (MLR)
The multiple linear regression is a technique that aims to analyze or relate one dependent variables to various independent variables (Fonseca et al., 2012). The relationship between the dependent variable Y and other independent variables (X 1 , X 2 , X 3 ) is formulated by the following linear model, Eq. (1) (Sousa et al., 2007): In the present study, Y i is the variable to be estimated, which can be maximum air temperature or relative air humidity, X 1i , X 2i , X 3i are the values of maximum air temperature or relative air humidity observed by the stations used for estimation, a, b 1 , b 2 , b 3 are the regression coefficients and e i represents the independent random disturbances or random errors (Lyra et al., 2011).
The resolution of this problem is related to the estimation of the values of the parameters a, b 1 , b 2 , b k , which can be performed by the minimum squares method, which aims to determine the values of a and b minimizing the sum of the squared errors (Sousa et al., 2007;Lyra et al., 2011).
These values can be found considering the matrix notation of the data, described in Eqs. (2) and (3).

Multilayer Perceptron (MLP) networks
Artificial Neural Networks (ANNs) try to emulate the biological neurons of the human brain, through a massively parallel and distributed processing, capable of learning through examples and generalizing the acquired information. ANNs calculate mathematical functions and have a natural propensity to store the knowledge from the experience and make it useful. Thus, they are similar to the human brain (Härter and Velho, 2005;Robles et al., 2008;Haykin, 2001). There are various different architectures of ANNs, but the present study used the Multilayer Perceptron (MLP) network.
The MLP-type ANN is a universal approximator of functions that belongs to the feedforward class. It has been applied in different problems, such as the processing of information, recognition of patterns, weather forecast, problems of classification, reconstruction of missing information, processing of images and others (Shah and Ghazali, 2011).
The structure of the model is constituted by one input layer, one or more hidden layers, and one output layer. Each one of the neurons of the input layer is connected to all neurons in the hidden layer. Likewise, each neuron of the hidden layer is connected to all neurons of the output layer (Wanderley et al., 2014).
The present study adopted an architecture with four layers; one input layer, two hidden layers and one output layer. Figure 3 shows the applied model.
In the model, x 1 , x 2 and x 3 are the values of maximum air temperature or relative air humidity recorded by the stations used for estimation, w i are the weights associated with the layers, and y 1 is the variable to be estimated, which can be maximum air temperature or relative air humidity.
Many tests and simulations were conducted to define, through the presented results, that the first hidden layer would be established with 30 neurons and with a Hyperbolic Tangent function, Eq. (4) Simulations were also performed with different training algorithms and the backpropagation Quasi-Newton algorithm was selected. This algorithm is a variation of the classic backpropagation algorithm described by Haykin (2001). This method is based on the Newton's method, but does not require the calculation of the second derivative, because it updates the approximate Hessian matrix in each iteration of the algorithm. The update is calculated as a function of the gradient. This algorithm requires more computation in each iteration and more storage space compared with the backpropagation method, but converges to a solution in less iterations (Gill et al., 1982).
As for the training period of the MLP model, it was defined as 4000 epochs and took an average of 10 to 20 min to converge.

Performance evaluation
To evaluate the capacity of the models to estimate the variables of maximum air temperature and relative humidity, statistical measures were used, such as Pearson's correlation coefficient (r) Eq. (6) applied to evaluate the degree of association between estimated and observed data, the mean absolute error (MAE) Eq. (7), root-mean-square error (RMSE) Eq. (8), mean percentage error (MPE) Eq. (9), index of agreement (D) and index of confidence (C) (Fonseca et al., 2012;Deshmukh and Ghatol, 2010;Pezzopane et al., 2012). The index of confidence (C) Eq. (11) allows to analyze simultaneously precision and accuracy of the obtained results. It is calculated through the product of the coefficient of correlation (r) by the index of agreement (D), Eq. (10). Its values vary from zero (0), for no agreement, to one (1), for perfect agreement (Pezzopane et al., 2012). Table 4 demonstrates the criteria for performance evaluation.  (11) where n or N represents the number of utilized data, O j the observed value, x j the value estimated by the employed techniques, O the mean of the observed data and x the mean of the estimated data.
In addition to the methods used to determine the results, the average (M), maximum (MAX), minimum (MIN) and standard deviation (SD) measurements of actual data and those estimated by the models. Table 5 shows the values of M, MAX, MIN and SD of the actual data and the errors obtained by each model in the 322 Coutinho et al.    It is possible to observe that although the difference presented by the statistical measures used seem to be small, it is possible to note from the results highlighted by, the error measures RMSE and MAE, that the MLP obtained the smallest errors to estimate the maximum air temperature. It is also possible to verify by means of a comparison of the mean percentage error (MPE) that the MLP obtained in its estimates, data between 23% and 45% more accurate than the MD model and between 12% and 15% more accurate than the MLR.

Results of maximum air temperature estimates
Analyzing all the results it is observed that for the region of Campos dos Goytacazes, the correlation coefficient (r) between the actual data and those estimated by the MLR and MLP models remained at 0.97, demonstrating a high correlation with the observed data. In addition, the RMSE remained between 0.49 and 0.67 according to the MLP and MD. Another relevant factor are the values of MAE, MPE, (D) and (C), respectively equal to 0.40°C, 1.34%, 0.98 and 0.96 for the MLP model, which characterized the precision of the previously presented results above 98% and an optimal index of confidence of the results of the MLP model for this region (Fig. 4).
In the estimates of maximum air temperature for the region of Rio de Janeiro, the models MLP and MLR showed (r) of 0.94, which demonstrates a high correlation with the observed data. Additionally, the models MLP and MLR exhibited MAE between 0.69°C and 0.80°C, and MPE between 2.32% and 2.66%, guaranteeing a precision in the estimates above 97%. Besides these parameters, the index of confidence of the results for the MLP model remained optimal (Fig. 4).
The estimates of maximum air temperature for the region of Cordeiro also showed high correlation between ac- tual data and those estimated by the MLP model, exhibiting a coefficient of correlation (r) of 0.98. The results of RMSE, MAE, MPE and (C) remained respectively between 0.41 and 0.85 for MLP and MD, 0.34°C and 0.40°C for MLP and MLR, 1.25% and 1.48% for MLP and MLR and 0.97 for MLP, which guarantees an optimal index of confidence for the estimated values. Another important aspect observed through Table 5 is the proximity between the actual values and those estimated by the MLP model, which showed mean (M), maximum (MAX) and minimum (MIN) of 27.43°C, 32.08°C and 23.21°C for the actual data, and of 27.47°C, 31.88°C and 23.54°C for those estimated by the MLP model. This proximity can be observed in Fig. 4.
On the other hand, for the estimates of maximum air temperature of the region of Itaperuna, the lowest values of RMSE and MAE were 0.42 and 0.32°C, obtained with the MLP model. However, the results generated by the MLR model were almost equivalent to those of the MLP. Never-theless, comparing M, MAX and MIN, the data estimated by the MLP model are more precise and reached values considerably close to the actual data. In addition, according to the MPE of 1.05%, the quality of the values estimated by the MLP model shows a precision of almost 99% (Fig. 4).
After analyzing the results and observing the capacity of the MLP model, the data removed in the sorting stage from the stations used in the estimates of each region were applied in the MLP model to fill the actual gaps of the estimated stations; as an example, the region of Campos dos Goytacazes, which used data removed in the sorting stage from its estimators, Rio de Janeiro, Cordeiro and Itaperuna, to fill its gaps.
For the region of Campos dos Goytacazes, it was possible to fill eight out of the nine gaps, while for Rio de Janeiro it was possible to fill seven out of eight gaps. For Cordeiro, six out of the seven and, for Itaperuna, three of the four gaps (Fig. 5). 324 Coutinho et al.  Table 6 presents the results of the estimation of the variable relative humidity of the air applying the same methodology used in the data of maximum air temperature. Analyzing the results presented by measures (r), RMSE, MAE, (D) and (C) in Table 6, it can be seen that the MLP model was superior in all the estimates in comparison to the other models. This fact can also be verified by comparing the results of the measurement of error (MPE) obtained by MLP in its estimates, where it remained between 23% and 35% more accurate than MD and between 12% and 18% more accurate Than MLR. Comparing all the model results for the estimates of relative air humidity, it is possible to verify that for the region of Campos dos Goytacazes, they showed a high relation with the data observed. However, it is still possible to observe a greater precision in the estimates generated with the MLP model (Table 6).Comparing the values of (r) between the MLR and MLP models, the value generated by the MLR was lower than that of MLP, 0.77 and 0.80, respectively. This demonstrates that the values estimated with the MLP model remained closer to the observed data.

Results of relative air humidity estimates
Another important factor demonstrated by the MLP in the estimate of relative air humidity data are the values of RMSE, MAE and MPE, which were 2.08, 1.57 and 2.11, thus confirming that MLP reached lower errors in comparison to the other applied models, and that the values generated by the MLP remained closer to 98% of precision. Figure 6 shows the data estimated by the MLP model.
For Rio de Janeiro, it was also possible to prove that the data estimated by the MLP model showed errors of RMSE, MAE and MPE lower than those of the MLR and MD models, respectively equal to 2.18, 1.74 and 2.41%, thus guaranteeing greater success in the estimates of the MLP model.
In the estimation of relative air humidity data of the region of Cordeiro, it was also possible to observe that the MLP model demonstrated superior performance, with the lowest errors and highest index of correlation (r) ( Table 6). According to the results, it is also observed that the mean (M) and minimum (MIN) of the values estimated by the MLP model remained relatively close to the actual values (Fig. 6).
In regard to the estimates of relative air humidity for the region of Itaperuna, according to the parameters M,   MAX and MIN, the values estimated by the MLP model show certain similarity with the actual data. In addition, the value of (r) obtained by the MLP model was equal to 0.94, which demonstrates a high correlation. Furthermore, according to Table 6, the error parameters RMSE, MAE and MPE of the estimates generated by the MLP model were lower than those of the MD and MLR, respectively 2.10, 1.67 and 2.35%. This guarantees that the MLP model showed a higher hit rate in its estimates (Fig. 6).
After analyzing the results of the estimates of relative air humidity and observing the capacity of the MLP model, the data removed in the sorting stage from the stations used in the estimate of each region were applied to fill the actual gaps of the estimated stations, adopting the same methodology employed in the filling of maximum air temperature data.
Thus, for Campos do Goytacazes, it was possible to fill three out of the seven gaps; for Rio de Janeiro, six out of eight; for Cordeiro, forty-two out of forty-seven; and, for Itaperuna, six out of eight (Fig. 7).

Conclusions
From the analysis of the results achieved by the MLP model, it can be concluded that it presented considerably convincing results, being superior to the MD and MLR models. However, it was possible to verify that the MD and MLR models also presented satisfactory results, showing high correlation indices (r) and low mean errors (EMP) with real data. This fact may have been influenced by the treatment method applied in the variables that standardize the historical series making it homogeneous.
However, it can be observed from the comparison with the real data that most of the values estimated by the MLP model were closer to reality when compared to the measures of (M), (MIN), (MAX) and (DS) Presented by the other models. 326 Coutinho et al. Thus, it can be stated that the MLP, RNA type, stands out as an effective tool to reliably estimate and fill the meteorological variables of maximum air temperature and relative humidity.