MANNGA: A Robust Method for Gap Filling Meteorological Data

Abstract This paper presents Mannga (Multiple variables with Artificial Neural Network and Genetic Algorithm), a method designed for gap filling meteorological data. The main approach is to estimate the missing data based on values of other meteorological variables measured at the same time in the same local, since the meteorological variables are strongly related. Experimental tests showed the performance of Mannga compared with other two methods typically used by researches in this area. Good results were achieved, with high accuracy even for sequential failures, which is a big challenge for researchers. The core advantages of Mannga are the flexibility of handling different types of meteorological data, the ability of select the best variables to assist the gap filling and the capacity to deal with sequential failures. Moreover, the method is available to public use with the Java programming language.


Introduction
Meteorological data has an important position in scientific research. Based on meteorological data, explanations about climatic phenomena are made, allowing us to understand several characteristics of our planet. To aid the process of data acquisition, many types of equipment are installed in meteorological stations. Commonly, the equipment works 24 hours per day, for years. Therefore, a huge quantity of data is generated. Unfortunately, not all data is integrally perfect, because failures appear in data series.
Missing or rejected data in these measurements is an ubiquitous problem due to equipment failures (system/ sensor breakdown), maintenance and calibration, spikes in the raw data, and physical and biological constraints (e.g. storms, hurricanes, and non-optimal wind directions) (Hui et al., 2004). In any case, the gap created in the data series will cause a bad interpretation in the data study. Thus, it is important to apply a gap filling method to fix the dataset.
One of the methods used for gap filling is Multiple Imputation (MI), used by Sullivan et al. (2015), a Monte Carlo technique in which the missing values are replaced by m > 1 simulated versions, where m is typically small, for example, between 3 and 10 (Schafer, 1999). Horton and Ipsitz (2001) comment on several systems to facilitate the use of the method, like Solas, Sas, S-Plus, Mice, and others. Hui et al. (2004) used the MI method for gap filling eddy covariance data, which collect data about the exchange of carbon dioxide, water vapor and heat from a vegetated surface and the atmosphere.
Other methods of gap filling are the Mean Diurnal Variation (MDV) and the Look-up Tables (Falge et al., 2001). MDV replaces the gap using an average calculated from values of adjacent days (Kato et al., 2006). This method was also used in Hu et al. (2009), Alavi et al. (2006) and Mohan and Rao (2016). The look-up table approach consists of creating a table with the flux values binned, based on the corresponding values of the external parameters. The determination of the relevant parameters and their critical values is a crucial step if this technique is to be successful (Mishurov and Kiely, 2011). This method was used in Zhou et al. (2015), Rodrigues et al. (2005), Wilson and Baldocchi (2001) and Shao et al. (2011).
Regression analysis is performed in order to determine the correlations between two or more variables having cause-effect relations, and to make predictions for the topic by using the relation (Uyanık and Guler, 2013). Multiple Linear Regression (MLR) can be used to simulate meteorological data, as shown in Malik and Kumar (2015).
Some variations of gap filling techniques were compared with the same dataset of net carbon fluxes in Moffat et al. (2007), like interpolation, probabilistic filling, lookup tables, non-linear regression, artificial neural networks, and process-based models in a data-assimilation mode. Besides, the performance of three methods for gap filling data of net ecosystem CO 2 exchange was evaluated in Ooba et al. (2006). It was concluded that a method using an Artificial Neural Network offers better performance for gap filling.
In all of them, methods for gap filling are limited to a specific climatic variable. In some cases, it is very complicated to apply the method, since you have to make different settings for each data type. These disadvantages are common in gap filling methods. Therefore, the purpose of this work is to show the development of Mannga (Multiple variables with Artificial Neural Network and Genetic Algorithm), which is an optimized method, combining two Artificial Intelligence techniques, Genetic Algorithm and Artificial Neural Network.Mannga method works with several climatic variables at the same time and avoid the user to execute a specific configuration for each variable. This method is called Mannga and it was implemented with the Java programming language.

Proposed method
The proposed method, Mannga, takes advantage of two techniques to perform gap filling on meteorological data: Artificial Neural Network (ANN) and Genetic Algorithm (GA). Artificial Neural Network is a computational technique based on the concept of the human brain neurons. An ANN is a massively parallel distributed processor made up of simple processing units, which has a natural propensity for storing experiential knowledge and making it available for use (Haykin, 1999).
The structure of an ANN has several parameters and can be configured in many different ways. For each dataset there is a better configuration of the ANN to solve the problem. Finding the optimal structure of ANN consists of investigating an entire space of possible states. This task requires a great amount of processing, so it is necessary to use a search algorithm to find a satisfactory solution.
GA is a computational analogy of adaptive systems that is used to generate useful solutions to optimization and search problems. In this context, a Genetic Algorithm was used to assist the structure definition of the ANN, as a search method that finding optimal or good solutions by examining only a small fraction of the possible candidates (Mitchell, 1998).
The main idea of the proposed method considers that climatic variables are related toeach other. Thus Mannga estimates the missing data based on the values of other available climatic variables. For example, if at 10:30 AM the value of temperature data is missing, the method calculates the temperature at this moment considering the values measured at 10:30 AM of incoming shortwave radiation, wind speed and relative humidity data. Even if there are several sequential gaps, it is possible that this method is able to fill them.
Thus, the ANN will be responsible for calculating the missing data. However, as mentioned, there are countless configurations of an ANN, each one worse or better depending on the data series. In this case, the GA was utilized to determine the best ANN for the current data series. In this approach, we have more probability to work with different types of meteorological data, because the ANN will be optimized in each test.
Based on Ventura et al. (2015), the ANN parameters determined by the GA were: training algorithm, activation functions, learning rate, momentum rate and number of neurons. Sometimes there are many climatic variables in the data series. Thus, in addition to the parameters of ANN, the GA determines which variable should or should not be used in the estimation. In this case, only the more correlated variables are used, improving the performance of the method and decreasing the error in the final estimate.
The method is shown in Figure 1. Initially, one dataset (without failures) is given to the GA. The GA will use these data to learn the patterns of the climatic variables and search for the best settings for the ANN for that specific data. This is achieved creating several neural networks with different parameters. The networks created are evaluated and those with greater precision have more chances of being selected. After several iterations, the chosen ANN is used to gap filling on other datasets that have failures. Finally, the dataset with failures is fixed.

Experimental setup
Simulations were performed in order to evaluate the Mannga performance. The dataset used were obtained in AmeriFlux 1 , which provides continuous measurements from forests, grasslands, wetlands, and croplands in North, Central and South America (Boden et al., 2013). We also evaluate a dataset from INMET 2 . Three sites were chosen from AmeriFlux and one from INMET. The quality of several variables was not good, containing invalid and missing data. Therefore it was selected the variables and months with a minimum of quality to test Mannga performance. More information about the dataset is shown in Table 1.
Often meteorological data has a high variation during the annual cycle. Therefore to estimate values of this type of data it is necessary specific period of data to create good models. In Leauthaud et al. (2017) was considered only 30 days close to the gap to perform gap filling. In this case we have similar data to process, increasing the probability of a good estimation. Staub et al. (2017) present other advantage using only a specific amount of data, which is a decrease in computations effort to build models. For this reasons it is a good approach to select only a small sample of meteorological measurements (1 to 3 months) to perform gap filling. We do the same in this work for each dataset.
The processing time varies depending on the amount of data and computer used. In these tests, a dual-core computer with only 1GB of RAM was used, taking approximately 19 minutes for processing each month of data with Mannga gap filling method.
For each site, several variables were selected to perform gap filling. Mannga accuracy was checked by simulating gaps in data series. Three simulations were tested for each dataset: � 5% of failures randomly inserted, to test regular scenarios on the dataset. � 10% of failures inserted on sequence, to test the method accuracy when several gaps occur for a long period of time. � 30% of failures randomly inserted, to test the method behavior when a lot of gaps are presented on the dataset.
To compare Mannga accuracy the same tests were performed with another two others methods: Average (commonly used due to his facility) and Multiple Linear Regression.

Mannga implementation
To facilitate the use of the gap filling method, Mannga was developed in the Java programming language and all the complex procedures involving Artificial Intelligence were abstracted internally. It is possible to perform a complex process with a few functions, such as gap filling. Code 1 is one example of the method procedure to perform gap filling. It can be set some parameter to a better method's performance. The accepted error is one of these parameters, and is set up on lines 1 and 2. Others parameters involve especially to control the ANN and GA. On lines 4 and 5 the method is created and configured. After the initial configuration it is necessary to train the structure. Line 6 shows, with only one command, the data was loaded (informing the file name, number of sensors, the column where the fails are and whether the file has a header in the first line) and the method was trained to recognize all the patterns in the data. Finally, line 8 collect the results of the gap filling and lines 9 to 10 shows each estimated value.
It can be observed that to run Mannga is not a difficult task. And also that can be easy to incorporate Mannga implementation in other software, even if the developer does not have any knowledge in the method used.

Tests results
On the first site, one month's data from New Jersey station were used, which contains 2928 records (without failures)with 15 minutes as frequency of the measurement. In these records failures were inserted in the variables of incoming shortwave radiation, net radiation, humidity and temperature. Being these randomly or in sequence: 146 (5% random), 293 (10% in sequence), and 878 (30% random). The results of the processing are shown in Table 2 with their respective mean absolute error (MAE).
On the second site, data for each 20 minutes of two months from the station at Florida were used with 4289 records, and were inserted: 214 (5% random), 429 (10% in sequence), and 1287 (30% random) failures. The variables chosen were incoming shortwave radiation, net radiation, temperature, humidity and carbon concentration. The results of the processing are shown in Table 3 with their respective mean absolute error (MAE).
On the third site, data collected each 30 minutes during six monthsfrom the station at Kansas were used and 11307 records were processed, with 565 (5%) random failures, 1131 (10%) in sequence failures, and 3392 (30%) random failures inserted to test how the proposed method handles multiple failures. The variables used to perform the gap filling were incoming shortwave radiation, temperature, soil temperature, carbon concentration and carbon flux. The results of the processing are shown in Table 4 with their respective mean absolute error (MAE).
On the last site, data collected hourly of three months from the station atRio Grande do Sul, Brazil, were used and 2112 records were processed, with 105 (5%) random failures, 211 (10%) in sequence failures, and 633 (30%) random failures inserted to test the proposed method. Temperature and humidity were used to perform gap filling. The results of the processing are shown in Table 5.
The results obtained in gap filling were estimated based on the values of other sensors, obtained in the same place and at the same time as the detected failures. The GA, in addition to determining the configuration parameters of the ANN, also evaluates which sensors are available to be used as input to the neural network training. This is relevant because it can happen that a sensor, which represents a particular climatic variable, has a totally different behavior from the climatic variable estimated, affecting the accuracy of the simulation.
The results showed that Mannga had a good performance with different climatic variables. Sensors such as atmosphere temperature and soil temperature obtained error like 1.42. The carbon flux also obtained good results in experiments (minor error 2.04). However, sensors such as incoming shortwave radiation and net radiation had bad results (109.88 for Kansas dataset), with MAE values far from the average. In all simulations, Mannga robustness is observed, i.e., it was seen uniformity in performance and behavior for different scenarios.
It was also observed, in the experiment using data from Kansas and Florida site, carbon concentration variable needed only one sensor, respectively carbon flux and temperature, to estimate the missing value. Unfortunately, the data related to carbon concentration did not have good accuracy (9.88 on average). It may be possible to improve its precision by using other climatic variables in data series. In order to achieve this, new tests should be performed in the future.
About the processing time to training the method, in the biggest dataset with 6 month of data, the average for training was 67 minutes and 11 seconds. It is a big difference in processing time compared with statistical methods, as can be seen in Table 6, Table 7, and Table 8. Even so, it is an acceptable time to processing this amount of data.

Comparison with others methods
In order to evaluate Mannga performance, others gap filling methods were tested with the same datasets. The results can be seen in Table 6, Table 7 and Table 8 showing the MAE obtained in each test with Mannga, Average method and Multiple Linear Regression (MLR) method.
With the simulation of 5% of random failures, Mannga was better compared to Average and MLR method in only two cases (incoming shortwave radiation and net radiation in Florida site). In all cases, Mannga was better than MLR method, except when there were just a few failures in the data series. Average method proved to be very successful in this scenario.
On the simulation of 10% of failures in sequence, in ten cases Mannga was better than the others methods. There are good precisions with several variables, like incoming shortwave radiation, net radiation, humidity and carbon concentration. In all these tests, Mannga was always better than Average.
In the last simulation, with 30% of random failures, Mannga showed regular results. It was the best in three cases, being the second best method in all the others tests. Therefore, Mannga can be used in scenarios where exist a lot of failures in the dataset. In general, Mannga shows to be a good option to gap filling meteorological data.

Mannga public availability
As mentioned, Mannga was implemented with the Java programming language. It was included in the framework FICSED and can be downloaded on CEDA website as free software. The website has the necessary documentation to use the method.

Conclusions
In this paper we propose a novel method for gap filling meteorological data called Mannga. The great advantage of this method is the flexibility of handle different types of meteorological data, adjusting their structure for each dataset. Another advantage is the possibility of selects the best sensors to estimate the missing value, increasing the accuracy and saving processing time. Besides, if failures occur in sequence, for example, gaps occurring in the data series for hours, days or even months, it is possible to estimate the values, considering that other sensor variables contain valid data from the same period of failure.
We can list the method's disadvantage as the time to process the data. While Mannga takes minutes to perform the gap filling, others statistical methods takes just seconds. Furthermore, a higher accuracy was found mainly when failures occur in sequence in the dataset compared with other methods.
In general, tests were performed evaluating the proposed method and good results were achieved. Therefore, combined with its public availability, it is expected that the product of this work assist several research projects in the meteorological area, making meteorological data series more consistent.