Comparison of data mining models applied to a surface meteorological station

This paper presents the application of data mining techniques for pattern identification obtained from the analysis of meteorological variables and their correlation with the occurrence of intense rainfall. The used data were collected between 2008 and 2012 by the surface meteorological station of the Polytechnic Institute of Rio de Janeiro State University, located in Nova Friburgo RJ, Brazil. The main objective is the automatic prediction related to extreme precipitation events surrounding the meteorological station location one hour prior its occurrence. Classification models were developed based on decision trees and artificial neural networks. The steps of consistency analysis, treatment and data conversion, as well as the computational models used are described, and some metrics are compared in order to identify their effectiveness. The results obtained for the most accurate model presented a rate of 82. 9% of hits related to the prediction of rainfall equal to or greater than 10 mm h-1 one hour prior its occurrence. The results indicate the possibility of using this work to predict risk events in the study region.


INTRODUCTION
Flooding and landslides are some of the consequences related to high precipitation rates.Predicting these volumes automatically and in advance has been the focus of several studies (AMENT et al., 2011;ALFIERI et al., 2012;KNEIS et al., 2012;VERKADE et al., 2013).
According to NOAA (2010), a warning system for heavy rains must be differentiated when compared with other systems related to hydrometeorological events.Most current alert systems are focused on predicting events that involve a broader area; however, more effective warning systems are needed to cover smaller geographical areas (HAYDEN et al., 2007).Borga (2008) shares this view by suggesting that warning systems for heavy rains should be developed on a local scale, since the meteorological phenomena that cause very heavy rains usually have scales of less than 100 km 2 .Kobiyama et al. (2006) argue that the main factors that cause disasters must be continuously monitored.The data must, in parallel, feed a model capable of simulating phenomena in real time, so that, when the system identifies the possibility of occurrence of a critical condition, alerts are issued and the withdrawal of the population and assets from the risk location begins.
According to Fayyad et al. (1996), Knowledge Discovery in Databases (KDD) is a non-trivial process of identifying valid, new, potentially useful and understandable patterns, embedded in the data.This process involves three phases: data collection and preparation; data mining, with the application of techniques that allow the extraction of patterns; and, finally, the post-processing, where the validation of the results is done.
Works like those developed by Onwubolu et al. (2007), Petre (2009), Salvador et al. (2009), Olaiya and Adeyemo (2012), Pessoa (2004), Pessoa et al. (2012), Joshi et al. (2015), Taksande and Mohod (2015) and Yadav and Khatri (2016), can be cited as examples of the application of KDD in climate-related studies.In a recent study, Krishna (2015) reviews meteorological forecasting surveys that apply different data mining models in KDD processes.This work, developed in the city of Nova Friburgo, located in the mountain region of the State of Rio de Janeiro in Brazil, aims to apply the KDD process to identify patterns of correlation between different meteorological variables that can indicate, with one hour in advance, high precipitation rates.In January 2011, the city was affected by an event considered one of the biggest climatic tragedies in Brazil (MEDEIROS; BARROS, 2011).

MATERIAL AND METHODS
The work used the historical records of the automatic surface meteorological station maintained by the Center for Technology in the Environment (Cetema), located at the Polytechnic Institute of the Rio de Janeiro State University (IPRJ) building, in the central region of Nova Friburgo.These records were collected from November-2008to April-2012.It was decided to create models that would allow the prediction of rainfalls equal to or greater than 10 mm in the hour following the collection of meteorological data.In situations where there is a large accumulation of rainfall in a short span of time, this level of precipitation may be a risk factor for the population.
Therefore, these models should be a relevant contribution to the authorities responsible for the prevention of natural accidents, and can be applied for forecasting and issuing alerts.
This work included the tasks of data collection, data preparation, data mining and the validation of the results obtained, aiming at the identification of patterns related to the occurrence of high precipitation rates.

Data collection and preparation
The IPRJ weather station is an automatic surface station that collects the meteorological data (temperature, humidity, wind speed and direction, atmospheric pressure, solar radiation and precipitation) every 10 minutes using its sensors.Data are sent automatically to the servers of IPRJ Information Technology Laboratory (LTI).This data is stored in a database and a website (CETEMA, 2009) has been built since 2009.This website provides many facilities, including glossary, statistics and graphs.This station is located at latitude -22º 17 ' 10,59", longitude -42º 32' 31,86" at an altitude of 846 meters above sea level.
Because it is an automatic station, the IPRJ weather station has a data logger that stores all the information collected by the sensors.This ensures that the data will be saved, even if the connection to the server is lost for some reason.This redundancy made it possible, for example, that records referring to the heavy rains that hit the region in January 2011 were stored, despite the loss of connection to the database in that period, allowing subsequent access to those records.Table 1 presents the variables provided by the IPRJ meteorological station sensors.
One of the tasks implemented in the preparation of data was the grouping of the records in larger time intervals.As the focus of analysis was on hourly variations, the records, originally divided into 10-minute time intervals, were grouped in one-hour intervals.In order to avoid the insertion of noise into the values, as a consequence of the grouping process, it was established that the accumulated precipitation value would be stored (precipitation attribute) while all other attributes would store their averages at the respective time.The original number of 165539 records, corresponding to the observations made by the meteorological station between the period of 10/13/2008 and 04/27/2012, was reduced by 83.5%, to 27232 records.
Works such as that of Redman (2001) and Wang et al. (2001) present different aspects related to data treatment and the guarantee of data quality.In this sense, inconsistent information in the databases was corrected so that the quality of the models was not compromised.Records that indicated a possible failure in the collected data by the station, such as negative rainfall values, or off-range precipitation values (e.g.records with rainfall indices of 425 mm h -1 ) were deleted.The same procedure was adopted in all variables of the meteorological station.Registers having any of the variables not filled (null value) were considered inconsistent and were also excluded.More details on the process can be found in Charles (2015).
It is known that precipitation caused by frontal displacements is usually preceded by changes in atmospheric pressure (RODRIGUES et al., 2004).Thus, understanding the relationship between pressure and the occurrence of precipitation is very important.Within this perspective, it became necessary to create new attributes that indicated the changes suffered by the meteorological variables in the analysed period of time.In addition to the creation of attributes which represented the meteorological variations at the same measuring hour, new attributes were created expressing the behaviour of such variables one, two and three hours prior to measuring.The updating of the created attributes was done through the execution of SQL queries (Structured Query Language), as detailed in Charles (2015).
In the present work, it was defined that, in circumstances such as a rainy sequence that causes soil flooding, an accumulation of precipitation equal to or greater than 10 mm in the interval of one hour should be considered a threat.Therefore, a last variable was created, here denominated intense rain, indicating if the accumulation of precipitation would exceed this threshold in the next hour.This binary variable indicated whether there would be precipitation values equal to or greater than 10 mm h -1 or values less than 10 mm h -1 -It was used as a target variable in the classification algorithms used, since the research's main objective is to predict the occurrence of precipitations with accumulation equal to or greater than 10 mm in the hour after the measurement of the variables.The prediction would allow the generation of alerts for the population, related to the occurrence of risk events.
Table 2 presents the attributes generated from the originally existing ones (Table 1) and which were used as input to the data mining process.In the table are presented several attributes related to the hour before the measurement of the variables (e.g.rain_1hourbefore); however, attributes referring also to the period of 2 and 3 hours prior to the observation were created (e.g.rain_2hoursbefore, rain_3hoursbefore), although they are not presented in the table.

Data mining
The second phase of the KDD process consists of data mining.In this stage we used the Weka (Waikato Environment for Knowledge Analysis), open source software developed in Java, consisting of a set of implementations of algorithms, corresponding to different techniques of data mining (BOUCKAERT et al., 2010).
Identifying interesting patterns of correlation between data is the primary goal of data mining.Two classification models were used in this work: decision trees and artificial neural networks.
According to Han et al. (2012), classification is the process of finding models that describe and distinguish data classes.
Classes represent the values of a given attribute named as the target attribute.For the case in question, the target attribute was related to the accumulated rainfall in the hour after the measurement of the meteorological variables, having a value indicating rainfall equal to or greater than 10 mm h -1 or rainfall less than 10 mm h -1 .Thus, the goal was to identify the combination of attributes that led to the classification of precipitation as intense (> = 10 mm) in the hour after its registration.Average humidity variation in the last three hours average_temperature_variation Average temperature variation in the last three hours average_solarradiation_variation Average solar radiation variation in the last three hours average_pressure_variation Average pressure variation in the last three hours average_winddirection_variation Average wind direction variation in the last three hours average_windspeed_variation Average wind speed variation in the last three hours intense_rain Binary attribute, indicating the occurrence of precipitation greater than/equal 10 mm h -1 or less than 10 mm h -1 in the hour after the measurement.Target attribute for the classification algorithms.

Decision trees
A decision tree is a tree-shaped structure, where each inner node denotes a test, related to a variable (attribute).
Each branch refers to the test result; and each leaf node stores a class label.The top node in a tree is its root (HAN et al., 2012).In the present work the target class refers to the attribute intense_rain.
The decision tree construction procedure is called induction.There are many decision tree induction algorithms; most of them adopt a recursive and top-down approach, that is, from the root node towards the leaf nodes.Among these, we find the Hunt Algorithm, which serves as the basis for many other induction algorithms.According to Tan et al. (2009), this algorithm consists basically of two steps: 1) If all records belong to the same class, then associate this class with a leaf node; 2) If the set contains records that belong to more than one class, then: -Select an attribute to partition records into smaller subsets; -Create a child node for each value of this attribute and distribute the records to these nodes according to the value they satisfy; -Apply the algorithm recursively to each child node.
Weka provides several decision tree algorithms in its environment.We used the Decision Tree algorithm J48, similar to the known induction algorithm called C4.5 (details in WITTEN et al., 2011).

Artificial neural networks
An artificial neural network simulates the biological neural system in a simplified way, constructing mathematical models capable of learning, generalizing and making associations.Artificial neural networks, like the human brain, have a parallel structure, composed of processing units, or neurons, connected by communication channels associated with weights.The function of these weights is the same as the dendrites, responsible for the synapses in the human brain.The weights, when have their values alternated during the stimuli, influence the result of the output signal (TAFNER, 1998).Neural networks can be composed of several layers, being formed by: -Input layer: receives the input data to the neural network; -Intermediate layers: are responsible for processing; -Output layer: presents the final result processed by the neural network.
The multilayer perceptron (MLP) was used in our work.It is a neural network that has several layers, being an input layer, one or more hidden layers and an output layer, which are connected through their nodes, to which the weights are assigned (HAYKIN, 1999).Further details regarding the MLP elaborated in the present application can be found in Charles (2015).

Class imbalance
It should be noted that there was a large difference between the number of records classified as intense precipitation (> = 10 mm h -1 ) compared with the number of the other class (<10 mm h -1 ).The former contained a much lower number of occurrences (99 records out of a total of 27232), due to their own characteristics related to the history of precipitation.According to Han et al. (2012) this situation, known as class imbalance can compromise the accuracy of the classificatory model.
To reduce the problem of class imbalance, we used the SMOTE algorithm, available in the weka tool.This algorithm adds to the set a number of fictitious records with behavior similar to the records of the class with lower number of occurrences (oversampling).More details about the algorithm can be found in Blagus and Lusa (2013).

Classifier performance evaluation
A classificatory model is composed of the learning phases, where a classifier algorithm is applied on a set of training data; and test, which aims to evaluate the effectiveness of the results through a set of test data.The training set is used to construct the classification model (decision tree or MLP).Subsequently, this model is applied to the test set, containing records with unknown class labels.Here the Holdout Method, known as the Test-Training method, was adopted.In this method 2/3 of the data are destined for training; the remaining base, corresponding to 1/3 of the data, is used for model validation.
The effectiveness of the classifier represents the percentage of observations from the test set that are correctly classified by the model.If this efficacy is high, the classification model is considered efficient and can be used to classify new cases.
For a binary classification problem, where the target attribute contains only two values, it is usual to denote the classes as positive and negative.As previously mentioned, the target attribute used in the model corresponded to the precipitation in the hour after the measurement of the variables, which could be classified as intense precipitation, with values equal to or greater than 10 mm h -1 , or not (precipitation less than 10 mm h -1 ).
An important measure is that which indicates the number of true positives (TP), and rresents the number of true occurrences correctly classified.In the present case, the true positives would be the cases extracted from the test database correctly classified as from intense precipitations.Another important indicator is the number of true negatives (TN).These would be the cases correctly classified as not occurrence of intense precipitations, that is, the forecast obtained by the application of the model would be similar to the information of the instance collected in the test base, which would indicate no rainfall equal to or greater than 10 mm in the next hour.
Measures concerning false positives (FP) and false negatives (FN) would indicate inaccuracies in the predictions made by the classification model, representing, respectively, cases of badly anticipated intense precipitations and the failure to predict intense rainfall occurrences, when in fact they should be foreseen.

Charles et al.
The following measures of performance evaluation were collected and presented as results in the software Weka: (1) (2) where As can be seen in Equations 1 to 4, the F-measure is given by the harmonic average between precision and recall.Given that the harmonic mean between two numbers tends to have a value closer to the smallest of the numbers, then larger values of the F-measure indicate higher values of both recall and precision, characterizing better model performance.
Here it was also applied the ROC (Receiver Operating Characteristics) analysis, a graphical method for evaluating, organizing and selecting classifiers based on their performances.The space where the ROC is plotted is two-dimensional for a two-class problem.In the axis of the abscissa (x) is plotted the value of the true positive rate (TP Rate) and in the axis of ordinates the value of the false positive rate (FP Rate), defined previously in Equations 1 and 2. The performance of a classifier is evaluated by means of the ROC curve, analyzing how distant it is from the diagonal y = x, where the classifier presents a stochastic behavior (Prati et al., 2008).In addition, through the ROC curve you can define an evaluation measure, which is the area below the curve (ROC Area).In practical terms, the closer this number is to the unit, the better the classifier performance.More details on ROC analysis can be found in Majnik and Bosnić (2013), Fawcett (2006) and Tan et al. (2009).

RESULTS AND DISCUSSION
Decision tree models (J48 algorithm) and neural networks (MLP) were developed and tested, initially using records containing a large imbalance between classes.Subsequently, the SMOTE method was applied to the training base, which allowed the use of records containing classes in a balanced way, for the construction of the models.Table 3 presents the values of the main evaluation metrics, obtained from the test base, being related to the class that indicates the occurrence of intense precipitations (> = 10 mm) in the hour after the measurement.These data were extracted from the generated models, allowing the comparative analysis of the results related to the different methods.
According to the data of Table 3, it is verified that the algorithms J48 and MLP behaved very poorly, when applied to the original database.It can be noticed a great imprecision in the forecast of intense precipitations, from the rate of true positive (TP Rate), with only 2.9% of hits for the J48 algorithm and no adjustment for the MLP.It should be noted that the true positive and false positive rates related to the MLP algorithm applied to the original (unbalanced) database had zero value, since there was not even any prediction of intense precipitation made by the model when applied to the test set.
It can also be observed that the application of the SMOTE algorithm, which added to the data set effective records with similar behavior to the records of the class that presented lower number of occurrences, gave a significant improvement in the results of both models.The MLP obtained a rate of 82.9% of hits related to the forecast of intense rainfall (> = 10 mm), higher than the rate of 71.4% obtained by the J48 algorithm.
Figures 1 and 2 show the graphs indicating the application of the MLP and J48 models, integrated to the SMOTE algorithm to the test base, which considers a period of 42 months, comprising the months of November 2008 to April 2012.Looking at the graphs is possible to verify the actual occurrences of intense rains and the correct forecasts made during the evaluated 42 months.
Despite to present a higher precision in the forecast of precipitation equal or higher than 10 mm the MLP algorithm applied to the balanced base presented higher number of false alerts (FP Rate equal to 12.6%) when compared to the algorithm J48 (FP rate equal to 9.9%).
According to Andrews, Diederich and Tickle (1995), artificial neural networks can be considered black-box processing models, that is, despite solving highly complex problems, they do not provide adequate interpretation of the results due to the incomprehensibility of model generated.Decision trees, on the other hand, have an intuitive representation, facilitating the interpretation of the model.In this perspective, a rule extracted from the decision tree generated by the J48 algorithm will be presented with the application of the SMOTE method.The presentation of a single rule allows to illustrate the correlation between some microscale variables obtained in the surface meteorological station and the occurrence of heavy rains.Further details on the various rules extracted from the decision tree can be found in Charles (2015).The rule presented here is in the form X => Y, where X is composed of the different meteorological variables that imply Y, the latter representing a situation of intense precipitation in the hour after the measurement of the variables.• Average temperature 3 hours before the measurement >16.4; solar radiation variation 2 hours before the measurement < 22.5; it was recorded any precipitation in the current hour = not; average pressure 1 hour before the measurement < 901.3; average temperature 1 hour before the measurement > 18.9; it was recorded any precipitation 1 hour before = yes => intense precipitation.
As can be observed in the above rule, if no precipitation is occurring at the moment of the measurement, but 1 hour before rainfall is registered and, concurrently, the average temperature and pressure recorded in the previous hour is, respectively, higher than 18.9 °C and lower than 901.3 mb.If, in addition, the average temperature in the preceding three hours and the solar radiation variation in the preceding two hours are respectively above 16.4 °C and below 22.5 W m -2 , the model will forecast precipitation equal or higher than 10 mm.It should be noted that this rule referred to 96 instances of the database, 93 being correctly classified and 3 incorrectly classified by the model.
Analyzing Figures 1 and 2, it can be observed that the events studied here are concentrated between the months of October and March.There is only one occurrence of rains larger than 10 mm outside these months, precisely on 05/04/2010.It can be seen that neither of two models was able to correctly predict the event that occurred on that date.It is believed that this fact is due to the reduced number of records in the database containing variables related to the occurrence of intense rains outside the period of october-march.In this context, the constructed models were unable to identify patterns related to events occurring outside  this period.When analyzing the above-mentioned rule and, more specifically, the variable indicating the temperatures at the time of data collection and three hours before, it can be seen that temperatures (higher than 18.9 ºC and 16.4 °C, respectively), are not characteristics in the month of May in the city of Nova Friburgo.In short, it is understood that the models were adjusted based on the seasonality of the occurrences, not being able to predict situations outside the period characterized by the occurrence of intense rains.
It should be noted that the models developed in the present work differ in relation to other models related to the forecast of extreme rains due to the use of micro scale data collected in a surface meteorological station.More recent works, such as those by Ruivo et al. (2015), who study extreme precipitation events in Santa Catarina and Amazonia, or Yucel and Onen (2014), focusing on events of the same category in different regions of Turkey, or even from Pessoa et al. (2012), who tried to predict occurrences of severe convective events in three mini-regions of the Brazilian territory, use as base mesoscale data, involving a broader area.It is believed that the contribution of the present proposal is to enable the more precise elaboration of forecasts of extreme phenomena, considering regions with smaller area of coverage.

CONCLUSIONS
The present work had the objective of analyzing the data recorded by a surface meteorological station in the period between 2008 and 2012 to extract patterns of behavior of the meteorological variables that indicated the formation of precipitations equal or greater than 10 mm for the hour after the collection of the data.
Among the obtained conclusions, it can be mentioned that the class balancing carried out by the SMOTE algorithm made the efficiency in the identification of the cases of intense precipitations increase considerably; on the other hand, the number of false positives, i.e. false alerts, also increased considerably.It is believed, however, that from the perspective of population safety that may potentially be affected by the precipitations being predicted, the emission of false alarms has fewer consequences than the omission of correct alarms.
Another finding was that both models were not able to make correct forecasts outside the rainy season.It is believed that this fact would not have so much relevance, since between the months of April and September does not occur a great accumulation of rains.Failure to predict precipitation in excess of 10 mm h -1 during these months would not have as much impact as in the period from October to March, characterized by a higher accumulation of precipitation and, consequently, a greater probability of flooding or landslides as consequences of heavy rains.
The obtained results showed the satisfactory predictive capability of the models used here, motivating their future incorporation to the servers that collect the data of the meteorological station used in this study.Thus, automatic predictions could be made, with an advance of one hour, related to extreme events of precipitation in the surroundings of the place where the meteorological station is installed.These forecasts could support decision-making by government agencies, such as Civil Defense, in addition to preventing the population in general, with the availability of information, in real time, on the site www.clima.iprj.uerj.br.It is also expected that the work may contribute to the development of new research related to meteorological forecasting based on microscale data analysis and to the development of more effective alert systems, considering smaller areas.

Figure 1 .
Figure 1.Predictions of intense rains compared with actual occurrences (MLP after SMOTE application).

Figure 2 .
Figure 2. Predictions of intense rains compared with actual occurrences (J48 after SMOTE application).

Table 2 .
Set of attributes created from existing attributes.

Table 3 .
Metrics related to the different classification methods.