1. Introduction
Research in agrometeorology depends on complete meteorological data collected from weather stations, remote sensors or climate models without spatial or temporal missing data.
The formation of a solid meteorological database often requires reconstruction of time series, which involves quality control methods, including gap filling or data imputation (^{Vicente-Serrano et al., 2010}). In most cases, filling gaps in daily rainfall data is a difficult task. Rainfall data observed at various locations during different times typically show intrinsic spatial and temporal variations.
A typical example would be a network of weather stations in which data are collected at regular, daily, weekly, monthly or yearly intervals. Data analysis must consider the spatial dependence of the seasons, but also that observations at each station are usually not independent from each other, but form a time series. Therefore, temporal and spatial correlations have to be considered in the analysis. Articles of ^{Ruiz-Cárdenas et al. (2009)} studying pest distributions, ^{Lou and Obradovic, (2011)} developing more accurate spatio-temporal predictors, ^{Haworth and Cheng (2012)} defining a non-parametric spatio-temporal kernel regression model to forecast the future unit journey time values of road links, ^{Li and Parker (2008)} developing a novel method to estimate missing observations in wireless sensor networks, Lau et al. (2014) defining a novel approach for diagnosing mis-specifications of a general spatio-temporal transmission model, ^{Franciscon et al. (2008)} proposing strategies applied to the analysis of citrus leprosis incidence, through the use of spatio temporal autologistic model, have this concern.
Agriculture is the human activity most dependent on weather and climate conditions. Weather conditions affect all stages of agricultural activities and climate adversities often lead to serious social impacts and huge economic losses that are often difficult to quantify. As adverse weather conditions are common and difficult to control, agriculture constitutes a high-risk activity (^{Pereira et al., 2002}). Therefore, climate monitoring and weather forecasting are becoming increasingly important for agricultural decisions.
Due to the usually-sequential nature of the calculations in crop simulation models, interruptions or lack of data at any point in a time series of a crop cycle prevents completion of the simulation of that cycle or harvest. Therefore, such errors, will often result in the loss of results for an entire crop cycle or year of simulation, at a given site.
Meteorological databases maintained by several Brazilian institutions such as the Institute of Meteorology -INMET and the Weather Center and Climate Studies -CPTEC, for example, often have missing data. This requires that series with gaps be rebuilt for end-user applications and further analysis.
A number of approaches have been used to produce complete time series including multiple discriminant analysis (^{Young, 1992}) nearest neighbor analysis (^{Vicente-Serrano et al,. 2010}), regression methods (^{Lo Presti et al., 2010}), geostatistical methods (^{Bajat et al., 2013}), multiple linear regressions or neural networks (^{Fowler et al., 2007}).
Often, the primary interests in the analysis of spatio-temporal data is a prediction of development time of response variable in a given spatial domain (^{Lasinio et al., 2007}). In recent years, there has been an increase in research on statistical models and techniques to solve this problem.
Spatio-temporal models have been successfully applied in several areas such as hydrology (^{Rouhani and Myers, 1990}; ^{Amisigo and Giesen, 2005}), meteorology (^{Soares et al., 2014}; ^{Haslett, 1989}) and environmental systems (^{Goodall and Mardia, 1994}; ^{Mardia et al., 1998}; ^{Fassò and Cameletti, 2009}, ^{2010}).
These models can be represented in state-space form and their parameters can be estimated using a Kalman filter (^{Cressie and Wikle, 2002}). However, for the most common configuration, where the model parameters are unknown, the standard approach uses the EM algorithm (Expectation-Maximization) to estimate the model parameters (^{Shumway and Stoffer, 1982}).
In this work, a spatio-temporal model was used to estimate daily rainfall data. We used data from weather stations located in two rainfall homogeneous zones in Brazil. The results were compared in two specific days with geo-statistical kriging and cokriging techniques.
2. Material and Methods
We used rainfall data from several weather stations, in two homogeneous areas in Brazil, identified according to ^{Keller Filho et al. (2005)}, obtained through the Agritempo System for all Brazilian regions (^{Embrapa Informática Agropecuária, 2014}). Agritempo is an agro-meteorological monitoring system that allows users to access weather and agro-meteorological information from various Brazilian municipalities and states via the Internet.
The homogeneous zones are identified according to the similarity of their rainfall probability distributions and delimited using the hierarchical cluster analysis obtaining 25 rainfall homogeneous zones in Brazil. Two homogeneous zones were chosen. The first zone covers the states of São Paulo within a rectangle with latitude ranging from −23.0 to −20.5 and longitude of −49.5 and −47.5 with 103 weather stations. The second zone included 51 weather stations located in the northeast of Brazil, within the rectangle defined with latitude ranging from −14.0 to −8.0 and longitude between −46.5 to −43.0. The first homogeneous zone is a climatic transition zone with average rainfall. In the second homogeneous area dominates the semi-arid tropical climate of low rainfall.
The spatio-temporal model for each set of data from each zone was adjusted using programs developed in R language (^{R Development Core Team, 2011}) with the support of the Stem library (Spatio-temporal models in R) (^{Cameletti, 2009}). In order to improve the estimates of the model parameters, the altitude of the stations was used as a covariate.
The spatio-temporal model used is discussed following ^{Fassò and Cameletti (2009)}. Let Z (s, t) be a scalar spatio-temporal process observed at time t and geographical location s. Let Z_{t} = {(s_{1}, t), …, Z(sn, t)} be a dataset at time t for n geographic locations s_{1}, …, s_{n}. Moreover, let Y_{t} = {Y_{1}(t), …, Y_{p} (t)} be a vector of p dimension of a not-observed temporal process at time t with p ≤ n. The hierarchical three stage model for t = 1,…, T is defined as follows:
In Eq. (1), the error ε_{t} is introduced so that U_{t} is regarded as a smoothed version of the spactio-temporal process Z_{t}. In Eq. (2), the not-observed spatio-temporal process U_{t} is defined as the sum of three components: the X_{t} matrix of observed covariates for the time t for n locations, the latent spatio-temporal process Y_{t} and the ω_{t} error model. Finally, the Eq. (3) is modeled as an autoregressive process where G is the matrix of transition and η_{t} is the innovation error. The errors ε_{t}, ω_{t} and η_{t} have zero mean and are time-independent and independent from each other. Substituting Eq. (2) into Eq. (1) results in the following hierarchical two stages model:
Equations (4) and (5) are the state-space model equations (^{Durbin and Koopman, 2001}; ^{Carvalho et al, 2011}; ^{Chui and Chen, 2009}). Eq. (4) is the equation of measurements and Eq. (5) is the equation of state. The Y_{t} temporal process can be estimated using the Kalman filter or Kalman smoother.
In Eq. (4), the error
In geostatistics, the error
Among the known approaches to perform parameter estimations, one can cite the maximum likelihood techniques involving the use of scoring techniques or the New-ton-Raphson method for solving non-linear equations arising from the differentiation of the log likelihood function. The likelihood methods usually have several undesirable characteristics such as inversion of large Hessian matrices, the instability of the numerical maximization process and the resulting non-positive definite matrices (^{Fassò and Cameletti, 2009}, ^{2010}). To avoid these problems, the Stem library uses the EM algorithm (^{Shumway and Stoffer, 1982}; ^{Mclachlan and Krishnan, 1997}), which is widely used for problems with missing values, as it is the case for Eqs. (4) and (5) where the component of missing values is given by the latent variable Y_{t}.
Since the EM algorithm does not use a Hessian matrix in the log-likelihood function, it does not provide standard errors for use with the estimated parameters, as the New-ton-Raphson algorithm does. Hence, the bootstrap method is used mainly for the estimation of EM to obtain an estimate of the standard errors.
Bootstrap methods are computationally intensive methods of statistical analysis that uses simulation to calculate standard errors and confidence intervals. These methods are applicable to any level of modeling, and thus can be used in both parametric and non- parametric analysis (^{Efron and Tibshirani, 2003}).
To validate the values obtained using the spatio-temporal model, for the study zones, two dates were chosen, January 1, 2012 and May 1, 2012. For these fixed times, ordinary kriging and cokriging geostatistical techniques (^{Yamamoto and Landim, 2013}) were used with altitude as a covariate, to estimate missing values by cross-validation assuming that one of the sample elements, was not observed.
The first date corresponds to the rainy season in the first homogeneous zone of Brazil. In the second homogeneous zone, this date is during the dry season. The second date, represents the dry season in the first homogeneous zone of the study area and the rainy season for the second homogeneous zone. According to these criteria, the best interpolation for each variable, is one that has the lowest value for the Mean Squared Error (MSE), that is, the ratio of the squared difference between the observed value and the estimated value divided by the number of observations. It is expected that by setting two distinct dates, the best performance of the estimates are obtained by the spatio-temporal model.
It is quite common in verification studies to use the statistic SS (Skill Score) to summarize the quality of the forecasting system. This score quantifies the relative variation of the mean square error of the spatio-temporal model (MSE_{mod}) with regard to the kriging and cokriging (MSE_{krig} and MSE_{cockri}). Positive values of SS indicate that the model improved the predictions (^{Carvalho et al., 2011}; ^{Libonati et al., 2008}).
3. Results and Discussion
For 2012, 103 weather stations with complete data series were selected in the first homogeneous zone and 51 weather stations for the second homogeneous zone (365 days of precipitation values). Figure 1 shows the spatial distribution of rainfall observations used in this study by zone.
For the first and second homogeneous zones, random data samples of six weather stations were taken out and estimates of the same values for the spatio-temporal pattern and geostatistical interpolation using values of neighboring points were obtained. Estimates of daily rainfall for January 1, 2012 and May 1, 2012 using the spatio-temporal model, kriging and cokriging methods are shown in Tables 1 and 2 respectively.
Zone | MSE_{mod} | MSE_{krig} | SS_{1} (%) | MSE_{cokrig} | SS_{2} (%) |
---|---|---|---|---|---|
First | 19.77 | 26.15 | 24.41 | 24.04 | 17.77 |
Second | 4.60 | 5.44 | 18.51 | 5.62 | 18.20 |
Zone | MSE_{mod} | MSE_{krig} | SS_{1} (%) | MSE_{cokrig} | SS_{2} (%) |
---|---|---|---|---|---|
First | 17.75 | 26.15 | 32.12 | 24.04 | 26.16 |
Second | 2.99 | 5.44 | 82.31 | 5.62 | 46.83 |
The statistics Skill Score (SS_{1} and SS_{2}) (Eqs. (9) - (10), respectively) are used to quantify, as a percentage, the improvement that occurred in the estimation of daily rainfall data for the two dates, using the spatio-temporal model in relation to the estimates obtained by kriging and cokriging. For the first date in the first zone, the estimate obtained by the spatio-temporal model was 24.41% better than the estimate obtained by kriging (SS_{1}) and 17.77% better than the estimate obtained by cokriging (SS_{2}). For the second date in the same zone, the spatio-temporal model was 32.12% and 26.16% better. The statistical skill score has been used in several studies with good results, as in ^{Carvalho et al. (2011)} in Kalman filter and correction of the temperatures estimated by PRECIS model.
For the second zone, the estimates obtained by the spatio-temporal model were always better than the estimates obtained by kriging and cokriging. In all cases, the mean squared errors obtained by the model (MSE_{mod}) were considerably less than the mean squared errors obtained by kriging (MSE_{krig}) and cokriging (MSE_{cokrig}).
In both zones the result obtained by the spatio-temporal model is always better indicating that neither the year nor the station influences the results. However, the results of the three methods for second zone are considerably lower compared to the first zone. ^{Amisigo and Giesen (2005)} also obtained better results by estimating gaps in daily riverflow series using spatio-temporal modelling. As well as, ^{Franciscon et al. (2008)} when she modeled the incidence of citrus leprosis using spatio-temporal autologistic model and ^{Ruiz-Cárdenas et al. (2009)} using the same model to estimate better values when the temporal series is inflated with zeroes and missing values.
4. Conclusions
The application of a spatio-temporal model produced better estimates of daily precipitation values compared with results obtained by geostatistical kriging and cokriging models for the study period.
The spatio-temporal model proved to be a versatile technique, adapting to different seasons, and should be considered as an alternative to fill gaps in time series.
The mean squared errors obtained by the model were considerably less than the mean squared errors obtained by kriging and cokriging.
For the two time periods studied, predictions using the spatio-temporal model were more than 17% better for first zone and more than 26% better for second zone, compared to the forecasts obtained by geostatistical techniques.