Spatio-Temporal Modeling of Data Imputation for Daily Rainfall Series in Homogeneous Zones

Spatio-temporal modelling is an area of increasing importance in which models and methods have often been developed to deal with specific applications. In this study, a spatio-temporal model was used to estimate daily rainfall data. Rainfall records from several weather stations, obtained from the Agritempo system for two climatic homogeneous zones, were used. Rainfall values obtained for two fixed dates (January 1 and May 1, 2012) using the spatio-temporal model were compared with the geostatisticals techniques of ordinary kriging and ordinary cokriging with altitude as auxiliary variable. The spatio-temporal model was more than 17% better at producing estimates of daily precipitation compared to kriging and cokriging in the first zone and more than 18% in the second zone. The spatio-temporal model proved to be a versatile technique, adapting to different seasons and dates.


Introduction
Research in agrometeorology depends on complete meteorological data collected from weather stations, remote sensors or climate models without spatial or temporal missing data.
The formation of a solid meteorological database often requires reconstruction of time series, which involves quality control methods, including gap filling or data imputation (Vicente-Serrano et al., 2010).In most cases, filling gaps in daily rainfall data is a difficult task.Rainfall data observed at various locations during different times typically show intrinsic spatial and temporal variations.
A typical example would be a network of weather stations in which data are collected at regular, daily, weekly, monthly or yearly intervals.Data analysis must consider the spatial dependence of the seasons, but also that observations at each station are usually not independent from each other, but form a time series.Therefore, temporal and spatial correlations have to be considered in the analysis.Articles of Ruiz-Cárdenas et al. (2009) studying pest distri-butions, Lou and Obradovic, (2011) developing more accurate spatio-temporal predictors, Haworth and Cheng (2012) defining a non-parametric spatio-temporal kernel regression model to forecast the future unit journey time values of road links, Li and Parker (2008) developing a novel method to estimate missing observations in wireless sensor networks, Lau et al. (2014) defining a novel approach for diagnosing mis-specifications of a general spatio-temporal transmission model, Franciscon et al. (2008) proposing strategies applied to the analysis of citrus leprosis incidence, through the use of spatio temporal autologistic model, have this concern.
Agriculture is the human activity most dependent on weather and climate conditions.Weather conditions affect all stages of agricultural activities and climate adversities often lead to serious social impacts and huge economic losses that are often difficult to quantify.As adverse weather conditions are common and difficult to control, agriculture constitutes a high-risk activity (Pereira et al., 2002).Therefore, climate monitoring and weather forecasting are becoming increasingly important for agricultural decisions.
Due to the usually-sequential nature of the calculations in crop simulation models, interruptions or lack of data at any point in a time series of a crop cycle prevents completion of the simulation of that cycle or harvest.Therefore, such errors, will often result in the loss of results for an entire crop cycle or year of simulation, at a given site.
Meteorological databases maintained by several Brazilian institutions such as the Institute of Meteorology -INMET and the Weather Center and Climate Studies -CPTEC, for example, often have missing data.This requires that series with gaps be rebuilt for end-user applications and further analysis.
Often, the primary interests in the analysis of spatiotemporal data is a prediction of development time of response variable in a given spatial domain (Lasinio et al., 2007).In recent years, there has been an increase in research on statistical models and techniques to solve this problem.
These models can be represented in state-space form and their parameters can be estimated using a Kalman filter (Cressie and Wikle, 2002).However, for the most common configuration, where the model parameters are unknown, the standard approach uses the EM algorithm (Expectation-Maximization) to estimate the model parameters (Shumway and Stoffer, 1982).
In this work, a spatio-temporal model was used to estimate daily rainfall data.We used data from weather stations located in two rainfall homogeneous zones in Brazil.The results were compared in two specific days with geostatistical kriging and cokriging techniques.

Material and Methods
We used rainfall data from several weather stations, in two homogeneous areas in Brazil, identified according to Keller Filho et al. (2005), obtained through the Agritempo System for all Brazilian regions (Embrapa Informática Agropecuária, 2014).Agritempo is an agro-meteorological monitoring system that allows users to access weather and agro-meteorological information from various Brazilian municipalities and states via the Internet.
The homogeneous zones are identified according to the similarity of their rainfall probability distributions and delimited using the hierarchical cluster analysis obtaining 25 rainfall homogeneous zones in Brazil.Two homogeneous zones were chosen.The first zone covers the states of São Paulo within a rectangle with latitude ranging from -23.0 to -20.5 and longitude of -49.5 and -47.5 with 103 weather stations.The second zone included 51 weather stations located in the northeast of Brazil, within the rectangle defined with latitude ranging from -14.0 to -8.0 and longitude between -46.5 to -43.0.The first homogeneous zone is a climatic transition zone with average rainfall.In the second homogeneous area dominates the semi-arid tropical climate of low rainfall.
The spatio-temporal model for each set of data from each zone was adjusted using programs developed in R language (R Development Core Team, 2011) with the support of the Stem library (Spatio-temporal models in R) (Cameletti, 2009).In order to improve the estimates of the model parameters, the altitude of the stations was used as a covariate.
The spatio-temporal model used is discussed following Fassò and Cameletti (2009).Let Z (s, t) be a scalar spatio-temporal process observed at time t and geographical location s.Let Z t = {(s 1 , t), ..., Z(s n , t)} be a dataset at time t for n geographic locations s 1 , ..., s n .Moreover, let Y t = {Y 1 (t), ..., Y p (t)} be a vector of p dimension of a not-observed temporal process at time t with p £ n.The hierarchical three stage model for t = 1, ..., T is defined as follows:

U X KY
In Eq. ( 1), the error e t is introduced so that U t is regarded as a smoothed version of the spactio-temporal process Z t .In Eq. (2), the not-observed spatio-temporal process U t is defined as the sum of three components: the X t matrix of observed covariates for the time t for n locations, the latent spatio-temporal process Y t and the w t error model.Finally, the Eq. ( 3) is modeled as an autoregressive process where G is the matrix of transition and h t is the innovation error.The errors e t , w t and h t have zero mean and are time-independent and independent from each other.Substituting Eq. ( 2) into Eq.( 1) results in the following hierarchical two stages model: Y GY Equations ( 4) and ( 5) are the state-space model equations (Durbin and Koopman, 2001;Carvalho et al., 2011;Chui and Chen, 2009).Eq. ( 4) is the equation of measurements and Eq. ( 5) is the equation of state.The Y t temporal process can be estimated using the Kalman filter or Kalman smoother.
In Eq. ( 4), the error e t t t = + r r w e is normally distributed with zero mean and variance and covariance matrix r r r S G where r G is the spatial covariance function defined as In geostatistics, the error s e 2 is the nugget effect for the spatial process (s, t) for a fixed t.The vector of parameters to be estimated is { , , , , , Among the known approaches to perform parameter estimations, one can cite the maximum likelihood techniques involving the use of scoring techniques or the Newton-Raphson method for solving non-linear equations arising from the differentiation of the log likelihood function.The likelihood methods usually have several undesirable characteristics such as inversion of large Hessian matrices, the instability of the numerical maximization process and the resulting non-positive definite matrices (Fassò andCameletti, 2009, 2010).To avoid these problems, the Stem library uses the EM algorithm (Shumway and Stoffer, 1982;Mclachlan and Krishnan, 1997), which is widely used for problems with missing values, as it is the case for Eqs. ( 4) and ( 5) where the component of missing values is given by the latent variable Y t .
Since the EM algorithm does not use a Hessian matrix in the log-likelihood function, it does not provide standard errors for use with the estimated parameters, as the Newton-Raphson algorithm does.Hence, the bootstrap method is used mainly for the estimation of EM to obtain an estimate of the standard errors.
Bootstrap methods are computationally intensive methods of statistical analysis that uses simulation to calculate standard errors and confidence intervals.These methods are applicable to any level of modeling, and thus can be used in both parametric and non-parametric analysis (Efron and Tibshirani, 2003).
To validate the values obtained using the spatio-temporal model, for the study zones, two dates were chosen, January 1, 2012 and May 1, 2012.For these fixed times, ordinary kriging and cokriging geostatistical techniques (Yamamoto and Landim, 2013) were used with altitude as a covariate, to estimate missing values by cross-validation assuming that one of the sample elements, was not observed.
The first date corresponds to the rainy season in the first homogeneous zone of Brazil.In the second homogeneous zone, this date is during the dry season.The second date, represents the dry season in the first homogeneous zone of the study area and the rainy season for the second homogeneous zone.According to these criteria, the best interpolation for each variable, is one that has the lowest value for the Mean Squared Error (MSE), that is, the ratio of the squared difference between the observed value and the estimated value divided by the number of observations.It is expected that by setting two distinct dates, the best performance of the estimates are obtained by the spatio-temporal model.
It is quite common in verification studies to use the statistic SS (Skill Score) to summarize the quality of the forecasting system.This score quantifies the relative variation of the mean square error of the spatio-temporal model (MSE mod ) with regard to the kriging and cokriging (MSE krig and MSE cockri ).Positive values of SS indicate that the model improved the predictions (Carvalho et al., 2011;Libonati et al., 2008).

Results and Discussion
For 2012, 103 weather stations with complete data series were selected in the first homogeneous zone and 51 weather stations for the second homogeneous zone (365 days of precipitation values).Figure 1 shows the spatial distribution of rainfall observations used in this study by zone.
For the first and second homogeneous zones, random data samples of six weather stations were taken out and estimates of the same values for the spatio-temporal pattern and geostatistical interpolation using values of neighboring points were obtained.Estimates of daily rainfall for January 1, 2012 and May 1, 2012 using the spatio-temporal model, kriging and cokriging methods are shown in Tables 1 and 2 respectively.
The statistics Skill Score (SS 1 and SS 2 ) (Eqs. ( 9) -( 10), respectively) are used to quantify, as a percentage, the improvement that occurred in the estimation of daily rainfall data for the two dates, using the spatio-temporal model in relation to the estimates obtained by kriging and cokriging.For the first date in the first zone, the estimate obtained by the spatio-temporal model was 24.41% better than the estimate obtained by kriging (SS 1 ) and 17.77% better than the estimate obtained by cokriging (SS 2 ).For the second date in the same zone, the spatio-temporal model was 32.12% and 26.16% better.The statistical skill score has been used in several studies with good results, as in Carvalho et al. (2011) in Kalman filter and correction of the temperatures estimated by PRECIS model.
For the second zone, the estimates obtained by the spatio-temporal model were always better than the estimates obtained by kriging and cokriging.In all cases, the mean squared errors obtained by the model (MSE mod ) were considerably less than the mean squared errors obtained by kriging (MSE krig ) and cokriging (MSE cokrig ).
In both zones the result obtained by the spatio-temporal model is always better indicating that neither the year nor the station influences the results.However, the results of the three methods for second zone are considerably lower compared to the first zone.Amisigo and Giesen (2005) also obtained better results by estimating gaps in daily riverflow series using spatio-temporal modelling.As well as, Franciscon et al. (2008) when she modeled the incidence of citrus leprosis using spatio-temporal autologistic model and Ruiz-Cárdenas et al. (2009) using the same model to estimate better values when the temporal series is inflated with zeroes and missing values.

Conclusions
• The application of a spatio-temporal model produced better estimates of daily precipitation values compared with results obtained by geostatistical kriging and cokriging models for the study period.
• The spatio-temporal model proved to be a versatile technique, adapting to different seasons, and should be considered as an alternative to fill gaps in time series.
• The mean squared errors obtained by the model were considerably less than the mean squared errors obtained by kriging and cokriging.
• For the two time periods studied, predictions using the spatio-temporal model were more than 17% better for first zone and more than 26% better for second zone, compared to the forecasts obtained by geostatistical techniques.
Figura 1 -Spatial distribution of rainfall observations for first zone and second zone.
effect, and q = spatial process.
Tabela 1 -Mean Square Error for estimates of gaps obtained from the spatio-temporal model (MSE mod ), kriging (MSE krig ) and cokriging (MSE cokrig ) on January 1, 2012 -SS 1 and SS 2 are the Skill-Score statistics when comparing the model with kriging and cokriging, respectively.-Mean Square Error for estimates of gaps obtained by the spatio-temporal model (MSE mod ), kriging (MSE krig ) and cokriging (MSE cokrig ) on May 1, 2012 -SS 1 and SS 2 are the Skill-Score statistics when comparing the model with kriging and cokriging, respectively.