A SARIMA forecasting model to predict the number of cases of dengue in Campinas , State of São Paulo , Brazil

Introduction: Forecasting dengue cases in a population by using time-series models can provide useful information that can be used to facilitate the planning of public health interventions. The objective of this article was to develop a forecasting model for dengue incidence in Campinas, southeast Brazil, considering the Box-Jenkins modeling approach. Methods: The forecasting model for dengue incidence was performed with R software using the seasonal autoregressive integrated moving average (SARIMA) model. We fitted a model based on the reported monthly incidence of dengue from 1998 to 2008, and we validated the model using the data collected between January and December of 2009. Results: SARIMA (2,1,2) (1,1,1)12 was the model with the best fit for data. This model indicated that the number of dengue cases in a given month can be estimated by the number of dengue cases occurring one, two and twelve months prior. The predicted values for 2009 are relatively close to the observed values. Conclusions: The results of this article indicate that SARIMA models are useful tools for monitoring dengue incidence. We also observe that the SARIMA model is capable of representing with relative precision the number of cases in a next year.


INTRODUCTION
Dengue is a disease of great importance for public health in tropical and sub-tropical areas of the world.The disease is transmitted by the bites of infected Aedes mosquitoes, and its symptoms, which include headache and muscle and joint pain, are very similar to those of fever-causing illnesses.It is estimated that between 50 and 100 million cases of dengue fever occur each year 1,2 , and about two-thirds of the world's population live in areas infested with dengue vectors 3 .In the first decade of the 21 st century, Brazil ranked among the countries with the highest dengue incidence in the world 4 .In Brazil, more than three million cases were reported from 2000 to 2005, comprising approximately 70% of reported dengue fever cases in the Americas 5 .
Dengue can be caused by any of the four serotypes of dengue virus, designated DEN-1, DEN-2, DEN-3, and DEN-4.In Brazil, the first laboratory-confirmed dengue outbreak was reported in 1981-1982 in the State of Roraima 6 , and no further dengue activity was reported until 1986 with the introduction of DEN-1 in the State of Rio de Janeiro 7 .The DEN-2 serotype was introduced in 1990 in Rio de Janeiro during a period of DEN-1 serotype circulation 8 .In the following years, the DEN-2 serotype spread to other Brazilian regions, with more severe clinical presentations 9 .In 1994, DEN-3 virus was reintroduced in the Americas after an absence of 16 years, and in 2000, it was introduced in Rio de Janeiro, causing a large epidemic of dengue fever 10,11 .The first report of DEN-4 in Brazil was in the State of Roraima in 1982 12 .
Mathematical and statistical models can provide substantial contributions to the understanding of the dynamics of dengue transmission and the trends of growth in the number of cases of the disease.Recently, statistical tools such as time series analyses 13,14 have been used by several authors to describe and forecast the number of cases of dengue in specific populations [15][16][17][18][19] .Among these models, the seasonal autoregressive integrated moving average

RESULTS
(SARIMA) model is useful in situations when the time series data exhibit seasonality-periodic fluctuations that recur with about the same intensity each year.This characteristic makes the SARIMA model adequate for studies concerning monthly dengue data, given that the number of dengue cases in a population tends to be subject to seasonal variations, with a maximum in the rainy season and a minimum during the dry season.
The objective of this study was to develop time series models to forecast the monthly dengue incidence in Campinas, a city located in the State of São Paulo, Brazil, on the basis of reported incidence rates available from 1998 to 2008; these models were then validated using the data collected between January and December of 2009.Forecasting dengue cases in a population using time-series models can provide useful information that can be used to facilitate the planning of public health interventions.
Campinas is a city of nearly one million inhabitants and is located in the southeastern part of Brazil, in the State of São Paulo.Campinas is 100km from the City of São Paulo, which is the state capital and the largest metropolitan area in Brazil.The economic and demographic growth in the last decades has transformed the city into an important industrial and commercial center.The city has an international airport, several universities and an extensive public health network.According to the 2000 Brazilian Demographic Census (IBGE Foundation), Campinas has a Gini index of relative inequality of 42% and a poverty incidence of 9.8%.In Campinas, dengue transmission was identified for the first time in 1996 20 .
The monthly number of confirmed cases of dengue in Campinas was obtained from the Municipal Health Secretary of Campinas (available in http://2009.campinas.sp.gov.br/saude/).The dataset was divided into two parts: the data observed from January 1998 to December 2008, which were used to develop the time series model, and the monthly number of dengue cases during the year 2009, which was used to validate the model.
Let Y´ = (Y 1 , Y 2 , …,Y n ) be a time series of data.A seasonal ARIMA model 13,14,21,22 (SARIMA) with S observations per period, denoted by SARIMA(p,d,q)(P,D,Q) S , is given by Φ are seasonal polynomial functions of order P and Q, respectively, that satisfy the stationarity and invertibility conditions, d is the number of differencing passes needed to stationarize the series, D is the number of seasonal differences and ε t are error terms assumed to be independent identically distributed random variables sampled from a distribution with a mean equal to zero and the variance σ 2 ε .In time series analyses, the variables ε t are commonly referred to as white noises, and they are interpreted as an exogenous effect that the model is not able to explain.Considering the time series of monthly dengue incidence, this white noise can be, for example, an effect of climatic variables, eventual campaigns of prevention and education, the introduction/reintroduction of a dengue serotype in a susceptible population or random factors.
Thus, in the present article, we used the statistical software R 23 to fit SARIMA models to dengue incidence from 1998 to 2008 in Campinas using the Box-Jenkins modeling approach 24 .The adequacy of the each model was verified by plots of the histogram and an autocorrelation (ACF) of the standardized residuals and the Ljung-Box test 25 , which is a test for hypotheses of no correlation across a specified number of time lags.ACF of the residuals and Ljung-Box statistics are useful for testing the randomness of the residuals.The Akaike information criterion (AIC) 26 was employed to compare the goodness-of-fit of different models.Lower AIC values indicate better fit.
Table 1 and panel (a) of Figure 1 show the monthly number of dengue cases in Campinas between 1998 and 2009.Observing the graph in panel (a) of Figure 1, we note a peak in the dengue incidence in 1998, followed by two non-epidemic years.In 2001 and 2002, there were two yearly peaks, followed by one small yearly peak and two nonepidemic years (2004 and 2005).The large number of cases observed in 2001 and 2002 coincides with the introduction of dengue virus serotype 3 (DEN-3).This virus serotype was introduced in 2000 5,27 , and it led to a large and severe epidemic of the disease in Brazil 28 , with more than 1.2 million cases reported in 2001 and 2002 in addition to the circulation of DEN-1 and DEN-2.A relatively large number of cases of dengue was observed in Campinas in 2007 (9,218 cases), again followed by two non-epidemic years (2008 and 2009).Considering the time series in Table 1, March and April are of particular interest, because these are the months with the highest number of dengue cases.
We generated logarithms of the data exhibited in Table 1 to induce constant variance.Thus, Y´ = (Y 1 , Y 2 , …, Y n ) is the vector of the natural logarithms of the monthly number of cases of dengue from 2000 to 2008 , in which we added 1 to deal with the logarithm of zero values in cases of non-occurrence of dengue in a given month.Considering a plot of the series Y 1 , Y 2 , …,Y n against time (not shown here), we note that there is still some trend, but we should be able to obtain a more stationary series from first differencing.Thus, we consider d = 1.Panels (b) and (c) of Figure 1 show graphs of the estimated auto correlation function (ACF) and partial auto correlation function (PACF) of the transformed series using data from 1998 to 2008.The ACF of the logarithmically transformed series exhibits periodicity of length S = 12.This result was expected, because the dengue incidence shows a seasonal cycle.The PACF suggests that p should be equal to 2, given that partial autocorrelations are near to zero at all lags that exceed 2, and the ACF suggests a moving-average of order q equal to 2 or 3, given that its autocovariances are close to zero at lags that exceed 3.
Table 2 shows values of AIC and the estimates for the variance σ 2 ε for the SARIMA models fitted to the monthly number of cases of dengue from 2000 to2008, considering different choices of p and q.Problems with convergence were encountered when using D = 0. Therefore, considering that 1 seasonal difference is usually sufficient (D = 1), we set D to 1 in all models in Table 2.The model with the lowest AIC value for this data set, and therefore the best-fit model, was SARIMA (2,1,2)(1,1,1) 12 (Table 2).Considering this model,
After estimating the parameters of this model, we assessed their adequacy by analyzing their residuals.Figure 2 shows the standardized residuals, their histogram, the respective ACF graph and p-values for the Ljung-Box statistic.Panel (a) of Figure 2 suggests that the standardized residuals estimated from this model should behave as an independent and identically distributed sequence with a mean of zero and a constant variance.The histogram in panel (b) of Figure 2 shows that the standardized residuals for the model approximated a normal distribution.In addition, the Kolmogorov-Smirnov test gives no reason to reject the assumption that the distribution of residuals is normal (p-value 0.21).The ACF of the residuals showed in Panel (c) suggests that the autocorrelations are close to zero.This result means that the residuals did not deviate significantly from a zero mean white noise process .Panel (d) shows p-values for the Ljung-Box statistic.Given the high p-values associated with the statistics, we cannot reject the null hypothesis of independence in this residual series.Thus, we can say that the SARIMA (2,1,2)(1,1,1) 12 model fits the data well.
Out-of-sample predicted values for 2009 considering the SARIMA (2,1,2)(1,1,1) 12 model are shown in Table 3, where we compare these values with the observed number of dengue cases.The predicted values are relatively close to the observed values; this result indicates that the model provides an acceptable fit to predict the number of dengue cases.The authors declare that there is no conflict of interest.

REFERENCES DISCUSSION
In this study, the SARIMA (2,1,2)(1,1,1) 12 model well reflected the trend in the incidence of dengue in Campinas.We showed that the number of dengue cases in a given month can be estimated by the number of dengue cases occurring 1, 2 (p = 2) and 12 (S = 12 and p = 1) months prior, and we found that a moving-average component of order q equal to 2 is adequate for the data.The highest peaks from the time series observed in Figure 1, panel (a), can be a direct consequence of the introduction or reintroduction of different serotypes, but we noted that the SARIMA model produced good estimates for each month, even though time series contains periods with relatively large numbers of dengue cases.This result suggests that the model fits the data adequately, despite the introduction and reintroduction of different viral serotypes within the studied period.
When we used this model to produce out-of-sample predictions of the number of dengue cases in Campinas, we observed that the SARIMA model was capable of representing the number of cases in a subsequent year with relative precision.However, these predictions may not be credible for forecasting the number of dengue cases in epidemic years, when the observed monthly incidence is significantly higher than the expected number of new cases for that period.This large number of cases may be a consequence of the lack of immunity in the population, because many people in these circumstances are exposed to a dengue viral serotype for the first time.
These results indicate that statistical time series models should lead to a better understanding of the disease mechanismand that they can assist in the planning of public health programs and interventions.
In addition, considering the potential impacts of climate changes on dengue transmission, more accurate predictions could be made by introducing meteorological variables such as temperature, pressure, humidity and rainfall into the model, and these variables should be taken into account in a future study.These variables are known to be associated with an increase in the number of available breeding places for Aedes aegypti, and with that, the risks for transmission for dengue.This work was supported by FAEPA (Fundação de Apoio, Ensino, Pesquisa e Assistência, Hospital das Clínicas, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo).E.Z.M. had investigator grants from CNPq.

FIGURE 1 -
FIGURE 1 -(a) Monthly number of cases of dengue from 1998 to 2009 in Campinas, Southeast Brazil (Source: http://2009.campinas.sp.gov.br/saude/).Autocorrelation (b) and partial autocorrelation (c) functions calculated using the log-transformed number of cases of dengue from 1998 to 2008 in Campinas.The dashed horizontal lines are 95% confidence limits assuming a white noise input.