Predicting the Number of Cases of Dengue Infection in Ribeirão Preto, São Paulo State, Brazil, Using a Sarima Model

This study aimed to develop a forecasting model for the incidence of dengue in Ribeirão Preto, São Paulo State, Brazil, using time series analysis. The model was performed using the Seasonal Autore-gressive Integrated Moving Average (SARIMA). Firstly, we fitted a model considering monthly notifications of cases of dengue recorded from 2000 to 2008 in Ribeirão Preto. We then extracted predicted values for 2009 from the adjusted model and compared them with the number of cases observed for that year. The SARIMA (2,1,3) (1,1,1)12 model offered best fit for the dengue incidence data. The results showed that the seasonal ARIMA model predicts the number of dengue cases very effectively and reliably, and is a useful tool for disease control and prevention.


Introduction
Dengue is a disease of great importance to public health in tropical nations, particularly in Southeast Asia and Central and South America.It is caused by four serotypes of a flavivirus -DENV1, DENV2, DENV3 and DENV4 -classified on biological and immunological criteria.Dengue is transmitted between human hosts by several species of day-feeding mosquitoes, such as the Aedes aegypti.Infection can be asymptomatic or it can manifest as an undifferentiated febrile illness, known as dengue fever, characterized by symptoms including fever, headaches, myalgia and retro-orbital pain 1 .Some infections result in dengue hemorrhagic fever (DHF), a syndrome that, in its most severe form, can be life-threatening 2 .
By the final decade of the twentieth century, Ae. aegypti and the four dengue viruses had spread to nearly all countries of the tropical world, and tens of millions are infected annually 3 .In the 21 st century Brazil became the country with the most reported cases of dengue fever in the world 4 : more than three million cases were reported there from 2000 to 2005; that is approximately 70% of reported dengue fever cases in the Americas 5 .The Southeast region of Brazil -and, as a special case, the city of Ribeirão Preto -has been most affected by dengue 6,7,8,9 .
The first reported dengue outbreak in Brazil occurred in 1922, in Niteroi, Rio de Janeiro Sta-Cad.Saúde Pública, Rio de Janeiro, 27 (9):1809-1818, set, 2011 te 10 , and the first laboratory-confirmed dengue outbreak was reported in 1981-1982, in Roraima State 11 .After a period with no cases being diagnosed, the disease reappeared in Rio de Janeiro State in 1986, when the DENV1 virus was introduced 12 .In 1990, during a period of high DENV1 activity, the DENV2 virus serotype was isolated in Niteroi 13 , and the first cases of DHF were documented there 14 .Thus, a major dengue epidemic in Rio de Janeiro State was caused by the simultaneous circulation of DENV1 and DENV2, with a total of 140,000 reported cases 15 .In the following years, the DENV2 serotype spread to other regions of Brazil, with more severe clinical presentations 16 .In Ceará State, northeastern Brazil, DENV2 was first identified in 1994, at which time the first cases of DHF were notified 17 .The DENV3 serotype was first isolated in December 2000 in Nova Iguaçu, Rio de Janeiro State 18 , marking the start of a period of co-circulation of DENV1, DENV2 and DENV3.In 2002, the number of dengue cases increased in susceptible populations that had only experienced DENV1 and DENV2 epidemics 19 , and DENV3 virus later spread more broadly in Brazil.In 2008, DENV4-positive samples were obtained from patients in Amazonas State 20 , the first time this serotype was isolated in Brazil in 25 years.By the end of March 2010, the São Paulo State health authorities reported more than 34,000 cases of dengue 21 .In 2010, approximately 30,000 confirmed cases of dengue were reported in Ribeirão Preto, the largest outbreak to date in that municipality (see http://www.cve.saude.sp.gov.br/).
Statistical tools used in epidemiology to monitor and predict dengue and other infectious diseases have included time series analysis techniques 22 , such as autoregressive integrated moving average (ARIMA) models 23,24,25 .In the epidemiological literature, recent articles have used ARIMA models to describe the temporal pattern of diseases such as influenza 26 , malaria 27 and dengue 28,29,30,31,32 .
In this paper, we study the performance of the seasonal ARIMA model (SARIMA) in describing and predicting the monthly number of notified cases of dengue in Ribeirão Preto (São Paulo).Using dengue incidence data from 2000 to 2008, and the Box-Jenkins modeling approach 33 , we fit a SARIMA 34 model to dengue incidence, and then used the fitted model to out-of-sample predict dengue incidence for the year 2009.

Methods
Ribeirão Preto is a municipality in northeastern São Paulo State, Brazil (21°10'42" South latitude and 47°48'24" West longitude), with an economy based on agribusiness.The frequency of confirmed cases of dengue in Ribeirão Preto was obtained from the Divisão de Vigilância Epidemiológica of the Secretaria Municipal de Saúde de Ribeirão Preto (available at http://www.ribeiraopreto.sp.gov.br).The dataset includes the monthly number of cases, and the study period was from 2000 to 2009.
Given a stationary time series of data Y' = (Y 1 , Y 2 , …, Y n ), an autoregressive moving average (ARMA) model, denoted by ARMA (p,q), consists of two parts, an autoregressive (AR) part of order p and a moving average (MA) part of order q.Thus, the ARMA model of order p and q, denoted by ARMA (p,q), is given by where μ is a constant, φ' = (φ 1 , φ 2 ,…, φ p ) is a vector of autoregressive coefficients, θ' = (θ 1 , θ 2 ,…, θ q ) is a vector of moving average coefficients, and ε t are error terms assumed to be independent, identically-distributed random variables sampled from a distribution with mean equal to zero and variance σ 2 ε .In time series analyses, the variables ε t are commonly referred to as white noise, and they are interpreted as an exogenous effect that the model is not able to explain.Considering the time series of monthly dengue incidence, these white noises can be, for example, an effect of climatic variables, prevention and education campaigns, introduction/reintroduction of a dengue serotype in a susceptible population, or random factors.
If the time series show evidence of nonstationarity, the data can be stationarized by introducing difference operators in the model.The first difference operator is given by DY where B is the lag operator given by B k = Y t -k / Y t .Thus, we obtain the autoregressive integrated moving average (ARIMA), denoted by ARIMA (p,d,q), where d is the number of differencing passes.The mathematical form of the ARIMA (p,d,q) model is . Thus, an important issue in fitting an ARIMA model is to identify the appropriate order of differencing needed to stationarize the series.
A seasonal ARIMA model (SARIMA) with S observations per period, denoted by SARIMA (p,d,q)(P,D,Q) S , is given by are seasonal polynomial functions of order P and Q, respectively, which satisfy the stationarity and invertibility conditions.
In order to analyze the time series for dengue incidence in Ribeirão Preto over the years 2000 to 2008, we defined S = 12, given that we have 12 observations per year.Our first step used plots of the autocorrelation and partial autocorrelation functions 35,36 (correlograms of the time series) to identify possible values for the autoregressive or moving average components.The second step was to obtain maximum likelihood estimates for the parameters of the SARIMA models, according to the different values of p, d, q, P, D and Q. Thirdly, we verified the goodness of fit of each model by plotting the autocorrelation and partial autocorrelation of residuals, and by using the Ljung-Box test 37 .The fourth step compared the models by the Akaike information criterion (AIC) 38 , where the preferred model is the one with the lowest AIC value.Finally, we extracted predicted values for 2009 from the best SARIMA model, and compared them with the number of new cases observed in that year.
As a criterion for comparing the predictive ability of the models, let K be a measure defined by the sum of squared differences between predicted and observed values at each month, each divided by its respective predicted value.Thus, the preferred model is the one with the lowest K value.
All analyses were performed using R software (The R Foundation for Statistical Computing, Vienna, Austria; http://www.r-project.org) 39.

Results
The monthly numbers of cases of dengue notified in Ribeirão Preto from 2000 to 2009 are shown in Table 1.Note that 2001, 2006 and 2007 are years with large numbers of individuals with the disease, and a graphical description of monthly cases of the disease (Figure 1) identifies a peak of cases in March, April, and May.
Let X' = (X 1 , X 2 , …, X n ) = (8, 22, 31, 73,…, 3, 16) be the vector containing the monthly cases of dengue between the years 2000 and 2008, as shown in Table 1.In order to obtain a more stationary time series, let Y' = (log(X 1 +1), log(X 2 +1),…, log(X n +1)), instead of the original values X.In order to avoid taking natural logarithms of zero values, we added 1 to the number of dengue cases reported in each month, given that there are months with no cases recorded.In the absence of mathematical formalisms, a time series is said to be station-ary if it oscillates around a constant mean value, and with a constant variance.However, a graph of the series Y 1 , Y 2 ,…, Y n against time (not shown in this article) describes a rising trend, but we ascertained that this series can be stationary after one difference operation.This suggests that it is appropriate to consider an order d = 1 in fitting the model to the data.
Figure 2 shows graphs of the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the transformed series.The shape of the ACF describes a seasonal effect with a period of S = 12 months.The PACF suggests that p should be equal to 2 or 3, given that partial autocorrelations are near to zero at all lags that exceed 3, and the ACF suggests a moving average of order q equal to 2 or 3, given that its autocovariances are close to zero at all lags that exceed 3.
Also for the SARIMA (2,1,3)(1,1,1) 12 model, Figure 3 shows the standardized residuals, their histogram and ACF graphs and p-values for the Ljung-Box 37 statistic.Panel (a) of Figure 3 suggests that the standardized residuals estimated from this model should behave as an independent and identically distributed sequence with mean zero and constant variance.Panel (b) suggests that the residuals are normally distributed (in addition, the p-value Kolmogorov-Smirnov test is 0.26, and we thus do not reject the null hypothesis of normality).The ACF of the residuals shown in Panel (c) suggests autocorrelations close to zero.This means that the residuals did not deviate significantly from a zero-mean whitenoise process.Panel (d) shows p-values for the Ljung-Box statistic.Given the high p-values associated with the statistics, we cannot reject the null hypothesis of independence in this residual series.We can therefore say that the model identified fits the data well.

Discussion
Efforts to model dengue incidence in various parts of the world have used statistical approaches for time series analysis.Wongkoon et al. 30 developed a SARIMA model on the monthly data collected between 2003 and 2006 in Northern Thailand, and found that the SARIMA (2,0,1) (0,2,0) 12 was appropriate to predict the number of cases of dengue hemorrhagic fever for 2007.Promprou et al. 29 forecasted the monthly number of dengue hemorrhagic fever cases in South-ern Thailand by an ARIMA (1,0,1) model.Silawan et al. 31 showed that a SARIMA (2,1,0)(0,1,1) 12 model was suitable to determine temporal patterns and forecast dengue incidence in Northeastern Thailand.Choudhury et al. 32 showed that a SARIMA (1,0,0)(1,1,1) 12 model was suitable for forecasting dengue incidence in Dhaka, Bangladesh.Luz et al. 28 developed a SARIMA (2,0,0) (1,0,0) 12 model for monitoring dengue incidence in Rio de Janeiro, Brazil, from 1997 to 2004.All these studies showed that the number of dengue cases in a given month can be estimated by the number of dengue cases occurring one (when p = 1) or two (when p = 2) months prior.In the present article considering data from Ribeirão Preto, we also showed that the number of dengue cases in a given month can be estimated from the number of dengue cases occurring one and two (p = 2) months prior, and twelve (S = 12 and P = 1) months prior, but we found that a movingaverage model of order q equal to 3 is suitable for the data between 2000 and 2008.
The results from this study show that the seasonal ARIMA model is a very effective and reliable predictive model for determining the number of dengue cases in a population, and is a useful tool for disease control and prevention.Allard 40 claims that ARIMA models are a useful tool for interpreting surveillance data, and that the usefulness of forecasting expected numbers of infectious disease reports consists not so much in detecting outbreaks or providing probability statements, but in giving decision makers a clearer idea of the variability to be expected among future observations.We found that the SARIMA model's predictions for 2009 agree reasonably well with the observed incidence of dengue.However, these out-of-sample predictions may not be credible for forecasting the number of dengue cases in epidemic years, when the observed monthly incidence is significantly higher than the expected number of new cases for the period.This large number of cases may be a consequence of a lack of immunity in a population exposed for the first time to a given dengue viral serotype.A dengue virus type 1 (DENV1) outbreak is known to have started in Ribeirão Preto in November 1990 7 , and DENV2 and DENV3 were introduced in São Paulo State in 1997 and 2003, respectively (see http:// www.cve.saude.sp.gov.br/).Thus, the highest peaks in the time series shown in Figure 1 may be a direct consequence of the introduction or reintroduction of different serotypes.Note, however, in Figure 4, that the SARIMA (2,1,3)(1,1,1) 12 model produced good estimates at each month, even though the time series contains periods with large numbers of dengue cases.Figure 4 shows that the model failed to estimate the number of dengue cases in April 2006, but in the following months the model again provided estimates with good precision.These results suggest that the out-of-sample forecast values for 2009 obtained from the SARIMA model are not subject to an effect due to the introduction or reintroduction of different dengue serotypes.
In addition, climate changes have potential impact on dengue transmission, and in a future study more accurate predictions should be made by introducing meteorological variables such as temperature, pressure, humidity and rainfall into the model.These variables are known to be associated with an increase in the number of available breeding places for Ae.aegypti and, accordingly, in the risk of transmission of dengue.

Figure 1 Number
Figure 1 Number of notifi ed cases of dengue between 2000 and 2009 in Ribeirão Preto, São Paulo State, Brazil.

Table 1
Number of recorded cases of dengue between 2000 and 2009 in Ribeirão Preto, São Paulo State, Brazil.

Table 3
Observed number of dengue cases in 2009 and corresponding out-of-sample predicted values obtained from seasonal autoregressive integrated moving average (SARIMA) models.Observed number of dengue cases in 2009 (represented by triangles) and the respective out-of-sample predicted values (represented by circles) obtained from seasonal autoregressive integrated moving average (SARIMA) (2,1,3)(1,1,1) 12 model.