Evaluating How the Social Restriction, the Government Response, the Health, and Economic Indices Affected the Prediction of the Number of Deaths Provoked by COVID-19 in Brazil Using Classical Statistical and Machine Learning Models

HIGHLIGHTS Thirty statistical and machine learning models have been used to predict COVID-19 in Brazil. Each model has been trained and tested 266 times. The time series of accumulated deaths produces better estimates than daily deaths. The Cubist nonlinear regression model provides better predictions of accumulated deaths. Abstract The COVID-19 death predictions are helpful for the formulation of public policies, allowing the use of more effective social isolation strategies with less economic and social impact. This article evaluates a wide range of forecasting methods to identify the best models for predicting cumulative and daily deaths caused by COVID-19 in Brazil, considering a forecast for a seven-day horizon. With the seven-day horizon, the predictions have more accuracy. The dataset is from Oxford Covid-19 Government Response Tracker. The jackknife resampling technique was implemented, thus providing an accurate estimate for evaluating the predictive capacity of the models. Each model was fitted with 266 jackknife samples considering 30-day training bases. The comparison between predictions was made using the average results, considering R2, MAPE, RMSE, and MAE. Models from different classes were adopted: 1 ETS, 4 ARIMA, 18 regression models, and 7 machine learning algorithms. The cumulative death models produce better results than daily deaths, as the cumulative death models are less influenced by time series components: cycle and seasonality. The best results for predicting daily deaths were attained by the Ridge regression method. The best results for predicting cumulative deaths were obtained by the Cubist regression method.


INTRODUCTION
Several cases of pneumonia patients have been associated with new human coronavirus disease (SARS-CoV-2, COVID-19) starting in December 2019 [1]. The virus demonstrated a large capacity for interhuman transmission spread rapidly worldwide, and became a pandemic with millions of deaths. Infected patients showed significantly varied symptoms, and their cases ranged from asymptomatic individuals to death. Studies and research on this new disease have become constant. Reports and articles with the prediction of cases and deaths emerged daily.
Virus outbreak forecasts for different countries are helpful for effectively allocating health resources and acting as an early warning system for government policymakers. The forecasts improve epidemiological surveillance and help stakeholders make timely decisions, allowing more specific social isolation strategies with less economic and social impact. Such responses can lead to correct political decisions that generate action protocols to contain future pandemics. Thus, the objective of this article is to predict deaths from COVID-19 in Brazil without the influence of vaccinating people and to guide public policy agents for future decision-making in other possible epidemics or pandemics.
Between 1927 and 1933, William Kermack and Anderson McKendrick created a model in which a fixed population (N) is considered with only three classes of individuals: Susceptible (S), Infected (I), and Recovered/Removed (R) [2][3][4]. Even though it was developed a long time ago, the SIR model is still widely used today in scientific articles, research, works, and studies in general about viruses and epidemics.
Each outbreak of viral infectious diseases exhibits specific patterns that need to be identified based on transmission dynamics. Machine learning algorithms have emerged as a new paradigm in recent scientific research due to their flexibility with data specifics. The versatility of this approach allowed its application in different contexts, from the forecast of financial variables to the analysis of sentiment in texts and medical applications [5].
Due to the highly complex nature of COVID-19, machine learning emerges as an effective tool to predict the outbreak been an alternative to the SIR model. An absolute novelty in forecasting outbreaks can be achieved by integrating machine learning and the SIR model [6]. Within a short period since the outbreak of COVID-19, advanced machine learning techniques have been used in the taxonomic classification of genomes, virus detection assay, and survival prediction in critically ill patients [7]. Machine learning methods are fundamental in screening, prediction, contact tracking, and drug development for epidemics [8].
Predictive models for the propagation of COVID-19 were developed using different characteristics such as climatic conditions (temperature and humidity), demographic data, health center data, et al. The experimental results show that climate variables are more relevant in predicting the mortality rate when compared to other census variables such as population, age, and urbanization. Thus, it could be concluded that temperature and humidity are essential characteristics for predicting the virus mortality rate. Furthermore, it is indicated that the higher the temperature value, the lower the number of infectious cases [9].
A COVID-19 outbreak prediction model was developed in Canada using state-of-the-art deep learning based on public datasets provided by the John Hopkins University and the Canadian Health Authority. Data patterns reveal that prompt and effective approaches taken by Canadian public health authorities to minimize human exposure have shown a positive impact compared to other countries. Based on the results, provinces that implemented social distancing guidelines before the pandemic had fewer confirmed cases [10].
Prediction models based on genetic algorithms have been developed for confirmed cases and deaths in India's three most affected states and across the country. The proposed models were developed from COVID-19 daily case reports published by the Government of India since the first lockdown in the country, which took place on March 24, 2020. The results found that the proposed genetic algorithm-based models are highly reliable for predicting COVID-19 time series in India. Models satisfy all external validation requirements and, therefore, can be used to predict future cases. Another relevant feature of genetic algorithm modeling is that it can work with less time series data and still provide reliable results [11].
The published works use a fixed training set and a fixed test set in a specific period in time series [11][12][13]. However, for a better assessment of the predictive capacity of the models, it is necessary to use many training and test sets. Thus, this work has as its differential the use of multiple sets of training and testing.
The jackknife method was programmed, generating a total of 266 training and test sets for each model, thus providing a more accurate estimate of the predictive capacity of the models.
Furthermore, the other works published for the time series prediction of COVID-19 use a minimal scope of predictive methods [6,[9][10][11][12][13]. This work has a differential in using a broad scope of forecasting methods, a total of 30 methods, making it possible to identify the methods that produce better predictions of daily and accumulated deaths. This work also differs from the others by using, in addition to the time series, the exogenous variables: cases; restriction index; government response index; health index; and economic index.

MATERIAL AND METHODS
In addition to non-pharmacological measures, containment measures were very used to contain the spread of the virus. Most countries recommended their population stay at home. Other measures used were: to do extra investments in the health and the economy, closed schools, airports, workplaces, etc. The database used was Oxford COVID-19 Government Response Tracker (OxCGRT) [14]. This database systematically collects daily information on various public policies that governments have adopted to respond to the pandemic, such as lockdown, travel restrictions, etc. This work was done with nine variables (one dependent and eight independent): • Dependent variable (Y): deaths.
The restriction index is the restriction adopted by the region, such as closing airports and other places. The government index is the response of the government to the pandemic of COVID-19. The health index is the investment done by the government in health policies. The economic index is the financial support that the government made during the pandemic, such as financial aid to people who have lost their jobs or cannot work. All variables are completely described in a global panel of the pandemic policies [14].
These variables were chosen because they are measures used by Oxford to analyze countries worldwide concerning the level of restriction against COVID-19. The lags of variables were selected according to the seven-day horizon, which offers greater precision. Furthermore, in an application of a time series study, it is interesting to initially use the period to 7 days forecast because of the small number of observations.
The period analyzed was from March 17, 2020, to January 20, 2021. It is because the first death by COVID-19 in Brazil was notified on March 17, 2020, and the vaccination started on January 17, 2021. All data analysis, modeling, and programming were conducted using the R [15] program, with the packages: stats [15], ggcorrplot [16], forecast [17], and caret [18].

Time series
The primary purpose of time series analysis is forecasting. This methodology allows forecasting future values through the present and past values [19]. Therefore, models are essential to provide the necessary support for statistical inference [20]. A time series can have four components: trend, cycle, seasonality, and residual. The trend describes the behavior of the variable over time. The cycle is a periodic fluctuation concerning the trend. Seasonality is a change that occurs in specific periods of a time series. The residual is represented by random fluctuations resulting from unexpected facts [19].

Exploratory data analysis
The first analysis of time series is visual. Some inferences and hypotheses can be suggested. The boxplot is a standardized way of displaying data distribution based on the five values: minimum, first quartile, median, third quartile, and maximum. Outliers can be displayed as individual points. This technique makes no assumptions about the statistical distribution involved in the data. The spaces between the different parts of the box indicate the degree of dispersion, the asymmetry in the data, and the outliers [21]. Scatter plots are used to observe relationships between two variables. Pearson's correlation measures the degree of linear association between variables [22].
Many statistical methods make assumptions that the data are from a population with a specific probability distribution. The characteristic of this distribution can be one of the purposes of the analysis. There are statistical tests responsible for determining the theoretical distribution of data. The following tests were used in this work: Shapiro-Wilk, Kolmogorov-Smirnov, Anderson-Darling, Cramer-von Mises, Lilliefors, Pearson, and Shapiro-Francia to test whether the data follow the Gaussian distribution. In these tests, the null hypothesis (H0) is that data follow a normal distribution. The alternative hypothesis (H1) is that data do not follow a normal distribution [23].

ETS and Box-Jenkins models
The ETS model is a "special" class of exponential smoothing. The point forecasts produced by the models are identical if they use the same smoothing parameter values [24]. The characterization of the model following the terminology of [25] and [26] is done using a three-character string. The first letter indicates the type of error (A or M); the second letter indicates the type of trend (N, A, or M); and the third letter indicates the type of seasonality (N, A, or M). In all cases, N = none, A = additive and M = multiplicative.
The Box-Jenkins methodology consists of adjusting Autoregressive Integrated Moving Averages (ARIMA) models. The strategy for building the model is based on an iterative cycle. The stages of the interactive cycle are specification, identification, estimation, and diagnosis. In general, the postulated models are parsimonious, as they contain a small number of parameters, and the predictions obtained are pretty accurate [15]. The ARIMA models assume that the values of a time series have a dependency relationship where each value can be explained by the previous value of the series data [20]. The purpose of the Box-Jenkins methodology is to determine the three components that make up the structure: p (autoregressive parameters), d (differentiation processes), and q (moving average parameters), thus forming the ARIMA (p,d,q) [19].
Autocorrelation is the correlation of the variable X(t) with itself at the last instant X(t-k), which is called the time lag k. The Autocorrelation Function (ACF) measures the dynamics of the correlation between a variable and its lags. The Partial Autocorrelation Function (PACF) is a measure of the correlation between observations of a time series that are separated by k time units (X(t) and X(t-k)) [19,20].

Regression models
Seek to identify the relationship between the dependent and independent variables. This relationship can be linear or nonlinear. In regression models with cross-sectional data, the order of observations is irrelevant for the analysis. In time series, the order of the data is fundamental. A significant feature of this type of data is that neighboring observations are generally dependent over time, so it is interesting to analyze and model this dependence [20].
Typically, the data sets have many independent variables, so it is necessary to know which are relevant to explain the dependent variable. In these cases, mechanisms are needed to choose the best subset of independent variables to explain the dependent variable. For this, Regularization Methods are recommended. These methods incorporate a constraint into the model, limiting the model's coefficients and therefore selecting the most important independent variables [27].
Dynamic regression models are also called ARIMA models with exogenous variables (ARIMAX). In linear regression models, it is assumed that noise has zero mean, constant variance, normal distribution, and independence, thus having no serial correlation [20]. ARIMAX combines the dynamics of time series and the effect of explanatory variables. The dependent variable is explained by its lagged values and current and past values of exogenous variables.
In addition to ARIMAX, the following linear regression models were used in this work: Multiple (MLR), Stepwise, Stepwise with lower Akaike value (Stepwise AIC), Lasso, Ridge, Elastic Net, Boosted, Boosted Tree, and Robust. The nonlinear models were: Cubist, Multivariate Adaptive Regression Splines (MARS), and MARS with cross-validation pruning (MARS gCV).

Machine learning
Explores the study and construction of algorithms that can learn from data and make predictions [28]. Studies with Support Vector Machines (SVM) started to be developed in the 60s, in Russia, by Vapnik, Lemer, and Chervonenkis. However, it can be said that the SVM had its starting point along with the development of the theory of statistical learning by Vapnik in 1979. The current form was developed by Vapnik in the late 1990s and aimed to find a hyperplane that maximizes the margin between classes. Support Vector Regression (SVR) maintains the same characteristics as SVM [29].
Random Forests (RF) are formed by several decision trees. All trees are used, each of which provides an estimate. The final classification is given by the most frequent result in all trees [27,28,30].
Artificial Neural Networks (ANN) are computational techniques that present a mathematical model inspired by the human brain [28]. The most crucial property of ANNs is their ability to learn from their environment and improve their performance. It is done through an iterative process of adjusting the weights of the network [30]. The Autoregressive Neural Networks technique combines the autoregressive statistical model and neural networks, resulting in an AR-NN(p) model. In the AR-NNX model, exogenous variables are included, providing new data to improve prediction performance. In addition to the lagged values of the dependent variable, independent variables can be added that will also be used [31].
Boosting belongs to the machine learning category called an ensemble. Ensemble techniques involve groups of predictive models to achieve better model accuracy and stability. Boosting refers to a family of algorithms that convert weak learning into strong learning. The prediction of each learning is combined to convert it into strong learning [27,30]. The eXtreme Gradient Boosting (XGBoost) algorithm combines the Boosting and tree models.

Assessment metrics
The Jackknife resampling method was implemented, in which the entire database was used. The strategy removes a sample from the total observed set, recalculating the estimator from the remaining values. The use of this technique promotes the reduction of uncertainties, thus having an accurate estimate for evaluating the predictive capacity of the models. Cross-validation uses 30 values (days) for training, and the forecast is given considering a 7-day horizon, where forecast metrics are applied. Each model has been trained and tested 266 times.
Forecast metrics are averaged to evaluate the models. In this work, the following figures of merit have used the Coefficient of Determination (R 2 ), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Root Mean Square Error (RMSE).

RESULTS
Between March 17, 2020, and January 20, 2021, the average number of deaths from COVID-19 was 687 ± 391. The mean of cases was 27,865 ± 18,461. The first death occurred when there were 51 reported cases (Table 1). Both time series show a cycle, with trends of growth and reduction. The boxplot reveals no outliers for deaths and one outlier for cases ( Figure 1). All p-values were below 0.05, indicating that the data are not normally distributed considering the seven tests performed (Table 2).  The scatter plot reveals that deaths and cases have a positive correlation -the greater the number of cases, the greater the number of deaths (Figure 2). Deaths and cases showed a coefficient of correlation equal to 0.78. Concerning the correlogram, which shows the correlation between all the variables considered in this study. Only the health index and the government index have a coefficient of correlation greater than 0.9 ( Figure 3). ACF and PACF indicate a need to carry out at least one differentiation for the ARIMA model for the time series of daily and accumulated deaths (Figure 4). After 266 training and test sets for each model, the averages of the prediction metrics of the test bases were calculated. Tables 3 and 4 show the evaluations for the regression models, Random Forest, Support Vector Regression, Artificial Neural Networks, and eXtreme Gradient Boosting. R 2 shows that accumulated deaths produced better estimates because their values are more significant than daily.  Robust (considering R 2 ) and Ridge (considering RMSE and MAE) regressions had the best fit to predict COVID-19 daily deaths (Table 3). Robust regression has a set of procedures that resist minor violations of a model's requirements, and Ridge regression has a penalty that decreases the complexity of a model. MARS (considering R 2 ), and Cubist (considering RMSE and MAE) regressions had the best fit to predict COVID-19 accumulated deaths (Table 4). These methods use a rules-based approach, like decision trees.
Tables 5 and 6 present the evaluations for the ARIMA, ETS, ARIMAX, AR-NN, and AR-NNX models. The ARIMA and ARIMAX models were fitted with different p and q components and were up to 5 each (i.e., at each one of fitted 266 training bases, the p and q values can be changed). The differences (d) were up to 4 for ARIMA and up to 5 for ARIMAX. The MAPE shows that the accumulated deaths produced better estimates because their values are more significant than daily.   (Table 5). ARIMA modeling contains a small number of parameters, and predictions are pretty accurate. In this case, one differentiation was necessary to obtain the best model. ETS (considering MAPE) and ARIMA(p,2,q) (considering RMSE and MAE) had the best fit to predict COVID-19 daily deaths (Table 6). ETS modeling gives greater weight to past observations over recent ones. This model also considers elements such as trend and seasonality. In ARIMA(p,2,q), two differences were necessary to obtain the best model.

DISCUSSION
The COVID-19 pandemic has killed millions of people since the end of 2019. With the evolution of data systems and the high contagion, information began to be published daily on the number of cases and deaths caused by the virus [14]. Thus, it became essential to use forecasting techniques and models to project this count in regions and countries with high indexes [9][10][11][12]. In Brazil, there have already been more than 650,000 deaths (it is about 0.3% of the population). Vaccination in Brazil began on January 17, 2021, but few individuals were vaccinated during this work. It strengthens the hypothesis that there was no external influence on the predictions and certifies the results obtained. The Jackknife methodology was applied to reduce possible temporal variations, with many training and test sets.
Several prediction models were observed for daily and accumulated deaths. Time series do not have a normal distribution. The scatter plot and the correlogram show that cases and deaths have a positive correlation (0.78). Health and restriction (0.84) indices and government and restriction (0.75) indices also have a positive correlation. However, when these variables are tested with others, only health and government indices have a strong correlation (0.95).
The predicted numbers at the end of the epidemic are highly dependent on the length of the time series used in the predictive models [32]. Therefore, was used a 7-day prediction. The exogenous variables cases, restriction index, government response index, health index, and the economic index were used to get more precision. To get more accurate, the averages of the 266 prediction metrics of the test bases revealed the best fit of the models. The forecast for accumulated deaths produced better estimates than the daily ones (this result corroborates with [33]), as seen by R 2 and MAPE. Possibly, this is because temporal elements such as cycle, seasonality, and randomness have less influence on accumulated deaths than on daily deaths.
Considering R 2 for the prediction of daily deaths, the best models were the Robust (0.796), Lasso (0.790), Ridge (0.772), and Elastic Net (0.752) regressions. For accumulated deaths, the best models were the MARS gCV (0.995), MARS (0.995), Cubist (0.993), and Lasso (0.989) regressions. Looking at the MAPE, the models that showed better results for daily deaths were ARIMA(p,1,q) ( Considering deaths by COVID-19 in Brazil, as the time series does not have a normal distribution, there is no outlier, and the variables do not have a high correlation, nonlinear regressions had the best fit for predicting accumulated deaths.
The proposed models were used for the COVID-19 pandemic. However, they can be used for other epidemic or pandemic situations with possible good results, because in epidemiology the historical context is extremely important [34]. The change in daily numbers of COVID-19 is affected by many factors, such as the population's adherence to prevention measures, vaccination, social isolation, and new variants of the virus. Analysis suggests that COVID-19 shows chaotic behavior, like in previous epidemics [35]. Government campaigns are very important to avoid the possible underreporting of cases and deaths. Delays in notifications can also bring biased results.
Due to these reasons, to have better accuracy of the predictions, it is necessary to use many training and test bases. It is important to emphasize that this study was designed without the influence of vaccination of people. Therefore, these models may be interesting for use at the beginning of an epidemic or a pandemic. In practice, this work proposes that at the beginning of an epidemic, the forecast is made by the non-linear model. In addition, predictions can be made daily, as long as the data used is accumulated.
The results presented in this study differ from some previously published works [36][37][38][39]. The Jackknife resampling method used here has better accuracy, because used 266 training and testing bases. So as data on the pandemic are published daily, the forecasts must also be updated periodically. For future works, it is recommended to include some other variables like the vaccination of people and the virus reproduction rate. In addition, to have the best fit of the models, it may be interesting to consider smaller regions such as some districts.

CONCLUSION
This work used multiple sets of training and testing to predict the number of deaths from COVID-19 in Brazil. Furthermore, exogenous variables were used. This procedure helps to produce a more accurate estimate of the predictive capacity of the models. A total of 30 forecasting methods were used, making it possible to identify the methods that produce better predictions of daily and accumulated deaths. Therefore, this work showed that the time series of accumulated deaths produced better estimates than daily deaths. The cubist regression had the best fit for cumulative deaths, and ridge regression had the best fit for daily deaths. The contribution of this work revealed that nonlinear regressions are the best methodology to predict the number of accumulated deaths from COVID-19 in Brazil.
Funding: This research received no external funding.