COVID-19 prediction of tendency for 2021 in northwestern Argentina

: Using a lagged polynomial regression model, which used COVID-19 data from 2020 with no vaccines, the prediction of COVID-19 was performed in a scenario with vaccine administration for Tucumán in 2021. The modeling included the identification of a contagion breaking point between both series with the best correlation. Previously, the lag that served to obtain the smallest error between the expected and observed values was indicated by means of cross correlation. The validation of the model was carried out with real data. In 21 days, 18,640 COVID-19 cases out of 20,400 reported cases were predicted. The maximum peak of COVID-19 was estimated 21 days in advance with the expected intensity.


INTRODUCTION
In March 2020, the World Health Organization (WHO) declared the coronavirus disease (COVID-19) a pandemic 1 . It urged the activation of various protocols to contain its spread 2 . In Argentina, the first case was detected in March 2020 in Buenos Aires, declaring mandatory quarantine by Decree of Necessity and Urgency 3 .
At the beginning of 2021, no vaccines had been administered to the population and after the reopening of activities, the second wave of COVID-19 began.
The objective was to predict the trend of COVID-19 cases during 2021 for a scenario with vaccine administration and its maximum peak, studying the statistical behavior of COVID-19 data in 2020 without the application of vaccines.

METHODS
The study was carried out in the province of Tucumán, in northwestern Argentina, which was chosen due to the lack of prediction of COVID-19 cases and for being the second most densely populated province in the country, with reported 1,338, 523 inhabitants 4 .
The elaboration of the prediction model for cases of COVID-19 consisted of identifying in data of COVID-19 of 2020 a lag of days t that best correlates with a lag of days t of COVID-19 of 2021, using as reference a point break of infections in the first series. This being identified, a cross-correlation was performed between the lags, in order to find the best one to fit the data with a lagged polynomial regression model and predict the current COVID-19 trend.
Data conversion: an order-one differencing was used to stabilize the mean and reduce the trend. The p value was calculated with the t statistic with n-2 degrees of freedom and with n based on the number of samples that overlap in the cross-correlations. The analysis was performed with Past 3.22 5,6 . RESUMEN: Usando un modelo de regresión polinomial con retraso, que empleó datos de COVID-19 de 2020 con ausencia de vacunas, se realizó la predicción de COVID-19 en un escenario con administración de vacunas para Tucumán en 2021. La modelación incluyó la identificación de un punto de quiebre de contagios entre ambas series con la mejor correlación. Previamente, se indicó por medio de correlación cruzada el lag que sirvió para obtener el menor error entre los valores esperados y los observados. La validación del modelo fue realizada con datos reales. En 21 días fueron predichos 18.640 casos de COVID-19 de 20.400 casos informados. El pico máximo de COVID-19 fue estimado 21 días antes con la intensidad esperada. Two COVID-19 data sets were used, which were published daily by the Ministry of Public Health of the province of Tucumán (Ministerio de Salud Pública de la provincia de Tucumán -MSPT) 7 . The first set from 03/18/2020 to 11/27/2020. The second set from 03/19/2021 to 05/20/2021. A matrix of lags for COVID-19 of 2020 with different amounts of days in length was created. Previously, the start and end dates of the lags were obtained based on a contagion breakpoint, indicated by a 50% increase in all reported cases before the peak of COVID-19 in 2020. A 15-day moving average was used. The length in t days of the lags (l) were explored at 30, 35, 40, and 45 days.
The identification of the lag of COVID-19 2020 to elaborate the training set was determined by Pearson's correlation (r p ) with p>0.05 with the lag of COVID-19 of 2021. Once the lag was identified, a cross-correlation was performed (r d ) with p>0.05 between them. Thus, the location in the predictive series y i (COVID-19 2020) was obtained for its best delay m. In Appendix 1, it is indicated by means of a flow diagram to the methodology used in details.
Model used: the 2020 COVID-19 lag identified in the m delay together with the 2021 COVID-19 data lag were fitted with a lagged polynomial regression model. This type of model was used because COVID-19 cases are random and non-linear. The Polynomial 8 regression model used was: Where x i represents the differentiated and predicted COVID-19 cases for 2021 on day i, a, b, c are coefficients of the polynomial model, y is the 2020 COVID-19 predictive series on day i that best predicts x on function of y for its best lag i-m, while e represents the estimated error. The process was invertible for the differentiation performed. The autocorrelation of the residuals of the best model was null. The evaluation of the model was carried out with real data from COVID-19, using the mean absolute percentage error (MAPE). A forecast horizon was subsequently assessed in the same way.
The COVID-19 data lag in 2021 used to build the model was from 03/30/2021 to 05/13/2021 and for the COVID-19 data lag in 2020 it was from 08/18/2020 to 01/ 10/2020. Between the series, r p = -0.296 was obtained, with p=0.04, while for a delay of m=17 days it was r d =0.488 with p=0.008 (Figure 1

DISCUSSION
The results showed that the model underestimated the number of events occurred before 05/27/2021, the moment in which strict social restrictions were installed 9 and the real cases were accompanied toward the maximum peak when the restrictions were installed. It is possible that underestimation is influenced by society's relaxation regarding the administration of vaccines. Before starting the model on 04/22/2021, the vaccination campaign The red line indicates probability. The lag 10 was significant with non-null model residuals.  accumulated 230,000 applied doses 10 . While two days after the first peak of COVID-19, on 05/06/2021, the application of 306,000 doses was accumulated 11 . Another form of underestimation of the model would be the absence of social restrictions. We jointly compared the increase in COVID-19 cases reported by the MSPT, those predicted, and an Index of Movement of People in Supermarkets and Pharmacies 12 and we observed that they behaved in a similar way (Figure 2).
The accuracy of the model is similar to that of other reported investigations, such as the one calculated with a parsimonious and robust survival and convolution model 13 . The duration of the prediction obtained is similar to that achieved with the extended susceptible-exposed-infectious-recovered model 14 .
The model presented here was able to predict the trend in the dynamics of expected COVID-19 cases toward the maximum peak. However, it was only able to predict the COVID-19 peak for June 3 rd , with two COVID-19 peaks actually occurring in 2021, one on 06/04/21 and the other on 06/08/21.
In conclusion, we highlight that the trends of COVID-19 cases in 2021 in Tucumán could be predicted by analyzing the statistical behavior of the first wave of COVID-19 that occurred in 2020.