Imputation method to reduce undetected severe acute respiratory infection cases during the coronavirus disease outbreak in Brazil

Abstract INTRODUCTION: The coronavirus disease (COVD-19) outbreak has overburdened the surveillance of severe acute respiratory infections (SARIs), including the laboratory network. This study was aimed at correcting the absence of laboratory results of reported SARI deaths. METHODS: The imputation method was applied for SARI deaths without laboratory information using clinico-epidemiological characteristics. RESULTS: Of 84,449 SARI deaths, 51% were confirmed with COVID-19 while 3% with other viral respiratory diseases. After the imputation method, 95% of deaths were reclassified as COVID-19 while 5% as other viral respiratory diseases. CONCLUSIONS: The imputation method was a useful and robust solution (sensitivity and positive predictive value of 98%) for missing values through clinical & epidemiological characteristics.

The coronavirus disease (COVID-19) pandemic had caused more than 10 million cases and 500,000 deaths worldwide by June 2020 1 . The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus has been spreading fast globally, causing many severe cases and deaths. This virus has a higher basic reproduction number (R0) and case fatality rate (CFR) compared to influenza (R0: 2.5-3.3 and CFR: 0.4%-2.9% versus R0: 1.2-2.3 e CFR: 0.15%-0.25%, respectively) [2][3][4][5] . In Brazil, the first confirmed case was reported on February 25 in Sao Paulo City, and recently at least one case has been reported in all Brazilian states and almost all municipalities (96%) 6 .
Brazil has a surveillance system working at three levels (federal, state, and municipality) of government installed in public and private health units for severe acute respiratory illness (SARI), and notification of SARI has been mandatory since 2009. The reported cases included patients hospitalized because of SARI at any health service and mild respiratory cases reported by sentinel networks using an online database (Influenza Epidemiological Surveillance Information System in Brazil -SIVEP-GRIPE). The discovery of SARS-CoV-2 in China and suspected cases in Brazil were reported using the REDCap platform, remaining until the country reached 1,000 confirmed cases; subsequently, a new system was developed (e-SUS) and used to report mild respiratory cases, and the SARI remained reported on SIVEP. Because of the continuity and consistency, SIVEP has been maintained as an official system to report and monitor the severe cases of COVID-19, including the deaths from COVID-19 independent of hospitalization.
Although SIVEP is an online platform, inconsistencies in monitoring and case closure opportunities persist. In addition, the Ministry of Health has reported a high percentage of deaths from SARI without a diagnosis, called "non-specified SARI," or alerted health authorities to a possible activity of other respiratory viruses in the Brazilian population. Therefore, this study was aimed at investigating the clinico-epidemiological characteristics of deaths from SARI reported in the Influenza Epidemiological Surveillance Information System in Brazil (SIVEP-Gripe) to correct the absence of robust laboratory results for COVID-19.
We used deaths from SARI reported in the Influenza Epidemiological Surveillance Information System in Brazil (SIVEP-Gripe) during the COVID-19 outbreak from January 1 to June 28, 2020. The death registers were selected using the case evolution variable.
All reported cases were classified as follows: (i) COVID-19, with laboratory confirmation through the reverse-transcriptase polymerase chain reaction (RT-PCR) for SARS-CoV-2; (ii) undetected, with laboratory confirmation through RT-PCR for other viruses; and (iii) missing value, with no confirmation through RT-PCR and an indeterminate result in the processing test. This was considered our response variable to the regression model and subsequently imputed.
Before completing the data imputation method, we performed the logistic regression analysis to identify the variables related to the response. First, we applied the univariate model using the following predictors: signs and symptoms (fever, cough, throat pain, dyspnea, respiratory distress, O 2 saturation < 95%, diarrhea, and vomiting), comorbidities (chronic cardiovascular disease, chronic hematological disease, chronic liver disease, asthma, diabetes mellitus, chronic neurological disease, other chronic pneumopathy, immunodeficiency/immunodepression, chronic kidney disease, and obesity), hospitalization (yes/no), intensive care unit stay (yes/no), ventilation support (invasive, non-invasive, and none), chest X-ray, sex, and age group (<10 years, 10 to 39 years, 40 to 59 years, 60 to 69 years, and 70 years or more). The multiple logistic regression model was obtained from variables with a p-value less than 10% in the univariate regression model, and stepwise method was applied using the Akaike information criterion, Bayesian information criterion, and deviance. Subsequently, cases classified as "missing value" were subjected to a data imputation method using as predictors the variables selected in the multiple logistic regression.
We applied the multiple imputation method to obtain complete information for the "missing value" cases for the classification of SARI deaths. Imputation was performed using the additive regression method, which comprised procedures of a flexible additive model (nonparametric regression method) fitted on samples taken with replacements from original data and missing values (dependent variable) and predicted using non-missing values (independent variable obtained by multiple logistic regression) 7-9 .
We selected a random sample of SARI deaths that had resulted from COVID-19 and other viral respiratory diseases to validate the data imputation method. It generated randomly missing values for 30% of cases, and we applied the imputation method. Subsequently, the imputed values were compared with the observed values. The sensitivity, specificity, positive predictive value, and negative predictive value were calculated to quantify this validation. Furthermore, the Kappa test was performed to measure the concordance between the imputed and observed values. The significance level was considered as 5% for all analyses. All data were processed using R software, and the data imputation method was performed using the R package Hmisc.
In Brazil, from January 1 to July 28, 2020, 84,449 deaths from SARI were reported. Furthermore, 45,321 (54%) cases were confirmed using RT-PCR for some respiratory viruses, of which 42,981 (95%) were confirmed as COVID-19. These proportions of confirmed COVID-19 cases were different across Brazilian states, with the lowest in the Mato Grosso do Sul (19%) state and the highest in Acre (91%) ( Table 1).
In the univariate logistic regression model, the age group was associated with COVID-19 and positively correlated with the odds ratio. The signs and symptoms that showed significant associations were respiratory distress, fever, cough, throat pain, and dyspnea, all indicating inverse odds to be detected for COVID-19. Only four underlying health conditions presented with significant associations with COVID-19: chronic cardiovascular disease, diabetes mellitus, chronic kidney disease, and obesity. Individuals that needed intensive care were more likely to be detected with COVID-19. In the multiple logistic regression, only five variables remained in the final model: age group, with age of 40 years or above having approximately eight times more odds to be detected with COVID-19 compared to age below 10 years; 33% chance for individuals with respiratory distress; 10% to 20% more chance for individuals with chronic cardiovascular disease and diabetes mellitus, respectively; and increasing chance in individuals who require ventilation support (32%: invasive; 38%: non-invasive) ( Table 2).
Using the variables defined by the multiple logistic regression, the imputation method was applied for all data classified as "missing value." Of the total registers classified as "missing value," the data imputation method could classify 37,980 cases (97%). Furthermore, 1,994 (2%) cases were detected with other viral respiratory diseases (undetected for COVID-19), and 35,986 (43%) cases were confirmed with COVID-19. Therefore, of the total deaths from SARI that occurred in Brazil from January 1 to July 28, 2020, 95% were reclassified as COVID-19 while 5% as some other viral respiratory disease (not COVID-19). Hence, all Brazilian states and federal district have at least 90% of deaths from SARI classified as COVID-19. Only the Maranhão (15%), Mato Grosso (14%), and Mato Grosso do Sul (11%) states presented with more than 10% of SARI deaths classified as other viral respiratory diseases by the imputation data method ( Table 3).
To validate the data imputation method, simulation showed high sensitivity (99%) and positive predictive value (99%) and Source: SIVEP-GRIPE accessed in 06/20/2020. Notes: *included undetected results, indeterminate, not tested, in processing, ignored, and missing.   selecting some imputed cases and trying to investigate the medical records to identify more examinations (X-ray, tomography, etc.) that help confirm the cases and perform retesting for these cases using a different methodology suitable for laboratory collection. These estimations should be confirmed with empirical data as the quality of the information systems improve.
The main limitation of this method is the associated data structure, i.e., if the quality of information is not reasonably good, the output of imputation follows this bias. With the speed of disease spread in the country, surveillance may compromise the quality of filling out epidemiological antecedents. This can explain the difference observed in some states that showed less than 90% of detected COVID-19 cases. These states usually have worse filling of the investigation form (Maranhão missing value for variables ranging from 12% to 74% while Mato Grosso and Mato Grosso do Sul ranging from 3% to 67%).