How limitations in data of health surveillance impact decision making in the Covid-19 pandemic

The Covid-19 pandemic signaled an alert to all countries about controlling transmission of SARS-CoV-2 to have fewer infected individuals, causing less stress to all health systems, and saving lives. As a result, multiple governments, including national and local levels of government, went through several degrees of social distancing measures. The decision process regarding the flexibilization of social distancing measures requires evidence of incidence decrease, available capacity in the health systems to absorb eventual epidemic waves, and serological prevalence studies designed to estimate the proportion of individuals with antibody protection. The trend criterium usually given by the effective reproduction number might be misguided if there are significant delays for reporting cases. For instance, the reproduction number for Niterói, in the state of Rio de Janeiro, went down from a value of approximately 3 to little more than 1. Even with all measures, the reproduction number did not get below R<1, which would demonstrate a more controlled scenario. Finally, a prediction method permits adjusting the notification delay and analyzing the current status of the epidemics.


Introduction
The arrival of SARS-CoV-2 in Brazil proved to be a challenge for health surveillance due to insufficient testing, varying degrees of preparedness in surveillance and health system capacity, health inequities displaying in population vulnerabilities, among other factors. The impact mitigation of the Covid-19 pandemic led to various measures regarding social distancing, as well as better practices on hygiene, in order to reduce the number of cases in the epidemic peak and to reduce the transmission intensity 1 . Hardship in the effort for sustaining population isolation for a long time requires objective criteria for continuous evaluation on returning to a new normal of collective activities.
Discussion about improvement on surveillance and getting back to a 'new normal' involves a careful evaluation of the situation. Indicators should confirm any related decisions gathered from local levels, such as states and cities, and followed by gradual steps. In early March, health surveillance in the main urban centers of São Paulo and Rio de Janeiro indicated sustained transmission of SARS-CoV-2. Subsequently, the virus spread to other urban centers in the country, and the process still goes on spreading to small and medium cities. Various cities in the Northern region of Brazil, such as Manaus, had many cases of Covid-19 in March. Serological tests show that the proportion of people developing antibodies against SARS-CoV-2 infection is more substantial in the region than in other parts of the country. All these aspects show that the pandemic happens asynchronously across cities. Therefore, evaluations should be local, considering the trend of cases, the health system capacity, and the seroprevalence in the population.
The World Health Organization (WHO) issued recommendations for monitoring trends of cases and deaths, testing, and health system capacity 2 . Regarding case trend analysis, an important number is the effective reproduction number, R, which describes the number of secondary cases of infection expected after an infected individual at any time 3 . This number corresponds to the basic reproduction number (R 0 ) at the beginning of the epidemic. As the number of susceptible individuals decreases, R values should lower. Social distancing and similar measures might accelerate the process. The goal is to have R<1. WHO recommends an R<1 for at least two weeks as one of the criteria to consider any return to daily routines. But there might be stochastic fluctuations, and relaxing isolation measures will put more susceptible people at risk, which can also increase R values.
Here we analyzed the cases reported in the Brazilian notification system using datasets from the state of Rio de Janeiro to understand these criteria in the state and its municipalities. We analyze both deaths and confirmed cases. We evaluate the reproduction number and show a method for forecasting the number of cases, including the number of cases accounting for the notification delay.

Data
Analytical work involved datasets of anonymized cases of Covid-19 in the state of Rio de Janeiro obtained from the RJ Covid-19 dashboard maintained by the Health Secretariat of State of Rio de Janeiro. We used two datasets depending on the downloading date: the first dataset on May 7th whereas the second on June 6th, i.e., separated by a 30-day difference. Traditional surveillance teams notify cases using a standard form from the agency responsible for database systems at the Ministry of Health, Datasus. Entries, however, might enter the database system on dates different from the notification date. Therefore, the dataset contains entries for individual cases that contain several dates: the date of initial symptoms, the notification date, the conclusion date, and the date of death, if death is the outcome. A more convenient approach to analyze trends is to aggregate notification records by any of these dates and cities. In this case, the aggregation allows us to have the number of cases by date and city, considering initial symptoms and the particular city. This aggregation allows studying the dynamics of the transmission of the SARS-CoV-2 virus.
The state of Rio de Janeiro is in the Southeast region of Brazil. Population size is 17.2 million people, whereas the capital concentrates the most density, of about 6.6 million inhabitants. Niterói, a city in the state of Rio de Janeiro, with approximately 0.5 million people, is chosen as a case study, since city authorities issued several restrictive measures to mitigate the transmission of SARS-CoV-2 in the city. The city of Rio de Janeiro severely reduced commerce and services activity, as well as public transportation, and the state restricted human mobility between towns 4 . Cumulative incidence and mortality rate use number of cases (or cases with death as outcome) and population size estimates, given by the National Statistics Bureau (IBGE -https://www.ibge.gov.br) based on census and demographic methodologies.

Epidemiological analysis
We analyze the intensity of transmission over time using the effective reproduction number R, which describes how many cases are expected to develop the disease after a case at any time 5,6 . We estimated R values using time series for each municipality using the parametric method in the package EpiEstim in the R statistical platform, developed by Cori et al. 7 . This parametric method is based on the framework proposed by Cori et al. 7 , which takes into account the daily number of cases and knowledge about the distribution of the serial interval of the disease. The serial interval comprehends the time interval between the onset of symptoms in a primary case and a secondary case. The estimation of R for any day depends on the number of cases counted on that day, the daily number of cases on previous days, and the distribution of the serial interval. We assume a serial interval of 4.7 days with a standard deviation of 2.9 days, as estimated by Nishiura et al. 8 .
We also estimate the trend over recent weeks by an estimation of the daily incidence over the past three weeks with the method proposed by Villela 9 . The algorithm forecasts epidemic numbers based on data collected over discrete time units such as days, weeks or months 9 . Since the method permits forecasting from an incidence time series, it also allows a prediction of the recent weeks when using the time series of daily cases discarding the past weeks. This nowcasting process requires data until a chosen date to estimate the growth parameters in the epidemic curve. In this case, the goal of using this methodology is to correct notification delay due to lack of transmission of records to the database. The method has an assumption that the epidemic curve has a shape that can be modeled by a generalized Richards model 10 .
An important death-related indicator is the Case Fatality Rate (CFR), calculated by the ratio between the cumulative number of deaths and the cumulative number of confirmed cases 11 .

Results
The State of Rio de Janeiro had counted 14,156 and 64,533 confirmed cases of Covid-19 as reported in the dashboard of the state of Rio de Janeiro, by May 7th and June 6th, respectively. The most recent figure indicates a cumulative incidence higher than 370 cases per 100 thousand inhabitants. Furthermore, the mortality rate is higher than 38 deaths per thousand inhabitants in the state. The June dataset contains the records in the previous dataset plus more notifications added to the database. When we examine the date of first symptoms in the June dataset, 44,834 cases had initial symptoms prior to May 7th, hence an outstanding figure of 30678 late-counted cases. Table 1 shows the total number of Covid-19 cases in the state of Rio de Janeiro given by the categories of age, gender, and missing numbers for age information. From more than 64,000 cases of Covid-19 reported by June 6th, gender frequency was 51% / 49%, i.e., close to 50% for each gender group (a little higher than 30,000 cases for each of male and female groups). Regarding death counts, however, the frequency was much higher for female individuals (57%). When investigating incidence by age, the incidence of Covid-19 increases over age intervals. The number of fatalities also increased with age, given that more than 50% of cases had age higher than 60 years. The age information has a noticeable missing rate (76%) in the dataset but minimal missing rate among dead cases. The Case Fatality Ratio in the June dataset indicates little more than 10%.  Figure 1 shows the number of confirmed cases of Covid-19 that had death as an outcome in the state of Rio de Janeiro. Until May 7th, the state of Rio de Janeiro had counted 1394 deaths, among which 63 records did not have the date of death (4.5% missing rate). By June 6th, the state had 6,639 deaths. A total of 54 notifications did not have death dates (0.8% missing rate). The analysis from the June 6th dataset, however, reveals that the number of cases up to May 7th was significantly larger since 3,740 deaths had happened, hence an outstanding number of 2,346 cases of death had not been reported at the time. Also, in the dataset, a portion of 0.7% of cases did not have dates of first symptoms available (504 notifications from the total of 64,533 cases).      Figure 4 shows the results of the nowcasting procedure to evaluate the number of cases in the nine municipalities with most cases in the state of Rio de Janeiro. Actual numbers within the shaded areas are an indicator of good performance. The method performs well for the cities of Niterói, São Gonçalo, Queimados, Nova Iguaçu, Macaé, Angra dos Reis and Itaboraí. The number of registered cases in the epidemic curve is beyond the credibility intervals only for Rio de Janeiro and Duque de Caxias. The credibility intervals increase as time progresses, showing increasing uncertainties over time. All curves point to a slowing down of the number of cases, with varying degrees of acceleration.

Discussion
Brazilian health surveillance relies on database systems that have national coverage and are flexible to be handled by various statistical softwares, including open systems. These database systems such as the current eSUS and Sivep-gripe systems for notifying syndromes such as Influenza-like Illness and Syndrome of Acute Respiratory Illness (Sari) proved to be helpful in the current crisis due to the Covid-19 pandemic. These databases permit epidemiological analyses to obtain results such as the ones in this work. The epidemic in the state of Rio de Janeiro is still ongoing at this moment, but numbers are already critical for demanding public health policies. The prospect of several thousands of cases of Covid-19 cases and deaths lead to harsh but necessary restrictive measures to curb contagion of SARS-CoV-2. Social distancing and several recommendations helped to mitigate the impact of the epidemic, but our analysis still shows many cases, as revealed by more than 60,000 confirmed cases in the state of Rio de Janeiro and more than 6,000 deaths. Hence, the Case Fatality Ratio currently stands at 10%, but had different risks among groups depending on age and gender. The higher risk applies to older adults, and the number of deaths was disproportionately higher as the outcome of male individuals. Still, despite the strengths of the health surveillance databases, there are limitations to be either fixed or treated to avoid uncertainties. How to report the number of deaths has been an issue during the period of the Covid-19 pandemics. Counting the total number of deaths by notification date might obfuscate many early deaths, possibly misunderstood for late deaths. Counting by death date permits tracing the dynamics of the disease. However, in this case, recent numbers will hide a large number of deaths still to be counted due to delays involving confirmation, notification system, and other factors. The analysis of death numbers revealed that only accounting for recent records largely underestimates the latest figures. Therefore, criteria only based on recent number of deaths should be avoided.
Reporting of cases exhibits significant delays that impact the daily evaluation of the incidence and puts uncertainty in the surveillance decisions. The basic reproduction number is in agreement with several evaluations from other countries 12 . The WHO recommends evaluation over three weeks with R<1. If the number of cases is decreasing over the last two weeks, one of the criteria for returning to a new normal is supposedly met. However, results here show that R<1 for more than the last two weeks when analyzing the May dataset. Later, the June dataset proved that by late April/early May, many cases of Covid-19 were still emerging. As time progresses, we observed that the city of Niterói could not have R values below 1, despite having a functional surveillance capacity and all the broad measures from the city Health Department. Therefore, the recommendations should take into account the delay expected in each city/ state. But the current notification delays can lead to few numbers in recent days, which inadvertently leads to low estimates of R.
The approach of using the early epidemic to estimate numbers in the current stage of the epidemic proved helpful. An analysis of the epidemic curve in the nine cities ranking higher in confirmed cases shows that later reported values for most of them were within the range of estimation. Therefore, nowcasting is a predictive tool to mitigate notification delay problems. The method by Villela 9 requires an epidemic with a single peak. The current pandemic has required social distancing. If the population resumes some aspects of daily life routine, such as working/attending school etc., new periods of an extended force of infection might appear leading to multiple peaks. In this case, some adjustments, such as analyzing by parts, should adapt the forecasting methodology. Such effects might be the case for some municipalities that exhibited number of cases beyond the credibility intervals.
Recent works analyzed the reproduction Estimating the notification delay is possible by the statistical structure observed from the difference between the registration date and the date of first symptoms. The inference method proposed by Bastos et al. constructs a matrix with elements unobserved, given by time and amount of delay, to be estimated by statistical learning of the time series 16 . Such matrices will have predictions for every element and might use covariates, for instance, seasonal variables. In some cases, such as the aggregated number of cases over time, the approach presented here based on generalized Richards model applies very well to do both nowcasting and forecasting.
The numbers presented here are neither the final numbers of the epidemic nor in the state of Rio de Janeiro or in its cities. Indeed, our analysis shows that more incoming data in subsequent weeks will conclude the counts of confirmed cases, apart from subnotification issues. Also, this ongoing process impacts the cumulative incidence and mortality rates. A proper estimation of the case fatality ratio requires time for accurate figures for confirmed cases and number deaths. Also, the confirmation of cases requires testing and diagnosing individuals that demand care at health facilities. But the number of deaths by age groups presented here indicate a high risk for older adults, in agreement with other studies for Covid-19 in Italy 17 . The number of deaths suffers from notification delays, and the number of confirmed cases is limited by the number of applied tests. Therefore, the numbers from table 1 comprise exploratory analysis, and the death rate evaluation still requires final numbers. These kinds of results are very relevant to modeling studies, including finding better estimates for CFR, as proposed by Verity et al. 18 . Also, another possible death-related indicator is the Infection Fatality Rate (IFR) in which the denominator is the number of infected people. One possibility to obtain the IFR is by serological surveys such as done by Hallal 19 .
Analysis of delay notification does not solve issues related to undetected cases. Asymptomatic individuals might also impact transmission. Li et al. found that a substantial number of infections went undetected in Wuhan, China, but the transmission rate for those cases accounted for 55% of the regular transmission rate 20 . Also, some individuals might be pre-symptomatic, hence not considered in those datasets.
The methodology defined by Villela 9 for short term forecasting assumes epidemics with little to no interventions, in which the shape of the epidemic curve is well approximated by the generalized Richards model. We noticed that for two cities in the state of Rio de Janeiro, the number of cases was not within the credibility interval of 95%. In most cities, NPI interventions are taking place with significant impact. A distinct behavior due to interventions, however, is a possible reason for lack of accurate estimation in the case of those cities.
Health surveillance should also rely on strategies of contact tracing infected individuals to identify close contacts also infected to receive treatment and to mitigate subsequent transmission. Availability of tests such as done with RT-PCR is necessary for this purpose. Moreover, serological surveys are essential to having an unbiased perspective of the epidemic. In this case, a study designed to randomly apply serological tests indicates how far the epidemic is from achieving herd immunity, which is the condition for starting reducing numbers of cases. The force of infection gets down to the point of not having intensity for more outbreaks, even though some stochasticity might still happen. Hallal et al. found a seroprevalence of 7.5% for the population of the city of Rio de Janeiro, showing how underestimated the numbers are from the state dashboard 21 . Testing in this random serological design estimates from sampling the number of infected individuals to date, including recovered ones, as well as asymptomatic ones.
The experience in analyzing surveillance datasets shows high variability in variable completion in the health surveillance databases, since the missing rate can be quite high for a few variables. In the dataset of analysis in this work, date of symptoms, and date of death were very well documented, whereas age information had high missing rate.

Conclusions
Incidence and mortality rates for the ongoing epidemic of SARS-CoV-2 infections in the state of Rio de Janeiro in 2020 are already very high, thus requiring proper analysis on health surveillance. This work shows that these effects introduce uncertainties in epidemic indicators, hence the need for uncertainty treatment.
Criteria for relaxing social distancing measures, introduced to mitigate the impact of the pandemic, demand knowledge of incidence over time. Indicators such as estimated values of the effective reproduction number produce a time-varying measurement of the epidemic intensity. This work, however, shows that the notification delay significantly impacts an accurate estimation. An assessment by nowcasting as described in this work with knowledge of up to the previous three weeks should be valid for obtaining accurate estimations, but epidemiological evaluations for decisionmaking processes should also include other indicators.