Estimated prevalence of COVID-19 in Brazil with probabilistic bias correction

Using data collected by the Brazilian National Household Sample Survey – COVID-19 (PNAD-COVID19) and semi-Bayesian modelling developed by Wu et al., we have estimated the effect of underreporting of COVID-19 cases in Brazil as of December 2020. The total number of infected individuals is about 3 to 8 times the number of cases reported, depending on the state. Confirmed cases are at 3.1% of the total population and our estimate of total cases is at almost 15% of the approximately 212 million Brazilians as of 2020. The method we adopted from Wu et al., with slight modifications in prior specifications, applies bias corrections to account for incomplete testing and imperfect test accuracy. Our estimates, which are comparable to results obtained by Wu et al. for the United States, indicate that projections from compartmental models (such as SEIR models) tend to overestimate the number of infections and that there is considerable regional heterogeneity (results are presented by state).


Introduction
The Brazilian National Household Sample Survey -COVID-19 (PNAD-COVID19) 1 is a nationwide, complex, survey aimed at "estimating the number of persons with symptoms associated with the flu syndrome and at following up the impact of the COVID-19 pandemic in the Brazilian labor market. The data collection of the Brazilian National Household Sample Survey -PNAD COVID-19 began on May 4, 2020, including interviews by telephone in nearly 48,000 households per week, adding up to nearly 193,000 households per month in the entire country. The sample is permanent, i.e., the households interviewed in the first month of data collection will remain in the sample along the next months, up to the end of the survey". Considering its latest release, in December 2020 (survey end date: December 5th), the survey indicates a total of 22.7% positive results for COVID-19, that is, 3.1% of the Brazilian population has tested positive. Official tallies by the Brazilian Ministry of Health are also at 3.1%. Two states are above 7% (confirmed) infection rate: states of Roraima and Amapá both in the North Region of Brazil. Most states have rates between 3% and 5%. According to our results presented by state in the section Results from Probabilistic Bias Analysis, the prevalence of COVID-19 in Brazil at the end of 2020 is estimated to be around 15% of the population, slightly more than 30 million individuals.
The number of confirmed cases is well known as being only a fraction of the actual number of infected individuals. Firstly, testing is not always available and most moderate or asymptomatic cases go undetected even when testing is accessible. Secondly, COVID-19 testing in Brazil has been carried out with a several testing kits and survey respondents may have used procedures with high rates of false negatives. Finally, and somewhat related to the first issue, there is the problem of selection bias as only those with stronger symptoms will seek medical attention and testing 2,3,4,5 . Some earlier studies have indicated that the true number of cases was about 12 times the number of confirmed cases 6,7 . The September issue of The Economist 8 publicized some estimates for the share of the population with COVID-19 antibodies obtained by serosurveys in June and the implied ratio of cases to confirmed cases: it was reported that Moscow (Russia) could -have 27 times more cases than reported ones, Stockholm (Sweden) 17 times, London (England) 14 times, Madrid (Spain) 10 times, and New York City (United States) 7 times. More recent estimates from Wu et al. 9 suggest a ratio of nine times for the United States and that the ratio would vary from three to 20 across the 50 states in the country. In late 2020, studies from the São Paulo University 10 and the Federal University of Pelotas 11 , in Brazil, indicated a much higher share of antibodies in the population, as high as 60%, close to standard herd immunity thresholds, but recent increases in cases across the country suggest that antibodies may not be as prevalent. In any event, common sense and empirical evidence strongly suggest that the actual number of infected individuals is much higher than the number of confirmed (cumulative) cases.
For estimating the prevalence of COVID-19 in Brazil at a given point in time we can resort to at least three major approaches. Firstly, there are simple analyses using data from other years on similar respiratory diseases and some demographical considerations: for instance, Ribeiro et al. 7 concluded in May 2020 that 3.8 times more hospitalizations in Brazil due to COVID-19 were identified than reported by analyzing hospitalization patterns of acute respiratory distress syndrome between 2012 and 2019 as compared to 2020.
Another approach is based on dynamic mathematical models, such as SEIR models, which have been made widely available for COVID-19. Such models yield projections that are higher than what is observed as illustrated by the compilation of four popular models in OurWorldInData.org 12 : recent estimates regarding SARS-CoV-2 from the SEIR model by Youyang Gu (YYG) indicate that Brazil has reached 16.9% infection rate, eight times the confirmed cases. The model from the London School of Hygiene & Tropical Medicine (LSHTM) estimates that between 28% and 42% of symptomatic cases are unreported. The models by the Imperial College London (ICL) and the Institute for Health Metrics and Evaluation (IHME) place estimates at much higher values, especially around June.
A third approach, adopted by Wu et al. 9 who developed a bias correction model for the estimation of COVID-19 cases, is to perform a quantitative bias analysis 13 . This approach is entirely data-based and does not aim to model transmission mechanisms or dynamics. Quantitative bias analysis aims to quantify the effects of systematic error (due to selection bias, unmeasured confounders, information bias, etc.) on estimates derived from nonrandomized epidemiologic studies.
Cad. Saúde Pública 2021; 37(9):e00290120 The next section briefly describes the approach of bias correction developed by Wu et al. 9 and reproduced in this study. Section Results from Probabilistic Bias Analysis provides the results we obtained for Brazil using the national survey mentioned above (PNAD-COVID19) and Section Discussion concludes with a summary and final remarks briefly discussing the related issue of herd immunity.

Probabilistic bias analysis by Wu et al.
Along the lines of probabilistic bias analysis 13 , Wu et al. 9 have developed a simulation-based bias correction method aimed to adjust count estimates on confirmed cases for selection bias (preferential testing of moderate/severe cases) and imperfect test accuracy. Despite additional computational cost, empirical, simulation-based methods provide the flexibility needed for the kind of multiple correction desired. In this study, the simple estimate based on confirmed cases is biased away from the true value due to preferential testing and imperfect test accuracy. The parameters affecting the distributions used to correct for bias are treated as random variables and, hence, the procedure is known as probabilistic bias analysis. Even though the only modification we have made to the method proposed by Wu et al. 9 is the selection of (hyper-)parameter values in the prior models (their specifications reflect the U.S. reality), we explain their method for completeness.
More specifically, we want to estimate N* which is the number of cases in the population (for each of the 27 Brazilian states, including the Federal District). The starting point, which is reported in surveys or official reports, is , the number of confirmed cases among tested individuals which we identified to be just a fraction of N* due to selection bias and imperfect test accuracy. These two figures are connected by an epidemiological identity which provides a correction for imperfect test accuracy 13 , where: N is the population size (known), is the number of infected individuals not tested, , and S e and S p are test sensibility and specificity, respectively. The value of is unknown. It may be obtained as the sum of the number of untested individuals who have moderate or severe symptoms and would result positive if tested, and the number of untested individuals who have mild or no symptoms and would result positive if tested, The above expressions provide correction for incomplete testing. Following Wu et al. 9 , we considered them to be binomially distributed with size and success probability equal to or 1. In simulations, these two quantities are held fixed at their mean values since their variability is negligible compared to other sources of uncertainty and the population size is large 9 . However, the probabilities involved in the above expressions are unknown. They are modelled and simulated: thus the set of parameters for which prior information must be provided contains the probability of having moderate to severe symptoms among tested individuals, and also , and the probability of having mild symptoms among positive cases, . It also contains the sensibility and the specificity of testing procedures used for COVID-19 and the ratios α and β which refer, respectively, to and divided by . A probabilistic identity is used to coherently connect these probabilities, where the ratios α and β have been defined above. Because priors are specified on both sides of the equation above, a sample for the vector is obtained by a technique known as Bayesian melding 14 . Bayesian melding is a procedure that uses the fact that one parameter can be expressed as a deterministic function of other parameters. It can be useful to simplify complex models when a deterministic relation is present. In this sense, it is enough to define the prior for one of them with the other being fully determined by the deterministic relation among them.

Simulations
Considering the above framework, seven quantities are present in the probabilistic bias analysis just described whose uncertainties must be assessed with simulations.
For each state, the empirical estimate (cumulative number of cases divided by the cumulative number of tests) from the latest release of PNAD-COVID19 is fixed. The ratio of tests performed to the state population ranged from 9% to 26% across the 27 states and the point estimates of positive rates range from 18% to 50 across states.
With those probabilities fixed the relevant quantities are simulated and a distribution of expected cases is obtained for each state as described next. Table 1 shows all parameters modelled as truncated beta random variables such that their moments and bounds, agreeing with estimated values in the survey (PNAD-COVID19) or test kit specifications obtained from the Brazilian Ministry of Health.
Finally, a decomposition of the two sources of biases, incomplete testing and imperfect test accuracy, can be obtained through , and p 2 = 1p 1 , where p 1 is attributable to the inaccuracies of testing and its complement to incomplete testing.
Initially, 10 4 values are sampled from the distributions of , , α, β, S e and S p with these six variables independent and identically distributed across states. Then, values of and are simulated based on and simulated values of α and β. Despite the theoretical possibility that some parameters may be correlated between some states, we and Wu et al. 9 do not have robust evidence to inform an appropriate model of correlation structure.

Table 1
Prior specifications for truncated beta models used in the probabilistic bias analysis. In Wu et al. 9 , and in this study, Bayesian melding is used to relate the components of . As stated by Poole & Raftery 15 the use of pooling weight to be ½ makes the combined prior distribution of and to be the geometric mean of each prior distribution. We do not have any evidence to support any of the prior distributions, therefore using ½ as pooling weight is the natural choice and it is employed in this study. Bayesian melding is performed with 10 5 iterations of sampling-importance-resampling algorithm (SIR); a simulation size greater than 10 4 , since this stage involves a more complex generator as opposed to independent univariate sampling. These simulation sizes were also used by Wu et al. 9 .

Parameter
After all samples are simulated, point estimates are obtained as sample medians. The analysis does not involve a probability and only the quantities of interest are sampled. No sampling of likelihood parameters was necessary.
Before we report our results based on the methodology described, we argue that a suitable way to estimate the number of infected individuals is to try to emulate a natural experiment based on the data provided by PNAD-COVID19. Many employers and local governments have requested mass testing for certain groups of individuals. We took a subsample from the national survey considering only working individuals, aged from 18 to 60 years old who have declared absence of previous COVID-19-like symptoms. This should partially eliminate selection bias. By calculating the percentage of those individuals who tested positive we reached 14.3%.

Results from probabilistic bias analysis
We will report our results in terms of percentage of the population infected by COVID-19 (prevalence), and the associated correction factor F, Figure 1 shows our main results, where the estimated percentage of cases decomposed into confirmed cases (blue bar) and unreported cases (orange part) can be identified. Table 2 brings the corresponding credibility intervals. Northern states have higher number of estimated cases. Note that, reported cases account for 8% of the population of the state of Roraima. After bias corrections this value increases to 28%. Amazonas was one of the states with the highest number of infected individuals. However, the local government has reported that only 5% of its population was infected with COVID-19 (official cases). A recent study 10 suggests that the prevalence of COVID-19 antibodies is much higher based on donated blood sample, as high as 60%. We estimate it to be around 21%. Figure 2 shows our results in terms of correction factors (F) including lower and upper bounds (2.5% and 97.5% quantiles). For instance, the state of Ceará has, according to these estimates, 4.3 times more cases than officially reported within an interval 3-5.3.
Finally, our results indicate a strong linear association between correction factors and testing coverage in the log-log scale (Figure 3), which can be written as: Therefore, we can estimate F for a given region, say a large city, which has not been directly targeted by the survey. For instance, if 5% of the population has been tested then we estimate that F ≈ 9.0; for 10% coverage we estimate F ≈ 4.9 and for 20% testing we should obtain F ≈ 2.7.

Discussion
Estimating the number of infected individuals during a pandemic is essential to decision-makers and researchers. National surveys such as PNAD-COVID19 have been a significant source of information to adjust official counts of reported cases for different biases. Our use of the simulation methods developed by Wu et al. 9 show that Brazilian official estimation is about five times lower than the expected number of cases. It must be understood that this number is static and represents the magnitude of underreporting in August and that this number is likely to have been higher in the peak of the pandemic. We have presented national and state figures but for practical policy use, local (municipality level) estimates may be necessary.
The city of Manaus is located in Brazil and is the largest city in the state of Amazonas which accounts for slightly more than half of the state's population. The COVID infection in this region presented two peaks (with peaks in late July/2020 and in mid-March/2021) and has thus attracted much attention as an example of a largely unmitigated epidemic. A recent study 10 has estimated that, by October of 2020, the cumulative incidence in Manaus was 76% (95%CI: 67% to 98%), whereas our estimate for the state stands at 21% (95%CI: 15% to 26%). The fact that, in March, the city experienced a stronger second peak suggests that the 76% estimate is, most likely, unreliable and too high. Considering such high infection rate, a strong second peak would have to be assigned to the new variant found in Manaus (P.1 lineage) and to antibody waning. Reinfection would have then been very common which, as far as we know, has not been well documented. Despite these possibilities we still find the 76% estimate to be too high. Buss et al. 10 have indicated that their results rely on a certain Cad. Saúde Pública 2021; 37(9):e00290120 assumption about the dynamics of seroreversion and that their sample is not a random sample but, rather, a sample of blood donors which they argue to be representative of the population of Manaus. Notably, without adjusting for seroreversion, Buss et al. 10 report an estimated prevalence, adjusted for specificity and sensitivity, of 26% (95%CI: 21% to 31%) which is close to what is reported in this study. The disparity appears after adjustment for seroconversion. Whether this adjustment is reliable is open for research it certainly suggests that the estimates around 20-30% are too conservative. Similar arguments may be applied to the case of the state of São Paulo. Buss et al. 10 report an estimated prevalence of 29% (95%CI: 26% to 37%) adjusting for seroconversion and 14% (95%CI: 11% to 17%) without adjustment, for the capital city which accounts for slightly less than 30% of the state's population. Our state estimate for the state of São Paulo is 11% (95%CI: 7.8% to 13%). We conclude that our estimates, or any estimate not adjusted for seroconversion, must be considered conservative. Since the prevalence (or a lower bound for it) has been estimated, it is natural to ask about herd (or collective) immunity, the level at which contagion becomes under control (herd immunity threshold -HIT). Despite lacking a precise definition, the concept of herd immunity is inevitable in discussions about COVID-19 and infectious diseases in general.
The possibility that collective immunity is not applicable to SARS-CoV-2 could be based on the epidemiology of the coronavirus HCoV-NL63 16 for which long term individual immunization is not achieved and reinfection is common. However, our understanding of the epidemiological literature 17 and of expert opinions available in the media is that collective immunity is applicable to COVID-19. Nonetheless, the actual value to be targeted has been a topic for discussion.   The standard formula for the threshold is 1 -(1/R 0 ), where R 0 is the number of persons infected, on average, by a given individual carrying the virus. It assumes uniform susceptibility and it results from an idealized scenario. For COVID-19 it has been argued that, approximately, R 0 = 3, and thus HIT = 67%. Such calculations have been questioned in the epidemiological literature and, regarding SARS-CoV-2, Aguas et al. 18 have argued that random immunization is far from reality and considerations about non-uniform susceptibility considering the individual coefficient of variation (CV) would lead to a more realistic formula: in an epidemic that takes its natural course, by contrast, the virus very specifically infects the people that are most susceptible first. This removes all of the strongest vectors early on, and continues to selectively remove the vectors until the herd immunity threshold is reached. The new parameter, CV, plays an opposite role than that of the reproduction number. The larger the CV the lower the HIT. It is proposed by Aguas et al. 18 that . Just like R 0 , the CV needs to be estimated. Setting aside (relevant) discussions about the validity of proposed values for de R 0 and CV for SARS-CoV-2, if we consider R 0 ≈ 2.5 and recent estimates of the CV for SARS-CoV-1 for Singapore and Beijing (China), that is CV ≈ 2.6, we obtain HIT = 11%. For comparison, CV values for malaria in the Amazon and tuberculosis in Brazil are 1.8 and 3.3 respectively. First versions of Aguas et al. 18 concluded that HIT values could be much lower than 50% in many cases but this seems to be valid without mutations and other changes in the dynamics of the epidemic as can be observed throughout the world with cases still on the rise in places where infection has passed the 10%. Nevertheless, the work of Aguas et al. provides an important discussion in the ongoing research on HIT.

Figure 3
Approximate linear relationship (log-log) between correction factors and testing coverage.
Another source of uncertainty must be addressed before answering the herd immunity question. The question of pre-existing immunity 19 . An aspect much harder to be measured. We thus expect more research to be conducted along these lines, not only regarding the best expression for HIT but also on best estimates for R 0 and CV and more insights into pre-existing immunity.
If we assume that the HIT for SARS-CoV-2 is in fact less than 67% (due to the effect of CV and pre-existing immunity), say 50%, then, considering the estimated prevalence of 8% to 28% ( Figure 2; Table 2), most of Brazil is still some time away from achieving some sort of collective immunity but not too far if vaccination efforts are successful. For HIT around 50% some states are half way towards herd immunity but for HIT around 60% to 70% most states would be less than halfway in reaching the threshold.