Estimation of live birth underreporting with a capture-recapture method , Sergipe , Northeastern Brazil

OBJECTIVE: Estimate the number of live births and, therefore, underreporting of live births. METHODS: The databases of the Live Birth Information System and the Civil Registry of the Brazilian Institute of Geography and Statistics, from the second and third trimesters of 2006 in Sergipe state (Northeastern Brazil) were paired by deterministic linkage based on the number of the Live Birth Declaration. The geographic disaggregation utilized was mother’s microregion of residence. Huggins closed population models were used to estimate the capture probabilities for each database and the total live births during the period, within each geographic subdivision. MARK® software was used for the estimates. RESULTS: Underregistration during the period studied was 19.3%. Application of the capture-recapture method to estimate underregistration of live births is possible, including for geographic disaggregations smaller than a state. The deterministic linkage was impaired in four microregions, due to non-inclusion of the Live Birth Declaration number in the database of the Brazilian Institute of Geography and Statistics. Maternal age, a heterogeneity characteristic in the population of live births, affected the probability of capture by the civil registry. CONCLUSIONS: Capture-recapture was a viable method to estimate the underregistration of live births. DESCRIPTORS: Live Birth. Birth Certifi cates. Underregistration. Registries. Records as Topic. Vital Statistics.


INTRODUCTION
a,b According to Simões, c the lack of coverage by vital statistics is a barrier to the direct calculation of fertility and mortality rates in Brazil.
Calculation of the fertility and child mortality rates with direct methods, without correction for the underreporting of births and deaths, can hide the demographic reality of a population.a To calculate these indicators, indirect techniques are employed for estimates, with information sources including demographic census and representative studies.Often, violation of the assumptions implicit when implementing such techniques causes distortion of estimates.When estimates are made for smaller geographic disaggregation of federal units, the problem becomes more complex due to the small population size of many Brazilian municipalities.c Various indicators may be calculated using statistics from the Civil Registry, such as fertility rates, mortality coeffi cients and life expectancy at birth.Efforts, to understand the negligence of civil records, attempt to remedy this situation and adhere to the principals of effi cient professional practices contained in the United Nation's Fundamental Principles of Offi cial Statistics. d Estimation through the capture-recapture method seeks to use the overlap between incomplete registries to formally measure the underestimation of these sources.This allows for the correction of statistics and the production of indicators that better approximate reality.These available sources (lists) may include mandatory reportable diseases, statistics from hospitals and other health services and death records, in addition to other sources. 2,11,12 summary, the capture-recapture method was utilized to estimate the population of France in 1793.Since the 19th century, the technique was widely used to estimate the population of wild animals, 3,8 and in various other application in medicine, demography and epidemiology.
In 1984, Petersen developed the most simply model for estimation with the capture-recapture method, using two samples. 8In the 1940s, Sekar & Deming 13 estimated the under registration of live births and deaths in India.Using census data, Shapiro 14 applied the technique to calculate the under registration of live births in the USA.In 1968, Wittes & Sidel 19 introduced a generalized capture-recapture method for epidemiology applications, through use of two or more lists.Interest continued to increase in this method, and since the 1990s there was a considerable increase in its use in epidemiologic research. 8e objective of this study was to apply the capturerecapture method to estimate live births.

METHODS
In ecology the most straightforward method involves sampling the population, marking the individuals, allowing them to mix with the remaining population, and then taking a new sample.The marked and recaptured individuals are counted, and the total population size is estimated based on the number of individuals exclusively contained in the fi rst sample, (n A ), exclusively in the second sample (n B ) and in both samples (n A∩B ).To use this technique, the following assumptions are necessary: 4,8 1.The population is closed, or in other words there are no births, deaths, or migration in the period between samples; 2. Marking is unique, meaning that each individual is identifi ed by the mark and there is no possibility of losing it; 3. In each sample, every individual has the same probability of being sampled (equiprobability); 4. The two samples are independent, i.e. the event of one individual captured in a sample is independent from the event of one individual captured by another sample; and 5.In each sample, any individual is captured (re-captured) independently from others.
The idea is that if the population in a given area is small, a large number of individuals captured by the second sample will have been marked in the fi rst sample.On the other hand, if the population is large, the second sample will have a small number of individuals marked by the fi rst sample.
In epidemiology, each available list is considered a sample of the population and "being registered on the list" is equivalent to "being captured" in the sample.
For more details about the development of this method in epidemiology, refer to Coeli et al, 2 Hook & Hegal, 4,5 International Working Group for Disease Monitoring and Forecasting (IWGDMF), 8,9 Wittes et al 18    Where equals when z ij = 0, where z ij indicates a prior capture of individual i.Alternatively: Therefore, γ ij is the probability of individual i being captured in sample j given its capture history and given it was captured at least once during the study.
If x ij = 1 when individual i is captured in sample j and x ij = 0 if the individual is not captured, individuals are renamed as 1, 2, 3, …, n and the non-captured individuals renamed n+1, n+2, n+3, ..., N, then the conditional likelihood is proportional to: This only depends on the individuals sampled.The formula for the linear adjustment according to individual and/or environmental characteristics is a logistic function {ln[p ij /(1 -p ij )]}. 7According to the author, the variables are normally distributed and their variances can be estimated with a secondary derivatives matrix.Various models can be adjusted based on the observed variables and the capture history.
To estimate the population size, the probability of individual i being captured at least once during the study is: 7 Where β is the vector of the parameters associated with the adjusted model.An estimate that does not depend on the population size is: And the variance is: The standard error of is the square root of its variance.The 95% confi dence interval is: Data were obtained from the Ministry of Health (preliminary data for 2006 were from the Live Birth Information System, Sistema de Informações sobre Nascidos Vivos, SINASC) and the Brazilian Institute of Geography and Statistics (Instituto Brasileiro de Geografi a e Estatística, IBGE -Civil Registry of Live Births for 2006).In 2006, the collection form used by IBGE included the number from the Live Birth Certifi cate, which was used to link the two databases and identify unreported live births.Information on data organization, standardization and linkage between the two databases was previously described.e The total number of live births was estimated with Huggins models, adjusted for the second and third trimesters of 2006, in Sergipe state (Northeastern Brazil).Under-reporting was calculated using the estimates for total live births.Each data source was considered as a sample, or occurrence.SINASC was considered the fi rst occurrence (fi rst capture) and the Civil Registry was considered the second occurrence (re-capture).The geographic analysis considered the microregion of mother's residence.Each microregion was considered as a group of individuals.
Various factors can infl uence under-reporting of live births, and some characteristics can be incorporated in models, including mother's education, race, number of previous children and the existence of piped water and sewage connection in the residence.Nonetheless, to use this technique the individual variables included in the linear model must be available at all captures, 7 in this case, the two data sources.Only the sex of the child and maternal age were available in the two databases and were considered in the estimation models.The lack of an offi cial document center (cartório) for the registration of people in the municipality is an institutional factor that can hinder civil registration of live births; in Sergipe, only two municipalities did not have this type of offi cial document center, in 2006.f Therefore, this factor was not considered, since microregions were used in the geographic disaggregation and offi cial document centers were located in all sub-divisions.
Between 4/1/2006 and 9/30/2006, SINASC captured 19,502 live births and the Civil Registry captured 17,254.The creation of pairs from the databases through use of the birth certifi cate number generated 15,532 pairs.Based on this pairing, the two databases included 21,224 registrations of live births from mothers residing in Sergipe.During the study period, the Civil Registry had 808 registrations with a missing birth certifi cate number, approximately 4.7% of the database.When the registrations with a birth certifi cate number are compared to the ones without a number, there is no statistical difference in the average age of the mother (p = 0.992) and in the proportion of the sex of the child (p = 0.510).The distribution of registrations with a missing birth certifi cate number in the Civil Registry revealed that some microregions have more than 5% of registrations with a missing birth certifi cate number: Agreste de Lagarto (31.9%),Tobias Barreto (9.2%), Boquim (8.2%) and Japaratuba (7.9%).In the other microregions, the percentage missing varied from 0.4% in Nossa Senhora das Dores to 4.0% in Carira.
For the sex of child, the null hypothesis was that the proportion of girls in each health microregion was the same as the proportion for the state as a whole.For maternal age, the null hypothesis was that the mean age in the health microregion was the same as in Sergipe.The proportion of female children was not statistically different from the mean of state.Mother's age was statistically different.Therefore, only maternal age was considered for inclusion in the estimation models.
Considering i = 1, 2, 3, ..., N as live births and b = the databases (SINASC, Civil Registry), the full linear model for the capture probabilities of SINASC and the Civil Registry were calculated as: ( Where, p ib is the probability of individual i being in database b (SINASC or Civil Registry); β 0 is the intercept; β k is the parameter estimate for group k (k is the microregion); g k are the individuals that belong to group k; β 13 is the parameter estimate for maternal age; idmae i is the age of the mother of individual i, in years, and β k+13 is the parameter estimate for the interaction between group k and maternal age.
In this model the probability of capture varies according to individual characteristics.The notation adopted for the full model was [p(g+idmae+g* idmae) c(g+ idmae +g* idmae)].
Each sub-model generated specifi c parameter estimates in accordance with the terms specifi ed.For example, in one model the probability of capture by SINASC depends on the group (geographic disaggregation) and maternal age and the probability of capture by the Civil Registry depends only on maternal age, [p(g + idmae) c(idmae)], have different parameter estimates of another model where the probabilities of capture by SINASC and the Civil Registry do not vary across microregions and are independent of maternal age, [p(.) c(.)].Models with more parameters respond better to the data but the precision of parameter estimates decrease.
One method to evaluate responsiveness and precision is to evaluate the models according to information criteria.One of these methods is the Akaike Information Criteria (AIC), which relates the conditional likelihood of the model to the number of parameters estimated: Where L is the conditional likelihood of the model and k is the number of parameters.Models with greater responsiveness have higher conditional likelihoods, decreasing the value [-2ln(L)].The additive term [+ 2k] penalizes the AIC value.Additional parameters decrease the AIC, since the conditional likelihood increases.However, the sum of the term [2k] balances the AIC value.Therefore, the model with a smaller AIC is the most parsimonious in relation to its likelihood and number of parameters.g Although it is easy to interpret and to select the model the best fi ts the data, sometimes there can be models with very similar AIC values, which makes selection diffi cult.Then the models can be calibrated in a way to provide a relative plausibility index, using the normalized Akaike weights.The weights, w i , are calculated for each model of the group of I candidate models, according to the formula: Where ∆AIC i is the difference between the AIC value of model i and the model with the smallest AIC.The weight w i is considered as evidence that model i is the best model of all the candidate models.Greater model weight can be interpreted to better support the data.f The models were adjusted with MARK, ® which estimates the capture-recapture probabilities in accordance with the linear model one desires to adjust for each probability.This way various models were adjusted where the probability of capture by SINASC was adjusted starting with a constant through the full model described in the equation (1).The same procedure was adopted for the capture probability of the Civil Registry, totaling 30 sub-models for the microregions.In addition, four models were adjusted for the entire state of Sergipe, in which no group was considered and to investigate the infl uence of maternal age on capture by the databases.
The MARK ® program calculated the number of parameters for each model, as well as the conditional likelihood, the AIC value, ∆AIC, w i , the probabilities for capture (SINASC) and recapture (Civil Registry), and also the derived estimate for total live births ( ).
After obtaining the estimates for total live births, the civil underreporting was calculated as a percentage l: (2) Where, = The percentage of underreporting in subdivision s (microregion).s = Estimate of total live births for subdivision s.
n RCs = Number of live births captured by the Civil Registry, in subdivision s.
The cartogram was created with Tabwin ® application.

RESULTS
Among the four adjusted models for Sergipe state as a whole, the model with greatest weight {p(idmae) c(.)} was the model where maternal age interferes in the capture of live births by the Civil Registry, with approximately 66% of the total weight among the models.This model estimated 21,391 (95%CI 21,363;21,423) live births in the second and third trimesters of 2006 in Sergipe, with a probability of capture by SINASC ( p) estimated at 0,912 and for the Civil Registry, 0.804.By deriving the estimates from the number of live births, civil underreporting was calculated at 19.3%.By using the estimate given in (2), with a 95%CI for total live births, the variation in underreporting was estimated between 19.2% and 19.5%.
When including the microregions of maternal place of residence as groups of live births, only 5 of the 30 models fi tted to the data demonstrated any relative weight when evaluating the AIC and conditional likelihood criteria.The model with the greatest weight {p(g) c(g + idmae + g*idmae)}, 67%, was selected (Table 1).
The probability of capture by SINASC was high in all microregions, varying from 0.69 in Agreste de Lagarto to 0.95 in Estância and Nossa Senhora das Dores.In Agreste de Lagarto there are a large number of records in the Civil Registry with a missing birth certifi cate number (more than 31%), which limited the pairing of the databases.Also, in Tobias Barreto, Japaratuba and Boquim, the percentage of records with a missing birth certifi cate number was greater than 5%.Besides these microregions, where matching was affected by the lack of a birth certifi cate number, only the Propriá microregion had a SINASC capture probability less than 0.90 (Table 2).
The capture probabilities for the Civil Registry were noticeably smaller than for SINASC.In the Civil Registry, the greatest capture probability estimated was in the Aracaju microregion (0.85) and the smallest was in Sergipana do Sertão do São Francisco (0.71), excluding the mircoregions with problematic matching (Table 2).
The total estimated live births ( ) was very close to the total measured in all the microregions, due to the high overlap of the lists.Again in Agreste de Lagarto, the high percentage of Civil Registry records with a missing birth certifi cate number created low overlap (relatively low n S∩RC ) and therefore infl ated the total estimate of live births.The absolute difference between the estimated live births and captured live births was less than 20 in almost all the microregions where the percentage of missing birth certifi cate numbers in the Civil Registry was less than 5%, and reached only 2 live births in Carira and Nossa Senhora das Dores (Table 3).
Underreporting across microregions varied from slightly more than 12% in Baixo Contiguiba to almost 27% in Sergipana do Sertão do São Francisco.Estimated underreporting in Agreste de Lagarto exceeded 40%, although this fi nding should be interpreted with caution.Civil underreporting was less in mircoregions located in the central part of the state, Aracaju, Baixo Contiguiba and Agreste de Itabaiana (< 15% of live births).As the microregions of maternal residence increase in distance from the central area, civil underreporting of live births increases (Figure 1).
When considering maternal age from the Civil Registry, there was a subtle decreasing trend in underreporting as maternal age increased (Figure 2).The Agreste de Lagarto microregion was not included in the analysis due to the large number of missing birth certifi cate numbers in the Civil Registry.
It is important to discuss the assumptions in estimation by capture-recapture.It assumes a closed population where there is no migration nor births or deaths during the study period.In this study, the population to estimate was the total number of live births.The event of a birth happens once, and the number of live births is constant in the period and the geographic area used.
Clearly, neonatal and/or child deaths can occur during the period analyzed and the families may have moved to another municipality of federative unit after the birth.Nonetheless, these factors do not alter the size of the "population of live births of mothers residing in Sergipe", since a death and change of address do not change the fact that the baby was born alive and the mother resided at the given location.
In regards to the unique marking, the failings in fi lling out the birth certifi cate number harmed the deterministic linkage utilized.Of the 17,254 live births present in the Civil Registry, 808 (4.7%) had a missing birth certificate number, which generates questions concerning the extent that these 808 records are able to be matched with the SINASC records.
When using the results of the model for the entire state of Sergipe, the estimated capture probability for SINASC was 0.912 and for the Civil Registry 0.804.The probability for one live birth to be included in the two databases would therefore be, 0.912 * 0.804 = 0.733.Of the 808 records with a missing birth certifi cate number in the Civil Registry, 0.733 * 808 The assumption that the capture-recapture probability of one individual does not affect the probabilities for others was not violated because the birth of a baby does not interfere with the identifi cation of another baby by SINASC, as well as the Civil Registry.In regards to multiple births -the fact that the live births are or are not registered together -does not meet this assumption.Nonetheless, the occurrence of multiple gestations is very rare, and the number of twins born alive does not harm this assumption.
Regarding the independence of the samples, here databases, the large overlap between them (large n A∩B ) suggests a positive dependence.This indicates that the number of estimated live births would not be much larger than the distinct individuals identifi ed in the two databases.In order to quantify the dependence between two epidemiologic sources (lists), Brenner 1 included a correction factor, for the probability of an individual to be captured in the two lists.The author simulates situations where both the capture probabilities for each source (n A and n B ) and the correction factors that modify the probability for inclusion in the two sources (n A∩B ) vary, creating positive and negative dependence in order to observe the behavior of the under-and overestimation factor.In the case of negative dependence, the investigator concluded that overestimation of the total population size would be more serious when the lists have low coverage of individuals and a small probability for including the individuals.In cases of positive dependence, the author affi rms that lists with a high inclusion probability have smaller underestimation factors.When considering this type of dependence, estimates of population size will still be closer to reality than simple aggregation of the sources. 1 In regards to the model selected, Tilling & Sterne 16 and Tilling et al 19 applied the Huggins model for estimating epidemiologic data, demonstrating the viability of the model for these types of data.Also, use of the conditional likelihood for the observed individuals allows fl exibility to include covariates to model the capture probabilities with adjusted linear models.We believe the Huggins model can continue to be applied to estimate total live births through linkage of SINASC and the Civil Registry, as long as future studies resolve the problem encountered with the IBGE database regarding the equal probability of capture during the study period used to identify records.
In conclusion, the results of the present study suggest minimal values for civil underreporting and SINASC coverage and that it is possible to apply the capturerecapture methodology to estimate underreporting of live births.In the case of large overlap between two databases, the International Working Group for Disease Monitoring and Forecasting 9 recommends aggregation of the sources and turning them into one source.An alternative may be the deterministic linkage of SINASC and the Civil Registry and probabilistic association between this new database and other sources that can be used when applying capture-recapture, such as enrollment in the Family Health Programs and the Hospital Information System, for example.

Table 1 .
Results of the Huggins models for closed populations of live births, which show some weight, according to microregion of maternal residence.Sergipe state (Northeastern Brazil), second and third trimesters of 2006.

Table 3 .
Distribution of capture by the two databases [n = total live births captured by SINASC.n RC = total live births captured by the Civil Registry.r = total distinct live births identifi ed (r = nS + nRC -n (S∩RC ) a These microrregions had more than 5% of Civil Registry records with missing birth certifi cate numbers, which impaired matching S , n RC, n (S∩RC ) and r] and derived estimates for the model {p(g) c(g + idmae + g*idmae)}, according to the microregion of maternal residence.Sergipe state(Northeastern Brazil), second and third trimesters of 2006.S

Figure 1 .
Civil underregistration, according to the microregion of maternal residence.Sergipe state (Northeastern Brazil), second and third trimesters of 2006.