Sensitivity of Probabilistic Record Linkage for Reported Birth Identifi Cation: Pró-saúde Study

The objective of the study was to evaluate the sensitivity of probabilistic record linkage for reported birth identifi cation. Data from the Pró-Saúde Study cohort population were used comprising technical-administrative staff at a university in Rio de Janeiro, Brazil, in 1999. A total of 92 records of subjects were linked to the database of the Brazilian Information System on Live Births (SINASC) using RecLink II program. Both reduced and amplifi ed strategies of clerical review were used. The sensitivity for birth identifi cation with the reduced strategy was 60.9%, while with the amplifi ed strategy was 72.8%. The limited number of fi elds available and the high proportion of homonymous names were major obstacles for the attainment of more accurate results.


INTRODUCTION
Database linkage has been used to monitor outcomes in cohort studies.It allows to joining data sets from different sources even when there is no univocal fi eld identifi er.Fields (e.g., name, date of birth) that are common to related databases are jointly used to estimate the probability that a given pair of records refers to the same individual. 1e accuracy of this probabilistic technique is affected by the number of fi elds available for comparison and quality of completion.When there are few fi elds available with low discriminatory power the likelihood of false-positive pairs is increased, i.e., although classifi ed as true pairs they refer to different individuals.False-negative pairs are often originated from failures in either data collection or data entry. 1 The accuracy of probabilistic record linkage is assessed by comparing the results obtained in the joining process with an independent source of information on outcomes of interest (gold standard).Since these sources are not easily available, few accuracy studies have been conducted. 2,5,6e objective of the present study was to assess the sensitivity of probabilistic record linkage in identifying births reported by female subjects in a longitudinal study.

METHODS
A cross-sectional study was conducted using probabilistic record linkage for identifying births reported by female subjects participating in the Pró-Saúde

Brief Communication
Study.The Brazilian Information System on Live Births (SINASC) database for the State of Rio de Janeiro was studied.Information on the date of birth of the fi rst child of all subjects was the gold standard.
The Pró-Saúde Study is a longitudinal study including a sample of technical-administrative staff of a university in Rio de Janeiro. 3 For the present study there were selected female subjects included in phase 1 of the study data collection carried out in 1999 (n=2,238) who reported having their fi rst liveborn child between 1996 and 1998 (n=92).SINASC database for the State of Rio de Janeiro, obtained from the State Health Department Offi ce of Vital Statistics (1996 to 1998; N=798,478) provided identifi cation information.
Linkage was performed using RecLink II software program. 1 A three-step blockage approach was applied based on a combination of phonetic codes of the fi elds "mother's last name" and "mother's fi rst name".The fi elds used for pairing were "mother's name" and "mother's year of birth" (calculated based on the mother's age and date of birth).
All links with scores ≥0 were checked manually in the fi rst step and only links with scores higher than six were manually reviewed in the next steps (short review).To improve the capture of true pairs, this strategy was expanded to manual review of all links with score ≥0 in all steps.During manual review the fi elds "mother's name," "mother's year of birth" and "district of residence" were checked.
Databases were evaluated for fi eld completeness in automatic ("name" and "year of birth") and manual ("district of residence") processes.The sensitivity of probabilistic record linkage was calculated for both strategies of manual review for identifi cation of records of births reported by the mothers.These estimates were repeated excluding births in the year 1996.
The study was approved by the Research Ethics Committee of Universidade do Estado do Rio de Janeiro Institute of Social Medicine.

RESULTS
The Pró-Saúde Study database showed 100% completion for subject's name and year of birth.As for SINASC database, an improvement was seen over the years studied in the fi eld "name," which was completed in 73.6% of records in 1996, 90.5% in 1997, and 97.5% in 1998.For mother's year of birth, obtained from the fi eld "age," completion rates were higher than 98% in all years studied.The strategy of short manual review allowed to identifying 56 women (60.9% sensitivity; 95% CI: 50.7;70.2) out of 92 who reported having their fi rst child between 1996 and 1998.The expanded strategy identifi ed an additional 11 women, making a total of 67 (72.8% sensitivity; 95% CI: 63.0;80.9).
Due to inadequate completion of fields required to joining SINASC database in 1996, a sensitivity analysis was carried out excluding women who had their fi rst child in that year.The sample total was then 63 women, of which 44 (69.8% sensitivity; 95% CI: 57.6;79.8)were identifi ed through the short strategy and 55 (87.3% sensitivity; 95% CI: 76.9;93.4)through the expanded one.

DISCUSSION
The present study showed low sensitivity of the short strategy of manual review and moderate sensitivity of the expanded one.These results are less favorable than that reported in another Brazilian study of probabilistic linkage between a primary database (cohort of elderly patients admitted to hospitals due to fracture) and the Mortality Information System (SIM) for death identifi cation, where 85% sensitivity was found. 2 The high proportion of records with missing information on mother's name in SINASC database for the year 1996 can in part explain our results since the exclusion of that year from analysis increased sensitivity.However, a similar sensitivity to that reported by Coutinho & Coeli 2 was only achieved after applying the expanded manual review of links.
We noted that several links with high scores that were created for the same subject were mostly false pairs.The limited number of fi elds available for joining databases negatively affected the procedure's discriminatory power.Because childbearing women are part of close birth cohorts, it is common to fi nd a high proportion of certain homonymous names "that are in fashion".Also, since a great number of Brazilians share common last names, a high rate of homonymous names with similar information on year of birth was found.An upper threshold score could not be determined, and falsepositive pairs were even seen in links with the highest score (score=11).Thus, thorough manual review of links was required and strict criteria were established for fi nal classifi cation of pairs as true or false, which resulted in low sensitivity for birth identifi cation in SINASC databases.
Linkage of SINASC and SIM databases to assess mortality in those under one year of age is an innovative application of this tool in Brazil. 4This approach allows to join a varied set of fi elds with information on delivery and newborn that is available from both SINASC and SIM databases, which facilitates the linkage procedure.But, for joining SINASC database and other databases or for other purposes, as in the present study, this procedure is hindered by the limited number of fi elds and high proportion of homonymous names.
Although the expanded strategy provided adequate results for birth identifi cation, this is a laborious approach.In the present study, one of the databases studied (Pró-Saúde Study) had a small number of records (n=92) but most applications involve large databases.For example, for joining all childbearing subjects in the Pró-Saúde Study (N=2.449) with SINASC database for a single year (≅ 270,000 records), an extremely large number of links would be required to be manually reviewed in the expanded strategy.While approximately 9,000 links would have to be reviewed in the fi rst step, more than 200,000 links would have to be reviewed in the following steps.
In conclusion, the limited number of fi elds available and high proportion of homonymous names increased the probability of false-positive links, requiring manual review of a larger number of links and the establishment of strict criteria for fi nal link classifi cation.Our results suggest that linkage probabilistic procedure for joining SINASC databases for the purposes other than infant mortality assessment will have lower sensitivity than expected for joining databases from different sources in Brazil.The previous study of database completeness with exclusion of those years with high rates of missing information can contribute for more accurate results.