Recovering records on cancer of the larynx from anonymous health information databases

REV BRAS EPIDEMIOL 2021; 24: E210011 ABSTRACT: Objective: To develop a linkage algorithm to match anonymous death records of cancer of the larynx (ICD-10 C32X), retrieved from the Mortality Information System (SIM) and the Hospital Information System of the Brazilian Unified National Health System (SIH-SUS) in Brazil. Methodology: Death records containing ICD-10 C32X codes were retrieved from SIM and SIH-SUS, limited to individuals aged 30 years and over, between 2002 and 2012, in the state of São Paulo. The databases were linked using a unique key identifier developed with sociodemographic data shared by both systems. Linkage performance was ascertained by applying the same procedure to similar non-anonymous databases. True pairs were those having the same identification variables. Results: A total of 14,311 eligible death records were found. Most records, 10,674 (74.6%), were exclusive to SIM. Only 1,853 (12.9%) deaths were registered in both systems, representing true pairs. A total of 1,784 (12.5%) cases of laryngeal cancer in the SIH-SUS database were tracked in SIM with different causes of death. The linkage failed to match 167 (9.4%) records due to inconsistencies in the key identifier. Conclusion: The authors found that linking anonymous data from mortality and hospital records is a feasible measure to track missing records and may improve cancer statistics.


INTRODUCTION
Cancer of the larynx (CL) accounts for 1% of all new cases and deaths due to cancer worldwide 1 . In 2018, 177,422 new cases and 94,771 deaths caused by CL were reported. In Brazil, this is the 8 th most common type of cancer among men, with 6,390 new cases in 2018 2 . Although smoking and alcohol consumption are the most important risk factors 3 , work-related carcinogens, such as polycyclic aromatic hydrocarbons 4 , inorganic acids 5 , or asbestos 6 , have been linked to an excess of CL in exposed groups. Increased mortality from CL has been reported among miners, tailors, blacksmith, toolmakers and painters 7 .
In high income countries, work-related cancers are commonly-reported occupational diseases, but they remain underreported, particularly in low and middle income countries 8 . An important step in gathering improved number of records is to recover registered cases from distinct sources. Considering that larynx malignancies usually require hospital treatment, administrative information systems could be taken as an additional data source to improve case assessment.
In Brazil, in the last 50 years, asbestos has been extensively used 9 but little is known about its impact on CL. Mapping health information systems (HIS) to recover all registered cases of asbestos-related diseases is one of the aims of the "Asbestos and Health Effects -Brazil" Project 9 . However, privacy and professional or research ethics limit the access to non-anonymous health databases 10 . Data linkage may be used for improving the reporting of a given disease by capturing cases and/or correcting the completeness of a database or a surveillance system 11 . Linking anonymous databases is feasible using computer-based routines based on common shared data. In a previous publication, records on deaths from mesothelioma and cancer of the pleura found in the Mortality Information System (SIM) and in anonymous databases of the Hospital Information System of the Brazilian Unified National Health System (SIH-SUS) were successfully combined using a unique key algorithm based on date of birth, sex, municipality of death occurrence, and date of death 12 , disclosing 32.2% additional records from SIH-SUS with a small linkage failure (1.7%).
Based on death certificates, SIM follows the recommendations of the World Health Organization for underlying and contributing causes of death. SIH-SUS is used for administrative purposes and does not cover hospital admissions in the private system, but registers all comorbidities requiring specific clinical protocols up to nine distinct diagnoses. In addition, considering the conditional relative survival of less than 95% of CL cases for 25 years or more 13 , patients may die of other causes, contributing for underreporting. Therefore, SIH-SUS can be used as an alternative source for capturing non-registered cases in SIM.
Up to date, no studies on underreporting of CL or on recovering CL in death records from multiple databases were found. In contrast with a rare tumor, such as mesothelioma, for which anonymous linkage proved feasible, CL is a more prevalent cancer. As part of an effort to track cases of typical or associated asbestos-related diseases to obtain more accurate estimates of its burden in Brazil, this study aims at assessing the feasibility and performance of anonymous linkage of CL records from two health information systems: SIM and SIH-SUS.

METHODS
All death records having an ICD-10 code C32X (cancer of the larynx, any subsites) of adults aged 30 years or older were investigated in the period from January 1 st , 2002 to December 31, 2012 in the state of São Paulo, Brazil.

DATA SOURCES
Death records were retrieved from SIM, a universal vital statistics database, and from SIH-SUS, an administrative hospital information system of the Brazilian Unified National Health System, which only covers state-owned or publicly funded hospitals. Both anonymous databases are freely available.
To assess the linkage performance, corresponding non-anonymous SIM and SIH-SUS datasets were obtained. Each database has multiple ICD codes: in SIM, there is underlying cause of death and multiple contributing causes; SIH-SUS has ICD codes for one principal and a maximum of eight secondary diagnoses, including the death-related cause when applicable. CL deaths consisted in records having at least one assigned C32X, from any coded subtype. In case of multiple C32X in the same individual, the most specific one was used in the analysis.

LINKAGE PROCEDURES
SIM and SIH-SUS databases were checked for duplicates, i.e., when more than one record shared the same unique key identifier and the same hospital admission form (from Portuguese, Autorização de Internação Hospitalar -AIH) or death certificate (DC), which were eliminated in both anonymous and non-anonymous versions prior to linkage. Only SIH-SUS presented duplicates, and records were manually verified, maintaining records with most of the remaining columns filled. After linkage performed with anonymous records using the unique key identifier, matched cases were checked for correction using the non-anonymous corresponding databases, which enabled the authors to verify full names and mother's names of the patients. Records with missing data related to variables required for linkage and its performance assessment were excluded. Records with missing names in the non-anonymous database were also excluded. Similar to the linkage strategy formerly used 12 , a unique identification key variable corresponding to a sequence of coded data (sex, municipality of death occurrence, date of birth, and date of death) was created. Then, the key was used to merge both databases, allowing for the identification of paired records -the same case was recorded in both databases and the algorithm successfully matched the records; unpaired cases only recorded in SIM; and unpaired cases only recorded in SIH-SUS.
To assess linkage accuracy, the same procedures were applied to the corresponding non-anonymous SIM and SIH-SUS databases. Both the deceased's and the mother's names were checked for similarity in each matched pair. For a final cross-check, unmatched cases from SIH-SUS were searched in the complete non-anonymous SIM database to find their pair, regardless of diagnosis.

DATA MANAGEMENT AND ANALYSIS
A relational database management system (RDBMS), MS SQL Server, was chosen to write each step of the linkage algorithm and repeat the process when necessary. Features, such as data transformation tools and index creation, were used to optimize the linkage procedures. Excel and SAS 9.4 (SAS Institute Inc., Cary, NC, USA, version 9.4) were used for quantitative analyses.
The study protocol was registered at the Brazilian National Research Ethics Committee (CONEP) and approved by the Ethics Committee through CAAE 36547514900005030, reports No. 962145 and 1761856.

STUDY POPULATION
In the state of São Paulo, from 2002 to 2012, there were 12,530 CL death records in the SIM anonymous database. In the SIH-SUS database, a total of 4,020 records were found. Records with missing data that prevented the unique key variable creation were excluded, REV BRAS EPIDEMIOL 2021; 24: E210011 specifically three from SIM, totaling 12,527 death certificates, and 383 from SIH-SUS, resulting in 3,637 hospital records for analysis ( Figure 1).
Results from the linkage performance are summarized in Table 1. All paired cases were correctly matched by the linkage strategy. Of the unpaired SIH-SUS records (n = 1,784), 167 (9.4%) had typing errors in the variables used to compose the unique key, which precluded matching. The remaining unpaired records were tracked in the SIM complete database, which contains CL and non-CL cases. Fifty-seven (3.2%) hospital C32X death records were not found in SIM and the remaining 1,560 could be identified with a non-CL ICD code as the underlying cause of death. Table 2 shows the specific four-digit codes of C32X according to paired status and HIS. Cancer of the larynx, unspecified (C32.9) was the most common regardless of pairing status or HIS, followed by overlapping lesions of larynx (C32.8), which prevails in SIH-SUS among paired (n = 499; 26.9%) or unpaired groups (n = 245; 13.7%). The distribution of sex and age groups according to pairing status (Table 3) shows that most CL records occurred among men (90%) in the paired group, 89.2% only in SIM. Age distribution of men's and women's C32X cases differ, with men's deaths occurring earlier in life compared to women's.

DISCUSSION
The findings of this study support the feasibility of anonymous databases linkage based on death records containing an annotated CL code. All matched records were confirmed as true pairs after checking the deceased's and their mothers' names. Most death records

100% success
Total Not all cancer of the larynx deaths registered in SIH-SUS are true death cases reported in SIM, even considering any other diagnoses.
Overall, the linkage strategy performed well, with 9.4% failure in matching true pairs. All failures were caused by inconsistencies in sociodemographic data.  only registered in SIH-SUS could be tracked in SIM in which the underlying and contributing causes of death had been registered with another ICD-10 code. Misclassification of death records in SIH-SUS represented 3.2% of unpaired cases in this database. The use of hospital data enabled to recover 12.5% CL records unreported in death certificates from SIM. Therefore, hospital records can disclose a significant number of deaths displaying a CL coding, which is a quite common disease. The most commonly recorded ICD-10 codes were cancer of the larynx, unspecified (ICD-10 C32.9) and overlapping lesions of larynx (ICD-10 C32.8). Cases between men prevailed, regardless of HIS or pairing status. Deaths in men occurred at an earlier age compared to women.
The feasibility of anonymous database linkage for CL is compelling, considering the simple strategy required and the availability of public health information systems, either of vital statistics, surveillance or administrative data. This strategy has been widely recognized as a useful tool to generate knowledge necessary to outline public health policies and programs for disease prevention and health promotion 14 . It can be of particular importance for epidemiological surveillance, concerning Workers' Health, considering the existence of large demographic databases, specially of labor forces and employment. Linkage is also crucial to develop complex study designs and long-term follow-ups of large populations using secondary data. Similar procedures have been succesfully used for deaths caused by mesothelioma and cancer of the pleura, both considered rare diseases 12 . The present results show that using combined sociodemographic data as a unique key identifier can accurately match records of a quite common malignant neoplasia, the cancer of the larynx. Inconsistencies in demographic data were the only reason for linkage failure. Such aspect highlights the importance in ensuring the quality of simple data because they can be used for other purposes such as sociodemographic descriptors. The poor quality of records may not only bias the estimates, but also compromise data management as the linkage itself. The potential use of multiple data sources in epidemiologic research or surveillance could be a strong reason to develop programs focused on data quality improvement in HIS. Most HIS are based on electronic forms and computational solutions may be introduced to block inconsistent data entry.
Inconsistencies were also flagged in death records from SIH-SUS, as a small proportion of reported cases could not be found in SIM. At present, SIM coverage is evaluated as "good" or "poor," being considered good in Brazil, particularly in the South and Southeast regions of the country 15 , where access to and the quality of healthcare and other basic social services are better compared with other Brazilian regions.
The proportion of 12.9% matched pairs for CL from SIM and SIH-SUS was higher compared with 5.7% obtained from the same linkage procedure for mesothelioma and cancer of the pleura 12 . This was expected, as CL is more common, more easily recognized, and has a longer survival time. Consequently, patients have an increased chance of dying from competing causes.
The SIH-SUS database only covers hospitalizations in publicly funded hospitals. In the Southeast region, where the state of São Paulo is located, the public system accounted for 63.7% of these events 16 . In 2012, 42.6% of the population of São Paulo had private health insurance. The SIH-SUS database is fed by hospital admission forms (AIH). These forms are administrative documents whose main function is the reimbursement of hospital expenses that, in turn, are tied to the requested procedures 17 . There is no employee-specific training to include ICD codes in the hospital admission forms, unlike what happens in SIM 18,19 . In the 2008-2010 period, 92.1% of hospitalizations in the public system were notified, which demonstrates a good coverage 20 . However, the nosological information contained in the AIH reflects the reason for hospitalization, and may omit other diseases or comorbidities of importance. In contrast to the rules for filling out death certificates, the field of secondary diagnoses is not used in most SIH-SUS records, being a factor of loss of information 20 .
The proposed unique key solely based on sociodemographic data proved to be effective for linking rare diagnoses 12 and for cancer of the larynx. The slight increase in the failure rate can be considered negligible due to the proportion of new records that were recovered. It paves the way for the enhanced use of multiple health information systems to capture unreported cases of both rare and common diseases. However, there is need to test the performance of the proposed linkage for more incident cancers, in such a way there will be more records sharing the same sociodemographic data.
Despite being a succesfull method for recovering records of interest, the validity of the ICD codes should be pursued by checking records, whenever possible, using other data sources, such as cancer registries, pathology reports, and/or clinical notes, in order to strengthen the linkage procedure and allow estimates of the yield of true disease cases. Confirmation rates were slightly higher than detection rates for cancer of the larynx in an accuracy study of cancer mortality statistics comparing the underlying cause of death with population--based cancer registries in three US states, suggesting a tendency for underreporting 21 .
REV BRAS EPIDEMIOL 2021; 24: E210011 Therefore, other data sources in addition to SIH-SUS can be used for capturing non-registered cases in SIM.
In conclusion, SIH-SUS is valuable to identify records of diseases of interest unreported in SIM. CL is often a comorbidity when death is the endpoint; therefore, it is precluded of being recorded as the underlying or even contributing cause of death 22 . The use of SIH-SUS for this purpose is still limited due to its partial coverage and data inconsistencies as far as sociodemographic variables are concerned. Efforts to improve the quality of data and to standardize the SIH-SUS database can boost its use in epidemiological and demographic studies and may be implemented as effective tools for management purposes and health research in the future.