Qualidade do Sistema de Informação do Câncer do Colo do Útero no estado do Rio de Janeiro Quality of Cervical Cancer Data System in the State of Rio de Janeiro , Southeastern Brazil

MÉTODOS: Estudo descritivo sobre a completitude, validade e sensibilidade dos dados no Siscolo no estado do Rio de Janeiro, com base no seguimento de uma coorte de 2.024 mulheres entre 2002 e 2006. As participantes eram residentes em comunidades assistidas pela Estratégia Saúde da Família nos municípios de Duque de Caxias e Nova Iguaçu (RJ). As duas bases de dados do Siscolo, referentes aos exames citopatológicos e aos exames confi rmatórios (colposcopia e histopatologia), foram comparadas a dados obtidos em uma base de referência de pesquisa e prontuários médicos. O gráfi co de Bland-Altman foi utilizado para analisar as variáveis contínuas. Para o relacionamento entre os bancos de dados foi utilizado o programa computacional Reclink.


INTRODUCTION
Health information system in Brazil consists of several different subsystems with data from sectoral activities.The Sistema de Informação sobre Mortalidade (SIM -Mortality Data System) was the fi rst to be created in Brazil in 1976. 8uring the 1990s, several data systems were established to provide information for planning and evaluating health services as part of the Sistema Único de Saúde (SUS -Brazilian National Health System).
These systems are a valuable source of information for epidemiological studies reducing research-related costs and time.Yet, the major barrier for their utilization is quality; i.e., inadequate information quality with a great deal of missing and incorrect data.There have been increased access to and opportunities of evaluating national health data systems in recent years.A recent systematic literature review 11 identifi ed 71 publications of Brazilian database integration related to 40 studies.Of these, 70% were epidemiological studies and, while most had some reference to data quality, only 15% of them actually intended to assess it.Of the articles reviewed, 75% referred to SIM and 57.5% referred to the Sistema de Informação sobre Nascidos Vivos (Sinasc -Information System on Live Births).Reclink program was used in 27.5% of studies for probabilistic linkage between databases.
The Sistema de Informação do Câncer do Colo do Útero (SISCOLO -Cervical Cancer Data System) is the most recently created information system, developed by Departamento de Informática do SUS (DATAUS -SUS Department of Information Technology) together with the Instituto Nacional do Câncer (INCA -Brazilian National Institute of Cancer).Established in January 2000, SISCOLO is intended to gather data on identifi cation of women who are SUS users, their demographic and epidemiological information, and information on cytopathology and histopathology tests performed.SISCOLO has continuously developed enabling care programs at the local and state level to follow up women with abnormal test results.Besides being a potential source of information for studies, SISCOLO also helps to subsidize SUS coverage for these tests and allows to evaluating cervical cancer care programs and services.a A search of Biblioteca Virtual em Saúde (BIREME -Virtual Health Library) databases in November 2007 revealed that the three studies based on SISCOLO had the objective of assessing quality of tests at cytopathology laboratories. 5,6,13e objective of the present study was to evaluate data completeness, validity, and sensitivity of SISCOLO database.

METHODS
The study was based on the following sources of data:  The fi elds analyzed included only demographic data, test results and women identifi cation.Databases were checked for inconsistencies that were then corrected to avoid potential interferences in the linkage process.Records of the fi elds "mother's name" and "district of residence" were left blank when they included words indicating missing information or numbers only.In the fi eld "patient name," as the physician's or nurse's name was mistyped in many records, they were manually deleted.
Reclink software program version 3 was used for linkage between databases to identify information of women of the reference cohort.Reclink is an application using a probabilistic record linkage method to estimate the likelihood of a pair of records being from the same individual.Record linkage involves the identifi cation of common fi elds in both databases and they are scored based upon their probability of matching or differing.This procedure is carried out in three steps. 1,3 Standardization -database fi elds are prepared for linkage to minimize errors.Fields can be subdivided and adjusted to have the same structure.The program also allows the exclusion of prepositions, punctuation signs, accents, and other symbols.
b) Blockage -logic blocks consisting of one or more fi elds are created to restrict linkage only to records that have the same content in the related fi elds selected.Seven strategies of blockage were sequentially applied to minimize loss of true pairs (Figure 1).c) Pairing -construction of scores for different pairs obtained using a certain blockage strategy based on specifi c criteria for the fi elds selected as identifi ers, because they have greater discrimination power.
The program can calculate scores based on the probability of the matching of two records on one identifi er, being a potential true pair, sensitivity (m), or in the event of a potential false pair, false-positive (u).In addition, it can calculate the probability of non-matching on one identifi er, being a potential true pair, false-negative (1m), and in the event of a potential false pair, specifi city (1 -u).Based on these probabilities, two weighting factors are generated, one for matching and one for non-matching.The weighting factor for matching is calculated using a base 2 logarithm of the likelihood ratio between probabilities m and u and the weighting factor for non-matching is calculated for the remaining probabilities (1 -m) and (1 -u).The total score of a pair is obtained from the sum of weighting factors attributed after comparing each identifi er.The fi elds selected as identifi ers, criteria used and calculated weighting factors are presented in Table 1.
Pairing strategies applied in record linkage and related maximum (full matching on all identifi ers) and minimum scores (non-matching on all identifi ers) are shown in Figure 1.The greater the number of identifi ers, the wider the score range.Date of birth was not included in steps 1 to 4 as it was used as a blockage strategy.
Steps 1 and 2 were more restrictive so few pairs would be created and would be more likely true pairs, thus allowing to be checked.In steps 3 and 4, more pairs were generated during linkage and only those with scores greater than -4.0 were checked.In the following steps, there was a dramatic increase in the number of pairs generated and only those with positive scores were checked.
Thorough manual checking aimed at avoiding misclassifying as true pairs those that did not belong to the same individual.Classifi cation criteria were as follows: a) same date of birth with identical name and mother's name, or with an abbreviated middle name or one of the middle names missing; b) same name and mother's name with date of birth with no more than two different digits, or day replaced by month; c) similar uncommon name or mother's name with same date of birth or address; d) different name, mother's name or date of birth or missing information in one database, but all remaining fi elds containing identical or very similar information matching on at least three fi elds.
Paired records in one step were not included in the following steps, except for records from the reference database when associated to SISCOLO fi les due to the possibility of a woman undergoing more than one test.Although Reclink version 3 has a feature for identifying duplicity, it was not used.
At each step paired records were saved as fi les and then put together as a single fi le using Microsoft Offi ce Access (2003).Each fi le was associated to the corresponding original SISCOLO database using Reclink, by joining fi elds related to test results.
To assess SISCOLO data quality, indicators of completeness, validity, and sensitivity were calculated as proposed by the US Centers for Disease Control and Prevention (CDC). 3eld completeness was assessed based on the proportion of complete record with no missing information in a given fi eld.Based on criteria described by Mello Jorge et al, 7 this indicator was considered excellent when the proportion of completeness was higher than 90%; good between 70.1% and 90%; and poor when equal to or lower than 70%.
The validity of SISCOLO fi elds was assessed based on sensitivity where the gold standard was data from medical records (including tests) or from the cross-sectional study (demographic and identifi cation information).In addition, a Bland-Altman plot was constructed to analyze the fi eld "data collection" for it is the most adequate to assess validity of continuous variables as proposed by Szklo & Nieto. 12nsitivity was estimated based on the proportion of tests recorded in the medical records of women in the reference cohort that were identifi ed in SISCOLO.This indicator was interpreted based on the criterion as proposed by Piper et al: 10 high sensitivity when greater than 90%; moderate between 70% and 90%; and low when below 70%.
There were also estimated the related 95% confidence intervals for indicators of fi eld validity and sensitivity.
The study was approved by the Research Ethics Committee of the National Cancer Institute (Protocol No. 074/06).

RESULTS
Completeness of SISCOLO databases containing cytopathology and confi rmatory tests (colposcopy and histopathology) was found to be excellent for the fi elds "mother's name" (98.4% and 98.2%, respectively) and "place of residence" (98.0% and 98.3%, respectively), and good for "district of residence" (84.5% and 89.8%, respectively).Completeness of the fi elds "zip code" and "individual taxpayer number" was poor in both databases (Table 2).Since the completion of the fi eld "date of birth" is not required when age is reported, the system assigns a year of birth based on age plus "01/01" for day and month.This set-up was seen in 3.5% of records analyzed.The fi eld "age" is estimated by the system based on the date of birth.Records with age younger than ten years and older than 89 years accounted for 0.2%.Other fi elds with demographic, and identifi cation information and those related to test data are all required fi elds and thus there were no records with missing information.
There were checked 19,801 pairs in the record linkage between the reference database and SISCOLO database with cytopathology tests and 556 pairs in the linkage between the reference and the database with confi rmatory tests; and 10.6% and 0.9%, respectively, were classifi ed as true pairs.Most true pairs were identifi ed in step 1, accounting for 64.4% for cytopathology tests and 60.0% for confi rmatory tests.In step 2, although there can be completion error of the fi eld "city code", it may also indicate migration or misinformation reported by women.In this step, the proportion of pairs created was 7.6% and 20.0%, respectively (Table 3).
Steps 1 and 2 were more restrictive, and pairs with very low scores were classifi ed as they showed abbreviations or the exclusion of women's middle name as well as their mother's, or omission of mother's name.
In the following less restrictive steps, pairs with similar characteristics, if any, could not be identifi ed as their score was below the cutoff or they did not show other ) were of women whose medical records were not found at the health units in the study.The other tests were reassessed and their classifi cation was confi rmed.Additionally, the medical records showed 251 tests not identifi ed in SISCOLO.
The system sensitivity to identify cytopathology tests was 77.4% (95% CI: 75.0;80.0).Of 2,317 cytopathology tests found in the data sources searched, 89.2% were identifi ed in SISCOLO, 48.0% in the medical records and 37.2% in both sources.
At the three reference units for diagnostic confi rmation, no records of colposcopy or histopathology tests of women in the reference cohort were found.
In INCA electronic medical records, 172 patients were identifi ed, of which 80 had records of colposcopy and histopathology tests, corresponding to 173 tests performed during the period studied.Of these 80 patients, only fi ve were identifi ed in SISCOLO, and two underwent tests that were included in the INCA medical records, while all other patients underwent only colposcopy tests at health units not included in the present study.The sensitivity of this database was very low (4.0%, 95% CI: 0.0;21.3).
Of women in the reference cohort, 1,251 (61.8%) had cytopathology, colposcopy or histopathology tests identifi ed in the sources of data studied.
As for validity, sensitivity of the fi eld "date of collection" was 100% in the database with confi rmatory tests and 70.3% in the database with cytopathology tests.Sensitivity of the fi eld "test results" was 100% in both databases.
The time interval between date of collection recorded in the database with cytopathology tests and that recorded in medical records was as much as 30 days in 93.5% of cases.A time interval greater than 60 days was seen especially in the second quarter of 2004 and 2006, corresponding to the same period when SISCOLO was updated (Figure 2).

DISCUSSION
In the present study, it was found that SISCOLO contained information of 89.2% of 2,317 cytopathology tests performed in women in the reference cohort.PBLOCK -Soundex code of the fi rst name, UBLOCK -Soundex code of the last name, Soundex code -phonetic 4-digit code consisting of one letter (the word's fi rst letter) and three numbers, and vowels are coded 0 and consonants with similar sounds have the same code (Reclink standardization strategy).Reclink was essential as no single identifi er was available for health information recording.It is expected that linkage between records will be more accurate and uncomplicated with the implementation of a SUS card undergoing in Brazil.
Major diffi culties were not encountered for Reclink application; however, manual selection of records was an extremely painstaking and time-consuming task, especially for less restrictive strategies that were required due to missing data and fi eld completion or entry errors.Despite that, more than 70% of pairs were identifi ed in steps 1 and 2, which were the least time-consuming.
As for quality of SISCOLO data, completeness, as well as validity of test results, was found to be excellent for most fi elds analyzed.
The system sensitivity for the database with cytopathology tests was moderate (77.4%).However, sensitivity is likely to be higher than that found since only information available in medical records was analyzed.Moreover, SISCOLO sensitivity may be different in other population groups or other Brazilian regions as seen for SIM and Sinasc.Despite being mandatory and having longer operation, these databases are still affected by different regional coverage. 8,9nsitivity of the database with confi rmatory tests (colposcopy and histopathology) was found to be very low (4.0%) and not yet a valuable tool since most data was not entered into this system.A possible explanation is that these tests are often performed together with other procedures during hospital admissions and covered only through hospital admission authorization.There is a need to make this information mandatory in SISCOLO to allow tracking cases requiring follow-up and treatment.The follow-up module of SISCOLO has undergone improvements and they are expected to enable to generate reports of women with abnormal cytopathology tests requiring follow-up and to allow state and local health managers to provide feedback information to system obtained during active search at the local level.
Information on cervical cancer and precursor lesion screening tests is not readily available at the health units where test specimens are collected, which to some extent make investigations more diffi cult.In this sense, SISCOLO is a promising tool for epidemiological studies as it could help signifi cantly reducing operational costs and time.It can also be used as an additional instrument to minimize loss to follow-up in cohort studies.
A limitation of SISCOLO is that data available are restricted to SUS users and do not include women undergoing tests in complementary health services.The Brazilian Ministry of Health household survey, a conducted in 16 cities during 2002 and 2003, showed that cytopathology tests in SUS ranged from 32.0% in Rio de Janeiro (southeastern) to 54.0% in Aracaju (northeastern).However, the cohort of women in the present study, as they were enrolled in the ESF, they are likely to use more services provided by SUS than the population studied in the survey.Also, as histopathology tests are more costly than cytopathology tests, women will likely turn to SUS services to get them.Unfortunately, the actual number of specialized diagnostic tests performed in Brazil is not known due to defi cient information network in reference centers.
In conclusion, the quality of SISCOLO data in the cohort of women studied was good.SISCOLO is an essential instrument for planning and monitoring actions of cervical cancer screening.SUS health services at different level of reference should be encouraged to make better use of SISCOLO and the dissemination of evaluation results can contribute to improve this information system.Further studies are needed to advance SISCOLO development, especially those including a representative sample of laboratories or health units that can help identify sources of errors and missing information.

a
Brazilian Ministry of Health.Department of Health Care.National Cancer Institute.Viva mulher.Câncer do colo do útero: informações técnico-gerenciais e ações desenvolvidas.Rio de Janeiro; 2002.b Brazilian Ministry of Health.Department of Health Care.National Cancer Institute.Center for Prevention and Surveillance.Nomenclatura brasileira para laudos cervicais e condutas preconizadas: recomendações para profi ssionais de saúde.Rio de Janeiro; 2006.

Figure 2 .Figure 1 .
Figure 2. Bland-Altman plot of the time interval time interval between date of collection recorded in the database with cytopathology tests in the Cervical Cancer Data System (SISCOLO) and that in the medical records analyzed.State of Rio de Janeiro, Southeastern Brazil, 2002-2006.

Table 3 .
Pairs created, pairs classifi ed as true and score ranges for each step of probabilistic record linkage between Cervical Cancer Data System (SISCOLO) databases and a reference cohort.State of Rio de Janeiro, Southeastern Brazil, 2002-2006.

Table 1 .
Parameters used and results of weighting factors by identifi er fi eld.State of Rio de Janeiro, Southeastern Brazil, 2002-2006.Compares sequences of digits, ignoring separators, for pairs of digits in the same position, indicated for fi elds with complete dates. b

Table 2 .
Completeness of fields in the Cervical Cancer Data System (SISCOLO) databases.State of Rio de Janeiro, Southeastern Brazil, 2002-2006.Of the remaining 1,235 tests, 862 (69.8%) were recorded in the medical records, including 197 tests performed at other health units.Of 373 tests identifi ed only in SISCOLO, 157 tests (42.1% Of 2,103 cytopathology tests identifi ed in SISCOLO, six were duplicated; 31 were performed after diagnostic confi rmation of HSIL+, and thus excluded from the analysis; and 831 tests were performed at other health units and not recorded in the medical records of the units studied.

Table 4 .
Sensitivity of fi elds in the databases with cytopathology and confi rmatory tests of the Cervical Cancer Data System (SISCOLO).State of Rio de Janeiro, SoutheasternBrazil, 2002-2006.