SciELO - Scientific Electronic Library Online

vol.18 issue4Analysis of vaccination status of preschool children in Teresina (PI), BrazilAssociated factors with oral cancer: a study of case control in a population of the Brazil's Northeast author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand




Related links


Revista Brasileira de Epidemiologia

Print version ISSN 1415-790XOn-line version ISSN 1980-5497

Rev. bras. epidemiol. vol.18 no.4 São Paulo Out./Dec. 2015 

Original Articles

Automatic coding and selection of causes of death: an adaptation of Iris software for using in Brazil

Renata Cristófani MartinsI 

Cassia Maria BuchallaII 

IGraduate Program in Public Health, Universidade de São Paulo - São Paulo (SP), Brazil.

IIDepartment of Epidemiology, School of Public Health, Universidade de São Paulo - São Paulo (SP), Brazil.



To prepare a dictionary in Portuguese for using in Iris and to evaluate its completeness for coding causes of death.


Iniatially, a dictionary with all illness and injuries was created based on the International Classification of Diseases - tenth revision (ICD-10) codes. This dictionary was based on two sources: the electronic file of ICD-10 volume 1 and the data from Thesaurus of the International Classification of Primary Care (ICPC-2). Then, a death certificate sample from the Program of Improvement of Mortality Information in São Paulo (PRO-AIM) was coded manually and by Iris version V4.0.34, and the causes of death were compared. Whenever Iris was not able to code the causes of death, adjustments were made in the dictionary.


Iris was able to code all causes of death in 94.4% death certificates, but only 50.6% were directly coded, without adjustments. Among death certificates that the software was unable to fully code, 89.2% had a diagnosis of external causes (chapter XX of ICD-10). This group of causes of death showed less agreement when comparing the coding by Iris to the manual one.


The software performed well, but it needs adjustments and improvement in its dictionary. In the upcoming versions of the software, its developers are trying to solve the external causes of death problem.

Keywords: Cause of death; Vital statistics; Mortality; Information systems; Death certificates; Automation; Mortality registries.


Mortality statistics are used to define health conditions and socioeconomic parameters1. Because they are measures of international comparisons, standardization of concepts, methods of data collection, and analysis is needed. As a milestone for mortality statistics, in 1948, the World Health Organization (WHO) defined the international model of death certificate, which is now used in many countries, including Brazil. In addition, the WHO defined rules of selection for the underlying cause of death (UCD), which should be the origin of the chain of events resulting in death2 and will be used in health statistics.

In Brazil, the death certificate (DC) is filled by a doctor and analyzed by a coder. On the basis of the International Statistical Classification of Diseases and Related Health Problems - 10th revision (ICD-10), this professional codes all causes of death mentioned in the DC, applies the WHO selection rules, and selects the UCD. To do so, coders undergo training and have access to the WHO manuals.

Failures may occur when coding or selecting causes of death that can compromise the quality of the mortality data. Mistakes made by the doctor in the DC, such as reporting causes that lack specificity, can lead to coding errors3. Another possible failure occurs in the selection of UCD because the rules are complex and comprehensive. The ability to interpret rules in different manners3 and the need to consider the numerous exceptions make determination of the UCD4 difficult even for trained and experienced coders.

To reduce failures and improve data quality, several strategies are used, including the awareness of health professionals to the errors they made4. Another possible solution is the use of softwares that can simulates the role of a coder. They are able to automatically select the UCD or also automatically code the causes of death mentioned in the DC.

The first type of software selects the UCD from the ICD-10 codes among causes of death in the DC, defined by the coder. Such software contains decision tables that simulate the WHO selection rules. The program Automated Classification of Medical Entities (ACME) developed by the United States is the most known and is used worldwide. The Underlying Cause of Death Selection System (SCB, in Portuguese) is the program created and used in Brazil, which also falls into this category. It has been developed from ACME decision tables, which have been changed to meet the Brazilian reality5.

The second type of software codes the causes of death listed in the death certificate and then selects UCD. A dictionary of medical terms and respective ICD-10 codes is required. This group comprehends the Mortality Medical Data System (MMDS)6, developed in the United States, and Iris, developed in partnership with institutions from France, Germany, Hungary, Italy, Spain, Sweden, and the United States7. As this software codes and selects automatically, it allows a great improvement in the quality of mortality data8.

Both Iris and the MMDS use ACME decision tables to code UCD. The difference between them is that Iris has a system in which the language aspects are separeted from the software itself. The latter has an independent dictionary that can be easily configured in various languages and therefore used in various countries. This explains why Iris is the most adaptable software for worldwide use. In addition, by using the WHO selection rules and the international medical certificate model, the comparison of mortality data among countries is easier.

Considering that Iris is a facilitator to obtain more reliable statistical data and to improve the daily routine of coders, this study aimed to describe the adaptation of Iris software for using in Brazil.


The preparation of Iris software to be used in Brazil was done in phases: first, the dictionary was prepared in Portuguese; then, its completeness was assessed and it was adapted to terms used in the country; finally, its application was compared to the causes of death coded manually.


The Iris version used was V4.0.349, with English interface. The software's dictionary is an eletronic file containing two tables: one table is the dictionary itself and the other one is a set of standardisation rules. The dictionary lists each category (possible causes of deaths, medical terms, and diagnoses) and their respective ICD-10 code. The standardization table contains the rules of standardization that filter the medical terms written in the DC so they can be found in the dictionary10.

To create the Portuguese dictionary, two sources were used: the eletronic file from PESQCID11, containing medical terms categories and their respective codes listed according to ICD-10 volume 1. The second source was the Thesaurus of International Primary Care Classification (ICPC-2) developed by the World Organization of Family Doctors (formerly WONCA)12, which contains the list of their diagnoses with equivalent ICD-10 codes.

The dictionary containing 58,546 categories was made by adding 12,211 categories (20.9%) from PESQCID and 46,335 categories (79.1%) from Thesaurus. Repeated terms (7,500 in total) were excluded, so the dictionary had 51,046 diagnosis terms and respective ICD-10 codes.

Following this step, we performed adaptation and standardization of the dictionary so that the terms could match those used by physicians to fill the DC. The dictionary and the standardization table had many modifications, addition and exclusion in their terms. The use of standardization tools reduces the dictionary size, that will contain only the key terms, and it is especially important in countries with rich linguistic variety such as Brazil. As an example of this step, one can cite the removal of graphic signs of all terms, so the software can recognize the word written in the DC with or without graphic signs. Many synonyms or different ways of identifying the same medical condition were standardized in a single term contained in the dictionary.

In the end of this process, the dictionary had 46,801 categories with respective ICD-10 codes, whereas the standardization table held 621 rules.


We used a sample of death occurrences among residents in the city of São Paulo, from December 1 to 4, 2010. Data were collected from the DCs archived in the Program of Improvement of Mortality Information in São Paulo (PRO-AIM), from the São Paulo city Health Department, responsible for processing, analyzing, and disseminating the city's mortality-related information.

The following information was collected from the DCs: DC register number; date of birth; date of death; gender and medical certificate data (block V). No other identification was present except the DC register number, so we were able to maintain the privacy of the deceased.

The PRO-AIM coders wrote in DCs the ICD-10 codes for each condition and indicated the selected UCD. This manual coding was the basis for comparison with the automatic method.

Addicional informations not found in the DC was not used. More specifically, medical or autopsy reports and investigation carry out by the mortality committees were not used to change the cause of death.


One of the steps for starting using Iris is the preparation of DC batches, an electronic file generated from a pre-formatted table. In this table, cells are set for the information used, required, and produced by the software on each DC processed. The electronic file can have several batches, and each batch can have multiple DCs; it depends on who is using the program to organize and name each batch. In this test, batches have been named after the date of death, averaging 167 DCs each.

Preparing a batch means filling table information for each DC: gender, date of birth, date of death, and reference number. This group of information is the minimum required for the operation of Iris.

For the information not available in DCs, the following rule was created: whenever DC was of a unidentified person, the date of birth was assumed to January 1, 1950, so data were not lost. If the age field was filled, birth date would be calculated as if the deceased's birthday was in the date of his/her death. When the field for gender was not filled in DCs, the information was considered ignored.


Once the first version of the Portuguese dictionary was completed and batches were prepared, all DCs were added to Iris to begin the process of coding and UCD selection.

In situations where Iris was unable to identify and codify the cause of death, the DC was subjected to analysis, with adoption of one of the following decisions:

  • adding the term referring to the cause of death in the dictionary with, wherever possible, a 4-digit ICD-10 code. For example, circulatory shock, code R57.9;

  • changing the text of a category in the dictionary table. For example, deleting "unspecified" of all categories; thus "Migraine, unspecified" becomes "migraine";

  • adding a standardization rule. For example, establish that "severe sepsis is sepsis", for example, if "severe sepsis" is indicated as a cause of death, Iris will search "sepsis" in the dictionary;

  • changing a standardization rule. For example, add "irreversible sepsis" to the standardization rule of "severe sepsis is sepsis", so that both "severe sepsis" and "irreversible sepsis" are referred to "sepsis".

The information collected formed a database with DCs as unit. The database contained: the ICD-10 codes for each cause mentioned in Part I and II coded manually and by Iris; ICD-10 code for UCD; the selection rules used; and the need to add and/or change a dictionary category or standardization rules.

The data analysis was made using Microsoft(r) Excel and EpiInfoTM 3.5.3 software. To compare cases in both types of coding, manual and automatic, an agreement ratio (AR) was created. The AR is obtained by dividing the proportion of causes of death coded in a given chapter of the ICD-10 by Iris and manually. The DCs were not paired for this analysis. The AR was considered good, moderate, or low, as described in Table 1.

Table 1: Range of values of the agreement ratio considered good, moderate, or low. 

The study was approved by the Ethics Committee of the Public Health School of Universidade de São Paulo, under protocol number 2256. There was no conflict of interest.


The test used 666 DCs of residents in São Paulo. The mean age was 65.4 years and 52.2% were males. In two cases (0.3%), the gender field was not filled or was registered as undefined. In 10 other cases (1.5%), the date of birth was not fulfilled, and, in 2 statements, the information given was an approximate age.

Iris was able to code all lines reported on the death certificate in 629 (94.4%) DCs and, therefore, these could have the UCDs selected. Among declarations in which the software could not code all causes of death, 89.2% had a diagnosis of external cause (Chapter XX of ICD-10).

The number of 4-digit ICD-10 codes (complete codes) used in this test by Iris was 362 (2.9% of the 12,451 existing subcategories) and by manual coding was 388 (3.1%). The proportion of codes per DC was almost the same in both coding systems: Iris 3.4% and manual coding 3.3%.

When comparing causes of death coded by the two methods without paired DC, and evaluating codes according to chapters of ICD-10, the AR was considered good (Table 2).

Table 2: Distribution of causes of death coded by Iris and manually according to ICD-10 chapters, the agreement ratio, and the qualification of the agreement ratio in the deaths of residents in São Paulo, December 1-4, 2010.v 

CID-10: International Statistical Classification of Diseases and Related Health Problems - tenth revision; AR: agreement ratio; QAR: qualification of agreement ratio; G: good; M: moderate; L: low.

When analyzing paired DCs and the codification of causes of death by the two methods, by Iris and manually, full agreement was observed on 420 DCs (63.1%). In such cases, all the complete ICD-10 codes for all causes of death, in parts I and II of DCs, agreed in both systems. Twenty DCs (3%) had total disagreement coding for all causes of death, and 226 (33.9%) for some cause of death. The average disagreement rate with paired DCs was 35.4%, considering complete ICD-10 codes.

Comparing the coding of terms in each line of DCs, we found a difference of 14.1% considering complete coding, with 4 characters, based on the ICD-10. If we consider agreement only to the extent of chapters, the difference decreases to 9.0%.

Iris was able to code directly and completely, with no need for adjustments, 337 (50.6%) DCs. For the remaining, some kind of change or term addition to the dictionary or to the standardization table was needed. In total, required adjustments were 582, which indicates that, on average, nearly one adjustment was made per DC (Table 3).

Table 3: Number of adjustments required for Iris to code, according to type and percentage of adjustment per death certificate in deaths of the residents in São Paulo, December 1-4, 2010. 

In total, 433 terms were added to the dictionary table. The chapters with more categories added were: circulatory diseases, with 97 additions (22.4%); endocrine, nutritional and metabolic diseases, with 56 new terms (12.9%); neoplasms, with 54 additions (12.5%); and external causes of morbidity and mortality, with 30 additions (6.9%).

Changes made after this first test resulted in a dictionary with 47,020 categories and 859 rules in the standardization table.


Iris was able to code and select the UCD of 94.4% sampled DCs. Agreement with manual coding was 63.1%, indicating the potential of the software to code and correctly select UCD. The Portuguese dictionary will be improved and more suitable for Brazil as its use increase.

The low AR to coding external causes (Chapter XX) is due to difficulty of the software to code this group of causes. This limitation is a worldwide problem and corrections by the team that developed the program are ongoing. In Brazil, particularly, there are variations of writing and specifications of external causes of death that influence their coding. Another difficulty is that in most cases data input on the death certificate is insufficient to correct coding. Coders had access to various information sources to which Iris did not have.

One of the main difficulties related to this group of causes is different ways to write the same cause of death. The following expressions, for example, can be synonymous: electric shock, electrocution, physical-chemical energy; physical electricity agent; and burning or other injury due to electrical current. Furthermore, the same term may have different codes, depending on the situation of death. Blunt object can be coded as Y29 if it was used for undetermined intent, X59 if an accident, X70 if a suicide, and Y00 if a homicide.

The initial dictionary did not contemplate this diversity of terms and hence standardization or addition/changes was used. As variations of synonyms are broad, it is difficult to create some standardization rules for a single keyword. Furthermore, due to programming problems, at that moment Iris did not accept the inclusion of certain terms in the dictionary. This was the case of codes starting with "W", "X", or "Y", which in the ICD-10 represent part of the chapter of external causes of morbidity and mortality (Chapter XX). Terms added to the dictionary that belonged on Chapter XX were mostly related to medical and/or surgical procedures.

Another factor that make it difficult to code external cause of death is the absence of information in DCs concerning the circumstances and the real cause of death13. This is a gap that hinders both manual and automatic coding. Failure to fill these data leads the coder to use less specific categories when coding13.

To improve the quality of mortality data due to external causes of death, PRO-AIM, since 1996, performs investigations along with the Institute of Forensic Medicine (IML) of the city of São Paulo, aiming to clarify the circumstances of these deaths. Documents consulted to seek additional information are police reports, autopsy report, or a report from the hospital where the death occurred14. This new information allows the DC to be recoded.

In this study, from the DCs in which an external cause was mention, 63.6% had an autopsy. These deaths were manual coded only after the access to the information collected at the IML. However, Iris only used the information contained in the original DC. This impacted the analysis and explains the low AR in this group of causes, when comparing the both coding systems.

As for discrepancy within the same ICD-10 chapter, a Brazilian study compared manual coding from the Mortality Information System (SIM, in Portuguese) with a manual coder and found a discrepancy of 3.7%3, which is lower than that found in this study (9%). This difference can be explained by the fact that Iris is a recent software and that is currently in process of improvement.

A Norwegian study8 compared manual coding of all causes of death mentioned in DCs, by four coders, and obtained a mismatch of 24.2% when considering 4 digit code level and 11.2% at 1 digit code level. One explanation for the discrepancy between this study and the Norwegian one is the difference in the method used, because it is known that as the number of coders increases, the agreement between them decreases8.

Some sequences of causes of death are often interpreted differently by coders and even by the same coder at different moments. Iris allows the sequences always to be interpreted in the same way, increasing the chance of comparisons.

The study8 also pointed 15-20% rejection or failure in the selection of the UCDs when DCs were automatically coded. The explanation used was that the software has found ambiguous causal relationships and is unable to resolve them. In such cases, manual coding is mandatory.

Besides the difficulties mentioned above, the Iris version used in this study was unable to process external causes of death or medical complications. This is because Iris is a recent software, which has been improved gradually as it is used. A group of experts from various countries is responsible for developing and updating the software. This group also controls countries adaptations and adjustments, which allows the comparison of the information produced. Part of the group´s functions is to include in the software ICD-10 updates; to request information from users for improvement and improve the functions, including the adaptation for its use in verbal autopsy15. It is up to the team of mortality in countries that use Iris to update and adapt the dictionary and its functions to local reality, without interfering in the selection tables.

Several changes have been made since the first version of Iris and the outlook is a continuing progress in the coming years. The group responsible for software enhancement has been engaged in solving the difficulties of coding external causes of death. The dictionary tables are being suitable for a cause of death listed on DCs to be recoded based on information contained in other fields or even because of other cause of death. Another important remark is the mandatory use of a paid support software (Microsoft(r) Access) to form batches. It is known that new alternatives, including a platform on the Internet, have been considered for replacement10.

The SIM of Brazil accepts an incomplete DC. Information such as date of birth or age may not be present, and yet, the system is able to process the certificate. The version of Iris used in this study did not allow batch preparation without the date of birth or identification of sex. However, a recently launched version, which requires no batch preparation, makes its use a lot easier. If, for any reason, there is incompatibility of age or sex with some cause informed in the DC, the program displays a screen with questions about the veracity of information.

Another feature that makes the program attractive is the possibility of screen translation, adapting it to the features of the DC used in each country.


Considering that this was the first test of Iris in Brazil, the fact that it has been able to code directly 50.6% DCs is a good indicator, especially because of the possibility of improvement after adjustments and additions in the dictionary and in the standardization table. In addition, the software showed 63.1% agreement in paired DCs, considering the ICD-10 codes with 4 characters.

As part of the development of the dictionary, the additions of terms used in medical routine and other adjustments to the language adequacy are expected. So in the next tests, adjustments will reduce as the dictionary gets complete.

Also, the use of a new version of the program, with updates, especially in coding of external causes, will surely contribute to better result.


1. Laurenti R, Mello Jorge MHP, Lebrão ML, Gotlieb SLD. Estatísticas de saúde. 2ª edição. São Paulo: EPU; 2005. p. 54-7. [ Links ]

2. Organização Mundial da Saúde. Classificação Estatística Internacional de Doenças e Problemas Relacionados à Saúde: CID-10. 3ª edição. São Paulo: Editora da Universidade de São Paulo; 1996. 2 v. [ Links ]

3. Fajardo S, Aerts DRG de C, Bassanesi SL. Acurácia da equipe do Sistema de Informações sobre Mortalidade na seleção da causa básica do óbito em capital no Sul do Brasil. Cad Saúde Pública 2009; 25(10): 2218-28. [ Links ]

4. Soares JAS, Horta FMB, Caldeira AP. Avaliação da qualidade das informações em declarações de óbito infantis. Rev Bras Saúde Matern Infant 2007; 7(3): 289-95. [ Links ]

5. Santo AH, Pinheiro CE. Uso de microcomputador na seleção da causa básica de morte. Bol Oficina Sanit Panam 1995; 119(4): 319-27. [ Links ]

6. U.S. Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics. Proceedings of the International Collaborative Effort on Automating Mortality Statistics, Volume III. Hyattsville: U.S. Department of Health and Human Services; 2006. [ Links ]

7. Lefeuvre D, Pavillon G, Aouba A, Lamarche-Vadel A, Fouillet A, Jougla E, et al. Quality comparison of electronic versus paper death certificates in France, 2010. Popul Health Metr 2014; 12: 3. [ Links ]

8. Harteloh P, Bruin K de, Kardaun J. The reliability of cause-of-death coding in the Netherlands. Eur J Epidemiol 2010; 25: 531-8. [ Links ]

9. International Coding System for Causes of Death - IRIS. [software on internet]. Version 4.0.23. França: Institut Nacional de La Santé et de la Recherche Médicale; 2010. Disponível em: Disponível em: (Atualizado em 19 de dezembro de 2010; Acessado em 30 de janeiro de 2014). [ Links ]

10. Johansson LA, Pavillon G, Pelikán L, Weber S, Witting B. Iris User Reference Manual V4.1.3. IRIS - Automated coding system for causes of death. Disponível em: (Acessado em 28 de janeiro de 2014). [ Links ]

11. Centro Colaborador da OMS para a Classificação de Doenças em Português - CBCD. Classificação Estatística Internacional de Doenças e Problemas Relacionados à Saúde - CID-10. 2008 Versão eletrônica. Disponível em: Disponível em: (Acessado em 25 de fevereiro de 2012). [ Links ]

12. Classificação Internacional de Atenção Primária - CIAP 2. [software em CD-ROM]. 2ª edição. Florianópolis: Comitê Internacional da Classificação da WONCA; 2009. [ Links ]

13. Matos SG, Proietti FA, Barata R de CB. Confiabilidade da informação sobre mortalidade por violência em Belo Horizonte, MG. Rev Saúde Pública 2007; 41(1): 76-84. [ Links ]

14. Drumond Jr M, Lira MMTA, Nitrini TMV, Shibao K, Taniguchi M, Bourroul MLM. O novo modelo da declaração de óbito e a qualidade das informações sobre causas externas. In: VI Congresso Brasileiro de Saúde Coletiva; 2000 jul; Salvador (BR). 2000. Disponível em: Disponível em: (Acessado em 30 de janeiro de 2014). [ Links ]

15. Pavillon G. IRIS. Newsletter on the WHO-FIC, Volume 10, number 1, 2012. Disponível em: (Acessado em 30 de janeiro de 2014). [ Links ]

Financial support: none.

Received: May 23, 2014; Accepted: March 16, 2015

Creative Commons License This is an open-access article distributed under the terms of the Creative Commons Attribution License