Improving geocoding matching rates of structured addresses in Rio de Janeiro, Brazil A melhoria das taxas de relacionamento de georreferenciamento de endereços estruturados no Rio de Janeiro, Brasil Mejorando las tasas de coincidencia en geocodificación de direcciones estructuradas en Río de Janeiro, Brasil

Strategies for improving geocoded data often rely on interactive manual processes that can be time-consuming and impractical for large-scale projects. In this study, we evaluated different automated strategies for improving address quality and geocoding matching rates using a large dataset of addresses from death records in Rio de Janeiro, Brazil. Mortality data included 132,863 records with address information in a structured format. We performed regular expressions and dictionary-based methods for address standardization and enrichment. All records were linked by their postal code or street name to the Brazilian National Address Directory (DNE) obtained from Brazil’s Postal Service. Residential addresses were geocoded using Google Maps. Records with address data validated down to the street level and location type returned as rooftop, range interpolated, or geometric center were considered a geocoding match. The overall performance was assessed by manually reviewing a sample of addresses. Out of the original 132,863 records, 85.7% (n = 113,876) were geocoded and validated, out of which 83.8% were matched as rooftop (high accuracy). Overall sensitivity and specificity were 87% (95%CI: 86-88) and 98% (95%CI: 96-99), respectively. Our results indicate that address quality and geocoding completeness can be reliably improved with an automated geocoding process. R scripts and instructions to reproduce all the analyses are available at https://github.com/reprotc/geocoding. Geographic Mapping; Geographic Information Systems; Mortality; Data Accuracy Correspondence T. R. Cortes Instituto de Medicina Social, Universidade do Estado do Rio de Janeiro. Rua São Francisco Xavier 524, sala 7013-D, Rio de Janeiro, RJ 20550-013, Brasil. taisacortes@gmail.com 1 Instituto de Medicina Social, Universidade do Estado do Rio de Janeiro, Rio de Janeiro, Brasil. 2 Instituto de Saúde Coletiva, Universidade Federal da Bahia, Salvador, Brasil. doi: 10.1590/0102-311X00039321 Cad. Saúde Pública 2021; 37(7):e00039321 COMUNICAÇÃO BREVE BRIEF COMMUNICATION This article is published in Open Access under the Creative Commons Attribution license, which allows use, distribution, and reproduction in any medium, without restrictions, as long as the original work is correctly cited.


Introduction
Geocoding is the process of converting address information into an absolute geographic reference, such as latitude and longitude 1 . Previous studies have shown that the use of low quality geocoded data can introduce substantial bias in spatial and epidemiological analyses 2,3 .
The quality of geocoding results can be influenced by several factors, including quality of the input address, underlying reference data, geocoding algorithms, and matching criteria 1,4 .
Strategies for improving geocoded data often rely on interactive manual processes that can be time-consuming and impractical for large-scale projects. On the other hand, some automated approaches may require large training samples that may not be available in the same language or format as the study addresses 5 .
In this study, we evaluated different automated strategies for improving input address quality and geocoding matching rates using a large dataset of addresses from death records in Rio de Janeiro, Brazil.

Study data
Mortality data were obtained from the Municipal Health Department of Rio de Janeiro. The dataset included 90,897 deaths caused by cardiovascular diseases and 41,966 deaths due to respiratory diseases (coded in Chapters IX and X of the 10th revision of the International Classification of Diseases) that occurred among residents of the municipality of Rio de Janeiro between 2012 and 2017.
Each record has a structured format that provided six address fields, including full street name (street type and name), house number, address complement, neighborhood of residence, postal code, and city.

Address standardization
Address standardization was performed by removing punctuation and double spaces and converting numbers and abbreviations to a uniform representation. The full street name was split into street type and name.
We used two types of dictionaries for error correction. One was manually created and was composed of the most frequent misspellings in the dataset, and the other was based on common spelling variants in Portuguese 6 . We applied these spelling variant rules to the Brazilian National Address Directory (DNE) obtained from Brazil's Postal Service (Correios S.A.). Each spelling substitution could only match a single street name (e.g., the missing word "da" in "Rua da União" would not be considered an error and would not be corrected if there were other official street names without such word; for instance, "Rua União").

Address enrichment
We used three approaches to enrich the address records and retrieve the missing information. Using regular expressions, we extracted the strings related to residence number from the address complement, such as lot and block. The retrieval of neighborhood data was performed by extracting strings from other fields that were fully compatible with the official neighborhood names in Rio de Janeiro. Furthermore, all records with a valid (8-digit) postal code were linked to the DNE. The remaining records were linked to the DNE database by their street name, and they were considered a match if: (1) There was a single pair of records with the lowest Levenshtein distance (up to 2) for the street name field; (2) They had the same street type, or the street name did not occur with a different type within the neighborhood; Cad. Saúde Pública 2021; 37(7):e00039321 (3) They had the same neighborhood name, or their neighborhood shared a land border; (4) The number falls within the street segment (side, range) of the postal code address.

Geocoding process and performance assessment
Residential addresses were geocoded using Google Maps Geocoding API (https://developers.google. com/maps/documentation/geocoding/overview). Most addresses were specified by following the Brazilian postal service format (i.e., full street name, number, neighborhood, and municipality). For some addresses, other formats were used that included block, lot, and house number (e.g., full street name, lot and block, neighborhood, and municipality).
The output address was also standardized performing the same steps for data correction and enrichment. We compared the returned address to the original data and the address components retrieved from the DNE database. All records with address data validated down to the (complete) street level and location type returned as rooftop, range interpolated, or geometric center (https:// developers.google.com/maps/documentation/geocoding/overview) were considered a geocoding match.
Geocoding completeness was determined by the overall matching rate 2 . Geocoding performance was assessed by manually reviewing a random sample of 3,400 addresses. With manual review as the gold standard, we calculated the percentage of false-positive matches, false-negative non-matches, and overall sensitivity and specificity.
All analyses were performed in R. Files that are not under copyright or data privacy laws, including the R code (https://github.com/reprotc/geocoding).
Ethical approval for this study was obtained from the Research Ethics Committee of the Municipal Health Department of Rio de Janeiro.

Results
Out of the original 132,863 records, 5.2% had incomplete addresses, and 54% had a valid (8 digit) postal code ( Table 1). The overall matching rate was 85.7% (n = 113,876, with 83.8% matched as rooftop, 15.1% as range interpolated, and 1.1% as geometric center). Half of the addresses with incomplete information were geocoded and validated.
An example of false-negative (i.e., true match that was incorrectly labeled as incompatible) is given by the input address "Rua Comandante Itapicuru, N o -Tomás Coelho, Rio de Janeiro", and the corresponding pair "Rua Comandante Itapicuru Coelho, N o -Tomás Coelho, Rio de Janeiro". In this case, the input address name is incomplete, but both addresses refer to the same location. However, our automatic strategy failed to validate the addresses using the DNE due to a missing word "Coelho" entails a Levenshtein distance greater than two.
On the other hand, false positives included any erroneous or inconsistent matches labeled as compatible. For example, the match between the input address "Rua Sauna, N o -Santíssimo, Rio de Janeiro" and the address "Rua Sauna, N o -Senador Camará, Rio de Janeiro" was a false positive. Although there is only one street named "Sauna" ("Rua Sauna"), which is in the neighborhood of Senador Camará, another possible link includes a lane with the same name ("Travessa Sauna") in the adjacent neighborhood of Santíssimo.

Discussion
In this study, we evaluated different automated strategies for improving address quality and geocoding completeness using a large dataset of addresses in Rio de Janeiro. We obtained a geocoding matching rate of 85.7%, out of which 83.8% were matched as rooftop (high accuracy).
Although we obtained higher rates of automatic geocoding compared to previous studies in Brazil 8,9 , further improvements could be achieved by performing multiple geocoding services and advanced address normalization methods 10 .
One limitation of our study is that important dimensions of geocoding quality were not investigated, such as positional accuracy and repeatability 2 . Previous studies have reported median positional errors ranging from 17 to 200 meters 2,4 . However, few studies in Brazil have investigated the accuracy of the main geocoding services. A study using Google Maps (https://www.google.com/ maps/) in the region of Belo Horizonte (Southeastern Brazil) reported a median error of approximately 55 meters for street and premise level accuracy 10 .
Another limitation was the use of proprietary data (DNE database), which increased the cost of the geocoding process by 85%. Some alternatives include the National Registry of Addresses from the Brazilian Institute of Geography and Statistics (IBGE) 11 and collaborative postal code databases.
We emphasize that some precautions are necessary regarding the use of dictionaries and similarity metrics for address standardization and validation. In Rio de Janeiro, 2,183 street names appear in multiple neighborhoods, and 668 names occur with different types within the same neighborhood. In addition, some street type pairs (e.g., "Via" and "Vila") can have identical or very close similarity measures (e.g., Levenshtein distance or Soundex). Consequently, without reference data, some matching criteria could lead to errors and reduced address quality.
Our results indicate that the quality of input data and geocoding completeness can be reliably improved with an automated process. Further work is necessary to investigate other aspects of geocoding quality and the performance of the main geocoding services available in Brazil.