Spatial analysis of Tuberculosis in Rio de Janeiro in the period from 2005 to 2008 and associated socioeconomic factors using micro data and global spatial regression models

The present study analyses the spatial pattern of tuberculosis (TB) from 2005 to 2008 by identifying relevant socioeconomic variables for the occurrence of the disease through spatial statistical models. This ecological study was performed in Rio de Janeiro using new cases. The census sector was used as the unit of analysis. Incidence rates were calculated, and the Local Empirical Bayesian method was used. The spatial autocorrelation was verified with Moran’s Index and local indicators of spatial association (LISA). Using Spearman’s test, variables with significant correlation at 5% were used in the models. In the classic multivariate regression model, the variables that fitted better to the model were proportion of head of family with an income between 1 and 2 minimum wages, proportion of illiterate people, proportion of households with people living alone and mean income of the head of family. These variables were inserted in the Spatial Lag and Spatial Error models, and the results were compared. The former exhibited the best parameters: R2 = 0.3215, Log-Likelihood = -9228, Akaike Information Criterion (AIC) = 18,468 and Schwarz Bayesian Criterion (SBC) = 18,512. The statistical methods were effective in the identification of spatial patterns and in the definition of determinants of the disease providing a view of the heterogeneity in space, allowing actions aimed more at specific populations.


Introduction
Tuberculosis (TB) is caused byMycobacterium tuberculosis (Koch's bacillus).The disease is a serious public health problem worldwide; it is endemic in many countries and kills approximately 1.5 million people annually 1 .
TB is directly linked to poor living conditions.The probability of an individual being infected and developing the disease depends on several factors, including the socioeconomic and health conditions to which this individual is subjected 2 .In urban centers of developing countries, social determinants such as poverty, low education level, increased population density, unhealthy housing and drug abuse constitute factors characterizing individuals who are vulnerable to the disease 3,4 .
Social inequality, the appearance of AIDS, an aging population and large migratory movements are some of the factors highlighted by Ruffino-Neto 5 as the main causes for the severity of the current TB situation in the world.In 2010, there were an estimated 8.5-9.2 million cases and 1.2-1.5 million deaths (including deaths fromTB among HIV-positive patients).TB is the second leading cause of death from infectious diseases in the world, losing only to HIV in 2008 1 .
TB remains a serious public health problem in Brazil.The highest concentration of cases is in the Southeast region, with Rio de Janeiro exhibiting the highest incidence rate.In 2010, the state of Rio de Janeiro identified 14,206 cases of TB, and the municipality of Rio de Janeiro reported the largest number of notifications: 7,664 cases.Although the incidence rate has decreased by 13% in the municipality of Rio de Janeiro in the last 9 years, in 2010, the rate was 95/100,000 inhabitants.Pulmonary TB is the predominant form of disease in this municipality 4,6 .
The main data source for TB is the Notifiable Diseases Information System (Sistema de Informação de Agravos de Notificação-Sinan), and the notification is based on the definition of a confirmed case in the investigation and in the follow-up of cases 7 .
In the municipality of Rio de Janeiro, TB is not equally distributed in geographic space.Identifying the spatial distribution of TB and its social determinants in different areas of the municipality allows identifying more vulnerable populations and planning governmental measures more aimed at the needs of different territories.Thus, incorporating the spatial dimension in the analysis of the disease may extract addition-al meanings than conventional analyses alone, contributing to understanding the dynamics of this disease.Therefore, the use of geoprocessing techniques, particularly Geographic Information Systems (GIS) together with Spatial Statistics, allow incorporating several variables such as location, time, socioeconomic characteristics and environmental characteristics into health studies 8 .They allow developing etiological hypotheses regarding the origin of diseases in different populations 9 .Thus, these methods seek an inferential model that includes the spatial relationships that constitute the studied phenomenon 10 .
A fundamental aspect in the application of these techniques is the characterization of the spatial dependency, showing how the values of the variables of interest are correlated in space.Global spatial regression models were used in some studies that associated TB and other communicable diseases with socioeconomic determinants, but those studies used larger spatial units, such as neighborhoods and administrative regions [11][12][13] , rather than the census sector, for the analysis.
The present study aimed to analyze the spatial pattern of TB by identifying the relevant socioeconomic variables for the occurrence of TB by comparing the classic statistical linear regression model with models with global spatial effects.

Methodology
The present is an ecological study conducted in the municipality of Rio de Janeiro.With a population of 6,323,037, this municipality is composed of 160 neighborhoods and 763 slums (favelas), in which 22% of the population live 14 .
The spatial analysis analyzed area data.The census sector was chosen as the spatial unit.The TB data studied were the new cases reported to the Epidemiological Surveillance of the municipality of Rio de Janeiro through the Notifiable Diseases Information System (Sistema de Informação de Agravos de Notificação-Sinan) between the years of 2005 and 2008, provided by the Municipal Secretariat of Health and Civil Defense of Rio de Janeiro (Secretaria Municipal de Saúde e Defesa Civil do Rio de Janeiro -SMS-DC-RJ).The registries used were those whose information regarding the municipality of residence and the service were performed in Rio de Janeiro.Socioeconomic data were extracted from the Demographic Census of 2010 and served as a basis to build the indicators used in the data anal-yses by census sector: average number of people per household (P_ANPH); average family income (AFI); proportion of heads of family with a monthly income greater than 1 minimum wage and less than 2 minimum wages (P_H_2MW); proportion of heads of family with a monthly income less than 1 minimum wage (P_H_1MW); proportion of illiterate people (P_ILLIT); proportion of households with a water supply from the general system (P_H_WS); proportion of households with a bathroom or sanitationfor the exclusive use of residents and sewage system via the general sewage system or pluvial system (P_H_SS); proportion of households with garbage collection by a cleaning service (P_H_GC); proportion of households with a bathroom or sanitation for the exclusive use of residents (P_B_BS); proportion of declared white skin color (P_WSC), proportion of declared black skin color (P_BSC), proportion of declared yellow skin color (P_YSC), proportion of declared brown skin color (P_BRSC), proportion of declared indigenousskin color (P_ISC);proportion of households with people living alone (P_H_ LA); and average income of the head of family (AI_H).The indicators were developed to contemplate the income, education level and housing condition dimensions to establish a proxy of the social condition of the patient.
TB cases were georeferenced by residence address using the methodology described in Magalhães et al. 15 .Cases in which the residence was a prison or a hospital were excluded from the study.Initially, an exploratory analysis was performed based on the incidence rate of the period for every 1,000 inhabitants, calculated by census sector.
The 314 census sectors that had no population (mostly areas occupied by massifs, ponds, green areas, etc.) or that had their data omitted by the Brazilian Institute of Geography and Statistics (Instituto Brasileiro de Geografia e Estatística -IBGE) to maintain the confidentiality of data because they had few private households were excluded from the study.
To minimize the instability of gross rates and eliminate random fluctuation, the incidence rates were smoothed using the Local Empirical Bayesian method [16][17][18] .A histogram showed that the data distribution was not normal, and to approximate it to a normal distribution, the neperian logarithm (Ln) transformation was used on the dependent variable.
To verify the presence of clusters, i.e., areas with their own spatial dynamics and that deserve a detailed analysis, the Moran Global index was used.
Because this study worked with several areas, local indicators of spatial association (LISA) were used based on the neighborhood matrix generated with first order neighbors.This indicator identifies significant spatial distribution patterns and represents a decomposition of the global index 19 .
LISA classified the census sectors as a function of the significance level of their local index values as high/high and low/low, which indicate that they have a positive association, i.e., the location has neighbors with close values; and high/ low and low/high, which indicate a negative association, i.e., the location has neighbors with different values.
After the spatial autocorrelation was confirmed, the Spearman correlation matrix was constructed, and under a statistical and epidemiological view, independent variables with significant correlation at 5% with the dependent variable and non-collinear variables, i.e., with a correlation < 0.7, were used in the analysis of classical and spatial regression.
The multivariate linear regression was applied (Ordinary Least Squares Estimation -OLS).Using the backwardmethod and an epidemiological criterion, variables that better described the occurrence of the disease with a significant correlation at 5% were sought.
To incorporate spatial effects,the so-calledmodels with global spatial effects that treat the spatial structure globally were applied, i.e., they suppose that the spatial correlation structure can be captured in a single parameter, which is added to the traditional regression model 20 .To determine which model would fit to the studied variables, two alternatives were applied.
The first model used was the mixed autoregressive spatial model (Spatial Lag model), which attributes the ignored spatial autocorrelation to the response variable Y.Given that spatial dependence is considered by adding to the regression model a new term in the form of a spatial relationship for the dependent variable, then in which W is the spatial proximity matrix; WY expresses the spatial dependence in Y; and is the autoregressive spatial coefficient 16 .In this model, spatial autocorrelation is incorporated as a component of the model.
The other model was the Spatial Error Model, which considers spatial effects as noise, i.e., as a factor to be removed.This model assumes that it is not possible to model all of the characteristics of a geographic unit that may influence neighbor areas.The effects of spatial autocorrelation are associated with the error term ε,and the model is described by: Y = Xb + e, e = lWe + x in which is the error component with spatial effects;is the autoregressive coefficient; andis the error component with constant and uncorrelated variance 19 .
When evaluating which model would better fit to the available variables, the model with highest Log-Likelihood and lowest Akaike Information Criterion (AIC) and Schwarz Bayesian criterion (SBC) 10,20,21 was chosen.
Residuals in the Spatial Lag and Spatial Error models were analyzed using Moran's index to quantitatively verify whether the spatial autocorrelation was eliminated with the model application.
GeoDa (Arizona State University) was used to generate the models and calculate Moran's Index and LISA.

Results
There was an 11% loss in the location of cases.
In the analysis by area (Figure 1), the variation of the incidence rate after the Bayesian smoothing can be analyzed throughout the municipality of Rio de Janeiro.The highest rates appear in areas of Downtown and the surrounding neighborhoods, extending through the neighborhoods of Benfica, Manguinhos, Maré, Penha, Vila da Penha and part of the West zone.
The existence of spatial autocorrelation of the dependent variable at the level of census sectors could be observed through Moran's Global Index (I=0.402,p=0.001).The calculation of LISA classified the census sectors as a function of the significance level of their local indices values.Figure 2 shows that for every variable studied, there are areas with significant indices.Figure 2A shows at least four large agglomerates of sectors.The first agglomerate (agglomerate 1) is formed by the neighborhoods of Downtown, São Cristóvão, Tijuca, Catumbiand Glória; the second agglomerate (agglomerate 2) is formedby part of the neighborhoods of Penha and Penha Circular, which is almost an extension of the first.The third agglomerate (agglomerate 3) is formed by Barra da Tijuca and Recreio, and the last agglomerate (agglomerate 4) is formed by portions of the neighborhoods of Pedra de Guaratiba and Barra de Guaratiba.Figure 2B shows that agglomerates 1, 2 and 4 are composed of sectors with a high incidence of TB and that agglomerate 3 comprises a group of sectors with a low incidence rate.
The Spearman correlation matrix results showed that among all independent variables studied, only the proportion of people by household did not have a significant correlation with the dependent variable.However, some variables had a very strong correlation between them: AFI, P_WSC,P_BSC andP_BRSC.Evaluating the correlations with epidemiological connotation demonstrates that the variables proportion of illiterate people (P_ILLIT),and proportion of households with garbage collection by a cleaning service (P_H_GC) had a positive correlation and therefore an opposite direction.
Based on this evaluation, the variables that did not exhibit a significant association with the dependent variable and the variables that exhibited collinearity were excluded from the OLS model.From this first result, models that better described the relationship between these variables were sought,and the chosen final model is shown in Table 1.
With this model, the R 2 determination coefficient was 0.044, the Log-Likelihood value was -10598.8,theAICwas 21207.5, and the SBCwas 21243.7.
The existence of a spatial autocorrelation can be observed from the residuals of the classic regression.Moran's index of residuals was 0.3609 (p < 0.01).Although they have a normal distribution, they are not randomly distributed by the municipality, as shown in Figure 3.
Because of the presence of spatial autocorrelation, the Spatial Lag model used the same variables as the OLS model.In this model, the R 2 determination coefficient was 0.3215, the Log-Likelihood (valueof the likelihood function logarithm calculated for the estimated values of the coefficients) was -9228.39, the AIC was 18468, and the SBC was 18512.2.
The residuals of this model had a normal distribution, and Moran's global index was -0.0183 (p < 0.001).This low value of Moran's index indicates that the inclusion of the spatial component in the model virtually eliminated the spatial autocorrelation.
For comparison purposes, the Spatial Error model was then used with the same variables of the OLS model.The R 2 determination coefficient was 0.319, the Log-Likelihood was -9270.72, the AIC was 18551.4,and the SBC was 18587.6.
The residuals of this model have a normal distribution, and Moran's global index was -0.0195 (p < 0.001).This low value of Moran's index indicates that the inclusion of the spatial component in this model also eliminated spatial autocorrection.
Table 2 shows a summary of the indices that evaluate the quality of the models.When this spatial autocorrelation was introduced in the models, there was an improvement in the results through the Spatial Lag and Spatial Error.However, among the considered spatial regression methods, the results provided by the Spatial Lag model indicate that this model provided the best fit of the studied variables with the highest Log-Likelihood value and the lowest AIC and SBC values.

Considerations
The option to smooth TB incidence rates using the local Bayesian methods was an attempt to minimize possible distortions that may be caused by the variability provided by the calculation of  gross rates as a function of the population size at risk in the census sectors.However, a disadvantage of this method is the possible potential to induce spatial autocorrelation.According to Morais Neto et al. 22 , this effect may overestimate global and local autocorrelation coefficients in areas with few TB cases, in which the smoothing of census sector rates toward the averages of their neighbors is more pronounced.Although recognizing this possible bias, we chose to use this resource to correct incidence rates for two reasons.First, TB is an airborne transmitted disease with a marked influence of socioeconomic conditions.Thus, being a microarea study, when there is a census sector with a low incidence rate whose neighbors have a high rate, adjusting the rate of this sector is believed to be more coherent from the epidemiological point of view.The other reason was the large number of zeros when the gross incidence rate of tuberculosis was used.Almost half of the census sectors had no cases; therefore, they had a gross incidence rate of zero.Notably, Moran's index was also calculated with the gross rate, and the value did not differ muchfrom the value calculated using the rate after smoothing by the Local Bayesian method.
The variablesP_ILLITand P_H_GC were correlated in the opposite direction than expected, but they were left in the analyses to evaluate how they would behave in the regression models be-cause in the first exploratory analyses, they were important variables in the explanatory power of the model.In the final model, P_H_GCwas excluded because it lost significance and P_ILLIT-  was maintained.The information regarding years of study would be very important in the present study, but this information was not collected in the 2010 Census.The information regarding education level was restricted to literate/illiterate people; therefore, thequality of the analyses from census data was reduced because together with income, this variable was a good estimator of the life conditions of the population.The low value of the OLS determination coefficient may indicate that other variables can be related to the incidence of the disease.When comparing the three models, we can notice that those that considered spatial dependence (Spatial Lag and Spatial Error) had a better performance than the classical model (OLS).This fact can be explained by proving the spatial dependence of the dependent variable, demonstrated by the observed value of Moran's Global index and by scattering maps of Moran's Local index.Another important parameter for the evaluation of the improvement of spatial models when compared to the classical model can be observed by a significant gain in the power of explanation of variables through the value of R 2 .
Between the two spatial regression models, the Spatial Lag model exhibited the best values for Log-Likelihood, AIC and SBC.
The analysis of Moran's index values of the spatial models'residuals captured the existing spatial dependence given that values went from 0.3609 for the classical model to -0.0183 and -0.0195 for the Spatial LagandSpatial Errormodels, respectively.
The best fitted model had two variables that were associated with income,P_H_2MWan-dAI_H, but P_H_2MW was considered to have an indirect relationship because it is a proportion and represents a well-defined part of the population.Considering that the variable P_H_1MW was also inserted in the initial analyses but did not remain in the final model because it was not significant in that model, one can indicate that the variable P_H_2MWis an important cut-off point in the study of TB.Some studies 11,23 indicated a possible association between TB infection and being part of the middle/lower class, particularly when we observe the TB infection associated with HIV and the resistant forms 11 .A study developed in Olinda 23 showed that the relationship between the incidence rate of TB was not directly associated with the low-income population, particularly when there are conditions that favor the propagation of the disease even in places in which the population has a higher income.This idea is corroborated by the results of the present study, which showed that the neighborhoods of Downtown and its surroundings, extending through the neighborhoods ofBenfica, Manguinhos, Maré, Penha, Vila da Penha andpart of the West zone (Figure 1),exhibited the highest incidence rates after the Bayesian smoothing and are areas in which the average head of family income is approximately 1,600 BRL.
The three agglomerates formed by 1) part of Downtown, São Cristóvão, Tijuca, Catumbi and Glória neighborhoods; 2) part of the Penha and Penha Circular neighborhoods; and 3) sectors of the Pedra de Guaratiba and Barra de Guaratiba neighborhoods (Figure 2) are composed of sectors with high incidence rates of TB and can be considered risk areas for the transmission of the disease.
The variable P_H_LAremained in the final model.According to the literature, this variable appears inconsistent regarding TB, but a person who lives alone may have a higher risk due to non-adherence to the treatment 11 .In a study conducted in Pelotas, Gonçalves et al. 24 identified that family participation is important for the adherence and permanence of patients in the treatment of TB.
The use of census sectors as a unit of analysis causes the problem of statistical instability of incidence rates because of small populations; however, it also enables working with more homogeneous populations than neighborhoods.In the study of TB in the municipality of Rio de Janeiro, the use of census sectors as a unit of analysis was very important.The analyses of the agglomerates formed by the log variable of the incidence rate after Bayesian smoothing allow the disease occurrence to be studied in greater detail.However, a limitation in the present study is related to the loss of georeferencing from addresses, described in Magalhães et al. 15 .The percentage of cases not located in poor areas could not be determined, but experience has shown that the vast majority of non-located cases occur in favelas.Often, the address provided by the patient is the address of the favela entrance or some reference point within the community.This causes some census tracts to receive points beyond or below expectations.
The loss in the georeferencing of cases may have removed cases referring to the most deprived population from the study described in Magalhães et al. 15 , and for TB in particular, it is expected that the effect on loss will be greater in the most deprived population.In addition, a study on deaths attributed to TB in the state of Rio de Janeiro per-

Collaborations
MAFM Magalhães worked on conception, design, analysis, data interpretation, article writing and critical review.RA Medronho worked on conception, design, analysis and critical review.formed in some hospitals by Selig et al. 25 showed that hospitalization in the moment of notification occurred in 3,495 cases (21%), which shows problems ofaccess, and that only 41.4% of deaths occurred in notified patients, which shows that TB in Rio de Janeiro is underreported and poorly detected.In a different study performed in Rio de Janeiro, Piller 4 evaluated the notifying sources and determined that 26% of cases are still notified in hospitals, when these cases should have been detected and treated early by primary care.This finding reinforces the evidence that the most affected population by the disease is composed of individuals with difficulties in accessing health services and are therefore more deprived.This population is precisely the one that lives in places of difficult localization and thus has a greater probability of being lost in the present study.Therefore, it is necessary to consider the results of the model with the caveat that some variables may have had no association with the TB rate simply because the population with the worst living conditions may have been involuntarily excluded from the analyses.However, in general, the statistical methods applied in the present study were demonstrated efficient to identify the spatial patterns of TB and to define some determinants for the occurrence of the disease.

Figure 1 .
Figure 1.Map of the incidence rate of tuberculosis after the Bayesian smoothing by census sector.

Figure 2 .
Figure 2. Map of the local indicator of spatial autocorrelation (LISA) for the dependent variable.(A) Areas with significant values; (B) LISA scattering.

Figure 3 .
Figure 3. Map of the local indicator of spatial autocorrelation (LISA) for the residuals of the OLS model.(A) Areas with significant values; (B) LISA scattering.

Table 1 .
OLS model for the incidence rate log after Bayesian smoothing.Proportion of head of family with an income greater than 1 minimum wage and less than 2 minimum wages; 2 Proportion of illiterate people; 3 Proportion of households with people who live alone; 4 -Average income of the head of family. 1

Table 2 .
R 2 , Log-Likelihood, Akaike Information Criterion (AIC) and Schwarz Bayesian Criterion (SBC) indices for the three studied models.