Geographic weighted regression : applicability to epidemiological studies of leprosy

Introduction: Geographic information systems (GIS) enable public health data to be analyzed in terms of geographical variability and the relationship between risk factors and diseases. This study discusses the application of the geographic weighted regression (GWR) model to health data to improve the understanding of spatially varying social and clinical factors that potentially impact leprosy prevalence. Methods: This ecological study used data from leprosy case records from 1998-2006, aggregated by neighborhood in the Duque de Caxias municipality in the State of Rio de Janeiro, Brazil. In the GWR model, the associations between the log of the leprosy detection rate and social and clinical factors were analyzed. Results: Maps of the estimated coeffi cients by neighborhood confi rmed the heterogeneous spatial relationships between the leprosy detection rates and the predictors. The proportion of households with piped water was associated with higher detection rates, mainly in the northeast of the municipality. Indeterminate forms were strongly associated with higher detections rates in the south, where access to health services was more established. Conclusions: GWR proved a useful tool for epidemiological analysis of leprosy in a local area, such as Duque de Caxias. Epidemiological analysis using the maps of the GWR model offered the advantage of visualizing the problem in sub-regions and identifying any spatial dependence in the local study area.

Geographic information systems (GIS) enable public health data to be analyzed in terms of their geographical variability and the relationship between risk factors and diseases.As a result, GIS have proven to be powerful tools for analyzing the prevention and control of infectious diseases, such as malaria, tuberculosis, and human immunodefi ciency virus/acquired immunodefi ciency syndrome (HIV/AIDS) (1) .
More recently, GIS have been used in the spatial analysis of leprosy, although not widely.Applications range from the use of cluster identifi cation to represent local interventions to program planning and control based on spatial distribution of associated risk factors (2) (3) (4) (5) (6) .Because leprosy is a chronic infectious disease with a transmission cycle that is not yet fully understood, implementation of effi cient control is complex; maintenance of transmission might be related with regional differences not only at national levels but also at subnational levels (7) .Analysis using spatial data considers spatial autocorrelation, and inclusion of spatial structure in models changes their explanatory power.The relationship between variables can be better explored when the analysis is local, yielding more detailed results and leading inevitably to a better understanding of the process.The importance of a heterogeneous data distribution resulting from differences in culture, habits, social dynamics, socioeconomic conditions, and other risk factors reinforces the need for more regionalized spatial analyses (8) .
Incorporating the spatial structure of data into statistical models is expected to afford more reliable estimations of the effects of the covariates and better predictive models.The inclusion of a spatial component in a statistical model ensures that the outcomes observed at closely neighboring points/areas will be adjusted more by the predicted values at close proximity than by those of more distant points/areas, because closer points/ areas tend to have similar socio-demographic and environmental characteristics and therefore similar responses (9) .
In traditional geostatistical models, a spatial structure is considered in the model on the basis of the error component, assuming a correlation function that decays with the distance between two points.The fi nal goal this of kind of analysis, or spatial interpolation (e.g., Kriging), is to predict the response at any point in the spatial domain.

METHODS
By contrast, GWR assumes that the spatial structure is arrived at through the mean component.Specifi cally, it assumes that the coeffi cients β 0 , β 1 ,…,β p associated with the explanatory variables vary smoothly in the space Pitombo et al. compared GWR models with traditional geostatistical models, concluding that one advantage of the GWR method is that results can be viewed and spatial patterns identifi ed for the different infl uences of each covariate along the studied surface (9) (10) .Therefore, unlike a linear regression model, which assigns statistical signifi cance to each coeffi cient in the model, a GWR model is analyzed from thematic maps that describe the spatial variability of each of the coeffi cients.
These spatial characteristics have implications for inferring the fi tted model; if not considered, they can lead to ineffi cient, wasteful parameters and spatially dependent residuals.Moreover, this technique allows more complex evaluation of spatial dependence, considering multivariate relationships through mapping of estimated parameters by creating a surface over the study area (11) .
Although this technique has been developed for processes observed at fi xed locations or points, approximations can be made by considering the centroids of polygons as fi xed points and the observed process as some measurement taken within these polygons (12) .
Whilst the first article regarding GWR was published in 1998, it only recently began to be used in epidemiology.However, there have been few discussions of the advantages and disadvantages of GWR in the study of specifi c diseases, such as leprosy (12) (13 (14) (15) (16) (17) .
This study discusses the application of GWR to health data, specifi cally for the epidemiology of leprosy, by evaluating the heterogeneity of the data distribution in a small municipality of Rio de Janeiro, Brazil, as well as the contributions and limitations of GWR.We hypothesized that the GWR can be a powerful tool for understanding dynamics of leprosy, showing the spatial heterogeneity of the different predictors of this disease.

Study area
This ecological study used as its unit of analysis the neighborhoods of Duque de Caxias, a municipality located in the metropolitan region of Rio de Janeiro State, southeast Brazil.Its territory is divided into 40 neighborhoods grouped into four districts.The 1 st and 2 nd districts are more urbanized, the 3 rd district displays features of rural-urban transition, and the 4 th district is predominantly rural.The 2010 population was estimated at 855,042 (18) .In the analysis, the Sarapui and Gramacho neighborhoods were merged, because Sarapui was considered an outlier, and municipal disease control measures were implemented only in Gramacho, which infl uenced the data from Sarapui.Therefore, the analysis consisted of 39 neighborhoods.
Until 2004, the Duque de Caxias municipality ranked second in the number of cases throughout the State of Rio de Janeiro.The case detection rate of the municipality and the Southeast region of Brazil were 3.90 in 2004-2006 and 9.76 in 2007, respectively.Although the prevalence of leprosy has reduced, it remains endemic in the municipality, and Brazil is among the four countries in the world where leprosy remains a public health problem (6) (19) .

Data and analysis
Data for new cases of leprosy reported in 1998-2006 among residents in the municipality were extracted from the database of the National System for Notifi able Diseases (SINAN).In addition, socioeconomic data, population, and the digital set of boundaries in the municipality were drawn from the 2000 census and health data from the Duque de Caxias Municipal Health Secretariat.Data were aggregated by neighborhood in a GIS (18).
To support the assumption of normality for the data, the transformed rate, log was used, where  represents the number of new cases detected and , the at-risk population assumed for the centroid of neighbourhood, i.
The covariates used in the analysis were divided into two subgroups.1) Covariates relating to clinical-epidemiological factors: sex ratio, ratio of cases with multibacillary to cases with paucibacillary operational classifi cations, ratio of cases with grade II disability to the sum of cases with degrees zero and I, and ratio of cases with an indeterminate clinical form to the sum of cases with tuberculoid, dimorphic, and lepromatous clinical forms.2) Covariates relating to socioeconomic and service factors: number of referral health facilities with a dermatologist offering care under the Leprosy Programme, number of primary health care facilities with the Family Health Programme offering care under the Leprosy Programme, number of local case-tracking campaigns, proportion of uneducated heads of household, proportion of heads of household with income <1 minimum wage (84 USD), proportion of heads of household with no income; population density; proportion of households with running water, proportion of households with running water in at least one room, proportion of households with piped sewerage; proportion of households with no toilet, proportion of households with ≥7 residents, proportion of households with waste disposal in a vacant lot; proportion of households with a septic tank, and proportion of households with an open ditch sewer.
First, linear regression models were analyzed to identify the best model; then, that model was analyzed using GWR and the following steps: 1) selection of covariates in the univariate linear regression models with p-values <0.20; 2) backward elimination based on the lowest Akaike information criterion (AIC) from multivariate linear modeling using all variables from step 1; 3) GWR analysis of the fi nal fi tted multivariate linear model; and 4) comparison of both the linear and GWR models using the criteria of smallest AIC and largest R 2 .In addition, the Moran I index was calculated for the residuals from the linear models and fi nal GWR to show that the residuals are not spatially clustered (signifi cance set at p < 0.05).

DISCUSSION
Rev Soc Bras Med Trop 49(1):74-82, Jan-Feb, 2016 From the GWR model, maps were constructed using the estimated local βs for each explanatory variable in the model, the residuals of the model, the intercept, and the predicted values.Only neighborhoods with statistically signifi cant coeffi cients βs were included in the maps.p-values were calculated for each estimated local β, and maps were constructed for each covariate in the model, revealing the signifi cant areas (20) .
The maps were constructed and analyses were performed using ArcGIS version 9.3 software (Esri, Redlands, CA, USA), and the p-values for the estimated local βs were calculated using Excel version 2010 (Microsoft, Redmond, WA, USA).
The Duque de Caxias Municipal Health Department authorized the use of both the secondary data from the municipal SINAN database and the data of the strategic actions to combat leprosy in the municipality.

Ethical considerations
The study was approved by the Ethics Committee of the National School of Public Health (Number 237/10).
During 1998-2006, there were 2,572 new cases of leprosy, for a detection rate of 3.61/10,000 habitants in Duque de Caxias Municipality.
Seven covariates had p-values <0.20 in the simple linear models (Table 1): proportion of households with running water; number of primary health care facilities with the Family Health Programme offering care under the Leprosy Programme; proportion of uneducated heads of household; proportion of households with ≥7 residents; proportion of heads of household with no income; ratio of cases with an indeterminate clinical form to the sum of cases with tuberculoid, dimorphic, and lepromatous clinical forms; and number of local case-tracking campaigns.Four of these had p-values <0.05.
The best-fitting multivariate linear model included four covariates: proportion of households with running water, number of primary health care facilities with the Family Health Programme offering care under the Leprosy Programme, proportion of households with ≥7 residents, and the ratio of cases with an indeterminate clinical form to the sum of cases with tuberculoid, dimorphic, and lepromatous clinical forms (Table 1).The spatial autocorrelation results for the residuals from the fi nal linear model were not signifi cant, based on Moran's I (p = 0.785163).
The GWR analysis, using the same variables as in the fi nal linear model, resulted in a larger AIC (9.507029) but a very similar adjusted R 2 (0.362976) than the linear model.The GWR residuals were also not statistically signifi cant, based on Moran's I (p = 0.431453).
The mapping of the estimated parameters for each neighborhood and for all variables in the model can be seen in Figure 1 and Figure 2.There were 39 local values each for β, standard error, and estimated p-value for each neighborhood.The proportion of households with running water was a protection factor to the new leprosy case detection rate, which was higher mainly in the northeast of the municipality and decreased towards the south (Figure 1A).The ratio of cases with an indeterminate clinical form to the sum of cases with tuberculoid, dimorphic, and lepromatous clinical forms had the largest positive correlation, with higher rates in the South of the municipality and decreasing rates to the Northeast (Figure 1B).The contribution to the explanation of new cases of the number of referral health facilities with a dermatologist offering care under the Leprosy Programme was greater in the North (Figure 1C).The proportion of households with ≥7 residents (Figure 1D) explained new case detection more in the Northwest, decreasing to the South.
Figure 2, 2B and 2C shows the estimated parameters with signifi cant p-values.The proportion of households with ≥7 residents was not illustrated, because the estimated parameters were not statistically signifi cant in any of the 39 neighborhoods.
Visual comparison of the observed and predicted value maps revealed tenuous data smoothing that highlighted two areas with higher disease detection rates to the North and South of the municipality (Figure 3A and 3B).The smallest predicted values existed in the central region.There was a small concentration of residuals in the south (Figure 3C).
According to our fi ndings, GWR was an effi cient tool for an epidemiological study of leprosy in local areas, showing the spatial heterogeneity of disease dynamics.Regarding the epidemiological discussion of the results, the maps described the estimated parameters included in the GWR model.Interesting interpretations can be drawn in view of the social, demographic, epidemiological, and health service characteristics specifi c to the region and the period.
The results showed that having a higher proportion of households with running water protected against new cases of leprosy to different degrees across the municipality, with the greatest protection in the northeast of the municipality.
In addition, the ratio of cases with an indeterminate clinical form to the sum of cases with tuberculoid, dimorphic, and lepromatous clinical forms was positively associated with a higher detection rate, particularly in the south of the municipality.Because the indeterminate form of leprosy is an early form, the explanatory power of this covariate is related indirectly to the ability of the municipal health service to track new cases (21) (22) (23) .The municipal service took actions to combat the disease, though local case-tracking campaigns and decentralization of patient care, and intensifi ed these actions during the study period (1998-2006), as recommended by the Ministry of Health.Prior to this period, the service was centralized in a single unit in the 1 st district in the south of the municipality (24) (25) , and patients living in other districts, mainly in the North and Northeast regions, had diffi culty accessing treatment because of the large distances.Since the services were decentralized to new health facilities in the 4 districts, expanded access to health care has improved new case detection.

A C B
In the South region, where patient care has been available for a longer period, a hidden prevalence is less likely.Accordingly, early case detection has contributed more to the overall detection rates.However, in the North and Northeast regions of the municipality, where health services were only recently decentralized, a hidden prevalence remains high because of the long period with limited access, and the bulk of detection was restricted to older cases and later clinical forms.Similar to the relationship between the indeterminate clinical form and municipal efforts to control the disease, the results for the number of referral health facilities with a dermatologist offering care under the Leprosy Programme also support the former interpretation.There was a greater contribution of this variable to the number of new cases in the North and Northeast of the municipality.Decentralization of patient care positively contributed to higher detection rates, particularly in areas where decentralization had occurred more recently (24).This refl ects the impact of actions to identify hidden cases and reduce their effect on sustaining the disease transmission cycle.
Interestingly, the model included a covariate relating to health facilities with a specialist in dermatology.The covariate relating to primary health care facilities without dermatologists did not have the same impact on increasing the detection rates of leprosy.Dermatologists more accurately detect cases that are the most atypical or have fewer symptoms, and may often go unnoticed.They also conduct regular screening in the outpatient dermatology clinic setting, where there is passive demand from new cases, often when the patient requires care for other skin problems.The dermatologist's main role leprosy control, especially post-elimination, is to provide technical support for generalists at primary health care facilities (21) .
The fourth covariate, households with ≥7 residents, followed a North-to-South track across the municipality, where it contributed most strongly to greater detection of the disease.A recent study using the same sample identifi ed this North-South zone, which showed greater local autocorrelation in leprosy detection rates using Lisa Map and Box Map methods (6) .Owing to closer contact, greater household population density is related to a higher risk of transmission (26) (27) .Importantly, the parameters estimated for the number of household members were not statistically signifi cant.However, these results might simply be pointing to a larger endemic area within the municipality, a spatial area with a specifi c context of disease dynamics.
Advantages and limitations were observed regarding the applicability of the GWR model for our leprosy data.We used aggregate data for areas as if they were observed point values; the loss of precision might have reduced the model's explanatory power.Nonetheless, our results identifi ed the most critical areas for which more detailed studies of the disease dynamics are needed.In addition, the fi ndings may guide efforts to combat the disease in smaller geographic areas, thus decreasing costs and enhancing effects, as well as indicating the diversity of strategic actions appropriate to each region.
Furthermore, the use of a Poisson model in the GWR was diffi cult for the present data, despite the appropriateness of this model for the count data (number of cases).Although this tool is implemented in the original GWR software, there are technical barriers to using it, and the functions implemented in the GWR package of the R software were unclear.The ArcGIS software has a user-friendly tool for analyzing linear GWR models only.As the intent was to test the feasibility and applicability of GWR to health services, we decided to use the log of the detection rate to approximate the data for use in linear models.
Although the results of the AIC indicated that the ordinary least squares model was better than the GWR, the epidemiological analysis using the maps of the GWR model offered the advantage of visualizing the data in sub-regions and identifying any spatial dependence.This technique enables the use of local spatial statistics to assess spatial patterns of data association.It also allows more complex analysis of spatial correlations with not just one, but multiple variables.
GWR was also relevant in the exploratory analysis of the dynamics of leprosy in Duque de Caxias.Our fi ndings may yield early indications of the beginning of changes in the dynamics of the health-disease process, which might be infl uenced by recent strategic actions to combat leprosy.
Future analysis should be more detailed and consider cases in less aggregated spatial units, for example census tracts, or even at the exact location of the event, to better capture possible spatial associations with factors associated with the endemic.Studies should also use data independent of municipal boundaries, perhaps taking into account neighboring regions with similar socioeconomic and demographic characteristics.
It should be remembered that leprosy is a chronic disease with a long incubation period, which makes it harder to detect spatial associations, given also the possibility of asymptomatic infected individuals participating in the disease transmission chain.Nonetheless, important additional information can result from considering temporal and spatial variability.For such analysis, considerably longer study periods may be necessary to surmount the disease complexity.

FIGURE 3 -
FIGURE 3 -Parameters of the geographic weighted regression model for leprosy and the covariates in Duque de Caxias, Rio de Janeiro, Brazil.A) Observed values (OSERVED); B) predicted values (PREDICTOR); C) residuals (RESIDUE).

TABLE 1 -Univariate and multivariate ordinary least squares (OLS) regression models and geographic weighted regression (GWR) model, using new cases of leprosy among residents in Duque de Caxias, 1998-2006.
*For each beta, standard error, and estimated p-value, there are 39 local values for each neighborhood.AIC: Akaike information criterion.Univariate OLS model (p-value <0.