Using epidemiological survey data to infer geographic distributions of leishmaniasis vector species

An important aspect of tropical medicine is analysis of geographic aspects of risk of disease transmission, which for lack of detailed public health data must often be reduced to an understanding of the distributions of critical species such as vectors and reservoirs. We examine the applicability of a new technique, ecological niche modeling, to the challenge of understanding distributions of such species based on municipalities in the State of São Paulo in which a group of 5 Lutzomyia sandfly species have been recorded. The technique, when tested based on independent occurrence data, yielded highly significant predictions of species’ distributions; minimum sample sizes for effective predictions were around 40 municipalities.


Using epidemiological survey data to infer geographic distributions of leishmaniasis vector species
Utilização dos dados de levantamentos epidemiológicos para inferir a distribuição geográfica de vetores da leishmaniose A. Townsend Peterson 1 , Ricardo Scachetti Pereira 2 and Vera Fonseca de Camargo Neves 3 Great efforts are expended to understand the distributions of animal species relevant to disease systems, particularly vectors and reservoirs for particular diseases.For cutaneous leishmaniasis, although reservoirs remain poorly known 24 , much energy has been dedicated to documentation of distributions of vector species, sandflies in the genus Lutzomyia 4 6 9 , which are related to risk of disease transmission 5 .These efforts accumulate lists of sites or municipalities from which vector species are known to provide an idea of geographic distributions of important species.
These maps of known occurrences of species, however, present a biased picture of species' geographic distributions, mixing the ecological needs and biogeography of a species with the geography of sampling of the species 10 16 .In this sense, the known distribution of a species provides a view of its distribution that is at best incomplete, if not actually misleading.A critical step towards improving this picture is one of inference into unsampled and undersampled areas; this inference can be achieved via models of the ecological niche of species of interest 22 .
In this contribution, we explore the potential of ecological niche modeling techniques for interpolating into unsampled areas for understanding vector species' geographic distributions.We use multiple subsamples of available distributional points to approach ARTIGO/ARTICLE the question of how much sampling is needed to assemble a good distributional understanding for a vector species 29 .In broader terms, we present the application of a method that can be generally useful in characterizing geographic distributions of vector and reservoir species based on incomplete or imprecise existing data.

MATERIAL AND METHODS
Input occurrence data.Ecological niche models were based on 366 unique occurrence records for the 5 most dominant Lutzomyia species in São Paulo state, Brazil, with overall sample sizes ranging 40-112 points per species.Distributional data for these species (Lutzomyia fischeri, L. intermedia, L. migonei, L. pessoai, and L. whitmani) were drawn from previous, intensive sampling in municipalities across the state 6 .All occurrence points -perforce given how the data were collected and also to mimic many of similar datasets available in other similar situationswere georeferenced to the centroids of the municipalities.To provide independent data sets for model building (input data) and model testing (extrinsic test data), and to assess sample size needs for modeling these species in this region, we randomly selected municipalities to create input training data sets representing 10%, 30%, 50%, 70%, and 90% of available points -remaining points were used for testing model quality.
Ecological niche modeling.Ecological niches were modeled using the Genetic algorithm for rule-set prediction (GARP) 25 26 27 , a machine-learning software package now available for public download (http://www.beta.lifemapper.org/desktopgarp/).In general, the procedure focuses on modeling ecological niches (the conjunction of ecological conditions within which a species is able to maintain populations without immigration) 8 .Specifically, GARP relates ecological characteristics of known occurrence points to those of points randomly sampled from the rest of the study region, seeking to develop a series of decision rules that best summarize those factors associated with the species' presence 25 26 27 .
Within GARP, input data are further divided randomly and evenly into training and intrinsic testing data sets.GARP works in an iterative process of rule selection, evaluation, testing, and incorporation or rejection: a method is chosen from a set of possibilities (e.g., logistic regression, bioclimatic rules), applied to the training data, and a rule is developed or evolved 25 26 27 .Predictive accuracy is then evaluated based on 1250 points resampled from the test data and 1250 points sampled randomly from the study region as a whole.Rules may evolve by a number of means that mimic DNA evolution: point mutations, deletions, crossing over, etc.The change in predictive accuracy from one iteration to the next is used to evaluate whether a particular rule should be incorporated into the model, and the algorithm runs either 1000 iterations or until convergence.
All modeling in this study was carried out on a desktop implementation of GARP that offers much-improved flexibility in choice of predictive environmental/ecological GIS data coverages.In this case, we used 15 data layers summarizing aspects of topography [elevation, slope, aspect, flow accumulation, flow direction, and topographic index (tendency to pool water) from the U.S. Geological Survey's (http://edcdaac.usgs.gov/gtopo30/hydro/ ) Hydro-1K data set]; aspects of climate including daily temperature range, mean annual precipitation, maximum, minimum, and mean annual temperatures, vapor pressure, and wet days (annual means over the period 1960-1990 from the Intergovernmental Panel on Climate Change (http://www.ipcc.ch/);and aspects of land use and land cover including an overall land cover classification and a tree cover map (based on AVHRR satellite imagery for 1992-1993, University of Maryland Global Land Cover Facility (http:// glcf.umiacs.umd.edu/index.shtml) for an area consisting of all of São Paulo state, Brazil.GARP's predictive abilities have been tested and proven under diverse circumstances 1 2 7 12 13 14 15 17 18 19 21 22 23 28 29 .
We developed multiple replicate models of each species' ecological niche.Unlike previous applications, which either used single models to predict species' distributions 12 13 or summed multiple models to incorporate model-to-model variation 23 , we used a new procedure 3 for choosing best subsets of models.The procedure is based on the observations that 1) models vary in quality, 2) variation among models involves an inverse relationship between error of omission (leaving out true distributional area) and commission (including areas not actually inhabited), and 3) best models (as judged by experts blind to error statistics) are clustered in a region of minimum omission of independent test points and moderate area predicted (an axis related directly to commission error).The relative position of the cloud of points relative to the two error axes provides an assessment of the relative accuracy of each model.To choose best subsets of models, we 1) produced replicate models until we had produced 20 models with omission error of <5% based on independent intrinsic test points, 2) calculated the median area predicted present among these minimumomission points, 3) identified the 10 models closest to the overall median area predicted, and 4) summed these 'best subsets' models.
Projection of the rule-sets for these models back onto geography provided distributional predictions for each species.We tested model quality via the independent extrinsic test sets of occurrence points in two ways: one using all available test data points to permit best estimation of levels of omission error, and the other using yet another random subsetting down to 10% of the total occurrence points available for the species to avoid differences in statistical power owing to different sampling densities.The c 2 tests were used to compare observed success in predicting distributions of test points with those expected under random models (proportional area predicted present provides an estimate of the proportion of occurrence points correctly predicted were the prediction to be random with respect to the distribution of the test points).Predicted presence was conservatively defined as the area in which all best-subsets models agreed in predicting presence.

RESULTS AND DISCUSSION
Ecological niche models and predictions of geographic distributions of species predictably improved in their performance as training sample size increased.For example, in models for Lutzomyia fischeri, 1) omission error was quite high at 10% training data density, moderate at 30% training data density, and lower thereafter; and 2) statistical significance was unevenly related to sample size and omission error (Figure 1).Results were similar across all species, with low model significance at the smallest training sample sizes; interestingly, model significance was also lower at the largest training sample sizes (Table 1 and Figure 1).Revista da Sociedade Brasileira de Medicina Tropical 37: 10-14, jan-fev, 2004   Inspecting levels of omission error across different data densities for each species, trends are remarkably coincident (Figures 2 and 3).At smallest training sample sizes (<10 points), all models for all species showed high omission error rates (80-100%).In each case, at intermediate sample sizes (40-50 points), omission rates reached minima, and were relatively constant or slightly higher thereafter (Figure 3).Overall, model predictivity was maximal at intermediate sample sizes, when both training and test data sample sizes are substantial, and neither is small.When training sample sizes are too small, model parameters are not estimated accurately.On the other hand, when testing sample sizes are small, statistical power is too low to detect good models.These results coincide with the results of   previous tests of the effects of sample size on model predictivity 13 29 ; in the present case, sample sizes considered adequate (40-50 unique points) were somewhat higher than in previous studies, probably owing to the imprecise georeferencing involved.More generally, the unreliability and unpredictable behavior of significance tests regarding model predictions coincides with the results of previous comparative tests 17 .
The asymptotes of the omission error X training sample size curves are relatively high -that is, even the best models are still plagued by error rates of 30-40%.This seemingly poor performance results from the use of municipality centroids for georeferencing occurrence records -a sizeable proportion of centroids may fall in areas of predicted absence, even though some portion of the municipality is predicted present.Hence, given this imprecise level of georeferencing, a certain base level of omission error is to be expected.

Lutzomyia migronei
On a much more positive note, this study demonstrates that moderate sampling densities -at sample sizes that likely characterize many epidemiological surveys of vector or reservoir distributions -are sufficient to produce excellent summaries of species' geographic distributions.That is, even with moderate sample sizes, it is possible to interpolate into unsampled or poorly sampled areas, and produce reliable and predictive maps of species' geographic distributions.This capacity permits development of geographic predictions for poorly known species important in understanding the geography of disease systems, which have important implications for human health issues 11 .

Figure 2 -
Figure 2 -Comparison of models for Lutzomyia species in São Paulo State based on 10% of points for training models versus 90% of points for training models.

Figure 3 -
Figure 3 -Summary of model quality as a function of training point sample size for Lutzomyia species in São Paulo State.Model quality is measured as proportional omission of independent test points.

Table 1 -
Summary of results of predictions for five Lutzomyia species across five data densities (10-90% of available occurrence data for training).