Combining multiple models to predict the geographical distribution of the Baru tree ( Dipteryx alata Vogel ) in the Brazilian Cerrado

The Brazilian Cerrado is a biome of great biodiversity, but detailed information about the diversity and distribution of species in this region is still insufficient for both testing ecological hypotheses and for conservation purposes. Among native plants in the Cerrado, Dipteryx alata Vogel (commonly known as the “Baru” tree), has a high potential for exploitation. The aims of this paper were to predict the potential spatial distribution of D. alata in the Brazilian Cerrado utilising five different niche modelling techniques. These techniques usually provide distinct results, so it may be difficult to choose amongst them. To adjust for this uncertainty, we employ an ensemble forecasting approach to predict the spatial distribution of the Baru tree. We accumulated a total of 448 occurrence points and modelled the subsequent predicted occurrences using seven climatic variables. Five different presence-only ecological niche modelling techniques (GARP, Maxent, BIOCLIM, Mahalanobis Distance and Euclidean Distance) were used and the performance of these models was compared using Receiver Operating Characteristics (ROC) and the Area Under the Curve (AUC). All models presented AUC values higher than 0.68, and GARP presented the highest AUC value, whereas Euclidean Distance presented the lowest. The ensemble forecasting approach suggested a high suitability for the occurrence of the Baru tree in the Central-Western region of the Brazilian Cerrado. Our study demonstrated that modelling species distribution using ensemble forecasting can be an important computational tool for better establishing sampling strategies and for improving our biodiversity knowledge to better identify priority areas for conservation. For the Baru tree, we recommend priority actions for conservation in the central region of the Cerrado Biome.


Introduction
Most tropical plants species are still poorly characterised with respect to their geographical distribution, a problem referred to as "Wallace shortfall" (Whittaker et al., 2005;Bini et al., 2006).Geographical distribution modelling has become an important tool in filling this gap and in furthering ecology, conservation and evolution studies.Specific applications include the study of the expansion of invasive species (Carroll et al., 2001;Peterson et al., 2007;Zhu et al., 2007;Reino et al., 2009), the effects of climatic changes on biodiversity (Heikkinen et al., 2006) and the establishment of conservation plans for threatened species (Solano and Feria, 2007;Marage et al., 2008).
Modelling the geographic distribution of a species is usually done by the spatial projection of functions describing the ecological niche of that species.These processes relate the known occurrences of a species and relevant environmental data to produce predicted geographic occurrences (Guisan and Zimmermann, 2000).There are several techniques that can be employed for niche modelling, such as Generalised Linear Models (McCullagh and Nelder, 1989), Generalised Additive Models (Hastie and Tibshirani, 1990), Principal Component Analysis (Robertson et al., 2003), Artificial Neural Networks (Mastrorillo et al., 1997), Maximum Entropy (Phillips et al., 2006) and others (see Elith et al., 2006;Reino et al., 2009).Many recent papers discuss these statistical techniques, how to evaluate the models they generate, and how to detect which model is more robust or accurate (Segurado and Araújo, 2004;Elith et al., 2006).However, Araújo and New, 2007, suggested that using ensemble forecasts -combining different models and parameters and using techniques to explore the resultant variety of potential distributions of a species -can be more informative than interpreting each method independently (see Marmion et al., 2009 for a recent statistical evaluation of ensembling techniques).
The Brazilian Cerrado is a region that has been highlighted as one of the 25 large biogeographical regions that requires priority conservation approaches, known as biodiversity "hotspots" (Myers et al., 2000).Nevertheless, information on the diversity and distribution of individual species within this biome is still insufficient for both understanding the ecological processes underlying diversity patterns and for establishing conservation policies.Among native plants in the Cerrado, Dipteryx alata Vogel (commonly known as the "Baru" tree) has a high potential for exploitation.This native fruit plant of the family Fabaceae is widely distributed but endemic to the Cerrado biome.It is found primarily in dense savanna formations and in dry forests (Ribeiro et al., 2004;Sano et al., 2004).This species has many potential human uses including food, medicinal applications and as wood for building (Almeida et al., 1998).Moreover, it is considered a key species for sustaining the natural fauna in the Cerrado as its fruit feeds birds, mammals (bats), rodents and monkeys, especially during the dry season when other food sources are scarce (Sano et al., 2004).
The goals of this paper were to predict the potential spatial distribution of Dipteryx alata in the Brazilian Cerrado biome using five different presence-only ecological niche modelling techniques: Genetic Algorithm for Rule-set Prediction -GARP, Maximum Entropy -Maxent, Bioclimatic envelope -BIOCLIM, Mahalanobis Distance and Euclidean Distance, and to compare their performances using ROC curve verification.We employed ensemble forecasting to produce a map using all of the models simultaneously as the results of this method have been proposed to generate a better and more conservative overall solution (Araújo and New, 2007).We also evaluated the similarity of the maps generated by the five different techniques using cluster analysis and produced a unique prediction of species occurrence based on an ensemble map.

Data
The geographical coordinates available for the locations of Dipteryx alata were obtained from the literature (e.g., Soares et al., 2008), field sampling and online databases at the Centro de Referência em Informação Ambiental (CRIA: http://www.cria.org.br/).We accumulated a total of 448 geographically unique occurrence points, representing the most complete database for the geographic distribution of this species (Figure 1).The Baru tree is endemic to the Cerrado biome (Ribeiro et al., 2004), which occupies approximately 2.5 million km 2 in the Central Region of Brazil.Moreover, this area includes a high heterogeneity of vegetation types, soil, geology, geomorphology and climate (Silva et al., 2006).Some climatic variables are strongly correlated.To account for this collinearity, we obtained values for the environmental variables at the occurrence points of the Baru tree and performed a Principal Component Analysis (PCA) on this matrix (Legendre, P. and Legendre, L., 1998).In the PCA, the first two components explained 72.23% of the total variability.A subset of climatic variables were then selected based on their loadings in these two principal components, allowing to define a set of six climatic variables for niche modelling that were not strongly correlated across the distribution of the Baru tree: mean annual temperature, temperature seasonality (coefficient of variation), mean temperature in the wettest quarter, mean temperature in the driest quarter, precipitation in the coldest quarter, precipitation in the warmest quarter, as well as one topographic variable (altitude) derived from the WORDCLIM database (http://www.worldclim.org/).All of the selected variables were converted to a grid resolution of 0.0417 degrees (nearly 4 km).

Species distribution modeling methods
We used five different techniques to model the potential geographic distribution of the Baru tree, based on presence-only data (i.e., only occurrences are known, absences are unknown).The specific models tested included Bioclimatic Envelope (BIOCLIM), Euclidean Distance, Mahalanobis Distance, GARP and Maxent.All of these techniques are frequently used to model species distributions, and especially Maxent and GARP have been used in many recent studies (see Peterson et al., 2007).All of the models were projected onto the contour map of the Brazilian Cerrado (Almeida et al., 1998).We analysed only the continuous regions of the Cerrado found in central Brazil and did not consider isolated patches that are found in the Northern and Northwestern parts of the Amazon.A brief description of each modelling technique and the related key references are provided below.
Bioclimatic Envelope (BIOCLIM) -This technique is based on the production of envelopes that contain presence points in a multidimensional environmental space.For this model we used DIVA-GIS software (Hijmans et al., 2002).The "Bioclim" option was used for identifying the suitability of areas for the Baru tree.Cells for which one or more climate variables were outside of the 0-100 percentile envelope obtained a code of "0".The cells within the 5-95 percentile were coded as "3"; those external to this range but within the 2.5-97.5 percentile were coded as "2"; and the cells external to this range, but within the 0-100 percentile for all climate variables, were coded as "1".The program then generated categories of species suitability, varying from "not suitable" to "most favourable".
Euclidean Distance -This technique is structured on the existence of an optimum ecological point, defined by the generation of a centroid for all occurrence points in the total ecological space.The distance between this "optimum" and the observed values for each cell in the environmental grid within the geographical study area is inverse to the suitability of the environment at that site.Smaller distances indicate more similarity among regions, and a higher probability of a species being present.The Euclidean Distance produces a circular envelope around the optimum within the ecological space.
Mahalanobis Distance (Typicality probabilities) -This method is similar to Euclidean Distance, but the Mahalanobis distance produces an ellipsoidal envelope around the optimum ecological space by taking into account the covariances among environmental variables.The distance of a given locality to the centroid in the environmental space is given by (Equation 1): where D 2 is the Mahalanobis distance, x is the vector of environmental measures at a location, m is the vector of the mean environmental measures for all known instances of the Baru tree, and C -1 is the inverse of the variance/ covariance matrix among environmental measures for all known occurrences of the species.
Maxent (Maximum Entropy Method) -This recent technique provides better results than other programs, especially for small datasets (Pearson et al., 2006;Elith et al., 2006).The Maxent technique produces predictions and inferences from incomplete information (see Phillips et al., 2006), and is based on the principle of estimating a "target" probability distribution by finding the probability distribution of maximum entropy (i.e., the most spread out or the closest to uniform distribution), subjected to a set of constraints that represent the incomplete information about the "target" distribution.The parameters used in the Maxent model in this paper were 1000 iterations and a convergence threshold value equal to 10 -5 (the program default), as well as removing all duplicates.
GARP (Genetic Algorithm for Rule Set Production) -This is a widely used algorithm based on artificial intelligence that works by combining rule sets for the purpose of generating the most accurate prediction in the considered region (Stockwell and Noble, 1992).The rules represent a group of possible multivariate relationships between the species occurrence points and environmental variables.These include bioclimatic rules, atomic rules and logistic regression (Stockwell and Peters, 1999).We adopted the following optimization parameters: 200 runs, a 0.001 convergence limit and a maximum of 2000 iterations.To generate the best subset output we indicated 0% for the omission threshold and 10% of the commission threshold.We evaluated 20 models using this subset.These models were then overlaid producing a number of models from the best subset at a given place or region.In both GARP and Maxent, absence data is need for some of the analyses (GARP) or for model testing (Maxent).In these cases, both software packages simulate pseudo-absence data as required (see below).

Evaluation and comparison of models
Model evaluation focused on their predictive performance and included the determination of a minimum threshold of quantitative output for the potential presence of the species.The model sensitivity was defined as the proportion of true presences in relation to total presences predicted by the model.Model specificity was defined as the proportion of true absences in relation to absences predicted by the model.In this manner, a Receiver Operating Characteristics (ROC) curve was obtained by plotting the sensitivity against 1 -specificity for the varying probability threshold values, generating a threshold independent model evaluation method (Manel et al., 2001).Moreover, the area under the ROC curve (AUC) has been used extensively in species distribution modelling, characterising the performance of the model, at all possible thresholds, based on a single value that could be used as an objective approach to comparing different models (Elith et al., 2006;Phillips et al., 2006).The AUC ranges from 0 to 1, where 1 indicates high performance while values < 0.5 indicate low performance (Luoto et al., 2005;Elith et al., 2006).Despite recent criticisms (e.g., Lobo et al., 2008), AUC can still be useful when comparing models from a single species in a similar geographical domain.
To use this analytical machinery without the existence of a dataset that includes true absences, Phillips et al. (2006) generated a sample of 10000 pseudo-absence points as a training sample and to estimate the ROC curve and the AUC values for the Maxent procedure.We repeated the same process with the other model predictions to allow for a proper comparison of these methods.
For the analysis of similarity among the models, quantitative measures from each model output were used.These usually included continuous measures of the species occurrence, e.g., suitability in Maxent and Bioclim, distances in Mahalanobis and Euclidian Distance, and the number of models in GARP.These quantitative measures were truncated (corresponding to the point on the ROC curve that has the shortest distance to the top-left corner) and the minimum threshold was used to generate presenceabsence data (Cantor et al., 1999).Values smaller than the threshold were considered zero, whereas values higher than threshold were considered one.A map of the number of models predicting the occurrence of the Baru tree for each location can then be obtained, providing a better idea of the regions for which predictions are independent of the model characteristics (an unweighted average -see Marmion et al., 2009).Finally, we used Simple Matching coefficients to produce a pairwise similarity matrix between models.This similarity matrix was clustered using an UPGMA algorithm (Legendre, P. and Legendre, L., 1998), allowing a visual evaluation of the similarity among the five different models.

Results
The potential geographical distribution of the Baru tree generated by all of the models suggested that the incidence of this species was higher in the Central-Western region of the Brazilian Cerrado (Figure 2).Among the predictions generated, the GARP model showed the highest occupational area, with approximately 63.99% of the Brazilian Cerrado biome occupied by this species.On the other hand, the model obtained using the Maxent method presented the smallest occupational area, of only 23.15% of the biome (Table 1).
All of the models tested presented AUC values higher than 0.68 (Table 1), with the highest value obtained for the GARP model and the lowest for Euclidean Distance model.Despite the great variation in the AUC values among all models, the analysis of similarity suggested the formation of two different groups of predictions: i) those formed only by GARP and ii) those formed by Mahalanobis Distance, Bioclim, Euclidean Distance and Maxent (within which Maxent and Bioclim were the most similar) (Figure 3).
The analysis of the sum of these models (the unweighted ensemble forecast, Figure 4) produced a wide-ranging model in which 72.25% of the Cerrado is occupied by the Baru tree.Furthermore, the ensemble distribution indicated a higher model overlap (i.e., more accumulated evidence for the occurrence of the species, or more environmental suitability) for the Baru tree in Goiás State in the Central-Western region of the Cerrado.On the other hand, in extreme Northern region of the Brazilian Cerrado, no model suggested the occurrence of the Baru tree.

Ensemble forecasting and distribution modeling
Tools for modelling ecological niches have been widely used to predict the geographic distribution of species (Filipe et al., 2004;Pearson et al., 2007) and can be used to define areas where environmental conditions are appropriate for protection of a species whose habitat is threatened (Araújo et al., 2004;Chefaoui et al., 2005;Martinez et al., 2006).Knowledge of the potential distribution areas of the Baru tree is of extreme importance, even more so when the highly diverse nature of the Cerrado biome is considered (see Myers et al., 2000).
The efficiency of these various niche models has been frequently tested using the AUC, (see Peterson et al., 2007;Lobo et al., 2008).In this work, the GARP and Maxent models provided better results than the other models, according to their AUC values.Despite the high frequency of papers using GARP, recent comparative work has suggested that the Maxent models provide better results (Phillips et al., 2004;Elith et al., 2006;Phillips et al., 2006) and that they are more computationally efficient.However, Nabout et al., (2009) recorded higher AUC values for GARP when compared to Maxent when modelling the potential distributions of American fiddler crabs.Moreover, other papers support the importance of employing different modelling techniques.For Farber and Kadmon (2003), the Mahalanobis Distance method produced more accurate species distribution models than the Bioclimatic Envelope technique.More recently, Tsoar et al. (2007) modelled the distribution of 42 species of land snails, nesting birds, and insectivorous bats using six different modelling techniques and reported that GARP and Mahalanobis Distance obtained the highest accuracy.Johnson and Gillingham (2005) observed different spatial distribution models and showed that models produced by Mahalanobis Distance and GARP techniques were not similar, demonstrating the  In our analyses, the ensemble forecast map showed smaller uncertainty in the predicted distribution of the Baru tree in the central-western region of the Brazilian Cerrado and not in the peripheries.Many ecological and physical conditions can influence the distribution, abundance and occurrence probability of trees.In many cases, higher abundance and occurrence probability occurs at the centre of a species' geographical range and declines toward the peripheries (called the "abundant centre" distribution) (Brown, 1984;Gaston and Blackburn, 2000).The bioclimatic conditions of this biome edge do not favour the incidence of this species (see also Diniz-Filho et al., 2009).The Brazilian Cerrado is also surrounded by other biomes with different environmental characteristics, such as the Pantanal, the Amazonian basin and the Caatinga.Moreover, many studies have used the "abundant centre" distribution as the basis for hypotheses about ecological and evolutionary processes (see revisions from Sagarin and Gaines, 2002), mainly because central and peripheral populations respond in different ways to ecological, genetic and evolutionary processes.
The "ecological niche" is a fundamental principle used in producing models that are designed to predict the occurrence of a species (Elith et al., 2006;Stockwell, 2006).These models reflect the potential niche; however, the absence of data relating to dispersal limitation, species interactions (i.e., interspecific competition) or historical factors (Araújo and Guisan, 2006) can generate overestimated geographical distributions.Based on this, Soberón (2007) suggested separating niches into two principal classes: i) the Grinnellian niche, which considers the environmental conditions at which a species can live, defined fundamentally by non-interactive variables (scenopoetic), and ii) the Eltonian niche, obtained by considering the biotic interactions and resource-consumer dynamics.The distribution models adopted in this study are based on the Grinnellian niche concept and can produce wider geographic ranges than are observed in nature.The Eltonian niche, although clearly important, is much more difficult to measure and model.Nevertheless, it is important to highlight that including biological interactions into species distribution modelling is an important research frontier (Kearney, 2006;Morin et al., 2007;Araújo and Luoto, 2007).

Implications for ecology and conservation
Populations of the Baru tree can form large clusters in any continuous area within the Brazilian Cerrado.In a survey of woody species within this biome, the Baru tree was found in 86 of 316 sampled regions (Ribeiro et al., 2004), demonstrating the wide incidence of this species.This range of incidence can be associated with conditions of higher soil fertility (Almeida et al., 1998).The Baru tree is hermaphroditic and is pollinated by insects (Thum and Costa, 1999).The dispersal of its fruits can occur by gravity (short distance dispersal) and by animals, mainly by bats that are long distance dispersers (Ribeiro et al., 2004;Sano et al., 2004).
Our ensemble forecasting model produced more correspondence between the expected distribution of this species and the observed sampled distribution than when each method was viewed separately, in which case wider distributions were generated.Our models can be utilised to suggest sampling areas for further study of the botanical characteristics and the genetics of the Baru tree.More importantly, they can be used to indicate conservation regions -an urgent issue considering the current rapid increase of human land use in the Brazilian Cerrado.
The Brazilian Cerrado has been under pressure from agriculture (mainly soybean) and pastureland since the 1960's (Ratter et al., 1997), and recently by the advance of sugarcane cultivation (Sawyer, 2008).In general, the Central-Western and Central-Southern regions of this biome have been encroached by agricultural advancement and pastureland development (Rangel et al., 2007).These regions have also been transformed into sugarcane culture for the production of biofuels, which have had significant impacts on the ecology of the Cerrado.It is important to note that, according to our modelling, these regions present the highest suitability for the Baru tree.Several environmental impacts in these regions due to habitat fragmentation and losses are expected to affect these populations by disturbing dispersal and gene flow, influencing population density and genetic diversity and occasionally causing local extinctions.In addition to the local and regional effects of agricultural activities, global change is the other leading factor that can impact the diversity and distribution of this species (Sawyer, 2008;Tundisi and Matsumura-Tundisi, 2008).In light of this, it is very important to better design the conservation strategies for the Baru tree and other endemic plants within the Cerrado, especially at broad geographical scales.
The "Workshop on Priority Areas for Conservation in the Cerrado and Pantanal Region" identified 87 areas for conservation (see Brasil, 1999;Cavalcanti and Joly, 2002).The ensemble model for the Baru tree predicted its occurrence in many of these areas (mainly within the Araguaia valley and the Rio das Mortes wetland).However, in practice, the Federal and State reserve systems in the Cerrado region are insufficient both in extent and representation -Only 1.6% of this area is under protection (Cavalcanti and Joly, 2002).Despite the small number of protected regions (reserves), the Baru tree has been maintained locally by small landowners and communities (located mainly in the interior of the Goiás State), which derive their livelihood from commercialised almond production.Adopting better  conservation plans for this biome will be important to the preservation of ecological and evolutionary processes that are maintaining Baru tree populations.
Finally, our study demonstrated that ecological niche modelling can be an important computational tool, and that using an ensemble forecasting approach produced a more robust and more theoretically consistent model.Ensemble forecasting, beyond producing a more coherent model when compared with the observed species occurrences, revealed that the Baru tree presumably exhibits higher environmental suitability in Central-Western region of the Cerrado.Since this is the region with the most intense human occupation, we recommend priority action for successful Baru tree conservation, especially in areas such as the Araguaia valley.Moreover, we encourage others scientists to use these models for defining sampling strategies for different types of ecological, botanical and genetic studies.We also recommend that a similar strategy of producing ensemble forecasting models be adopted for other plants within the Cerrado Biome.

Figure 1 .
Figure 1.Location of 448 occurrence data-points for Dipteyx alata used in the analyses.

Figure 2 .
Figure 2. Spatial distribution modelling of Baru tree in Brazilian savanna ("Cerrado"): a) Maxent model; b) GARP Model; c) Mahalanobis distance; d) Euclidean distance and e) Bioclimatic envelope.Increasing darkness of shading indicates higher environmental suitability.

Figure 3 .
Figure 3. UPGMA dendrogram demonstrated the relationship among models of potential spatial distribution of Baru tree in Brazilian savanna.

Figure 4 .
Figure 4. Combination of the five models (ensemble forecasting of maps from Figure 2) of Baru tree in Brazilian savanna.

Table 1 .
Evaluation of five models based on AUC values, and predicted area of each model (in %) in relation to Brazilian Cerrado.