Modeling habitat suitability of the invasive clam Corbicula fluminea in a Neotropical shallow lagoon , southern Brazil

This study aimed to model the habitat suitability for an invasive clam Corbicula fluminea in a coastal shallow lagoon in the southern Neotropical region (–30.22, –50.55). The lagoon (19km2, maximum deep 2.5m) was sampled with an Ekman dredge in an orthogonal matrix comprising 84 points. At each sampling point, were obtained environmental descriptors as depth, organic matter content (OMC), average granulometry (Avgran), and the percentage of sand (Pcsand). Prediction performance of Generalized Linear Models (GLM), Generalized Additive Models (GAM) and Boosted Regression Tree (BRT) were compared. Also, niche overlapping with other native clam species (Castalia martensi, Neocorbicula limosa and Anodontites trapesialis) was examined. A BRT model with 1400 trees was selected as the best model, with cross-validated correlation of 0.82. The relative contributions of predictors were Pcsand-42.6%, OMC-35.8%, Avgran-10.9% and Depth-10.8%. Were identified that C. fluminea occur mainly in sandy sediments with few organic matter, in shallow areas nor by the shore. The PCA showed a wide niche overlap with the native clam species C. martensi, N. limosa and A. trapesialis.


Introduction
Habitat suitability models (HSMs) use environmental variables to predict presence or abundance of a given species in any area of interest, acting as a mathematical tool to depict the multidimensional niche of a species sensu Hutchinson (1957).HSMs are useful in conservation, wildlife management, and environmental impacts evaluation or to predict scenarios of exotic species invasions (Guisan and Thuiller, 2005;Frankling and Miller, 2009).
Introduction of exotic species in freshwater ecosystems threatens biodiversity, change ecosystem natural cycles and cause the extinction on native biota (Lodge et al., 1998).
In the past 30 years the Neotropical region suffered the introduction of at least two mussel species, causing negative environmental and economic impacts (Darrigran, 2002).One of these species is Corbicula fluminea (Müller, 1774).
Corbicula fluminea is an Asiatic edible clam species well known for the invasive success (Cohen et al., 1984;Cataldo and Boltovskoy, 1998).The introduction of C. fluminea in Brazil is reported since 1970 (Veitenheimer-Mendes, 1981), and the species is now widespread in several Brazilian freshwaters basins (Rodrigues et al., 2007).The species has an aggressive invasive behavior, presenting physiological, environmental and behavioral adaptations to live both in lotic (Britton and Morton, 1982;Way et al., 1990) and lentic environments (Cenzano and Würdig, 2006), competing with native clam species (Gardner Junior et al., 1976;Phelps, 1994).
Large colonies of C. fluminea could increase the water transparency by filtration, changing algae and macrophyte production and influencing all the ecosystem dynamics (Phelps, 1994;Sousa et al., 2008).Also, the invasive feature of C. fluminea is enhanced by flotation strategies to disperse, a behavior triggered by the water flowing stimuli (Prezant and Chalermwat, 1984).
According to McMahon (1981), in environments with lentic dynamics, C. fluminea is restricted to shallow and oxygenated margins.Nevertheless, although several environmental drivers for C. fluminea presence or abundance have been described in the literature, studies in order to fit HSMs for the species are still scarce.In this study, we aimed to select a spatial modeling approach capable to identify the ecological relationships of C. fluminea with environmental predictors and to produce a map of habitat suitability.In addition, the niche overlapping with native clam species was also investigated.

Study area
The study was carried out in Araçá Lagoon (southern Brazil;.The lagoon has a surface area of 19 km 2 , maximum deeps of 2.5m, and receives the drainages of the Capivari River in the northern boundary.In the southern limit, the lagoon is drained to the Patos Lagoon, a large coastal system 250km long and 60km wide (Figure 1).When south-southeast winds are blowing the water flow can be inverted, and strong currents flow from the Patos Lagoon.These winds, that modulate current systems, leads to a deltaic formation in both lagoon extremes: the northern delta build from sediments (mud) from the Capivari River, and the southern delta build from sand deposition carried out with strong currents from the Patos Lagoon.
The region has a hot summer with mean temperatures of 29.3 °C in January and winter with temperatures reaching an average of 10.9 °C in June/July.Precipitation is well distributed during the year, ranging from a maximum of 156mm in September to a minimum of 86mm in November.The predominant wind regime is northeast with an average speed of 5m/s, followed by Southwest, with average speeds of 8m/s during the passage of cold fronts (Schwarzbold and Schäfer, 1984).

Sampling methods
The sampling design comprised a set of 84 points distributed all over the Araçá Lagoon in a 500m × 500m Cartesian matrix (Figure 1).Typical position error was around 30m (GPS error plus boat displacement).Samplings were performed from February 2002 to April 2003.Each sampling site was sampled five times with an Ekman dredge (sampling area: 225 cm 2 ), four subsamples for invertebrates and one for sediment analysis.Sampled individuals were washed in sieve with mesh size 0.80 mm, then transferred to labeled plastic bags, kept chilled and transported to the laboratory where they remained frozen (-18 °C) until processing.Other clam species were also sorted and counted.Species' abundance comprised the sum of four subsamples (900 cm 2 ) at each point.
Depth (m) was measured with a manual probe corrected by the annual mean (2001)(2002)(2003) of the nearby Palmares River scale.Sediment sample was transported to the laboratory chilled on ice and stored at -18 °C until analysis.Sediment subsamples were dried and classified through sieves with mesh size of 2000 μm; 1000 μm; 500 μm; 250 μm; 125 μm and 63 μm.The average granulometry (Avgran; μm) was calculated as a weighted average concerning the amount of sediment retained on each sieve.Percentage of sand (Pcsand; %) at each sampling point was estimated by dividing the sediment retained above the 63μm sieve by the total dry sample weight (×100).To determine organic matter content (OMC; %) 50 gr of dried sediment was burned through 6 hours in oven furnace at 550 °C, and the OMC determined by weight difference after the carbon oxidation.

Statistical analysis
In order to select the best modeling method we evaluated the abundance pattern of C. fluminea with three modeling approaches: Generalized Linear Models (GLM), Generalized Additive Models (GAM) and Boosted Regression Trees (BRT).Outliers were not excluded in order to compare the techniques.
We fitted the GLM in R software (R Development Core Team, 2014) with the package "stats" assuming the Poisson distribution.The procedure fitted the GLM models through optimization of maximum likelihood estimated by an iteratively reweighted least-squares mechanism.The GAM fitting was carried out with the functions "mgcv" library (Wood, 2001).As the first trials to fit the GAM model, we explored the predictors' behavior by running the model with a plate regression splines smoothing to check residuals.After analysis, we identified two distinct responses of C. fluminea abundance in relation to Depth.To handle with these responses, we choose the cubic regression splines smoother.For the remaining predictors (OMC, Avgran and Pcsand) the thin plate regression splines smoothing were applied (Wood, 2003).
The BRT method consists of a combination of the two algorithms: regression trees (or decision trees) and boosting.Regression trees were first described by Breiman et al. (1984), followed by De'ath and Fabricius (2000) and Hastie et al. (2005).The regression tree is created by several data splitting, aiming the partition of response into homogeneous groups (De'ath and Fabricius, 2000).
The Boosting process consists in merging results from multiple models or regression trees based on the general principle that finding many rough rules of thumb can be easier than finding a single highly accurate prediction rule (Schapire, 2003).Boosting is a numerical optimization method that aims to minimize the loss function by adding at each step a new tree that best reduces the loss criteria (Elith et al., 2008).The loss function in BRT modeling is a measure that represents the loss in predictive performance due to a suboptimal model.In this way, thousands of tree models are created and the BRT model works as a linear combination of many trees that can be thought as a regression model where each term is a tree (Elith et al., 2008).
The BRT models were evaluated by using a ten-fold cross-validation to detect the optimal number of trees to use in the model and to subsequently assess the predictive performance (Hastie et al., 2005) and choose the model with the best cross-validation result.We compared BRT models by setting a fixed learning rate of 0.01 and exploring two different tree complexities, 1 (BRT1) and 2 (BRT2).After running all models, we used the lower Root Mean Squared Error (RMSE) as criteria to select the best modeling technique among GLM, GAM and BRT.The selected model was used to predict the habitat suitability to the whole lake area.To generate a prediction map we interpolated the predictors obtained at each sampling point: Depth, OMC, Avgran and Pcsand by ordinary Kriging (Figure 2).Since the interpolation introduce new bias in the prediction map, a correcting empirical model (GAM) was fitted with C. fluminea observed densities as the response variable and the estimated densities as predictors.
Aiming to visualize the potential niche overlap among clam species we ran a Principal Components Analysis (PCA) using the abundance of the invasive C. fluminea and the native clam species Castalia martensi Thewing, 1891, Neocorbicula limosa (Maton, 1981) and Anodontites trapesialis (Lamarck, 1819) in relation to the environmental predictors Depth, OMC, Avgran and Pcsand.
We carried out GLM, GAM, BRT and PCA under the RStudio 0.98.501software (RStudio, 2012), an integrated development environment for R software 3.0.3(R Development Core Team, 2014).We ran the GLM with the basic package included in R software.We ran the GAM models with the 'mgcv' library (Wood, 2001).To fit the BRT models we used the 'gbm' library from Ridgeway (2012) and functions proposed by Elith et al. (2008).We ran the PCA using the package "ade4" (Dray and Dufour, 2007).The interpolations were performed with the package GSTAT (Pebesma, 2004) included in the GIS software IDRISI Andes 15.0 (Clark Labs  ).The prediction map and the map processing were performed in RStudio with the functions of package "raster" (Hijmans and Etten, 2012).

Results
The abundance of C. fluminea averaged 2.96 individuals by sampling site (33 ind.m -2 ), ranging from a minimum of 0 to a maximum of 63 (700 ind.m -2 ) and was absent roughly half (38) of the sampling points (84).The calculated RMSE for each method was 6.87 for GLM, 2.99 for GAM, BRT1 for 2.17, and BRT2 for 1.73.Figure 3 shows the dissimilarities between models concerning the residual variances.The GLM model had residual dispersion and was discarded as possible candidate model.GAM achieved better results concerning residual variances when compared with GLM; however, the BRT models with tree complexities of 1 and 2 steps presented the best overall performance.Among them, we choose the boosted regression tree model with tree complexity of 2 (BRT2), which achieved 0.82 in the cross-validation correlation, being chosen to model C. fluminea distribution.
Figure 4 shows the fitted functions to C. fluminea in relation to each environmental predictor of the BRT model with a tree size of 2. Pcsand had an average relative contribution of 42.6%, followed by OMC (35.8%),Avgran (10.9%) and Depth (10.8%).As a general pattern, C. fluminea occurs in sites where the sediment has high values of Pcsand and with low OMC, with increased prevalence in deeps in the range from 1.0 m to ~1.5m. Figure 5 shows the predicted abundance map to the whole lagoon area.According to our results, C. fluminea presented low abundance in the northeast-southwest lagoon axis.Increased abundance match with sandy shores by the eastern and northwestern lake limits (Figures 2, 5).
Figure 6A shows scatterplots of observed abundances of C. fluminea against the fitted values adjusted by the BRT models (R 2 =0.957).Figure 6B shows the observed abundance of C. fluminea against the data values of the prediction map after the correction of a GAM empirical filter (R 2 =0.940) suggesting good model performance even after the bias induced from predictors interpolation to the whole lagoon area.

Discussion
Modeling habitat suitability and distributional patterns are increasing goals in the ecological literature (Frankling and Miller, 2009).Despite the intense effort to map and organize the occurrences of aquatic biota in databases, the challenge is related to the unavailability of detailed environmental layers, especially for aquatic environments in small scales.However, successful examples of inference concerning distributional patterns with relative few predictors are already described for freshwater fish (Alves and Fontoura, 2009;Barradas et al., 2012) by using LOGIT functions, showing that satisfactory predictions could be achieved with relatively few key environmental predictors.
In the present work, the BRT method proved to be efficient to predict the habitat suitability of C. fluminea in the Araçá Lagoon with few environmental variables.The efficiency came from the ability to handle with different types of predictor and extreme outliers.According Elith et al. (2008), the advantage mainly comes from fitting multiple trees that overcome the drawback of one single solution that could have a relatively poor predictive performance.The BRT produces hundreds or thousands of decision trees, and the final solution is not a consensus, but the sum of solutions from each partial tree developed from a subset of data.The BRT method is also flexible because it could use different types of predictors and work with missing values, can handle with outliers and non-linear responses, and do not need data transformation (Moisen and Frescino, 2002;Friedman and Meulman, 2003;Leathwick et al., 2006;De'ath, 2007;Elith et al., 2008;Frankling and Miller, 2009).These attributes make the methodology very robust for prediction based in a specific data set, as for weather  forecast, but imply additional obstacles for comparison purpose as no one can handle, or print, the enormous set of decision trees that comprise the final solution.Also, the BRT models are Machine Learning methods and lack p values, coefficients and degrees of freedom, becoming hard to compare with traditional modeling approaches derived from regression.In our case, we overcame this using a general measure of accuracy, RMSE, to compare the models.
We found differences between the predicted values from the BRT model and the predicted values from the interpolated predictors.The majority of HSM studies use predictors' layers ready to use, extracting the information to fit models from those layers.Here we fitted models with real observations, generating interpolated layers data in order to feed the prediction model.As we detected differences between observed data and the interpolated predictors, resulting in bias concerning the descriptor maps, this was corrected trough a GAM empirical function.In this case, two nested methodologies gave rise to more accurate predictions, resulting in sharp distributional patterns as shown in Figure 6.
Considering the selected environmental predictors in the present study, the percentage of sand (Pcsand) was the predictor with greater influence on C. fluminea distribution (Figure 4) followed by the sedimentary organic matter content (OMC).The OMC had a negative effect on the habitat suitability for C. fluminea, a pattern already described by Britton and Morton (1982), Cohen et al. (1984), Mansur et al. (1994) and Cataldo et al. (2001).Areas with low organic matter content, considering lentic conditions, could be the product of increased hydrological dynamics, with low deposition of fine particles, limnological features that emulate lotic environments.The occurrences of C. fluminea at those locations can be understood as a consequence of their primary life history, associated with lotic environments in Asia.
According to our findings, C. fluminea prefers habitats where sand is predominant and average grain size range between 80-140 μ, in agreement with Cataldo et al. (2001).Shirmer (1996) detected high mortality in muddy habitats, and this information could explain the preference of C. fluminea for sandy environments with low amounts of organic matter.The preference of C. fluminea for sandy habitats was described by Cenzano and Würdig (2006) and Duarte and Diefenbach (1994).The sediment characteristics describe the amount of energy transferred to the bottom: larger sand grains indicate habitats with increased water dynamics, and smaller grains indicate depositional habitats.As described for Anodonta anatina (Englund and Heino, 1996), the species could be favored by flowing water currents due to energy savings related to filtering movements.
The wind is a key factor that determines the shape and the hydrodynamics in coastal laggons (Schwarzbold and Schäfer, 1984), by disturbing the benthic habitats, this influence loses its power as increases the depth.In the central area of the lagoon, with larger deeps and deposition of fine  sediments, the species showed low densities.According to our model, C. fluminea occurrences tend to be higher in depths ranging from 1.0m to ~1.5m, avoiding deeper areas with depositional patterns, where wave energy is not strong enough to suspend clay, silt and organic matter.
We observed a niche overlapping of C. fluminea with native clam species: C. martensi, N. limosa and A. trapesialis.The PCA (Figure 7) showed occurrences in the same parameters range, once C. fluminea occurs in all environmental range conditions occupied by native species, a pattern already described by Lercari and Bergamino (2011).As known as an aggressive invasive species, occurring in high abundances and with efficient dispersion capabilities (Cataldo and Boltovskoy, 1998), this species has dispersion advantages over N. limosa, with no free larval stage and juveniles immediately adopting the benthic phase, what limits dispersion and determines an aggregate spatial distribution (Parodiz and Hennings, 1965).
According to our results, we selected an adequate model to describe the habitat suitability of C. fluminea in a shallow coastal lagoon in southern Brazil.The clam C. fluminea preferred sandy habitats, showing coarser granulometry, and reduced amounts of organic matter in the sediment.Also we described that C. fluminea has a wide range of habitat suitability when compared to the native clam species.In a broad sense, the invasive C. fluminea represents a threat to native clam species in a dimension still not measured in southern coastal lagoons.Unfortunately, historical demographic data of native clams in coastal lagoons of southern Brazil are not widely available, and a feasible abundance reduction of native clams, in the same way that C. fluminea became the dominant clam species, is not documented.

Figure 1 .
Figure 1.Araçá Lagoon, southern Brazil.The black dots indicate the sampling sites.

Figure 3 .
Figure 3. Comparative performance of the GLM, GAM, BRT1 and BRT2 methods to model Corbicula fluminea abundance in Araçá Lagoon, southern Brazil.Box Plot of residuals to each modeling technique (Median; 25-75 percentiles and min-max accepted values).Dots indicate the outliers.

Figure 4 .
Figure 4. Fitted functions to each predictor in the BRT2 model to estimate Corbicula fluminea abundance in Araçá Lagoon, southern Brazil in relation to depth, organic matter content (OMC), percentage of sand (Pcsand) and average granulometry (Avgran).The number in parentheses shows the relative contribution of the predictor.

Figure 6 .
Figure 6.Comparative relationship between observed abundance of Corbicula fluminea in Araçá Lagoon, southern Brazil, and predicted values estimated by (A) BRT model adjusted from measured environmental descriptors and (B) interpolated predictors from ordinary Kriging after empirical GAM correction.

Figure 7 .
Figure 7. Principal components analysis (PCA) considering the most abundant clam species occurrences in relation to environmental predictors at each sampling point.The convex polygons show the range of clam occurrences in relation to predictors used to fit the Boosted Regression Trees (BRT) model.