COMPARISON OF PREDICTOR SELECTION PROCEDURES IN SPECIES DISTRIBUTION MODELING: A CASE STUDY OF Fagus hayatae in a

Selecting predictors for species distribution models (SDMs) is a major challenge. In this study, we evaluated a comprehensive set of 62 environmental predictors that may be related to the occurrence of Fagus hayatae. We modeled F. hayatae as a case study to compare model performance through different environmental predictor subsets according to three selection procedures, namely correlation coefficients between predictors, contribution level of predictors, and expert choice of biologically relevant predictors. The three selection procedures provided satisfactory results with high performance using about 6-10 valid predictors but had their respective limitations. Consequently, we suggest a synthetic strtegy of predictor selection. Accordingly, the first step was identifying and eliminating ineffective variables with nonidentifiability by using bivariate scatterplots. Next, calculate the correlation coefficients between other candidate predictors. Finally, comprehensively select the applicable environmental predictors


INTRODUCTION
In the past three decades, species distribution models (SDMs) have been increasingly used to solve many scientific and managerial problems related to climate change, conservation planning, ecological theory, and invasive species (Guisan and Thuiller, 2005;Austin and Van Niel, 2011;Petitpierre et al., 2017;Zhang et al., 2019). Generally, SDMs are composed of three elements (Sangermano and Eastman, 2012), namely a dependent variable (species occurrence data), explanatory variables (environmental predictors), and an algorithm or function for representing species-environment relationships (modeling methods). Numerous studies have explored the performance of different modeling methods (Guisan et al., 2007) and effects of species sampling sizes and spatial bias (Wisz et al., 2008;Syfert et al., 2013). By contrast, environmental predictors have been less commonly discussed (Franklin, 2010), although the selection of environmental predictors is relevant to SDM performance and its subsequent applications (Williams et al., 2012). Thus, the appropriate selection of predictors for SDMs is a major challenge (Araújo and Guisan, 2006;Franklin, 2010;Watling et al., 2012;Petitpierre et al., 2017).
To the greatest possible extent, data of species occurrence and environmental predictor layers should be collected before modeling species distribution. Environmental predictors can be divided into indirect, direct, and resource gradients (Guisan and Thuiller, 2005). According to the conceptual model of using environmental predictors (Guisan and Zimmermann, 2000;Franklin, 2010), direct and resource variables are preferred for predicting plant distribution. For example, Austin and Van Niel (2011) suggested that light, temperature, nutrients, water, CO 2 levels, disturbance, and biota are the seven groups of variables that control plant distribution. Although attention should be mainly focused on explanatory power and the ecological basis of choosing predictors (Araújo and Guisan, 2006), the use of candidate variables of an SDM depends on the availability of environmental layers. In practice, many environmental predictors used in SDMs are indirect variables or surrogates of direct and resource variables (Franklin, 2010;Austin and Van Niel, 2011); for example, elevation is used as an agency of mountain temperature based on lapse rate .
Problems related to over-parameterization and overfitting of a model may arise with the use of and excessive number of variables in SDMs (Guisan and Zimmermann, 2000;Tyberghein et al., 2012), particularly when highly correlated variables exist. Collinearity or high correlation between variables is a common feature of ecological data (Mac Nally, 2002). Adopting a predictor selection procedure to reduce the number of variables to as few as possible is necessary for two reasons (Dormann, 2011). First, a higher number of variables corresponds to greater correlation among them. Second, a higher number of variables corresponds to a greater likelihood of one of them spuriously contributing to the model (type I error). Although principal component analysis is a method used to deal with collinearity of variables (Dormann et al., 2013), it is difficult to interpret a modeled result based on it. Alternatively, some studies have reported retaining only those predictors that contribute more than 0.5% or 5% to the SDM and discarding the others (Young, 2010;Yang et al., 2013). However, to avoid the collinearity or nonidentifiability of predictors, a common method used is identifying and eliminating redundant variables by using the pairwise Pearson correlation coefficients among all candidate variables. The absolute value of the threshold of the correlation coefficient (|r|) is usually 0.7 (Dormann et al., 2013); however, different researchers may use different threshold values, such as |r| > 0.9 (Randin et al., 2009), |r| > 0.85 (Syfert et al., 2013), |r| > 0.8 (Young, 2010), and |r| > 0.6 (Andreo et al., 2011). In addition to noncorrelated predictors, Watling et al. (2012) used a subset of user-defined biologically relevant environmental predictors in SDMs. Consequently, Williams et al. (2012) suggested that the identification of redundant variables relies on a combination of a priori ecological considerations, knowledge of derivation and accuracy of each variable, awareness of relationships among variables, and a rigorous process of testing the utility of alternative sets of predictors in a statistical model.
The appropriate selection of environmental predictors is a critical step in modeling species distribution (Araújo and Guisan, 2006;Austin and Van Niel, 2011;Watling et al., 2012;Williams et al., 2012). Three common approaches of variable selection are as follows: using all available bioclimatic variables without justification, reducing the number of bioclimatic and biophysical covariates to account for collinearity, and selecting variables on the basis of ecological knowledge (Porfirio et al., 2014). Each SDM has an appropriate predictor selection method associated with it, namely expert knowledge, statistical significance, and iterative selection based on training accuracy (Lippitt et al., 2008). In this study, we compared the performance of models by using different environmental predictor subsets corresponding to three selection procedures, namely correlation coefficients between predictors, contribution levels of predictors, and expert selection of biologically relevant predictors.

Study area and target species occurrence data
Taiwan is a subtropical mountainous island at the periphery of East Asia ( Figure 1A). The study area covered an area in northern Taiwan ( Figure 1B), extending from 121.1925-121.8913°E and 24.4210-24.9024°N. The target species in this modeling study was Fagus hayatae Palib., a relic and stenotopic tree. The distribution area of F. hayatae was mapped in the National Vegetation Diversity Inventory and Mapping Project of Taiwan (Chiou et al., 2009), extending over 1,282 ha and ranging from 1,100 to 2,100 m above sea level and limited to a few ridges nearby ( Figure 1B). We generated regular points at 200-m intervals by using ArcGIS version 10.0 (ESRI; Redlands, USA) with the Geospatial Modeling Environment software version 0.7.2.0 (Beyer, 2012). According to our species occurrence data, 319 F. hayatae points were extracted (presence-only; Figure 1C) and used in all of the SDMs with different environmental predictor subsets. The regular 319 occurrence data prevented the effects of sample size and spatial bias for SDMs (Wisz et al., 2008;Syfert et al., 2013).

Environmental predictors
We evaluated a comprehensive set of 62 environmental predictors that may be related to the occurrence of F. hayatae. The 62 predictors and their abbreviations are listed in Supplementary Table 1. These include monthly mean temperature and precipitation (Chiu et al., 2009), bioclimatic variables (version 1.4;Hijmans et al., 2005), warmth index, coldness index (Kira, 1991), biotemperature, potential evapotranspiration ratio (Holdridge, 1967), summer and winter half-yearly precipitation (Su, 1985), humidity index (Xu, 1985), effective warmth index (Chiu et al., 2012), temperature annual range (Wolfe, 1979), whole light sky space (Lai et al., 2010), elevation, latitude, longitude, and slope, which were produced using a digital elevation model (DEM) provided by the Taiwan Forestry Bureau Aerial Survey Office. Furthermore, by using the Geomorphometry and Gradient Metrics Toolbox (Evans, 2011), data concerning dissection, roughness, compound topographic index, heat load index, topographic radiation aspect index, surface relief ratio, surface curvature index, slope position, and surface/area ratio were calculated from DEM. All environmental predictor layers were generated in ArcGIS at a 40-m spatial resolution in the same geographic extent.
Selection procedure for environmental predictors Before selecting environmental predictors, we eliminated the variables without discrimination power, which were identified through detection using bivariate scatterplots in the preliminary model analysis and through decision-making (Morisette et al., 2013).
In this study, we used the following three procedures with backward elimination to select environmental predictors on the basis of comprehensive recommendations made in different studies, including studies by Lippitt et al. (2008) 1. Selection based on the Pearson correlation coefficients (r) between predictors: Highly correlated variables (following the sequence |r| > 0.95, 0.90, 0.85, 0.80, 0.75, 0.70, 0.65, 0.60, and 0.55) were removed, and others were retained as environmental predictors in the SDM. In this procedure, if 61 predictors were retained and used in the SDM, we labeled the predictor subset as R61, and the subset with 32 predictors was labeled as R32; a similar labeling method was adopted for all subsets.
2. Selection based on the training contribution level of predictors: Variables with low contribution percentages (CP) of <0.1%, 0.2%, 0.3%, 1%, 3%, and 9% were sequentially removed, and others were retained as environmental predictors in the SDM. In this procedure, if 61 predictors were retained and used in the SDM, we labeled the predictor subsets as C61, and the subset with 30 predictors was labeled as C30; a similar labeling method was employed for all subsets. 3. Selection based on expert selection of biologically relevant predictors: Five experienced ecologists studied F. hayatae and removed the biologically irrelevant variables; the other variables were retained as environmental predictors in the SDM. In this procedure, if 61 predictors were retained and used in the SDM, we labeled the predictor subsets as E61, and the subset with 31 predictors was labeled as E31; a similar labeling method was adopted for all subsets.

Model creation and evaluation of accuracy
To predict the distribution of F. hayatae, we used a general-purpose machine learning SDM method, Maximum Entropy Modeling of Species Geographic Distributions (MaxEnt; Elith et al., 2011). MaxEnt has been shown to be a robust SDM method for presence-only species data (Guisan et al., 2007), and it is becoming one of the most widely used methods for SDMs. We used MaxEnt 3.3.3 k (http:// www.cs.princeton.edu/~schapire/maxent/) for 20 crossvalidated replicates, 5000 maximum iterations, and logistic output format with other default settings. The logistic output was used as a species-suitable index value or predicted occurrence probability (Elith et al., 2011).
To assess the performance of models using different environmental predictor subsets, we used receiver operating characteristic (ROC) curve analysis and the true skill statistic (TSS) (Allouche et al., 2006). Use of the ROC is a common approach for threshold independent evaluation, where "sensitivity" is plotted against "1 − specificity" for all possible thresholds, thus avoiding the subjective selection of one or several thresholds for the evaluation (Gontier et al., 2010). Following the guidelines from Swets (1988), standard values for the area under the curve (AUC) of the ROC plots were graded as follows for assessing model performance: failed (AUC = 0.5-0.6), poor (AUC = 0.6-0.7), fair (AUC = 0.7-0.8), good (AUC = 0.8-0.9), and excellent (AUC = 0.9-1.0). In addition, TSS is a threshold-dependent evaluation index. We adapted the suggestion provided in a study by Liu et al. (2013) and used the maximized sum of sensitivity and specificity as the threshold to transform the predictive occurrence probability into species presence or absence (1 or 0). For a 2 × 2 confusion matrix, TSS was defined (Allouche et al., 2006) as TSS = (ad -bc)/[(a + c) (b + d)] = sensitivity + specificity -1.

Bivariate scatterplot between all predictors
A bivariate scatterplot is a preliminary inspection that reveals the relationship between paired environmental predictors. Figure 2A shows all scatterplots created in ArcGIS and some examples of the paired predictors. The random relationship between TAR (for abbreviations, refer to Supplementary Table 1) and DIS predictors, where the correlation coefficient was <0.001, is presented in Figure 2B. High correlation, or collinearity, was observed between ELE and Bio1 ( Figure  2C); the correlation coefficient was 0.998. If a model contains collinear predictors, the variance estimates are typically inflated, resulting in biased prediction (Quinn and Keough, 2002;Solomon et al., 2002). The scatterplot of CI vs. TRAI ( Figure 2D) clearly showed no discrimination of CI variable with constant zero values. Although CI variables have often been used to indicate the upper limit of the distribution of evergreen broadleaved forests, their use is inappropriate for a subtropical island such as Taiwan . After examining each scatterplot of all paired environmental predictors, we discarded the CI variable due to its nonidentifiability and retained the other 61 variables as candidate predictors in subsequent processing procedures. Figure  2 also reveals some collinearity among environmental predictors (see Supplementary Table 2). Consequently, creating the scatterplot for each pair of environmental predictors before running the SDM was correct and necessary (Williams et al., 2012;Morisette et al., 2013).
Predictor subsets through backward elimination Table 1 presents the subsets of selected predictors used in different SDMs through backward elimination. These predictor subsets were labeled using three selection procedures; subsets were labeled with "R" for Pearson correlation coefficient (r), "C" for training contribution level, and "E" for expert selection, followed by the number of predictors used in different subsets. Supplementary Table 2 lists all pairwise Pearson correlation coefficients between predictors. Ten predictor subsets were based on Pearson correlation coefficients of predictors used in SDMs; these included R61, R32, R27, R23, R21, R18, R12, R9, R8, and R6. Furthermore, seven predictor subsets based on the training contribution level of predictors used in SDM were available, namely C61, C30, C28, C23, C15, C9, and C6. In addition, seven predictor subsets based on expert selection of biologically relevant predictors were used in SDM, namely E61, E31, E26, E21, E16, E11, and E6. The ratio of the maximum (R61, C61, and E61) to minimum (R6, C6, and E6) number of predictors in different subsets was > 10. Table 2 presents the r between predictors and the CP (%) of individual predictors used in the R6, C6, and E6 models. Only PER and CTI were selected twice, whereas the other predictors were selected only once in the R6, C6, and E6 models. The results revealed that different selection procedures resulted in different predictor subsets used in SDM. Italicization indicates high correlation between T6 and Bio5 (|r| = 0.90) and between Bio16 and longitude of raster (LON) (|r| = 0.93) in the C6 model and between Bio4 and WI (|r| = 0.88) in the E6 model. All predictors in the R6, C6, and E6 models could be divided into following two categories: thermal/moisture-related or topography/geographyrelated variables. The CP of individual predictors from 46.7 to 0.0 exhibited a high level of inconsistency among models such as R6, C6, and E6 (Table 2).
Performance comparison of models on the basis of different predictor subsets Figure 3 presents a comparison of model performance measured using AUC and TSS criteria within 10-R-subsets, 7-C-subsets, and 7-E-subsets. The AUC was excellent (> 0.9) across all 22 models with different predictor subsets (Supplementary Table 3).
Overall, the AUC values (overall mean 0.976) were higher than the TSS (overall mean: 0.939), as shown in Figure 3. Analysis of variance indicated significant differences (P < 0.05) within the 10-R-subsets, 7-C-subsets, and 7-E-subsets for the ROC and TSS criteria. Only minor differences were observed among the models that used 12 or more predictors with respect to both AUC and TSS, namely R61-R12 ( Figure 3A and B), C61-C15 ( Figure 3C and D), and E61-E16 ( Figure  3E and F). The five models, namely R9, R8, R6, E11, and E6, used as predictor subsets ( Figure 3A, B, E and F) had significantly lower performance than the other subsets. Figure 3 also presents a significant reduction in SDM accuracy between R12 and R9, C15 and C9, and E11 and E6. The results suggest that in our case, the appropriate number of predictors was approximately 10. Moreover, satisfactory performance could be achieved using as few as six predictors ( Figure 3C and D). In a review by Porfirio et al. (2014), 119 variables were once used in different SDM studies. The mean annual precipitation and mean annual temperature were the most commonly used variables, observed in 43% and 37% of related studies, respectively. In this study, variables related to temperature and moisture or their integrated index were used in the R6, C6, and E6 models, which again indicated the crucial roles of temperature and water in plant distribution (Austin and Van Niel, 2011).  As indicated in Figures 3A and B, setting the correlation coefficient threshold between variables at 0.7 was appropriate (|r| > 0.7 variables eliminated in R12 with higher accuracy, |r| > 0.65 variables eliminated in R9 with lower accuracy). The |r| = 0.7 threshold was recommended by Dormann et al. (2013) and is often used, although other authors have suggested that |r| can be set at a value between 0.6 and 0.9 (Randin et al., 2009;Young, 2010;Andreo et al., 2011;Syfert et al., 2013).

Predictive maps based on different predictor subsets
The comparison of model performance was conducted with a total of 440 MaxEnt models by using 22 predictor subsets with 20 replicates. The predictive maps of mean occurrence probability using 22 predictor subsets are presented in Supplementary Figure 1. Furthermore, Supplementary Figure 1 reveals that the predicted distribution area expanded when the prediction variables were R6, C6, or E6, which means that some unused variables may have still had predictive power. The overall patterns of 22 predictive maps were consistent, although some slight differences were observed among the maps. Supplementary Table 4 lists the spatial correlation coefficient (r) across 22 predictive maps calculated by pairing grid-based probabilistic values. The correlation coefficients of most pairs were more than 0.80, which indicated a high consistency among the 22 maps. Lower correlation coefficients (<0.79) were noted in the predictive maps that used R6 and E6 subsets, but all of these were greater than 0.67. Although these results mean that the three selection strategies are all effective, we still cannot know which strategy or environmental factor is the most ecologically significant. Therefore, the comprehensive application of these three strategies may be a better and more practical approach. For example, when the R strategy is used alone, LON will be selected The total 22 subsets of predictors selected through backward elimination used in different SDM models, labeled with three selection procedures (Pearson correlation coefficient as "R," training contribution level as "C," and expert selection as "E") and appended with the number of predictors. R61  = C61  = E61   T1, T2, T3, T4, T5, T6, T7, T8, T9, T10, T11, T12, P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11, P12, Bio1, Bio4, Bio5, Bio6, Bio8, Bio9,  Bio10, Bio12, Bio15, Bio16, Bio17, Bio18, BT, PER, WI, EWI, HI, PS, PW, PSR, PWR, TAR, ELE, SLO, LAT, LON, SR, DIS, ROU, SRR,  CUR, SP, SAR, CTI, HLI, TRAI, WLS   R32  P2, P3, P4, P5, P6, P7, P8  because of its low correlation coefficient with other factors, but if we use the comprehensive application of the three strategies, LON can be screened out in E strategy by experts who judge that its ecological significance is low.

Label Subsets of selected predictors
For the visual comparison of predictive maps, the average and difference in predicted occurrence probability according to the three backward elimination procedures are illustrated in Figure 4. The average occurrence probability (mean ± standard deviation) of maps (left of Figure 4) modeled using 10-R subsets, 7-C subsets, and 7-E subsets was 0-0.8710 (0.0243 ± 0.133), 0-0.7288 (0.0187 ± 0.0038), and 0-0.8776 (0.0244 ± 0.0125), respectively. Therefore, the average probability modeled by predictor subsets from three procedures was highly similar (also see Supplementary Figure 1). The results implied that using backward elimination of redundant variables that considered the correlation coefficients between predictors (Dormann et al., 2013;Syfert et al., 2013), contribution level of predictors (Young, 2010;Yang et al., 2013), or expert selection of biologically relevant predictors (Watling et al., 2012;Harris et al., 2013) are all reasonable methods.
Differences in occurrence probability (right of Figure 4) of R61−R6, C61−C6, and E61−E6 were −0.8140 to 0.7651 (mean: −0.0263), −0.5297 to 0.6467 (mean: −0.0049), and −0.7619 to 0.8294 (mean: −0.0253), respectively. A detailed comparison of predictive maps is also presented in Supplementary  Figure 1. Overall, the predictive probability modeled using a large number of predictors (R61, C61, and E61) was less than that of small number of predictors (R6, C6, and E6). This is possibly because the use of too many predictors in SDM results in over-parameterization and overfitting of the model (Guisan and Zimmermann, 2000;Tyberghein et al., 2012). The situation was true in our case with high collinearity or highly correlated predictors (see Supplementary Table 1), such as |r| > 0.9 between T1-T12, Bio1, Bio8, Bio9, Bio10, BT, WI, EWI, and ELE. By the images in Figure 4, we can see that the difference between the full 61-variable model and the simplest 6-variable model is not unacceptable, it means that there is no need to use too many variables in SDM (Guisan and Zimmermann, 2000;Tyberghein et al., 2012).
Proposed approach to selecting predictors Overall, the three procedures (high correlation, contribution level, and expert knowledge) for selecting predictors provided satisfactory results with high performance (Figures 3 and 4). However, the three procedures had their respective limitations. For example, biologically relevant predictors with strong correlations were vague in the selection procedure. Furthermore, regarding the contribution level of the procedures, collinearity was observed between predictors, such as Bio5 vs. T6 (|r| = 0.9), as presented in Table 3. Finally, the shortcomings of the expert knowledge procedure were not limited to the collinearity between predictors (Table 3) but also included artificial subjective decisions. Although numerous studies have been conducted on F. hayatae (Shen et al., 2015;Ying et al., 2016), identifying the most suitable set of environmental predictors for SDM for F. hayatae was difficult. Consequently, we propose a synthesis approach by combining the aforementioned three procedures to select predictors.
According to the aforementioned results, we suggest the following simple approach of combining different procedures of predictor selection.

FIGURE 4
Average of and difference in probability maps. The average of (left) and difference in (right) probability maps based on different MaxEnt predictor subsets from three backward elimination procedures.
1. Identify and eliminate ineffective variables with nonidentifiability, such as CI in our case, through bivariate scatterplot analysis.
2. Calculate the correlation coefficients between other candidate predictors 3. Gradually select predictors within some highly correlated candidate subset on the basis of expert knowledge concerning biologically relevant predictors for target species and a low contribution level, which facilitates rejection of redundant predictors.
In order to understand whether the synthetic (abbreviated as S) strategy is superior to the other three (R, C, E) strategies, we choose 7 environmental factors for these four strategies to simulate the distribution of F. hayatae. Figure 5 compares SDM performance using R, C, E, and S strategies to select 7 environmental factors. The results show that S strategy is closer to the true distribution of F. hayatae than the other three strategies, and the S strategy is more clearly to reveal that the distribution of F. hayatae is mainly controlled by thermalmoisture regime and topographic location. Consequently, the synthetic selection strategy of environmental factors proposed in this paper can help to select the predictors suitable for SDM.