Selection of Environmental Covariates for Classifier Training Applied in Digital Soil Mapping

A large number of predictor variables can be used in digital soil mapping; however, the presence of irrelevant covariables may compromise the prediction of soil types. Thus, algorithms can be applied to select the most relevant predictors. This study aimed to compare three covariable selection systems (two filter algorithms and one wrapper algorithm) and assess their impacts on the predictive model. The study area was the Lajeado River Watershed in the state of Rio Grande do Sul, Brazil. We used forty predictor covariables, derived from a digital elevation model with 30 m resolution, in which the three selection models were applied and separated into subsets. These subsets were used to assess performance by applying four prediction algorithms. The wrapper method obtained the best performance values for the predictive model in all the algorithms evaluated. The three selection methods applied reduced the number of covariables in the predictive models by 70 % and enabled prediction of the 14 soil mapping units.


INTRODUCTION
The primary predictor covariables used in digital soil mapping (DSM) are derived from digital elevation models (DEMs), which can be used to characterize the environment on a number of detailed scales (ten Caten et al., 2012).Digital elevation models can be used to obtain covariables that characterize the relief and have a direct or indirect relationship with other pedogenetic factors from the Scorpan model (McBratney et al., 2003).
The level of agreement obtained on the maps predicted by DSM remain low even when a large number of predictor variables are used.Average overall accuracy values are around 65 %, and it is not always possible to predict all the soil classes that make up the reference maps (Behrens et al., 2010;Coelho and Giasson, 2010;ten Caten et al., 2012).These accuracy values may be attributed to the low correlation between predictor covariables and soil classes, but there is still no consensus on which covariables should be used, as observed in the range of combinations used in DSM studies (Campos et al., 2010;ten Caten et al., 2012;Höfig et al., 2014;Teske et al., 2015;Dias et al., 2016).
An awareness lack of the relevance of predictor covariables in discriminating soil classes has led to the use of sets with an insufficient number of covariables to make a prediction or excessively large sets with redundant covariables.Large sets of covariables together with the presence of redundant variables increase the complexity of the predictive model and hinder prediction for some soil classes, reducing map accuracy and precluding understanding of the relationships between predictors and soil types (Guyon and Elisseeff, 2003;Behrens et al., 2010;ten Caten et al., 2012;Brungard et al., 2015).
Selection algorithms can be applied to attenuate the complexity and low performance of predictive models and reduce the number of predictor covariables (Guyon and Elisseeff, 2003;Hall and Holmes, 2003;Coelho and Giasson, 2010;Giasson et al., 2013).The main selection algorithms applied to data mining can be grouped into wrapper (envelope), filter, and embedded methods (integrated) (Guyon and Elisseeff, 2003;Hall and Holmes, 2003).
Wrapper algorithms select the predictor variables by assessing their relevance via induction of a predictive model.These models select by subtracting or adding covariables to the set and estimating performance indices of the respective predictive model, until achieving the smallest subset of predictors with performance greater than or equal to the set composed of all the predictor covariables under study (Guyon and Elisseeff, 2003;Hall and Holmes, 2003).This type of selection exhibits maximum performance in the classifier used to select the predictor covariables.
Filter-type algorithms are applied independently of the predictive model, and the selection criteria are parameters, such as correlation, distance, information gained, and covariable consistency (Hall, 1999;Dash et al., 2000;Guyon and Elisseeff, 2003).This type of selection is commonly used to select variables (Giasson et al., 2013;Paes et al., 2013;Subburayalu and Slater, 2013;Subburayalu et al., 2014;Taghizadeh-mehrjardi et al., 2016;Vasu and Lee, 2016) and the results can be applied to any prediction algorithm (Dash et al., 2000;Guyon and Elisseeff, 2003;Hall and Holmes, 2003).Embedded methods are integrated with learning models and are specific to these classifiers (Guyon and Elisseeff, 2003;Paes et al., 2013).
Applying each selection method results in different subsets of covariables and, consequently, affects the predictive ability of the model.Studies have demonstrated that the wrapper algorithm obtains the best result when the predictive model is hierarchical (Hall and Holmes, 2003;Brungard et al., 2015).Although hierarchical methods are mainly used in DSM as the classification method, the literature reports a predominance of filter-type algorithms for selecting predictor covariables since the procedures are faster and do not depend on the predictive model (Coelho and Giasson, 2010;Giasson et al., 2013; Rev Bras Cienc Solo 2018;42:e0170414 Subburayalu and Slater, 2013;Subburayalu et al., 2014;Taghizadeh-mehrjardi et al., 2016).Applying filter-type models may result in sets of covariables weakly correlated with soil types, explaining the low accuracy values observed in the literature.Thus, the wrapper algorithm is an alternative that could result in the selection of sets of covariables more associated with soil types, in turn, more accurate predictive models (Brungard et al., 2015).The present study aimed to compare three systems for selection of predictor covariables (two filter-type algorithms and one wrapper selection system) and assess their impacts on the predictive model.

MATERIALS AND METHODS
The study area was the Lajeado Grande River Watershed (Figure 1).The watershed encompasses an area of approximately 532 km 2 in the extreme northwest of the state of Rio Grande do Sul -Brazil, in the Alto Uruguai Hydrographic Region (Freitas et al., 2012).
The climate of the region is humid subtropical (Cfa according to the Köppen climate classification system), with average annual rainfall of 1,778 mm and temperature of 18.5 °C.The geology corresponds to the Parana Province, characterized primarily by basalt flows from the Serra Geral formation (Bagatini et al., 2015).The soil map of the area (Figure 1) has a scale of 1:50,000 (Kämpf et al., 2004), composed of 14 soil map units (MUs) (Table 1).2) were derived from a digital elevation model (DEM) that was produced based on ASTER/GDEM v2 sensor data (Global Digital Elevation Models), with spatial resolution of 30 m, dated October 17, 2011, and obtained from the American Geological Service (Tachikawa et al., 2011).The geoform covariable was derived from the LandMapR tool package (MacMillan, 2003).The drainage density index (km km -2 ) was obtained with the Raster Calculator extension of ArcGIS 9.2, following the methodology proposed by Cardoso et al. (2006).The other covariables were derived using RSAGA v 2.2.2 (Brenning et al., 2018), integrated into R 3.3.1 software ( R Development Core Team, 2018).
Next, all the predictor covariables were sampled together with the response variable (soil map units), following a stratified sampling scheme, with approximately 30,000 points (ten Caten et al., 2013;Bagatini et al., 2015).The samples were stratified based on the number and size of the polygons of each soil map unit.To that end, simulations were made with different numbers of points in each MU, and a minimum of 300 samples was established for the smaller MUs.Thus, 3,000 points were randomly distributed in the nine units with an area of less than 10 km 2 (Table 1), and the remaining points were randomly stratified, obtaining approximately 300 samples in the smaller area and 6,000 in the larger.
Three selection methods were applied to the dataset containing 40 predictor covariables (CJ40), and the soil map units and the following four subsets were separated: Subset 1 = selected to apply the Correlation-based Feature Selection (CFS) algorithm (Hall and Smith, 1999); this algorithm performs a heuristic assessment based on a correlation in order to find subsets containing covariables highly correlated with the class and not correlated with one another, while covariables with strong intercorrelation are considered redundant and excluded; (1) Endoaquent, (3) Hapludol, (4) Udorthents, ( 5) Dystrudept (Soil Survey Staff, 2014).
Rev Bras Cienc Solo 2018;42:e0170414 Subset 2 = selected to apply the Consistency Subset Eval (CSE) (Liu and Setiono, 1996).The CSE uses the class consistency rate to select the properties that divide the original dataset into subsets containing most of the classes, and uses the consistency assessor proposed by Liu and Setiono (1996).The CFS and CSE aimed to compare filter selection by correlation and consistency of the data.To that end, the Best First internal method (D1-N5) was applied in the respective algorithms, using Weka 3.8.0software (Hall et al., 2009); Subset 3 = this subset was selected following the wrapper selection principles (Hall and Smith, 1999).An R programming language script composed of predictive models with the J48 algorithm (J48 -C 0.25 -M 2) was written and cross validated by five blocks.The recursive command "while" was used, whereby the predictive model was repeated and covariables eliminated one by one until reaching a minimum number of covariables in the model that exhibited overall accuracy greater than or equal to the set with all the covariables (CJ40); Subset 4 = composed of covariables selected simultaneously in subsets 1, 2, and 3.
The set of predictor covariables (subsets 1, 2, 3, 4, and CJ40) was used to assess performance by applying four prediction algorithms (J48, REPTree, BFTree, and the Multilayer Perceptron) used in earlier studies (Coelho and Giasson, 2010;Giasson et al., 2011;Sarmento et al., 2012;Arruda et al., 2013;Giasson et al., 2013;ten Caten et al., 2013;Calderano Filho et al., 2014;Dias et al., 2016).These algorithms were selected to compare covariable performance in classifiers with different architecture, the first three with decision tree architecture and the last one with artificial neural networks (ANNs).For the decision trees, a minimum of five instances per leaf were used.All the algorithms were cross-validated by five blocks, applying Weka 3.8.0software (Hall et al., 2009).
The results were assessed by an error matrix using Kappa's coefficient, mapping accuracy (MA), overall accuracy (OA) (Congalton, 1991), and the following evaluators: area under the precision-recall curve (PRC), Matthews correlation coefficient (MCC) (Saito and Rehmsmeier, 2015), mean absolute error (MAE), and root mean square error (RMSE), the last two calculated based on the likelihood of the occurrence of the data observed and estimated for each class (Shi, 2007).

RESULTS AND DISCUSSION
Of the 40 covariables extracted from the DEM, only 20 were able to predict soil type, having been selected to make up some of the subsets (Table 3).It means that half the covariables under study exhibited a weak correlation with the spatial distribution of soils or were redundant, therefore they were discarded.
Subset 3, selected by applying the wrapper algorithm, resulted in a predictive model with higher overall accuracy and Kappa coefficient than the values obtained in subsets 1 and 2, selected by CFS and CSE filters (Figure 2).This behavior was observed for the four algorithms, indicating no interaction between the selection methods and respective prediction algorithms.
The smallest predictive model, containing 21 layers, was obtained by the neural network algorithm and the largest (2142 leaves) by J48 (Table 4).This difference is largely due to the architecture adopted by these classifiers.Neural networks are composed of neurons stacked in layers that make the connection between the layers and the neurons themselves, whereas the other algorithms display decision tree architecture consisting of nodes that, when combined, give rise to the leaves, which leads to a larger model than that obtained by the neural networks (Witten et al., 2011).The architecture and size of the predictive models are directly linked to their complexity, ideal models exhibiting high predictive power and complexity that allows them to be interpreted (Ruiz et al., 2014).The apparently smaller ANNs show greater complexity and prevent complete understanding of the nature of the data under analysis (ten Caten et al., 2012).
None of the selection methods assessed showed a significant decrease in the size of the predictive model compared to the model obtained with all the covariables.For the BFTree and J48 algorithms, the smallest predictive models were obtained from subset 3, selected by the wrapper method, whereas for the REPTree, the smallest model was (1) Number of leaves.
Of the predictor covariables tested, only slope orientations (ASPECT), channel network base level (CNBL), and drainage density index (DDI) were selected simultaneously by the three selection algorithms (subset 4).The simultaneous selection of these covariables is due to their strong association with the spatial distribution of soils.Slope orientation (Figure 3a) shows a direct effect on the microclimate, changing water availability and biological activity in pedogenesis, thereby correlating with the spatial distribution of soils in the landscape (Schaetzl and Anderson, 2005).
The covariable channel network base level (Figure 3b) is an intermediate variable used to calculate the vertical distance from the drainage network.This covariable is obtained by relating the vertical distance to the channel network base level, and is used to indicate the depth of soils influenced by underground waters (Bock and Köthe, 2008).The CNBL exhibits the lowest values in enclosed valleys, indicating that in these areas the soil surface occurs closer to the base level of the drainage network, helping to separate soils that occur in the enclosed valleys of the soils that occur in the upland areas (Bock and Köthe, 2008).The study area contains MUs formed by associations with predominant Latossolos (Hapludox) in upland areas and Neossolos Regolíticos (Udorthents) on valley slopes.These two groups of soils account for approximately 70 % of the total study area, making the CNBL covariable important in differentiating these soils.
The drainage density index (Figure 3c) shows the relationship with the properties that influence water infiltration into the soil, such as depth, texture, and drainage, and is used to differentiate soils in terms of drainage.In areas with a high DDI, there is theoretically less infiltration and, consequently, shallower soils or soils with limited percolation in the B horizon (Demattê and Demétrio, 1998;Dobos et al., 2000).
The three selection methods resulted in different combinations of predictor covariables.This occurred due to the criteria used by each method for selecting the covariables most important for predicting soil type.Application of the CFS filter eliminated the largest number of covariables.Of the 40 covariables under study, only seven were selected to form subset 1; thus, CFS led to an 80 % decline in the number of variables.The correlation matrix of subset 1 (Figure 4) shows that fewer than 30 % of correlations exhibit a magnitude greater than 0.4, which are classified as moderate or strong (Dancey and Reidy, 2006).The selection criterion of the CFS filter is the weak correlation between covariables and high correlation with the response variables, which contributes to the selection of poorly intercorrelated covariables (Hall and Holmes, 2003).The CFS was highly efficient in reducing the number of predictors; however, it decreased the performance of the predictive model.The highest correlation (0.96) occurred between the CNBL and MDEGENER covariables, due to the way the former is calculated, using the horizontal distance from the drainage network and the vertical distance (elevation) from the base of the drainage network, resulting in a strong correlation between them (Figure 4).
Applying the CSE filter resulted in subset 2, with 12 covariables selected, which represented a 70 % decline in the number of covariables.In this subset, fewer than 20 % of the correlations are greater than 0.4 (Figure 5a).The VALLEYDEP and MDEPAD covariables, collinearity of 0.63, and CND and MDEPAD covariables, collinearity of -0.71, were the strongest correlation values observed in subset 2. The selection criterion of the CSE algorithm is the consistency of the subsets in relation to the response variable, which allowed higher correlation values between the selected covariables.Nevertheless, applying these filters eliminated the highest number of covariables with strong intercorrelation (Hall and Holmes, 2003).However, as in the CFS, the CSE also reduced the performance of the predictive model.Subset 3, selected by applying the wrapper approach, consisted of 11 covariables.In this subset, 35 % of the correlations are greater than 0.4 (Figure 5b), this method being the least efficient in eliminating strongly correlated covariables.The highest collinearities occur between covariables CNBL and MDEGENER (0.95) and PRDECL and CONVINDEX (0.85).Only the wrapper selection did not decrease the performance of the predictive model in relation to the set with all 40 covariables.Moreover, a slight decline in the size of the predictive model was also observed in algorithms BFTree, REPTree, and J48.This result corroborates the study by Hall and Holmes (2003), who also obtained a better prediction result when the wrapper method was applied to select the predictor variables.
The presence of strongly intercorrelated covariables in all the subsets indicates that they may exhibit different degrees of importance for the soil classes, justifying maintaining them in the subsets.This result shows that the correlation between covariables alone is insufficient to select the predictors most relevant for predicting soil types.In all the subsets of variables and algorithms tested, it was possible to predict the 14 map units.Mapping Accuracy (MA) behaved identically to that of overall accuracy (OA), with the highest values observed in subset 3 associated with algorithm J48 (Table 5).The lowest MA values (<0.3) occurred in the BFTree and ANN algorithms associated with subsets 1 and 2. The greatest variation between the minimum and maximum MA value was observed in these two subsets.The greater variation between minimum and maximum MA in subsets 1 and 2 indicates that the prediction error was concentrated in only a few map units.
The opposite behavior can be observed for algorithm J48, which displayed the lowest variation between minimum (0.50) and maximum (0.90) MA values in CJ40 and subset 3, suggesting greater error distribution among the MUs and, in turn, better prediction of the 14 soil map units.Only map units LV2, NV2, and RL did not reach maximum MA in J48 associated with subset 3; however, the MA values observed for this combination between the prediction algorithm and subset were higher than 0.67, indicating good prediction for these three soil map units also.
The map units with the greatest difficulty for prediction were NV2, LV2, RR1, RR2, RR3, and RR6, which showed the lowest MA values.These map units (NV2, LV2, RR1, RR2, RR3, and RR6) consist of Typic Udorthents (Neossolos Regolíticos) or these classes in association with other classes, such as Lithic Udorthents (Neossolos Litólicos) and Typic Haplustox (Latossolos Vermelhos).Thus, these MUs may occur adjacently or in similar environments, hindering their prediction by classifiers.Map units occupying very similar positions in the landscape may cause more discrimination difficulties for predictive models (Höfig et al., 2014).
The performance of the subsets and algorithms is also supported by the values obtained in the other indices evaluated (Table 6).Except for the ANN algorithm, the values of the area under the PRC curve, Kappa coefficient, and OA exhibit a variation of less than 5 % in the predictive models, indicating low randomness in the classifications obtained by these models, showing good consistency in the covariables used to predict the occurrence of the MUs.The results for subset 4 demonstrate that only the three covariables selected simultaneously in the three methods tested account for more than 56 % of the agreement   other studies (Giasson et al., 2013;Paes et al., 2013;Subburayalu and Slater, 2013;Subburayalu et al., 2014;Taghizadeh-mehrjardi et al., 2016;Vasu and Lee, 2016).Wrapper selection assesses a model that predicts the relevance of each covariable independently, which made it possible to select variables highly correlated with the response variable and achieve maximum performance in the algorithms used for classification.

CONCLUSIONS
The wrapper selection method produced the best performance for the predictive model in all the algorithms.
Applying the three selection methods reduced the number of covariables by approximately 70 % in the predictive models and made it possible to predict the 14 map units.
Applying the Correlation-based Feature Selection (CFS) and Consistency Subset Eval (CSE) filter algorithms decreased accuracy and Kappa coefficient values in relation to the set of all the covariables.
Only the covariables of slope orientation, channel network base level, and drainage density index were selected simultaneously by the three methods tested.

Figure 3 .
Figure 3. Covariables selected in the three methods tested.Slope orientation (a); channel network base level (b); drainage density index (c).

Table 2 .
Covariables extracted from the digital elevation model

Table 3 .
Predictor covariables selected to make up the respective subsets (TWI); Valley Depth (VALLEYDEP); Direct to Diffuse Ratio (DDR) Subset 3 Drainage density index (DDI); Channel Network Base Level (CNBL); Aspect (ASPECT); Diffuse Insolation (DIFINSOL); Generalized Surface (MDEGENER); Positive Openness (POSOP); Convexity (CONVEXITY); Slope (Slope); Euclidean Distance of the Rivers (DISTRIVERS); Convergence Index (INDCONER); Mid-Slope Positon (MSP) Subset 4 Drainage Density Index (DDI); Channel Network Base Level (CNBL); Aspect (ASPECT)obtained from subset 1, selected by application of the CFS filter.However, it is important to underscore that only in J48 did a decline in the size of the predictive model not result in lower overall accuracy and Kappa coefficient values.

Table 4 .
Size of predictive models for the sets of predictor variables