Mapping Soil Cation Exchange Capacity in a Semiarid Region through Predictive Models and Covariates from Remote Sensing Data

Planning sustainable use of land resources requires reliable information about spatial distribution of soil physical and chemical properties related to environmental processes and ecosystemic functions. In this context, cation exchange capacity (CEC) is a fundamental soil quality indicator; however, it takes money and time to obtain this data. Although many studies have been conducted to spatially quantify soil properties on various scales and in different environments, not much is known about interactions between soil properties and environmental covariates in the Brazilian semiarid region. The goal of this study was to evaluate the efficiency of random forest and cokriging models applied to predict CEC in the Brazilian semiarid region. The covariates used to predict CEC consist of images from Landsat 5 TM and a legacy soil map (scale 1:10,000). The sample set comprises 499 samples from the topsoil layer (0.00-0.20 m), where 375 samples were used in training processes and 124 as validation samples. The cokriging model (R = 0.57 and RMSE = 7.22 cmolc kg) performed better in predicting CEC than the random forest model (R = 0.47 and RMSE = 7.89 cmolc kg). The approach used showed potential for estimating CEC content in the Brazilian semiarid region by using covariates obtained from orbital remote sensing and the legacy soil map.


INTRODUCTION
Sustainable agriculture and environmental management implies understanding the variability of soil properties on an appropriate scale to support decision-making.Cation exchange capacity (CEC) is an important soil property and a fundamental indicator of soil quality and environmental decontamination potential (Tang et al., 2009).The CEC affects ionic adsorption and desorption in elements such as copper, zinc, and lead, as well as in organic pollutants such as atrazine, phenanthrene, diquat, and paraquat (Liao et al., 2011).Understanding CEC spatial distribution is important for supporting decisions regarding crops and soil management; however, obtaining a large soil dataset is expensive and time consuming.
The use of covariates in soil surveys, such as information related to soil-landscape relationships, makes it possible to achieve a good cost-benefit ratio and quality of results.Reflectance and emission data can be analyzed to extract information about the Earth and its resources, as the physical and chemical properties of different surfaces vary across the electromagnetic spectrum.In this context, remote sensing data can be used as environmental covariates in digital mapping of soil properties (Boettinger et al., 2008), especially in arid and semi-arid regions, due to the absence of interference from vegetation.Images from orbital remote sensing are an important source of environmental data used in digital soil mapping, due to the relation of soil images with soil forming factors, such as organisms (vegetation) and parent material (McBratney et al., 2003).
Different methods have been used to predict soil CEC, including geostatistics, e.g., cokriging models (Tang et al., 2009;Bilgili et al., 2011;Ciampalini et al., 2012), and data mining methods, e.g., random forest models (Lagacherie et al., 2013;Hengl et al., 2015).A study conducted by Liao et al. (2011) used kriging models and principal component analysis to predict soil properties; they observed that cokriging had better performance than ordinary kriging in predicting CEC spatial distribution.
A combination of hyperspectral spectroscopy data from the visible and near-infrared spectrum and cokriging models to predict soil properties (including CEC) was exemplified by Bilgili et al. (2011) in a precision agriculture area in northern Turkey.The authors indicated that this approach better predicted soil properties compared to partial least squares regression and ordinary kriging.Moreover, it allowed reduction in the sample set and, consequentially, in laboratory analysis expenses.
Cokriging and hyperspectral data from the visible and near-infrared spectrum was used by Ciampalini et al. (2012) to predict CEC (among other properties) of the topsoil layer in Cap Bon, Tunisia.The results of CEC prediction are considered relatively good performance due to the low density of the dataset, which was not sufficient to capture short distance variations related to the lithology of that area.Furthermore, the results explain great part of variability considering the intrinsic dynamic of the measured property (CEC).
Random forest (RF) is a data mining method that has shown some advantages over most modeling methods, as highlighted by Breiman (2001) and Liaw and Wiener (2002).The RF model was also used by Lagacherie et al. (2013) to predict soil CEC in the soil subsurface using legacy data as input, and as covariate hyperspectral data from the visible and near-infrared spectrum and terrain attributes from a digital elevation model (DEM) with a resolution of 30 m in the Cap Bon region (Tunisia).The results obtained in the present study were considered satisfactory for the 0.15-0.30and 0.30-0.60m soil layers; lower accuracy was found in the 0.60-1.00and 0.30-1.00m layers.The resulting maps improved the existing soil maps for that region and were consistent with the tacit pedological knowledge.In contrast, Vaysse and Lagacherie (2015) did not obtain satisfactory results applying RF models to predict CEC through the use of soil legacy data and covariates derived from SRTM, images from Landsat 7, and geology maps as inputs.The poor performance was attributed to the small-scale variation of the parent material, Rev Bras Cienc Solo 2018;42:e0170183 and the erosion/deposition rates along the catena system were not well captured by the spatial resolution of the covariates used as input data (100 m).
A study performed by Hengl et al. (2015) compared RF and multiple linear regression (MLR) to predict several soil properties (including CEC) for the African continent using a large number of covariates [MODIS, SRTM, land cover map, and soil maps (SoilGrids 1 km)] and 28,000 soil samples as input.The RF models performed better than MLR for all soil properties and depths predicted.
Although mapping spatial variation of soil properties in flat areas remains a challenge since the terrain covariates related to current topography have low predictive power, remote sensing data can be useful to represent surface variability that may better correlate with soil properties predicted by using digital mapping techniques.Given the context of the issues presented above, the goal of this study was to compare the accuracy of cokriging and RF models in predicting CEC in the topsoil layer of a flat area in a Brazilian semiarid region under Caatinga (xeric shrubland) vegetation by using Landsat 5 TM data as a covariate.
The Köppen climate classification of the area is BSwh' (semi-arid with dry winter and rainy summer, and the coldest mean monthly temperature is above 18 °C).The average annual rainfall is 400 mm, with the rainy season extending from November to April (highest rainfall in March); and the average annual temperature is around 26 °C.Xerothermic indexes are between 200 and 150, comprising seven to eight dry months.The area has a unique vegetation formation, known as hyperxerophilic Caatinga, characterized by shrubs with a high degree of xerophytism.However, part of the original vegetation Rev Bras Cienc Solo 2018;42:e0170183 has been removed, for various purposes, and currently shows signs of degradation, sometimes exposing the soil surface.The landscape is composed of flat surfaces, with the lithology mainly composed of limestone rocks from the Caatinga Formation (old Tertiary -Quaternary) and gneiss-granite rocks of the Caraiba-Paramirim Complex (Souza et al., 2003).The most representative soil types in this area are Vertissolos, Cambissolos, and Planossolos (Santos et al., 2013), which corresponding to Vertisols, Cambisols, and Planosols, respectively (IUSS, 2015).

Covariates and soil input data
The values of CEC were determined according to Oliveira (1979) and Embrapa (1979).The input dataset comprised the topsoil layer (0.00-0.20 m) for 499 soil profiles from a soil survey (legacy data), performed by the Companhia de Desenvolvimento do Vale do São Francisco (Codevasf) in 1989, which were subdivided for model training and validation.
Sampling was performed according to the requirements of the National Soil Survey Service, regarding level of detail, as recommended by IBGE (2015) and Carvalho (1988).
Additionally, a categorical covariate, represented by the detailed soil map (legacy data) on a 1:10,000 scale (Codevasf, 1989), was used on the first taxonomic level according to the Brazilian soil taxonomic system (Santos et al., 2013).Most of the soils are Vertisols.This categorical covariate (soil map) was used only in the RF model.

Predictive models
The modelling procedures to execute the random forest (RF) and cokriging (CK) prediction were performed on R software (R Development Core Team, 2007) through the randomForest (RF) and gstat (CK) packages.
Random forest is a non-parametric technique developed by Breiman (2001) as an extension of CART (Classification and Regression Tree) systems to improve the performance of predictors.
The random forests are a combination of many predictive trees (forest), where each tree is derived from a random vector sampled independently and with the same distribution for all trees in the forest.The subdivisions within each tree are determined based on a subset of predictor variables randomly chosen from the total of predictors provided.The results of RF, when applied to predict continuous data by regression, consist of the average of the results from all the trees in the forest (Breiman 2001;Cutler et al., 2007).
To implement the RF models, three parameters are necessary: ntree -number of trees in the forest; nodesize -minimum amount of data in each terminal node; and mtry -number of covariates used in each tree (Liaw and Wiener, 2002).The ntree value was set to system default (500), although more stable results can be achieved with a larger number, according to Grimm et al. (2008).Preliminary tests showed that increasing the ntree did not improve model performance, supporting use of the default value.The nodesize value was set to five for each terminal node, as usually selected in regression studies.The mtry value chosen in this study was according to Liaw and Wiener (2002), who propose an amount corresponding to one third the total number of predictor variables for regression problems.
The RF model provides error estimation known as Out-Of-Bag (OOB), which is based on observed data not used by the algorithm to create the trees.Based on the OOB predictions from all the trees in the forest, the mean square error (MSE OOB ) is calculated, according to equation 1 (Liaw and Wiener, 2002).
Rev Bras Cienc Solo 2018;42:e0170183 Eq. 1 in which zi represents the average value of the variable, and ẑi oob is the average of all OOB predictions.However, as the MSE OOB is dependent on the measurement scale of the target variable (CEC), this index is not appropriate to compare the performance of different models.Addressing this issue, the percent of variance explained by the model (VAR ex ) is calculated according to equation 2 (Liaw and Wiener, 2002), where Var z is the total variance of the variable.
Cokriging is a geostatistical procedure where a regionalized variable can be estimated based on correlation with one or more secondary variables (co-regionalized) with spatial co-dependence (McBratney and Webster, 1983).It can be understood as an extended kriging method, which has a vector of values rather than a single value for each spot sampled.Therefore, the CK model allows prediction of a variable (soil CEC, in this case), using a known dataset and based on correlation with other covariates.

Validation
Performance of the models was evaluated based on an independent validation set that was not used in the training procedure.To do so, the 499 soil samples were randomly divided into two independent datasets in the R software; one of these was used in the training process (375 soil samples) and another in the validation process (124 soil samples).The analysis of model performance was based on the correlation between the observed values (validation samples) and the estimated values, calculated by the coefficient of determination (R 2 ) and the root mean square error (RMSE), using equation 3.
in which "d" is the difference between the observed and estimated values, and "n" is the number of samples used in the validation process.The RMSE is commonly used to estimate the error or uncertainty in places where the error was not measured directly (Holmes et al., 2000).The higher the values of RMSE, the greater the differences between the datasets.

Descriptive statistics and covariate selection
The descriptive statistics of the CEC from the topsoil layer (0.00-0.20 m) for the training and validation datasets, and from the environmental covariates are presented in table 1.
Analysis of variance (ANOVA) showed similarity between the training and validation samples, and no significant difference was detected for the CEC and the covariates at 95 % probability.The results of this analysis indicate that the validation samples adequately represent the training samples.
High values for the coefficient of variation (CV) for the soil CEC indicate heterogeneity in training and validation datasets, with values corresponding to 34 and 32 %, respectively.
In contrast, all other covariates had CV values lower than 18 %, representing more homogeneity, except for NDVI.

Random forest model
The RF model also has the ability to estimate the relative importance of the covariates (Figure 2).In the training process, the RF model maintains all input covariates, preventing even those weakly correlated but with important pedological meaning from being Rev Bras Cienc Solo 2018;42:e0170183 discarded from the model (Akpa et al., 2014).In contrast, predictive methods, such as stepwise linear regression, maintain only highly correlated covariates in the model (Cutler et al., 2007).
The importance ranking of the covariates for CEC prediction was as follows: Soil Map > b3/b7 > NDVI > b5/b7 > b1 > b7 > b3 > b4 > b2 > b5 > b3/b2.Removing these covariates from the model tended to increase the MSE OOB , which ranged from 12.45 (b3/b2) to 25.29 % (Soil Map).The importance of soil legacy data for mapping soil properties (including CEC) was highlighted by Hengl et al. (2015), based on the application of random forest models in Africa.
The importance of those variables may be related to types of land use, which affected soil CEC values through the contribution of organic matter.The VAR ex obtained from the Out-of-Bag (MSE OOB ) data using the training samples was 51.07 %, and the goodness of fit estimated by RMSE was 8.07 cmol c kg -1 (Table 2).The results can be considered satisfactory for performance of the models.Few studies in the literature have used RF to predict soil CEC, and no one has used only remote sensing data as main covariates alongside legacy soil maps (Table 2).
The results from this study were lower than obtained by Lagacherie et al. (2013).These authors obtained 79 % for VAR ex for the layer between 0.15 and 0.30 m and 3.4 cmol c kg -1 for RMSE (Table 2) using terrain attributes and hyperspectral data in the visible and near-infrared spectrum (AISA-Dual) with 5 m of spatial resolution as input data.The lower results obtained in this study may be correlated with the coarser spatial resolution from Landsat 5 (30 m) in comparison with Lagacherie et al. (2013), who used hyperspectral data (5 m).The effect of spatial resolution on predicting soil properties was also reported by Smith et al. (2006) and Ruiz Navarro et al. (2012).
The results obtained by Hengl et al. (2015) were higher than those obtained in the present study (Table 2).Those authors used RF in combination with a large number of covariates and 28,000 soil profiles as input data.
However, the performance metrics for CEC prediction in this study were better than those achieved by Vaysse and Lagacherie (2015), who did not have satisfactory results predicting this soil property (Table 2).The poor performance in that case may be associated with the small variation from the parent material and the erosion/deposition rates, which were not captured by the spatial resolution of the covariates (100 m).Moreover, the authors highlighted the importance of the input dataset to improve performance of the models.

Cokriging
The semivariogram obtained for soil CEC (Figure 3) provides a description of spatial dependence and indicates processes related to spatial distribution (Liao et al., 2013).
As shown in figure 3, the exponential model satisfactorily fits the semivariogram, as in the study performed by Liao et al. (2011).However, Bilgili et al. (2011) and Ciampalini et al. (2012) obtained a better fit of the semivariogram with a spherical model.The nugget effect is an important parameter that indicates unexplained variation based on sampling distance (McBratney and Webster, 1983).In this sense, the value of the nugget effect greater than zero (12.53) may be related to measurement error or to unexplained spatial variation of the soil property (Liao et al., 2013).The strength of spatial autocorrelation for a soil property can be determined by the ratio between the nugget effect and sill (Cambardella et al., 1994;Rossi et al., 2009;Bilgili et al., 2011;Liao et al., 2011).For this ratio, values below 25 % are considered a strong spatial dependence; whereas, this dependence is considered moderate for values between 25 and 75 %.Values greater than 75 % are considered poorly dependent.Therefore, the nugget/sill ratios of 15 % observed in this study (Figure 3) indicate a strong spatial dependence of the CEC in this area.In this sense, it is possible that the spatial variability of this property may be driven by intrinsic soil forming factors, such as the parent material (Cambardella et al., 1994).In the case of this study, the area is composed of limestone and gneiss-granite rocks and has a flat surface, showing low relevance of relief as a formation factor.Strong spatial dependence of the soil CEC was also observed by Liao et al. (2011), Bilgili et al. (2011), and Ciampalini et al. (2012) in their respective studies.
The sill parameter expresses the inflection point where the distance between samples (1,129 m in that case) shows no spatial autocorrelation.In a study conducted in Tunisia, Ciampalini et al. (2012) observed values for this parameter ranging from 250 to 2,000 m, which can be considered similar to the values obtained in this study.Liao et al. (2011) observed higher values (37,500 to 40,100 m), whereas a lower range of values (379-427 m) was reported by Bilgili et al. (2011).Furthermore, these results reflect the characteristics of different case studies due to sampling density (Trangmar et al., 1986).
The semivariogram used to estimate the CEC values has an R 2 value of 0.57 and an RMSE of 7.22 cmol c kg -1 for the validation samples.These performance metrics were better than those found by Liao et al. (2013), who used cokriging and principal components derived from soil properties to predict soil CEC in China.As observed by Ciampalini et al. (2012), the values for the coefficient of determination (R 2 = 0.50) and the RMSE (5.16 cmol c kg -1 ) were slightly lower compared to those obtained in this study.In contrast, Bilgili et al. (2011) observed lower RMSE values for all sample sets (1.41 to 1.83 cmol c kg -1 ) when predicting CEC values by cokriging using hyperspectral data in the visible and near-infrared spectrum as covariates.

Comparison of predictive models
The descriptive statistics of soil CEC prediction for the RF and CK models are presented in table 3. The RF model produced predicted values for CEC within the range of the original values, with a smaller standard deviation and coefficient of variation, as expected for this model (Table 3).In contrast, the CK model produced a map with a greater range of values than RF, with more similarity to the descriptive statistics for CEC from the soil samples, which, according to Liao et al. (2011), means that this model had greater accuracy in describing CEC spatial variation than the RF model.The mean values (34.11 cmol c kg -1 ) and standard deviation (9.58 cmol c kg -1 ) are also closer to the values for soil samples (34.22 and 11.37 cmol c kg -1 , respectively).The CV for both models were similar and smaller than the CV from the soil samples, which was also observed by Liao et al. (2011).
Performance of the models based on the independent validation dataset (124 samples) is presented in figura 4. The CK model had better performance (R 2 of 0.57 and 7.22 cmol c kg -1 for RMSE) than the RF model (R 2 of 0.47 and 7.89 cmol c kg -1 for RMSE) in predicting soil CEC.
Figure 4 corroborates this since the dispersion of points in the RF model is greater than in the CK model.A notable point of this study regards the comparison between the random forest and cokriging models in predicting soil CEC; this comparison was not reported in the literature.
Quantification of soil properties using an orbital sensor is not an easy task, due to the complexity of soil dynamics and properties, as pointed out by Demattê et al. (2007).
In this sense, the results obtained in this study can be considered satisfactory and are probably related to physical interference of the soil in incident and reflected energy.
The spatial distribution of CEC predicted by the RF and CK models, as well the statistics regarding modeling by RF and CK, are presented in figure 5 and table 4, respectively.Higher CEC can be observed in the central portion of the area for both models, corresponding to Vertisols developed from limestone, which represent 94.5 % of the study area.These soils have a surface horizon with average thickness of 0.138 m and topsoil texture ranging from clay loam to clay, with high cation saturation in the exchange complex.The lowest levels for CEC were observed in the eastern, southwestern, and northwestern portions of the area and are related to Cambisols and Planosols.Cambisols have a topsoil horizon (A) with an average thickness of 0.149 m.Most of these soils have high activity clay and are eutrophics, with soil texture of clay loam and, infrequently, clay.Planosols had a surface horizon with an average thickness of 0.166 m, clay loam as topsoil texture, and very high cation saturation.

CONCLUSIONS
The approach based on the combination of orbital remote sensing data, with Random Forest (RF) and Cokriging (CK) methods to predict soil Cation Exchange Capacity (CEC), in Brazilian semi-arid soils provides satisfactory results.
The semivariogram showed that soil CEC has strong spatial dependence in the study area.Although the cokriging results were satisfactory, improvement in the spatial resolution of covariates may achieve better performance of models in predicting soil CEC in the area.
More research is needed to improve the quality of the covariates used as input data in digital soil mapping.

ACKNOWLEDGMENTS
This study was supported by Embrapa Solos (National Center of Soil Research) and the Federal Rural University of Rio de Janeiro (Soil Department -Agronomy Institute).

Figure 1 .
Figure 1.Study area and location of soil profiles, over Landsat image (band3).

Figure 2 .
Figure 2. Importance of the predictors in models derived from the Random Forest.MSE = Mean Square Error.

Figure 4 .
Figure 4. Plot of observed and estimated values, coefficient of determination (R 2 ), and the root mean square error (RMSE) obtained from models and validation samples.

Figure 5 .
Figure 5. Spatial distribution of soil cation exchange capacity (CEC) estimated by the Random Forest and Cokriging models.

Table 4 .
Descriptive statistics from soil samples, Random Forest, and Cokriging models SD = Standard Deviation; CV = Coefficient of Variation.

Table 3 .
Results from previous studies that used the Random Forest model to predict soil cation exchange capacity Rev Bras Cienc Solo 2018;42:e0170183