A Regional Legacy Soil Dataset for Prediction of Sand and Clay Content with Vis-Nir-Swir, in Southern Brazil

The success of soil prediction by VIS-NIR-SWIR spectroscopy has led to considerable investment in large soil spectral libraries. The aims of this study were 1) to develop a soil VIS-NIR-SWIR spectroscopy approach using legacy soil samples to improve spectral soil information in a regional scale; (2) to compare six spectral preprocessing techniques; and (3) to compare the performance of linear and non-linear multivariate models for prediction of sand and clay content. A total of 1,534 legacy soil samples, stored by Epagri, were collected from agricultural areas in 2009 on a regional scale, covering 260 municipalities of Santa Catarina. Six spectral preprocessing techniques were applied and compared with reflectance spectra (control treatment) in the development of sand and clay prediction models. Five multivariate regression models, Support Vector Machines, Gaussian Process Regression, Cubist, Random Forest, and Partial Least Square Regression were compared. The scatter-corrective preprocessing groups produced similar or better performance than spectral-derivatives. In addition, preprocessing spectra prior to regression analysis does not improve sand prediction, since reflectance spectra achieved the best performance using Cubist, SVM, and PLS models. In general, clay content presented better prediction accuracy than sand content. The best multivariate model to predict sand and clay content from soil VIS-NIR-SWIR spectra was Cubist. The best Cubist performance was achieved combined with reflectance spectra (R = 0.73; root mean square error = 10.60 %; ratio of the performance to the interquartile range = 2.36) and MSC (R = 0.83; root mean square error = 7.29 %; ratio of the performance to the interquartile range = 3.70) for sand and clay content, respectively. Considering the mean RMSE values of the validation set, the predictive ability of the multivariate models decreased in the following order: Cubist>PLS>RF>GPR>SVM for both properties. The predictive ability of VIS-NIR-SWIR reflectance spectroscopy achieved in this study for sand and clay content using legacy soil data and heterogeneous samples confirmed the potential of the spectroscopy approach.


INTRODUCTION
Reflectance spectroscopy in the visible, near and shortwave infrared (VIS-NIR-SWIR) regions has been proposed as a rapid, accurate, and cost-effective model to predict chemical, physical, and mineralogical properties using laboratory, field, and airborne hyperspectral sensors (Vasques et al., 2008;Demattê et al., 2016a;Nouri et al., 2017;Poggio et al., 2017;Viscarra Rossel et al., 2017;Dotto et al., 2018). Soil VIS-NIR-SWIR spectra are non-specific and include weak, wide, and overlapping absorption bands directly linked to soil composition, whereby moisture, particle size, organic matter, and mineralogy of the clay fraction and iron oxides influence spectral behavior (Stenberg et al., 2010).
Different preprocessing techniques have been applied to transform soil spectra, removing noise from the multiple scattering effect, highlighting specific features of the spectra, eliminating redundant information, and preparing the soil spectra for spectral modeling (Rinnan et al., 2009). As reported by Buddenbaum and Steffens (2012), these techniques represent an important step in the multivariate approach and include several algorithms such as smoothing, normalization, scatter-correction, continuum removal, and derivatives. Some studies have reported improvements in the performance of prediction models (Vasques et al., 2008;Nawar et al., 2016;Dotto et al., 2017), while others found similar or better results with no spectral preprocessing (Sawut et al., 2014;Viscarra Rossel and Webster, 2012). The type and amount of required preprocessing techniques are sitespecific (Stenberg et al., 2010) and, with large datasets, the effects of preprocessing steps are not clear (Engel et al., 2013).
Several multivariate models based on VIS-NIR-SWIR have been applied to processing soil spectra in order to mathematically extract meaningful information from individual spectrum to accurately predict chemical and physical soil properties, such as organic carbon/matter, pH, total nitrogen, soil moisture, and cation exchange capacity, among others (Morellos et al., 2016;Demattê et al., 2017;Dotto et al., 2018;Xu et al., 2018). The capacity to predict sand, silt, and clay has also been demonstrated in previous studies (Vendrame et al., 2012;Demattê et al., 2016b;Lacerda et al., 2016;Nawar et al., 2016;Dotto et al., 2017;Santana et al., 2018), but none of them in a regional soil legacy spectral library of subtropical soils in Brazil. Among the multivariate model, the partial least square regression (PLS) is the most common multivariate model used (Dotto et al., 2018), given its simplicity and robustness (Viscarra Rossel et al., 2006;Vasques et al., 2008;Lacerda et al., 2016). However, other studies have established that nonlinear data-mining models such as Support Vector Machines (SVM), Gaussian Process Regression (GPR), and Random Forest (RF) can outperform PLS when used to build predictive models from reflectance spectra (Terra et al., 2015;Nawar et al., 2016;Dotto et al., 2017;Santana et al., 2018). In addition to these models, another data-mining tool based on Cubist regression-rules has been introduced into the spectroscopy approach to predicting soil properties (Minasny and Mcbratney, 2008;Viscarra Rossel and Webster, 2012;Morellos et al., 2016;Viscarra Rossel et al., 2016;Zeng et al., 2017;Sorenson et al., 2018). Minasny and McBratney (2008), Morellos et al. (2016), and Sorenson et al. (2018) used Cubist to build predictive models of soil properties, including clay content, total carbon, total nitrogen, moisture content, and cation exchange capacity, and reported that Cubist provided better results than those provided by PLS.
The success of soil prediction by VIS-NIR-SWIR has led to considerable investment in large soil spectral libraries (Shepherd and Walsh, 2002;Brown et al., 2006;Viscarra Rossel and Webster, 2012). Soil information stored by universities, research centers, and government agencies, among others, could provide an opportunity to enlarge spectral libraries for data-poor regions, new challenges with regard to the reliability of such data and brings understanding to improve future site sampling (Nocita et al., Rev Bras Cienc Solo 2019;43:e0180174 2015;Viscarra Rossel et al., 2016). While several studies have been done to build predictive models for soil properties based on local, regional, and national spectral libraries in Brazil (Bellinaso et al., 2010;Ramirez-Lopes et al., 2013;Araújo et al., 2014;Demattê et al., 2016b;Lacerda et al., 2016), the application of soil reflectance spectroscopy has not been reported on a regional scale in South Brazil, especially in the state of Santa Catarina (SC). In this state, existing studies using soil spectral libraries are limited to local scale with low variability (Dotto et al., 2017(Dotto et al., , 2018. This research aims to fill this gap and to further the use of reflectance spectroscopy for assessing sand and clay content using a legacy soil dataset in subtropical soils based on a regional spectral library. Given the high variability of our regional scale legacy soil dataset, the hypothesis stated is that the performance of the prediction models will rely on a combination of preprocessing, multivariate models, and soil property being predicted. The main objectives of this study were: (1) to explore the potential of a legacy soil dataset with large range variability of sand and clay content on a regional scale, to predict soil properties by a systematic VIS-NIR-SWIR spectroscopy approach; (2) to compare six spectral preprocessing techniques in the development of sand and clay content models; and (3) to compare the performance of linear and non-linear multivariate models for prediction of sand and clay content.

Study area and soil spectral library
The study area covers 260 municipalities (about 90 %) of the state of Santa Catarina ( Figure 1). The state is characterized by its diversity in climate, vegetation, geology, relief, and soil. Santa Catarina (SC) has two climate types according to the Köppen classification system ( Figure 1); super humid and mesothermal (Cfa) and quite humid and mesothermal (Cfb). The remaining original vegetation includes areas of Rain Forest type and five major subtypes, these being Dense Rain Forest, Araucaria Forest, Alpine Grassland, Deciduous Forest, and Coastal Vegetation (Klein, 1978). The geology consists of granitoids, charnockitics, gneisses, and granites in Eastern SC, the Gondwana Plateau  and a basalt plateau in Western SC, with a predominance of basic volcanic rocks and quartz sandstones, with siliceous and argillaceous intercalations (Silva and Bortoluzzi, 1987 Catarina) was used to develop the soil spectral library. A total of 1,534 samples were collected from agricultural areas in 2009 on a regional scale, from within the 0.00-0.50 m soil layer and described in Veiga et al. (2012). Samples were air-dried, ground, sieved to 2 mm, and stored. The sand (0.05-2 mm) and clay (<0.002 mm) content were determined in 2009 according to the pipette standard method (Teixeira et al., 2017). The silt fraction was not considered in this study.

Spectral measurements
The spectral reflectance of soil samples was obtained using a FieldSpec 3 spectroradiometer (Analytical Spectral Devices Inc., Boulder, CO, USA) in the VIS-NIR-SWIR range (350-2500 nm), following standard laboratory procedure of the Brazilian Soil Spectral Library (Romero et al., 2018). The samples were placed in a petri dish and shaken to ensure a smooth surface for spectrum acquisition. The light source was two halogen (50 W) bulbs with the beam non-collimated to the target plane, positioned at a distance of 35 cm from the sample with a zenithal angle of 30°.
The spectral sensor captured the light through a fiber-optic cable connected to the sensor, placed vertically within 8 cm of the sample, where the reflected light in an area of approximately 2 cm 2 at the center of the sample was measured. As a reference standard, a white Spectralon ® was used at the beginning of the measurements and after every 20 readings. Each spectrum measurement was the result of the average of 50 sensor readings. A total of three scans were collected from each sample, rotating the petri dish by 90° for each scan, and these were then averaged to obtain a representative spectrum.

Spectral preprocessing
In this study, six spectral preprocessing techniques were applied and compared in the development of sand and clay prediction models. The techniques were divided into groups of scatter-correction and spectral-derivatives. The first group included: (i) multiplicative scatter correction (MSC), which removes additive and/or multiplicative signal effects (Martens and Naes, 1992); (ii) detrending (DET), to reduce the effect of particle size and additionally remove the linear, or curvilinear, trend of each spectrum (Barnes et al., 1989); and (iii) normalizations by range (NBR), to get all data to approximately the same scale; (iv) continuum removed reflectance (CRR), which removes the continuous features of the spectra and isolates specific absorption features present in the spectrum to minimize noise (Clark and Roush, 1984). The second group is spectral derivatives, represented by (v) Savitzky-Golay first derivative using a first-order polynomial with a search window of 11 nm (FDSG) and (vi) Savitzky-Golay second derivative using a second order polynomial with a search window of 11 nm (SDSG).
The preprocessing techniques ( Figure 2) were selected because they have been shown to improve the performance of spectroscopy models and have different effects on model prediction. They were applied to soil reflectance curves in the range of 350-2,500 nm. Reflectance spectra (RAW) with no spectral preprocessing applied was used as "control treatment". All preprocessing techniques were carried out using R (R Development Core Team, 2017) by applying prospectr ( To better compare the preprocessing techniques, including RAW spectra, the SK test of R 2 and RMSE mean values of the validation set was carried out. The Scott-Knott (SK) test (Scott and Knott, 1974) was used to compare the average values of R 2 and RMSE between different models and preprocessing techniques to verify significant differences between them. The SK performs a hierarchical cluster analysis approach used to partition treatment into distinct homogeneous groups by minimizing variation within groups and maximizing variation between groups (Scott and Knott, 1974). The cluster procedure begins with the whole group of observed mean effects and then divides and keeps dividing subgroups in such a way that the intersection of any of the two formed groups remains empty (Jelihovschi et al., 2014). It was applied using the ScottKnott package (Jelihovschi et al., 2014).

Multivariate models
To compare the performance of the proposed multivariate regression models using a regional spectral library, we compared to two supervised learning algorithms with linear kernel function, Support Vector Machines (SVM) and Gaussian Process Regression (GPR), two tree-based models, Cubist and Random Forest (RF), and one of the most common linear model used in the spectroscopy approach, Partial Least Square Regression (PLS).
The regression process was implemented based on the measured reflectance spectra (RAW and six spectral preprocessing techniques) and the measured values of sand and clay content using the training set. The predictive models were assessed for each soil property using the independent validation set. Only the best predictive model, laboratory-measured versus VIS-NIR-SWIR-predicted values of sand and clay content, will be plotted. The modeling was performed using several packages in R (R Development Core Team, 2017) and the parameters of each model were manually optimized to generate the best possible fit between the variables and outputs.
The multivariate model used were Cubist, regression-rules model (Holmes et al., 1999), Random Forest (RF) as an ensemble learning model (Breiman, 2001), and Support Vector Machine (SVM) as a machine technique based on statistical learning theory (Vapnik, 1995).
Gaussian Process Regression (GPR) as a probabilistic, non-parametric Bayesian approach, and PLSR (Wold et al., 2001). The RMSE was used in this study to identify the number of latent factors and leave-one-out cross-validation was used for the model training set to verify prediction performance (Boos, 2003) and to prevent over or under-fitting the data in the training step. From the total dataset (n = 1,534), 75 % were separated at random into the training set (n = 1,151), to create the regression models, while the remaining 25 % (n = 383, validation set) were used to independently validate the models. To check the reliability of splitting of each subset, the Levene's test and Student's t-test were applied to verify the equality of variances and means, respectively. The coefficient of determination (R 2 ), root mean square error (RMSE), and the ratio of performance to the interquartile range (RPIQ) were used to assess the performance of sand and clay content prediction models, using equations (1), (2), and (3), respectively. The R 2 provides the proportion of the variance explained by the model. The RMSE provides the accuracy of predictions, giving the standard deviation of the model prediction error in the same units as the attributes. The RPIQ was used instead of the ratio of performance deviation (RPD) because it is based on quartile range and better represents the spread of the population for skewed distributions ( Bellon-Maurel et al., 2010). The highest R 2 (Equation 1) and RPIQ (Equation 2) values and the lowest RMSE (Equation 3) value of the validation set were used to determine the selection of the best spectral preprocessing and multivariate regression models.
where ŷ i is the value predicted by the model; y i is the measured value; ȳ i is the average value; and N is the number of samples. where IQ is the interquartile distance (IQ = Q3-Q1) of the observed values, which accounts for 50 % of the population around the median.

RPIQ
where N is the number of samples used in the prediction; and ŷ i and y i are the values of predicted and measured soil properties, respectively.

RESULTS
The samples under study presented a wide variation with sand and clay contents (Table 1), indicating great variability in terms of particle size distribution. The independent training and validation sets showed Levene's test p-value of 0.609 and of 0.175 for sand and clay content values in the training and validation sets, respectively. According to Student's t test sand (p-value = 0.179) and clay (p-value = 0.435), did not show a significant difference at a 5 % significance level, for the training and validation sets, respectively.
The effect of six preprocessing techniques fluctuated between models (Tables 2 and  3), so it is difficult to reach a clear conclusion as to whether the differences between the average values of R 2 and RMSE among preprocessing techniques are significant. It was observed that there was no statistical difference between the mean values of R 2 and RMSE for RAW spectra, NBR, MSC, DET, CRR, and FDSG for sand and clay prediction models (Figure 3a). In other words, on average, RAW spectra, NBR, MSC, DET, CRR, and FDSG preprocessing techniques had an equal effect on model performance to quantify sand and clay content, and they clearly perform better than SDSG preprocessing. Thus, RAW spectra and MSC preprocessing are the best strategies for sand and clay content, as they consistently simplify the models. The worst results for both R 2 and RMSE were found with SDSG preprocessing (Figure 3b). There is significant variability across SDSG results. The poorest result was achieved by SDSG-SVM, which presented the lowest R 2 and the highest RMSE (0.19 and 27.94 % for sand and 0.28 and 23.13 % for clay).
The predictive model performance of sand content presented R 2 , RMSE, and RPIQ values ranging from 0.19 to 0.73, 10.60 to 27.9 %, and 0.9 to 2.4, respectively ( Table 2). The predictive model performance for clay content presented R 2 , RMSE, and RPIQ values ranging from 0.28 to 0.83, 7.3 to 23.1 %, and 1.2 to 3.7, respectively. For sand, the SK test showed that there was no statistical difference between R 2 and RMSE mean values of the five models applied, using a 5 % significant level. For clay, the SK test showed  (Figure 4).

Effects of the preprocessing techniques on modeling
In general, model performances (Tables 2 and 3) decreased as the noise resulting from preprocessing increased (from RAW to SDSG). The derivatives emphasize noise in the data more distinctly than the other methods (Buddenbaum and Steffens, 2012). Even though extremely flat structures can be evaluated with spectral derivatives (FDSG and SDSG), this also tends to increase spectral noise (García-Sánchez et al., 2017), especially by second spectral derivatives (Buddenbaum and Steffens, 2012). For sand, RF (with FDSG and SDSG) did not appear to be sensitive to enhanced noise in the spectrum (Breiman, 2001), revealing its capability to better handle derivative transformation (Dotto et al., 2018). However, SDSG decreased model performance of the remaining multivariate models for the two soil properties studied. These results with spectral derivatives and ensemble-learning algorithms (RF) are in agreement with Vasques et al. (2008), Pinheiro et al. (2017), Dotto et al. (2018), and Santana et al. (2018).
Considering the performance of RAW, preprocessing the spectra before regression analysis did not improve sand prediction, with the exception when used with RF models. Therefore, spectral reflectance values without preprocessing (RAW spectra) were sufficient to obtain highly accurate models, and our results show that there is no need to perform any preprocessing technique on the spectra to generate better prediction models for sand in the VIS-NIR-SWIR region. The results achieved in the current study are divergent from Franceschini et al. (2013), which found high performance of the sand model (R 2 = 0.87) with applying spectral preprocessing. Considering the GPR model, only a slight benefit was found using preprocessing (NBR) on the spectra. This finding is in agreement with Duda et al. (2017), who reported no significant improvement in results with first derivative preprocessing compared with RAW spectra using the SVM model. They combined two proximal sensor approaches relative to a single sensor, to compare the efficacy in determining soil properties, sand and clay content, among others, in a catena scale in Eastern Europe. Sawut et al. (2014) reported a small influence of preprocessing on spectral analysis for the prediction of sand content in a thermal infrared region. All these studies worked with a smaller number of samples and range variability of the sand content than ours.
Preprocessing results for FDSG derivatives are in agreement with those found by Bilgili et al. (2010), Pinheiro et al. (2017), and Duda et al. (2017), who reported different R 2 values (0.84, 0.62, and 0.25, respectively). Spectral preprocessing may emphasize the feature sought in the spectra and several authors have noted its benefit (Vasques et al., 2008;Nawar et al., 2016;Dotto et al., 2018). Sand fractions, tends to have quartz as the dominant mineral (Demattê et al., 2007) which has no diagnostic spectral features in the VIS-NIR-SWIR ranges, high values of reflectance intensity (albedo), and its reflectance   spectrum is largely unvarying (Hunt and Salisbury, 1970;Clark, 1999;Ramirez-Lopes et al., 2013;Wight et al., 2016). Preprocessing techniques are designed for baseline corrections, so their effect is minimal when the baseline variance is small (Buddenbaum and Steffens, 2012). In general, sandy soils exhibit similar spectral behavior of quartz, and this may explain why no preprocessing technique was very helpful in improving sand model performance in the present study.
For clay, MSC and NBR achieved the best performance with the SVM, GPR, and PLS models. These preprocessing techniques are normalization procedures commonly used to compensate for baseline shift and multiplicative effects in the spectral data, which are induced by physical effects such as particle size (Martens and Naes, 1992;Rinnan et al., 2009;García-Sánchez et al., 2017). The MSC, which attempts to eliminate the effects of the spectrum by linearizing each spectrum by the average spectrum of the sample, is the most popular normalization technique (Martens and Naes, 1992). In NBR, each spectrum is divided by the range. In this study, MSC (R 2 = 0.70, RMSE = 11.0 %, RPIQ = 2.4) exhibited slightly better performance than NBR (R 2 = 0.67, RMSE = 11.7 %, RPIQ = 2.1), when combined with Cubist model (Table 2), albeit with the same performance as RAW spectra (control treatment). The NBR produced the best model result for the SVM, GPR, and PLS models.
The CRR presented inferior performance than expected, given that other studies commonly report this technique as an effective VIS-NIR-SWIR data preprocessing technique (Lagacherie et al., 2008;Nawar et al., 2016;Dotto et al., 2017Dotto et al., , 2018. In general, the scatter-corrective group gives similar, or better, performance than spectral-derivatives for sand and clay models. These results are in agreement with Dotto et al. (2017), who also reported better performance of the models with scatter-corrective preprocessing techniques compared to spectral-derivatives to predict soil organic carbon using a local spectral library from SC. All multivariate models applied in the present study achieved different performance with two spectral preprocessing groups. For clay content, the accuracy of the prediction models appears to depend on spectral preprocessing, except for Cubist model that achieved the same performance of prediction with and without spectral preprocessing technique. For sand, this was not so evident. The SVM seems to be more sensitive to spectral preprocessing applications, since R 2 dropped from 0.67 to 0.19 and from 0.77 to 0.28 for sand and clay content, respectively. The RMSE showed the inverse trend.

Effects of the multivariate models
Considering the performance of the models, Lacerda et al. (2016) developed a PLS prediction model to quantify soil texture from 3,750 soil samples using the topossequence model from three areas of São Paulo State, Brazil. These authors found predictions values in the validation set for sand (R 2 = 0.96, RMSE = 137.98 g kg -1 ) and clay (R 2 = 0.93, RMSE = 82.50 g kg -1 ) content. The R 2 and RMSE values are higher than those found in this study. The lowest R 2 found with the SC dataset can be explained in terms of the high heterogeneity of soil samples collected throughout the state. The soil samples used in this study are legacy soil samples and were sampled for another purpose than this spectroscopy study. Further, they were collected from the depth soil layer (0.00-0.50 m) showing great variability of the sand and clay content into that depth in soils with texture gradient. On the other hand, legacy soil samples bring new challenges and lead the techniques to limit.
The scatter plot ( Figure 5) of laboratory-measured versus VIS-NIR-SWIR-predicted values of sand and clay content based on Cubist model using the validation set showed quite low dispersion, with most of the values distributed close to the 1:1 line (red line), with small slope and intercept values, indicating good fit (Viscarra Rossel and Webster, 2012).
The Cubist model showed the best performance in comparison with the other models for the prediction of clay content (Table 3). The R 2 values of the validation set ranged from 0.69 to 0.83, whereas RMSE ranged from 9.79 to 7.29 %. This confirms the superior performance of Cubist in predicting soil properties, as stated in the literature (Minasny and Mcbratney, 2008;Viscarra Rossel and Webster, 2012;Stevens et al., 2013;Morellos et al., 2016;Viscarra Rossel et al., 2016). This result is in line with results reported by Viscarra Rossel and Webster (2012), who used Cubist to predict 24 soil properties, including sand, silt, and clay content, using a large soil dataset (n = 21,493) from all the states in Australia and found RMSE = 12.00 % and RMSE = 8.49 % for sand and clay content, respectively. They concluded that the rule-based model predicts sand and clay content well, working effectively with large and diverse datasets. A study presented by Minasny and McBratney (2008) showed that Cubist produced the best fit model (R 2 = 0.92) and lowest error (RMSE = 7.18 %) when compared to PLS and another data-mining model, Treenet, using mid-infrared (2500-25000 nm) spectra of soil samples from Australia.
The PLS is the most linear common multivariate model for quantitative spectroscopy analysis in soil. This model is based on the decomposition of spectral data into latent variables that capture most of the variance existing in the spectrum, and linear models are then created using the scores of the most correlated features (Morellos et al., 2016). However, in PLS, non-linear relationships can only be modeled in a limited way, and the model is a linear function of all wavenumbers, whereas in regression-rule models like Cubist, these non-linearities can be efficiently modeled using a set of comprehensible linear equations (Minasny and Mcbratney, 2008).
For clay, the SK test results showed that the models were divided into two groups (blue and dark red color groups, Figure 4), of which Cubist presented the best performance. However, for the mean RMSE values, there was no statistical difference in the SK test (α = 10 %). The prediction quality decreased from Cubist to SVM, with mean RMSE values increasing from 8.06 to 12.13 %, suggesting different performances between the models. These results again demonstrated the better performance of the Cubist model in comparison to the most common algorithms used in spectroscopy analysis.

CONCLUSIONS
The performance of preprocessing techniques fluctuated between models. In addition, spectra preprocessing before regression analysis does not improve sand prediction. Scatter-corrective preprocessing groups (MSC, NBR, DET, and CRR) produced similar or better performance than spectral-derivatives (FDSG and SDSG). Spectral-derivatives only showed better results with RF models for both attributes. The best multivariate model to predict sand and clay content from soil VIS-NIR-SWIR spectra was Cubist.
The predictive ability of the multivariate models decreased in the following order: Cubist>PLS>RF>GPR>SVM for sand and clay.
These legacy soil data can produce reliable information, and it can be used to populate a soil database as input to soil monitoring, as a primary reference to establish standards of the spectral behavior of soils in the Santa Catarina state.