Multiple linear regression and random forest to predict and map soil properties using data from portable X-ray fluorescence spectrometer (pXRF).

A determinacao de atributos do solo auxilia no correto manejo da sua fertilidade. O equipamento portatil de fluorescencia de raios-X (pXRF) foi recentemente adotado para determinar o teor total de elementos quimicos em solos, permitindo inferencias sobre atributos do solo. No entanto, esses estudos ainda sao escassos no Brasil e em outros paises. Os objetivos deste trabalho foram prever atributos do solo a partir de dados do pXRF, comparando-se os metodos de regressao linear multipla stepwise (SMLR) e de random forest (RF), alem de mapear e validar atributos do solo. 120 amostras de solo foram coletadas em tres profundidades e submetidas a analises laboratoriais. Utilizou-se o pXRF para leitura das amostras e determinou-se o teor total de elementos. A partir dos dados do pXRF, foram utilizadas SMLR e RF para predizer resultados laboratoriais, que refletem atributos do solo, e os modelos foram validados. O melhor metodo foi utilizado para espacializar os atributos do solo. Utilizando SMLR, os modelos apresentaram valores elevados de R² (≥0,8), porem maior acuracia foi obtida na modelagem com RF. A capacidade de troca de cations potencial e efetiva, materia orgânica do solo, pH, saturacao por bases e teores trocaveis de Ca, Al e Mg apresentaram ajustes adequados e predicoes acuradas com RF. Dos dez atributos do solo preditos por RF a partir de dados do pXRF, sete apresentavam CaO como a variavel mais importante para auxiliar as predicoes, seguido por P2O5, Zn e Cr. Os mapas gerados a partir de dados do pXRF usando RF apresentaram adequados valores de R² para seis atributos do solo, atingindo R2 de ate 0,83. O pXRF em associacao com RF pode ser usado para prever atributos do solo com elevada acuracia, com rapidez e a baixo custo, alem de proporcionar variaveis que auxiliam o mapeamento digital de solos.


INTRODUCTION
Soils present diverse physical, chemical, mineralogical and biological properties, which influence their diverse potentialities of use (Birkeland, 1999;Resende et al., 2014;Schaetzl;Anderson, 2005), such as plant growth. The characterization of those properties is of great importance for the proper management and conservation of soils (Severiano et al., 2009). For that, several laboratory analyses of different levels of complexity are employed, which helps making decisions on the correct management required according to the needs of the crops, so that the agricultural production may be increased (Lopes;Guilherme, 2016).
On the other hand, carrying out laboratory tests in a large number of samples requires more time and financial resources, as well as chemical reagents, which generate residues. Thus, the use of tools that quickly allow the evaluation of soil properties, at low cost and without residues production may facilitate the evaluation of more samples to characterize soils in more detail and for different purposes.
Portable X-ray fluorescence spectrometer (pXRF) is a tool used in works of several fields of study for identification and quantification of chemical elements present in varied materials (Ioannides et al., 2016;Milić, 2014;Peinado et al., 2010;Rouillon;Taylor, 2016;Terra et al., 2014;Weindorf;Zhang, 2011). This equipment emits high-energy X-ray beams, which cause the displacement of electrons from inner to outer orbits as they hit the atoms of the elements in the sample. In sequence, the displaced electrons return to their original orbits emitting a fluorescence characteristic of each chemical element, as it is related to the element atomic number. Thus, in a few seconds the equipment is able to determine the total contents of elements of the Periodic  Table between Mg and U, allowing its use both in the field and in the laboratory (Weindorf;Bakr;Zhu, 2014).
The pXRF generates a large data set, which may slow down their analyses and interpretation in detail. In this sense, the use of machine learning tools may accelerate the identification of data for characterizing soils. Several methods of analyzing large amount of data of both continuous and categorical variables have been used in works of various natures, such as the stepwise multiple linear regression (SMLR) (Juhos;Szabó;Ladányi, 2015;Menezes et al., 2016;Rodrigues;Corá;Fernandes, 2012). This analysis adjusts regression models from easily obtained variables to estimate data more difficult to be acquired, in which the addition or removal of predictive variables to the model is performed based on statistical tests, generating a final equation. Weindorf et al. (2012) evaluated the pXRF to discriminate spodic and albic horizons in the field, using SMLR to estimate organic carbon data from pXRF data, concluding that the equipment was adequate, contributing to the rapid generation of chemical data.
Another method that has been increasingly used for predictions is the so-called random forest (RF) (Breiman, 2001). This algorithm presents as advantages the possibility of using both numerical and categorical variables, modeling non-linear relationships, assessment of the importance of each variable for the generation of the final model, calculation of modeling errors, among others (Breiman, 2001;Wiener, 2002). However, despite of classifying the variables according to their importance to the model (Archer;Kimes, 2008), this method does not generate a final equation of the model, as opposed to SMLR. Therefore, it is sometimes referred to as a black-box method (Grimm et al., 2008), although some works have pointed out that this method is robust and provides better results than other methods for both spatial and non-spatial predictions (Hengl et al., 2015;Lies;Glaser;Huwe, 2012;Souza et al., 2016).
In recent years, most works involving digital soil mapping has been based on continuous variables for the area of interest, such as satellite images and digital elevation models and their derivatives (e.g. slope, topographic wetness index, curvatures, etc.), to spatialize soils information (Adhikari et al., 2014;Giasson et al., 2015;Menezes et al., 2014;Silva et al., 2016a;Taghizadeh-Mehrjardi et al., 2015). However, when working in smaller areas, mainly in developing countries, it is common to face difficulties in obtaining data with high spatial resolution, which tends to make the use of these variables unfeasible. In this sense, pXRF can be an alternative to obtain a large amount of punctual data that, after being spatialized, may contribute to spatial predictions (Silva et al., 2016b).
In spite of the advantages of using pXRF to analyze elemental composition of materials, very few works have used pXRF in Brazil and in other developing countries for studies with a variety of purposes, mainly regarding soils. In this sense, due to the search for methods to obtain soils information in rapid and economical ways, the objectives of this work were: (i) to predict results of laboratory analysis through SMLR and RF from data generated by pXRF, validating the generated models and; (ii) to evaluate the potential of pXRF to aid spatial prediction of analytical results, reflecting soil properties, generating and validating maps of soil properties.
The study area is occupied by typic Dystrophic Haplic Cambisols (95% of the area) followed by typic Dystrophic Regolithic Neosols (5%), classified using the Brazilian Soil Classification System (Embrapa, 2013), both with gravels, developed from metapelitic rocks. Soil samples were collected at three depths: 0 to 10 cm, 10 to 20 cm and 20 to 40 cm, at 40 places randomly distributed in the area, making up a total of 120 samples (Figure 1).

Laboratory analyses
Soil samples were air dried, passed through a 2 mm sieve and analyzed in the laboratory where the following soil properties were determined: soil pH in water, exchangeable contents of Ca 2+ , Mg 2+ and Al 3+ (Mclean et al., 1958), available K extracted with Mehlich-l, soil organic matter (OM) (Walkley;Black, 1934), remaining P (P-rem) (Alvarez; Fonseca, 1990), potential (T) and effective (t) cation exchange capacity (CEC), and base saturation (V).
The samples were also analyzed in the laboratory with the pXRF of Bruker model S1 Titan LE. This equipment contains 50 kV and 100 μA X-ray tubes. The software used was GeoChem, in the Trace (dual soil) configuration, recommended for soils, for 60 seconds, including two X-ray beams. The 120 samples collected were subjected to analysis in triplicate by pXRF and the accuracy of the equipment was evaluated through scanning standard reference materials 2710a and 2711a certified by the National Institute of Standards and Technology (NIST) as well as scanning an equipment standard sample (check sample -CS). From the NIST and CS certified samples, the recovery of the element contents obtained by pXRF (% of recovery = 100 x Obtained content / Total certified content) were calculated. The recovery percentages of the samples are presented in Table 1 only for the elements that were identified in all the samples of this work.

Analysis of data and modeling
The results of the laboratory analyses were submitted to descriptive statistics, in the three soil depths evaluated, to obtain the average, maximum and minimum values, standard deviation and coefficient of variation (CV). From the data of the pXRF, models were adjusted to predict the following soil properties: exchangeable Ca 2+ , Mg 2+ , K + , Al 3+ , P-rem, pH, potential CEC (T), effective CEC (t), soil organic matter (OM) and base saturation (V).  Soil samples were randomly separated into modeling and validation data sets, respectively, consisting of 75% and 25% of the total data. Also, the samples were subdivided and modeled in two ways: i) specific models, according to the three depths of sampling, with n = 40 for each depth, with 30 samples for modeling and 10 for validation; and ii) general model, including all samples (n = 120, 90 for modeling and 30 for validation).
In order to adjust the models for predicting soil property results from the pXRF data, two methods were tested: stepwise multiple linear regression (SMLR) and random forest algorithm (RF). The SMLR was generated through SigmaPlot software, backward method, in which the least important variables for model adjustment are removed, with 95% probability.
The random forest analysis was performed in R software, randomForest package (Liaw and Wiener, 2015), with the following parameters established: number of trees of the model (ntrees) = 1000, number of variables in each node (nodesize) = 5, and number of variables used in each tree (mtry) = one third of the total number of samples, as suggested by Liaw and Wiener (2002) for regression random forests.
The random forest adjustment results in the mean square of the residuals (MSE oob ), the percentage of the variance explained by the model and the importance of all the variables of the model in the prediction of the data, by the out-of-bag method. MSE oob is calculated when, for each iteration, only a few predictor variables are used to generate a tree. The MSE oob is calculated through Equation 1. The importance of the variables, also obtained by the algorithm, is a result of the average of the reduction of the accuracy in the prediction as one variable is left out of the model while the other variables are included. Thus, if a variable is removed, the more the prediction error increases, which means, the accuracy of the prediction decreases, the more important that variable is for the model adjustment (Breiman, 2001;Wiener, 2002). used in the modeling), consisting of 25% of the total data, to determine if the predictions by the models are valid for other observed data. For this, the estimated values for each sample of the independent subset were determined and the accuracy of the models was evaluated through the following statistical indices: coefficient of determination (R 2 ), adjusted R 2 (R 2 adj ) in relation to observed and estimated data, root mean square error (RMSE), and mean error (ME), according to Equations 2 and 3: in which y i is the real (observed) value, y i OOB is the mean of the predictions of OOB for the i th observation, n is the number of trees.

Accuracy of the models
The validation of the general and specific (per depth) models generated by SMLR and random forest was performed using the independent subset of data (not where n: number of observations, ei: values estimated by the model, and mi: values observed through laboratory analysis. The efficiency of the modeling methods (SMLR and random forest) was carried out in addition to the determination of the analytical results capable of being predicted with greater accuracy from the generated models. In this sense, the models that obtained the highest values of R 2 and R 2 adj and the smallest RMSE and ME comparing observed with estimated data were considered the best for prediction of the results of laboratory analysis from the pXRF data.

Spatial prediction of soil properties from pXRF data
From the definition of the best method for modeling, the laboratory results that presented high accuracy of predictions were spatialized for the entire study area. This procedure aimed to evaluate the possibility of using pXRF data as a basis for mapping soil properties (Duda et al., 2017;Silva et al., 2016b), providing easily obtainable variables, at low cost, rapidly and with no generation of chemical residues.
First in this procedure it was necessary to spatialize the variables obtained by pXRF for the entire study area, since the soil properties prediction models are based on pXRF data, which, in turn, only refer to the sites at where samples were collected. In order to do so, the inverse distance weighting (IDW) method was employed in the spatialization of the pXRF variables, allowing their subsequent use for mapping. The values inferred at non-sampled areas by IDW are estimated using linear combination of values at the sampled places, weighted by an inverse function of the distance from the point of interest to the sample points. The weights (λ i ) are expressed in Equation 4: where d i is the distance between two points, p is a power parameter, and n represents the number of sampled points used for the estimation.
The predicted maps of the soil properties were also validated with 25% of the samples (not used for the modeling) through the R 2 , R 2 adj , RMSE, ME and 1: 1 graphs (observed vs. estimated data).

Descriptive statistics
The analytical data of the samples used for modeling and validation are presented in Table 2. There is great variability of values of all evaluated soil properties, as demonstrated by the high coefficients of variation, in both modeling and validation data sets. This occurred, as expected, due to the different land uses of the area and soil management practices, ranging from native vegetation, where pH and nutrient contents are lower since no anthropic influence occurs, to cultivated areas, where these values are higher because of liming and fertilizers application. This variability of data can contribute to the generation of more reliable models with possible use for soils with different conditions, since the used values contemplate a wide range of values of the analyzed properties, such as P-rem varying from 10.8 to 47.1 mg dm -3 , and pH, from 4.4 to 7.7.
As a general trend, the exchangeable/available nutrient contents as well as pH, OM, T, t, and V decreased from the surface to the subsurface, contrary to the exchangeable Al that increases in depth. These facts are in agreement with the fertilizer applications and liming practices, which are carried out on the more superficial soil layers.Furthermore, although liming decreases the content of exchangeable Al, this product moves very little in depth in the soil, thus, its corrective effect is more concentrated on the layer in which it is incorporated (Alvarez; Ribeiro, 1999).    Table 3 presents the descriptive statistics for the pXRF data.

Modeling soil properties through stepwise multiple linear regression
Analyzing Figure 2, which shows the R 2 values from the SMLR models, it is noticed that high values were found with at least one model obtaining R 2 greater than 0.8 for all of the soil properties, except for T. Among the three depths, 0 to 10 and 20 to 40 presented, in general, higher values than 10 to 20. The latter only presented better adjustment for OM and t. These values indicate the potentiality of using pXRF to provide variables for adjusting prediction equations of soil properties in tropical regions. Works such as Sharma et al. (2015), who used pXRF data to perform CTC prediction in soils of the United States, obtaining adequate results using SMLR, corroborate the appropriate soil property predictions from pXRF data.   The general equation presented lower R 2 values in most cases, which may be due to the greater heterogeneity of the samples used in this modeling. In contrast, these results demonstrate that adjusting equations according to the depth of sampling tends to provide better models using SMLR. Souza et al. (2016) used SMLR for bulk density prediction, comparing models created only for A horizon, only for B horizon and a general one, encompassing the two horizons, and also obtained better adjustments for the equations for horizons separately in relation to the general model. Table 4 shows the equations with R² values greater than or equal to 0.80 generated for 0 to 10 cm, 10 to 20 cm, 20 to 40 cm. The general models did not reach R 2 of 0.80. It is noticed that, in the 17 equations presented, the CaO content was the one that appeared more often (15 equations), followed by SiO 2 (12 equations) and Fe (11 equations). Also, for 0 to 10 cm, 8 equations presented R 2 of at least 0.80, against 4 for 10 to 20 cm and 5 for 20 to 40 cm. Thus, these equations indicate that exchangeable Ca, Mg, K, Al as well as P-rem, pH, t, V(%) and OM could be adequately modeled by SMLR using pXRF data. Table 5 shows the results of random forest modeling: MSE oob and the percentage of variance explained for each model (general and specific). With the exception of Mg, it is observed that the percentage of the variance explained by the models for the analyzed soil properties decreased in depth, being greater, therefore, for the 0 to 10 cm depth and smaller for the 20 to 40 cm depth. However, the general model was the one that presented the lowest MSE oob and the highest percentages of the explained variance. This may be due to the greater amount of data used for this model (n = 90) relative to the models for only one depth (n = 30). This result is contrary to that found with SMLR modeling, in which the general models were mostly worse than the specific ones (by depth). Carvalho Junior et al. (2016) compared SMLR models generated with different sets of variables and number of samples to estimate the bulk density and noticed that R 2 values were lower for the models with greater amount of samples. In the same work, they noticed that the models generated by random forest with greater amount of data presented better adjustments than those with smaller amount of data, in agreement with the findings of this work. Table 5 indicates that the soil properties most explained by the general and specific random forest models were base saturation, exchangeable Ca and Al, and pH, whereas OM and T were the least explained. K was the variable with the highest MSE oob , indicating larger prediction errors (to be confirmed by the validation of the models).

Validation of models generated with stepwise multiple linear regression and random forest
The R² values resulted from the comparison between the observed and estimated values generated by SMLR and random forest for the validation of samples are presented in Tables 6 and 7. It is noted that the highest R² values were obtained in predictions with random forest rather than with SMLR for the soil properties, except for K. Available K was also the predicted soil property that presented the highest MSE oob values in the modeling phase (Table 5). Differences were verified between the R 2 values of the validation of the analyzed properties prediction with random forest models    and SMLR, especially exchangeable Ca and Al, pH and t, being them better predicted with random forest. This suggests that random forest presents greater potential for estimating analytical results, reflecting soil properties, from pXRF data. RMSE, ME and R 2 adj , presented in Tables 6 and 7, corroborate the best predictions with random forest in relation to the SMLR. Souza et al. (2016) compared model adjustments for predicting bulk density using SMLR and random forest, also obtaining better results with random forest, in consonance with this work.
In the validation of the models obtained with SMLR, only pH of the 20 to 40 cm depth model and of the general model, exchangeable Al in general model, and t and OM by the 20 to 40 cm depth model presented RMSE values lower than 1.0, while in the validation by random forest, Ca, pH, Al , Mg, t and OM showed values lower than that at all depths and in the general model (Tables 6  and 7). The absolute values of ME were also mostly smaller for the validation of the random forest models in relation to the SMLR models.   Table 5: Mean error of prediction by the out-of-bag method (MSE oob ) and percentage of the explained variance of the models originated using the random forest algorithm. 1 Ca, Mg, Al, T and t in cmol c dm -3 ; P-rem and K in mg dm -3 ; OM in dag kg -1 ; V in %; 2 Var exp (%) = percentage of the variance of the models explained.

Importance of variables
By analyzing the importance of the variables for the explanation of the data with random forest, eight out of the ten soil properties predicted through pXRF data had CaO as the most important variable (Table 8, Figure 3) and, among these ten soil properties, base saturation, and exchangeable Al and Ca had Cr as the second most important variable. P 2 O 5 was the most important variable to predict OM, followed by Zn, whereas SiO 2 was the most important to predict T, with P 2 O 5 as the second most important variable. Aldabaa et al. (2015) used pXRF, remote sensing data and visible infrared diffuse reflectance spectroscopy (VisNIR DRS) to predict values of electrical conductivity and verified that, among the pXRF variables, Cl and S were the most important elements for predictions.
The frequency that each pXRF variable appeared in the first three positions of importance for the predictions shows that CaO was the one that appeared most (8 times), followed by P 2 O 5, Zn, and Cr (6 times each), SiO 2 (2), Sr (1) and Fe (1). Figure 3 shows the values of importance for the main variables to help predict soil properties in order to show the greater importance of CaO in relation to the other important variables.
In contrast to the most important variables, the ones that appeared more often in the last three positions were Al 2 O 3 (7), Cu (4), Cl, Zr, V and Ti (3 times each), K 2 O (2), and SiO 2 , Mn Cr, Fe and Ni (1 each). It is worth noticing that Al 2 O 3 may have not been an important contributor to the prediction of exchangeable Al since the pXRF obtains total element contents, including both the exchangeable Al and the Al stuck in the structure of soil minerals. However, as the study area has managed areas, the exchangeable Al content is quite variable (Table 2), even having little variation of total Al contents as obtained by pXRF (Table  3), which may have hampered the models. Similar trends can be inferred for available K.

Mapping soil properties with random forest through pXRF data
Using random forest, which obtained better modeling and validation results than the SMLR, maps of some well predicted soil properties for the 0 to 10 cm layer were prepared and validated (Figure 4). The maps show that the highest contents of plant nutrients Ca and Mg, higher levels of OM, V, t, higher pH and lower exchangeable Al content were found in the areas of cultivated coffee, with the oldest crop being the one with better soil chemical conditions for plant development (only considering the chemical soil properties predicted here). Under eucalyptus plantation, the nutrient contents are lower, since this area was fertilized only at the moment of implantation, 5 years earlier the sampling. The areas with the lowest nutrient contents and pH are under native forest and native cerrado grasses, which do not present anthropogenic intervention, and reflect the high degree of weathering-leaching of these Brazilian cerrado soils (Resende et al., 2014).         Validation of these maps with an external set of samples (n = 10) resulted in 1:1 graphics between predicted and observed values of the soil properties ( Figure 5). For most predicted soil properties, high R 2 and R 2 adj values were found, except for available Mg, which had a R 2 of 0.30. For exchangeable Al (R 2 = 0.83), P-rem (R 2 = 0.80), exchangeable Ca (R 2 = 0.78) and t (R 2 = 0.73), adequate spatialized predictions were found, followed by base saturation (R 2 = 0.67), OM (R 2 = 0.66), and pH (R 2 = 0.54). These results indicate that, although this mapping procedure has accumulated errors, first on the spatialization of the pXRF variables by the IDW, and then during the random forest modeling and predictions, most generated soil property maps presented satisfactory accuracy. This demonstrates the potential of using pXRF as a source of variables to help predict soil properties also spatially, mainly in areas that lack continuous information in greater detail (e.g., digital elevation model and its derivatives), as it is the case of the study area of this work. In addition, by providing results quickly and inexpensively, it may favor gathering more observations (points visited) in the field and also, through predictions, reduce the number of laboratory analyses. The use of pXRF to improve spatial and non-spatial soil predictions was also found by Silva et al. (2016b), who used magnetic susceptibility and pXRF data, as well as continuous variables derived from digital elevation model for soil classes and properties prediction in Brazil, finding that magnetic susceptibility and pXRF data increased the models accuracy when associated with terrain data. Weindorf; Bakr; Zhu (2014), after presenting examples of correlations among the element contents obtained by pXRF and results of laboratory analysis, suggested that many works using this equipment would be performed focusing on predicting soil properties in the years to come. Here we demonstrated the potential of this equipment for predicting soil properties also in Brazilian soils, in accordance with Piikki et al. (2016), who used pXRF coupled with three other sensors to predict results of laboratory soil analyses in Kenya, observing that pXRF was frequently employed in good models. Sharma et al. (2014) used pXRF data to predict soil pH from linear regressions. Data collection through pXRF in this work was carried out in the laboratory; however, the use of this equipment in the field can accelerate the acquisition of data that is more difficult to be obtained, through adjustment of models with data from pXRF scanning in the field. Stockmann et al. (2016) evaluated the concentration of elements in soil profiles to infer about their parent materials using the pXRF in the field, in addition to making a comparison with the data obtained in the laboratory. In this way, future tests in this line of research are suggested for tropical soils, since the pXRF in association with robust algorithms can increase the amount of data on soils in Brazil both spatially and punctually, providing results rapidly, at low cost and without generation of chemical residues.

CONCLUSIONS
Soil properties such as exchangeable Ca, Mg, Al, pH, organic matter, base saturation, potential and effective CEC and P-rem could be predicted with high accuracy by random forest from the data obtained by pXRF, surpassing the predictions made by stepwise multiple linear regression. The variables obtained by pXRF allowed the spatial prediction of soil properties related to soil fertility, leading to the generation of accurate maps, which demonstrates the potential of pXRF to be used as a source of variables to help spatial prediction of soil properties rapidly, at low cost and without generating residues.