Acessibilidade / Reportar erro

Multiple linear regression and random forest to predict and map soil properties using data from portable X-ray fluorescence spectrometer (pXRF)

Regressão linear múltipla e random forest para predição e mapeamento de atributos do solo utilizando dados de espectrômetro portátil de fluorescência de raios-X (pXRF)

ABSTRACT

Determination of soil properties ​​helps in the correct management of soil fertility. The portable X-ray fluorescence spectrometer (pXRF) has been recently adopted to determine total chemical element contents in soils, allowing soil property inferences. However, these studies are still scarce in Brazil and other countries. The objectives of this work were to predict soil properties using pXRF data, comparing stepwise multiple linear regression (SMLR) and random forest (RF) methods, as well as mapping and validating soil properties. 120 soil samples were collected at three depths and submitted to laboratory analyses. pXRF was used in the samples and total element contents were determined. From pXRF data, SMLR and RF were used to predict soil laboratory results, reflecting soil properties, and the models were validated. The best method was used to spatialize soil properties. Using SMLR, models had high values of R² (≥0.8), however the highest accuracy was obtained in RF modeling. Exchangeable Ca, Al, Mg, potential and effective cation exchange capacity, soil organic matter, pH, and base saturation had adequate adjustment and accurate predictions with RF. Eight out of the 10 soil properties predicted by RF using pXRF data had CaO as the most important variable helping predictions, followed by P2O5, Zn and Cr. Maps generated using RF from pXRF data had high accuracy for six soil properties, reaching R2 up to 0.83. pXRF in association with RF can be used to predict soil properties with high accuracy at low cost and time, besides providing variables aiding digital soil mapping.

Index terms:
Soil analyses; spatial prediction; proximal sensor.

RESUMO

A determinação de atributos do solo auxilia no correto manejo da sua fertilidade. O equipamento portátil de fluorescência de raios-X (pXRF) foi recentemente adotado para determinar o teor total de elementos químicos em solos, permitindo inferências sobre atributos do solo. No entanto, esses estudos ainda são escassos no Brasil e em outros países. Os objetivos deste trabalho foram prever atributos do solo a partir de dados do pXRF, comparando-se os métodos de regressão linear múltipla stepwise (SMLR) e de random forest (RF), além de mapear e validar atributos do solo. 120 amostras de solo foram coletadas em três profundidades e submetidas a análises laboratoriais. Utilizou-se o pXRF para leitura das amostras e determinou-se o teor total de elementos. A partir dos dados do pXRF, foram utilizadas SMLR e RF para predizer resultados laboratoriais, que refletem atributos do solo, e os modelos foram validados. O melhor método foi utilizado para espacializar os atributos do solo. Utilizando SMLR, os modelos apresentaram valores elevados de R² (≥0,8), porém maior acurácia foi obtida na modelagem com RF. A capacidade de troca de cátions potencial e efetiva, matéria orgânica do solo, pH, saturação por bases e teores trocáveis de Ca, Al e Mg apresentaram ajustes adequados e predições acuradas com RF. Dos dez atributos do solo preditos por RF a partir de dados do pXRF, sete apresentavam CaO como a variável mais importante para auxiliar as predições, seguido por P2O5, Zn e Cr. Os mapas gerados a partir de dados do pXRF usando RF apresentaram adequados valores de R² para seis atributos do solo, atingindo R2 de até 0,83. O pXRF em associação com RF pode ser usado para prever atributos do solo com elevada acurácia, com rapidez e a baixo custo, além de proporcionar variáveis que auxiliam o mapeamento digital de solos.

Termos para indexação:
Análises de solo; predição espacial; sensor próximo.

INTRODUCTION

Soils present diverse physical, chemical, mineralogical and biological properties, which influence their diverse potentialities of use (Birkeland, 1999BIRKELAND, P. W. Soils and geomorhology. 3rd. ed. New York: Oxford University Press, 1999. 448p. ; Resende et al., 2014RESENDE, M. et al. Pedologia: Base para distinção de ambientes. 6th. ed. Lavras: Editora UFLA, 2014. 378p. ; Schaetzl; Anderson, 2005SCHAETZL, R. J.; ANDERSON, S. Soil: Genesis and Geomorphology. 1st. ed. New York: Cambridge University Press, 2005. 817p. ), such as plant growth. The characterization of those properties is of great importance for the proper management and conservation of soils (Severiano et al., 2009SEVERIANO, E. D. C. et al. Potencial de uso e qualidade estrutural de dois solos cultivados com cana-de-açúcar em Goianésia (GO). Revista Brasileira de Ciência do Solo , 33(1):159-168, 2009. ). For that, several laboratory analyses of different levels of complexity are employed, which helps making decisions on the correct management required according to the needs of the crops, so that the agricultural production may be increased (Lopes; Guilherme, 2016LOPES, A. S.; GUILHERME, L. R. G. A career perspective on soil management in the Cerrado Region of Brazil. Advances in Agronomy, 137:1-72, 2016. ).

On the other hand, carrying out laboratory tests in a large number of samples requires more time and financial resources, as well as chemical reagents, which generate residues. Thus, the use of tools that quickly allow the evaluation of soil properties, at low cost and without residues production may facilitate the evaluation of more samples to characterize soils in more detail and for different purposes.

Portable X-ray fluorescence spectrometer (pXRF) is a tool used in works of several fields of study for identification and quantification of chemical elements present in varied materials (Ioannides et al., 2016IOANNIDES, D. et al. A preliminary study of the metallurgical ceramics from Kition, Cyprus with the application of pXRF. Journal of Archaeological Science: Reports, 7:554-565, 2016. ; Milić, 2014MILIĆ, M. pXRF characterisation of obsidian from central Anatolia, the Aegean and central Europe. Journal of Archaeological Science, 41:285-296, 2014. ; Peinado et al., 2010PEINADO, F. M. et al. A rapid field procedure for screening trace elements in polluted soil using portable X-ray fluorescence (pXRF). Geoderma , 159(1-2):76-82, 2010. ; Rouillon; Taylor, 2016ROUILLON, M.; TAYLOR, M. P. Can field portable X-ray fluorescence (pXRF) produce high quality data for application in environmental contamination research? Environmental Pollution, 214:255-264, 2016. ; Terra et al., 2014TERRA, J. et al. Análise Multielementar de solos: Uma proposta envolvendo equipamento portátil de fluorescência de raios X. Semina: Ciências Exatas e Tecnológicas, 35(2):207-214, 2014. ; Zhu; Weindorf; Zhang, 2011ZHU, Y.; WEINDORF, D. C.; ZHANG, W. Characterizing soils using a portable X-ray fluorescence spectrometer: 1. Soil texture. Geoderma , 167-168:167-177, 2011.). This equipment emits high-energy X-ray beams, which cause the displacement of electrons from inner to outer orbits as they hit the atoms of the elements in the sample. In sequence, the displaced electrons return to their original orbits emitting a fluorescence characteristic of each chemical element, as it is related to the element atomic number. Thus, in a few seconds the equipment is able to determine the total contents of elements of the Periodic Table between Mg and U, allowing its use both in the field and in the laboratory (Weindorf; Bakr; Zhu, 2014WEINDORF, D. C.; BAKR, N.; ZHU, Y. Advances in portable X-ray fluorescence (PXRF) for environmental, pedological, and agronomic applications. Advances in Agronomy , 128:1-45, 2014. ).

The pXRF generates a large data set, which may slow down their analyses and interpretation in detail. In this sense, the use of machine learning tools may accelerate the identification of data for characterizing soils. Several methods of analyzing large amount of data of both continuous and categorical variables have been used in works of various natures, such as the stepwise multiple linear regression (SMLR) (Juhos; Szabó; Ladányi, 2015JUHOS, K.; SZABÓ, S.; LADÁNYI, M. Influence of soil properties on crop yield: A multivariate statistical approach. International Agrophysics, 29(4):433-440, 2015. ; Menezes et al., 2016MENEZES, M. D. de et al. Spatial prediction of soil properties in two contrasting physiographic regions in Brazil. Scientia Agricola , 73(3):274-285, 2016. ; Rodrigues; Corá; Fernandes, 2012RODRIGUES, M. S.; CORÁ, J. E.; FERNANDES, C. Soil sampling intensity and spatial distribution pattern of soils attributes and corn yield in no-tillage system. Revista Brasileira de Ciencia do Solo, 36:599-609, 2012. ). This analysis adjusts regression models from easily obtained variables to estimate data more difficult to be acquired, in which the addition or removal of predictive variables to the model is performed based on statistical tests, generating a final equation. Weindorf et al. (2012WEINDORF, D. C. et al. Characterizing soils via portable x-ray fluorescence spectrometer: 2. Spodic and Albic horizons. Geoderma , 189-190:268-277, 2012. ) evaluated the pXRF to discriminate spodic and albic horizons in the field, using SMLR to estimate organic carbon data from pXRF data, concluding that the equipment was adequate, contributing to the rapid generation of chemical data.

Another method that has been increasingly used for predictions is the so-called random forest (RF) (Breiman, 2001BREIMAN, L. Random forests. Machine Learning, 45(1):5-32, 2001. ). This algorithm presents as advantages the possibility of using both numerical and categorical variables, modeling non-linear relationships, assessment of the importance of each variable for the generation of the final model, calculation of modeling errors, among others (Breiman, 2001; Liaw; Wiener, 2002LIAW, A.; WIENER, M. Classification and regression by random forest. R News, 2(December):18-22, 2002. ). However, despite of classifying the variables according to their importance to the model (Archer; Kimes, 2008ARCHER, K. J.; KIMES, R. V. Empirical characterization of random forest variable importance measures. Computational Statistics and Data Analysis, 52(4):2249-2260, 2008. ), this method does not generate a final equation of the model, as opposed to SMLR. Therefore, it is sometimes referred to as a black-box method (Grimm et al., 2008GRIMM, R. et al. Soil organic carbon concentrations and stocks on Barro Colorado Island - Digital soil mapping using Random Forests analysis. Geoderma , 146(1-2):102-113, 2008. ), although some works have pointed out that this method is robust and provides better results than other methods for both spatial and non-spatial predictions (Hengl et al., 2015HENGL, T. et al. Mapping soil properties of Africa at 250 m resolution: Random forests significantly improve current predictions. Plos One, 10(6):0125814, 2015. ; Lies; Glaser; Huwe, 2012LIES, M.; GLASER, B.; HUWE, B. Uncertainty in the spatial prediction of soil texture. Comparison of regression tree and Random Forest models. Geoderma , 170:70-79, 2012. ; Souza et al., 2016SOUZA, E. DE et al. Pedotransfer functions to estimate bulk density from soil properties and environmental covariates: Rio Doce basin. Scientia Agricola , 73(6):525-534, 2016. ).

In recent years, most works involving digital soil mapping has been based on continuous variables for the area of interest, such as satellite images and digital elevation models and their derivatives (e.g. slope, topographic wetness index, curvatures, etc.), to spatialize soils information (Adhikari et al., 2014ADHIKARI, K. et al. Constructing a soil class map of Denmark based on the FAO legend using digital techniques. Geoderma, 214-215(2014):101-113, 2014. ; Giasson et al., 2015GIASSON, E. et al. Instance selection in digital soil mapping: A study case in Rio Grande do Sul, Brazil. Ciência Rural, 45(9):1592-1598, 2015. ; Menezes et al., 2014MENEZES, M. D. de et al. Solum depth spatial prediction comparing conventional with knowledge-based digital soil mapping approaches. Scientia Agricola, 71(4):316-323, 2014. ; Silva et al., 2016aSILVA, S. H. G. et al. Retrieving pedologist’s mental model from existing soil map and comparing data mining tools for refining a larger area map under similar environmental conditions in Southeastern Brazil. Geoderma , 267:65-77, 2016a. ; Taghizadeh-Mehrjardi et al., 2015TAGHIZADEH-MEHRJARDI, R. et al. Comparing data mining classifiers to predict spatial distribution of USDA-family soil groups in Baneh region, Iran. Geoderma , 253-254:67-77, 2015. ). However, when working in smaller areas, mainly in developing countries, it is common to face difficulties in obtaining data with high spatial resolution, which tends to make the use of these variables unfeasible. In this sense, pXRF can be an alternative to obtain a large amount of punctual data that, after being spatialized, may contribute to spatial predictions (Silva et al., 2016bSILVA, S. H. G. et al. Proximal sensing and digital terrain models applied to digital soil mapping and modeling of Brazilian Latosols (Oxisols). Remote Sensing, 8:614-635, 2016b. ).

In spite of the advantages of using pXRF to analyze elemental composition of materials, very few works have used pXRF in Brazil and in other developing countries for studies with a variety of purposes, mainly regarding soils. In this sense, due to the search for methods to obtain soils information in rapid and economical ways, the objectives of this work were: (i) to predict results of laboratory analysis through SMLR and RF from data generated by pXRF, validating the generated models and; (ii) to evaluate the potential of pXRF to aid spatial prediction of analytical results, reflecting soil properties, generating and validating maps of soil properties.

MATERIAL AND METHODS

Study area

This work was carried out at Santa Luzia Farm, in the county of Campos Altos, Minas Gerais, Brazil, located between latitudes 19°35’05.33’’ and 19°35’17.80’’S and longitudes 46°16’14.46’’ and 46°15’24.34’’W, covering 17.1 ha. The climate of the region is Aw, with annual average rainfall of 1,450 mm (Motta; Baruqui; Santos, 2004MOTTA, P. E. F.; BARUQUI, A. M.; SANTOS, H. G. Levantamento de reconhecimento de média intensidade dos solos da região do Alto Paranaíba, Minas Gerais. 1. ed. Rio de Janeiro: Embrapa Solos, 2004. 238p. ), dry winters and rainy summers, and monthly average temperature greater than 18 °C in all months of the year. The area has varying land uses, such as coffee plantations (Coffea arabica Lineu) with 5 years old (9.7% of the area) and 1 year old (51.8%), 5 year-old eucalyptus plantation (15.4%) and native vegetation of secondary forest (14.7%) and cerrado grasses (8.4%) (Figure 1).

The study area is occupied by typic Dystrophic Haplic Cambisols (95% of the area) followed by typic Dystrophic Regolithic Neosols (5%), classified using the Brazilian Soil Classification System (Embrapa, 2013EMPRESA BRASILEIRA DE PESQUISA AGROPECUÁRIA - EMBRAPA. Sistema Brasileiro de Classificação de Solos. 3rd. ed. Brasília: Embrapa, 2013. 353p. ), both with gravels, developed from metapelitic rocks. Soil samples were collected at three depths: 0 to 10 cm, 10 to 20 cm and 20 to 40 cm, at 40 places randomly distributed in the area, making up a total of 120 samples (Figure 1).

Figure 1:
Study area location, land uses and sampling points for modeling and validation.

Laboratory analyses

Soil samples were air dried, passed through a 2 mm sieve and analyzed in the laboratory where the following soil properties were determined: soil pH in water, exchangeable contents of Ca2+, Mg2+ and Al3+ (Mclean et al., 1958MCLEAN, E. O. et al. Aluminium in soils: I. Extraction methods and magnitud clays in Ohio soils. Soil Science Society of America Proceedings, 22(5):382-387, 1958. ), available K extracted with Mehlich-l, soil organic matter (OM) (Walkley; Black, 1934WALKLEY, A.; BLACK, I. A. An examination of the Degtjareff method for determining soil organic matter and a proposed modification of the chromic acid titration method. Soil Science, 37(1):29-38, 1934. ), remaining P (P-rem) (Alvarez; Fonseca, 1990ALVAREZ V., V. H.; FONSECA, D. M. Definição de doses de fósforo para a determinação da capacidade máxima de adsorção de fosfato e para ensaios de casa de vegetação. Revista Brasileira de Ciência do Solo, 14:49-55, 1990. ), potential (T) and effective (t) cation exchange capacity (CEC), and base saturation (V).

The samples were also analyzed in the laboratory with the pXRF of Bruker model S1 Titan LE. This equipment contains 50 kV and 100 μA X-ray tubes. The software used was GeoChem, in the Trace (dual soil) configuration, recommended for soils, for 60 seconds, including two X-ray beams. The 120 samples collected were subjected to analysis in triplicate by pXRF and the accuracy of the equipment was evaluated through scanning standard reference materials 2710a and 2711a certified by the National Institute of Standards and Technology (NIST) as well as scanning an equipment standard sample (check sample - CS). From the NIST and CS certified samples, the recovery of the element contents obtained by pXRF (% of recovery = 100 x Obtained content / Total certified content) were calculated. The recovery percentages of the samples are presented in Table 1 only for the elements that were identified in all the samples of this work.

Table 1:
Percentage of recovery of element contents by portable X-ray fluorescence spectrometer (pXRF) of National Institute of Standards and Technology (NIST) and pXRF equipment (CS) certified samples.

Analysis of data and modeling

The results of the laboratory analyses were submitted to descriptive statistics, in the three soil depths evaluated, to obtain the average, maximum and minimum values, standard deviation and coefficient of variation (CV). From the data of the pXRF, models were adjusted to predict the following soil properties: exchangeable Ca2+, Mg2+, K+, Al3+, P-rem, pH, potential CEC (T), effective CEC (t), soil organic matter (OM) and base saturation (V).

Soil samples were randomly separated into modeling and validation data sets, respectively, consisting of 75% and 25% of the total data. Also, the samples were subdivided and modeled in two ways: i) specific models, according to the three depths of sampling, with n = 40 for each depth, with 30 samples for modeling and 10 for validation; and ii) general model, including all samples (n = 120, 90 for modeling and 30 for validation).

In order to adjust the models for predicting soil property results from the pXRF data, two methods were tested: stepwise multiple linear regression (SMLR) and random forest algorithm (RF). The SMLR was generated through SigmaPlot software, backward method, in which the least important variables for model adjustment are removed, with 95% probability.

The random forest analysis was performed in R software, randomForest package (Liaw and Wiener, 2015LIAW, A.; WIENER, M. Package “randomForest.” 2015. Available in: <Available in: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf >. Access in: January, 3, 2017.
https://cran.r-project.org/web/packages/...
), with the following parameters established: number of trees of the model (ntrees) = 1000, number of variables in each node (nodesize) = 5, and number of variables used in each tree (mtry) = one third of the total number of samples, as suggested by Liaw and Wiener (2002)LIAW, A.; WIENER, M. Classification and regression by random forest. R News, 2(December):18-22, 2002. for regression random forests.

The random forest adjustment results in the mean square of the residuals (MSEoob), the percentage of the variance explained by the model and the importance of all the variables of the model in the prediction of the data, by the out-of-bag method. MSEoob is calculated when, for each iteration, only a few predictor variables are used to generate a tree. The MSEoob is calculated through Equation 1. The importance of the variables, also obtained by the algorithm, is a result of the average of the reduction of the accuracy in the prediction as one variable is left out of the model while the other variables are included. Thus, if a variable is removed, the more the prediction error increases, which means, the accuracy of the prediction decreases, the more important that variable is for the model adjustment (Breiman, 2001BREIMAN, L. Random forests. Machine Learning, 45(1):5-32, 2001. ; Liaw; Wiener, 2002LIAW, A.; WIENER, M. Classification and regression by random forest. R News, 2(December):18-22, 2002. ).

M S E o o b = 1 n [ y i y i O O B ] 2 (1)

in which yi is the real (observed) value, yi OOB is the mean of the predictions of OOB for the ith observation, n is the number of trees.

Accuracy of the models

The validation of the general and specific (per depth) models generated by SMLR and random forest was performed using the independent subset of data (not used in the modeling), consisting of 25% of the total data, to determine if the predictions by the models are valid for other observed data. For this, the estimated values for each sample of the independent subset were determined and the accuracy of the models was evaluated through the following statistical indices: coefficient of determination (R2), adjusted R2 (R2 adj) in relation to observed and estimated data, root mean square error (RMSE), and mean error (ME), according to Equations 2 and 3:

R M S E = 1 n i = 1 n ( e i m i ) 2 (2)

M E = 1 n i = 1 n ( e i = m i ) (3)

where n: number of observations, ei: values estimated by the model, and mi: values observed through laboratory analysis.

The efficiency of the modeling methods (SMLR and random forest) was carried out in addition to the determination of the analytical results capable of being predicted with greater accuracy from the generated models. In this sense, the models that obtained the highest values of R2 and R2 adj and the smallest RMSE and ME comparing observed with estimated data were considered the best for prediction of the results of laboratory analysis from the pXRF data.

Spatial prediction of soil properties from pXRF data

From the definition of the best method for modeling, the laboratory results that presented high accuracy of predictions were spatialized for the entire study area. This procedure aimed to evaluate the possibility of using pXRF data as a basis for mapping soil properties (Duda et al., 2017DUDA, B. M. et al. Soil characterization across catenas via advanced proximal sensors. Geoderma , 298:78-91, 2017. ; Silva et al., 2016bSILVA, S. H. G. et al. Proximal sensing and digital terrain models applied to digital soil mapping and modeling of Brazilian Latosols (Oxisols). Remote Sensing, 8:614-635, 2016b. ), providing easily obtainable variables, at low cost, rapidly and with no generation of chemical residues.

First in this procedure it was necessary to spatialize the variables obtained by pXRF for the entire study area, since the soil properties prediction models are based on pXRF data, which, in turn, only refer to the sites at where samples were collected. In order to do so, the inverse distance weighting (IDW) method was employed in the spatialization of the pXRF variables, allowing their subsequent use for mapping. The values inferred at non-sampled areas by IDW are estimated using linear combination of values at the sampled places, weighted by an inverse function of the distance from the point of interest to the sample points. The weights (λ i ) are expressed in Equation 4:

λ i = 1 d i p i = 1 n 1 d i p (4)

where d i is the distance between x 0 two points x 0, p is a power parameter, and n represents the number of sampled points used for the estimation.

The predicted maps of the soil properties were also validated with 25% of the samples (not used for the modeling) through the R2, R2 adj, RMSE, ME and 1: 1 graphs (observed vs. estimated data).

RESULTS AND DISCUSSION

Descriptive statistics

The analytical data of the samples used for modeling and validation are presented in Table 2. There is great variability of values of all evaluated soil properties, as demonstrated by the high coefficients of variation, in both modeling and validation data sets. This occurred, as expected, due to the different land uses of the area and soil management practices, ranging from native vegetation, where pH and nutrient contents are lower since no anthropic influence occurs, to cultivated areas, where these values are higher because of liming and fertilizers application. This variability of data can contribute to the generation of more reliable models with possible use for soils with different conditions, since the used values contemplate a wide range of values of the analyzed properties, such as P-rem varying from 10.8 to 47.1 mg dm-3, and pH, from 4.4 to 7.7.

Table 2:
Descriptive statistics of soil properties in modeling and validation data sets.

As a general trend, the exchangeable/available nutrient contents as well as pH, OM, T, t, and V decreased from the surface to the subsurface, contrary to the exchangeable Al that increases in depth. These facts are in agreement with the fertilizer applications and liming practices, which are carried out on the more superficial soil layers.Furthermore, although liming decreases the content of exchangeable Al, this product moves very little in depth in the soil, thus, its corrective effect is more concentrated on the layer in which it is incorporated (Alvarez; Ribeiro, 1999ALVAREZ V., V. H.; RIBEIRO, A. C. Calagem. In: RIBEIRO, A. C.; GUIMARÃES, P. T. G.; ALVAREZ V., V. H. (Eds.). Recomendações para o uso de corretivos e fertilizantes em Minas Gerais - 5° Aproximação. Viçosa: CFSEMG, 1999. p.43-60. ).

The pXRF determined 16 elements for all analyzed samples, being them Al2O3, Fe, SiO2, CaO, P2O5, K2O, Cl, Ti, V, Cr, Mn, Ni, Cu, Zn, Zr e Sr. Table 3 presents the descriptive statistics for the pXRF data.

Table 3:
Descriptive statistics for data (ppm) obtained by the portable X-ray fluorescence spectrometer (pXRF) for the different data sets (per soil depth and general).

Modeling soil properties through stepwise multiple linear regression

Analyzing Figure 2, which shows the R2 values from the SMLR models, it is noticed that high values were found with at least one model obtaining R2 greater than 0.8 for all of the soil properties, except for T. Among the three depths, 0 to 10 and 20 to 40 presented, in general, higher values than 10 to 20. The latter only presented better adjustment for OM and t. These values indicate the potentiality of using pXRF to provide variables for adjusting prediction equations of soil properties in tropical regions. Works such as Sharma et al. (2015SHARMA, A. et al. Characterizing soils via portable X-ray fluorescence spectrometer: 4. Cation exchange capacity (CEC). Geoderma , 239:130-134, 2015. ), who used pXRF data to perform CTC prediction in soils of the United States, obtaining adequate results using SMLR, corroborate the appropriate soil property predictions from pXRF data.

Figure 2:
Coefficient of determination (R2) of the equations to modeling soil properties through stepwise multiple linear regression. Al, Ca, Mg and K represent the exchangeable/available contents of these elements; OM - organic matter; P-rem - remaining P; T - potential cation exchange capacity; t - effective cation exchange capacity; V - base saturation.

The general equation presented lower R2 values in most cases, which may be due to the greater heterogeneity of the samples used in this modeling. In contrast, these results demonstrate that adjusting equations according to the depth of sampling tends to provide better models using SMLR. Souza et al. (2016SOUZA, E. DE et al. Pedotransfer functions to estimate bulk density from soil properties and environmental covariates: Rio Doce basin. Scientia Agricola , 73(6):525-534, 2016. ) used SMLR for bulk density prediction, comparing models created only for A horizon, only for B horizon and a general one, encompassing the two horizons, and also obtained better adjustments for the equations for horizons separately in relation to the general model.

Table 4 shows the equations with R² values greater than or equal to 0.80 generated for 0 to 10 cm, 10 to 20 cm, 20 to 40 cm. The general models did not reach R2 of 0.80. It is noticed that, in the 17 equations presented, the CaO content was the one that appeared more often (15 equations), followed by SiO2 (12 equations) and Fe (11 equations). Also, for 0 to 10 cm, 8 equations presented R2 of at least 0.80, against 4 for 10 to 20 cm and 5 for 20 to 40 cm. Thus, these equations indicate that exchangeable Ca, Mg, K, Al as well as P-rem, pH, t, V(%) and OM could be adequately modeled by SMLR using pXRF data.

Table 4:
Stepwise multiple linear regression equations with R² ≥0.80, in different depth to predict exchangeable Ca (cmolc dm-3), exchangeable Mg (cmolc dm-3), available K (mg dm-3), exchangeable Al (cmolc dm-3), pH, effective cation exchange capacity (t) (cmolc dm-3), base saturation (V) (%), remaining P (P-rem) (mg dm-3) and organic matter content (OM) (%) from pXRF data (ppm).

Modeling soil properties through random forest

Table 5 shows the results of random forest modeling: MSEoob and the percentage of variance explained for each model (general and specific). With the exception of Mg, it is observed that the percentage of the variance explained by the models for the analyzed soil properties decreased in depth, being greater, therefore, for the 0 to 10 cm depth and smaller for the 20 to 40 cm depth. However, the general model was the one that presented the lowest MSEoob and the highest percentages of the explained variance. This may be due to the greater amount of data used for this model (n = 90) relative to the models for only one depth (n = 30). This result is contrary to that found with SMLR modeling, in which the general models were mostly worse than the specific ones (by depth). Carvalho Junior et al. (2016)CARVALHO JUNIOR, W. de et al. Regressão linear múltipla e modelo Random Forest para estimar a densidade do solo em áreas montanhosas. Pesquisa Agropecuária Brasileira, 51(9):1428-1437, 2016. compared SMLR models generated with different sets of variables and number of samples to estimate the bulk density and noticed that R2 values were lower for the models with greater amount of samples. In the same work, they noticed that the models generated by random forest with greater amount of data presented better adjustments than those with smaller amount of data, in agreement with the findings of this work.

Table 5:
Mean error of prediction by the out-of-bag method (MSEoob) and percentage of the explained variance of the models originated using the random forest algorithm.

Table 5 indicates that the soil properties most explained by the general and specific random forest models were base saturation, exchangeable Ca and Al, and pH, whereas OM and T were the least explained. K was the variable with the highest MSEoob, indicating larger prediction errors (to be confirmed by the validation of the models).

Validation of models generated with stepwise multiple linear regression and random forest

The R² values resulted from the comparison between the observed and estimated values generated by SMLR and random forest for the validation of samples are presented in Tables 6 and 7. It is noted that the highest R² values were obtained in predictions with random forest rather than with SMLR for the soil properties, except for K. available K was also the predicted soil property that presented the highest MSEoob values in the modeling phase (Table 5). Differences were verified between the R2 values of the validation of the analyzed properties prediction with random forest models and SMLR, especially exchangeable Ca and Al, pH and t, being them better predicted with random forest. This suggests that random forest presents greater potential for estimating analytical results, reflecting soil properties, from pXRF data.

Table 6:
Validation data of the models generated by stepwise multiple linear regression.
Table 7:
Validation data of the models generated by random forest.

RMSE, ME and R2 adj, presented in Tables 6 and 7, corroborate the best predictions with random forest in relation to the SMLR. Souza et al. (2016SOUZA, E. DE et al. Pedotransfer functions to estimate bulk density from soil properties and environmental covariates: Rio Doce basin. Scientia Agricola , 73(6):525-534, 2016. ) compared model adjustments for predicting bulk density using SMLR and random forest, also obtaining better results with random forest, in consonance with this work.

In the validation of the models obtained with SMLR, only pH of the 20 to 40 cm depth model and of the general model, exchangeable Al in general model, and t and OM by the 20 to 40 cm depth model presented RMSE values lower than 1.0, while in the validation by random forest, Ca, pH, Al , Mg, t and OM showed values lower than that at all depths and in the general model (Tables 6 and 7). The absolute values of ME were also mostly smaller for the validation of the random forest models in relation to the SMLR models.

Importance of variables

By analyzing the importance of the variables for the explanation of the data with random forest, eight out of the ten soil properties predicted through pXRF data had CaO as the most important variable (Table 8, Figure 3) and, among these ten soil properties, base saturation, and exchangeable Al and Ca had Cr as the second most important variable. P2O5 was the most important variable to predict OM, followed by Zn, whereas SiO2 was the most important to predict T, with P2O5 as the second most important variable. Aldabaa et al. (2015ALDABAA, A. A. A. et al. Combination of proximal and remote sensing methods for rapid soil salinity quantification. Geoderma , 239:34-46, 2015. ) used pXRF, remote sensing data and visible infrared diffuse reflectance spectroscopy (VisNIR DRS) to predict values of electrical conductivity and verified that, among the pXRF variables, Cl and S were the most important elements for predictions.

Table 8:
Importance of portable X-ray fluorescence spectrometer (pXRF) variables in decreasing order to predict soil properties.

Figure 3:
Most important variables of portable X-ray fluorescence spectrometer (pXRF) (importance increases from 0 to 60) for prediction of soil properties with random forest.

The frequency that each pXRF variable appeared in the first three positions of importance for the predictions shows that CaO was the one that appeared most (8 times), followed by P2O5, Zn, and Cr (6 times each), SiO2 (2), Sr (1) and Fe (1). Figure 3 shows the values of importance for the main variables to help predict soil properties in order to show the greater importance of CaO in relation to the other important variables.

In contrast to the most important variables, the ones that appeared more often in the last three positions were Al2O3 (7), Cu (4), Cl, Zr, V and Ti (3 times each), K2O (2), and SiO2, Mn Cr, Fe and Ni (1 each). It is worth noticing that Al2O3 may have not been an important contributor to the prediction of exchangeable Al since the pXRF obtains total element contents, including both the exchangeable Al and the Al stuck in the structure of soil minerals. However, as the study area has managed areas, the exchangeable Al content is quite variable (Table 2), even having little variation of total Al contents as obtained by pXRF (Table 3), which may have hampered the models. Similar trends can be inferred for available K.

Mapping soil properties with random forest through pXRF data

Using random forest, which obtained better modeling and validation results than the SMLR, maps of some well predicted soil properties for the 0 to 10 cm layer were prepared and validated (Figure 4). The maps show that the highest contents of plant nutrients Ca and Mg, higher levels of OM, V, t, higher pH and lower exchangeable Al content were found in the areas of cultivated coffee, with the oldest crop being the one with better soil chemical conditions for plant development (only considering the chemical soil properties predicted here). Under eucalyptus plantation, the nutrient contents are lower, since this area was fertilized only at the moment of implantation, 5 years earlier the sampling. The areas with the lowest nutrient contents and pH are under native forest and native cerrado grasses, which do not present anthropogenic intervention, and reflect the high degree of weathering-leaching of these Brazilian cerrado soils (Resende et al., 2014RESENDE, M. et al. Pedologia: Base para distinção de ambientes. 6th. ed. Lavras: Editora UFLA, 2014. 378p. ).

Figure 4:
Maps of predicted soil properties for 0 to 10 cm depth spatialized with random forest; P-rem: remaining P; t: effective cation exchange capacity; ; V: base saturation; OM: soil organic matter.

Validation of these maps with an external set of samples (n = 10) resulted in 1:1 graphics between predicted and observed values of the soil properties (Figure 5). For most predicted soil properties, high R2 and R2 adj values were found, except for available Mg, which had a R2 of 0.30. For exchangeable Al (R2 = 0.83), P-rem (R2 = 0.80), exchangeable Ca (R2 = 0.78) and t (R2 = 0.73), adequate spatialized predictions were found, followed by base saturation (R2 = 0.67), OM (R2 = 0.66), and pH (R2 = 0.54). These results indicate that, although this mapping procedure has accumulated errors, first on the spatialization of the pXRF variables by the IDW, and then during the random forest modeling and predictions, most generated soil property maps presented satisfactory accuracy. This demonstrates the potential of using pXRF as a source of variables to help predict soil properties also spatially, mainly in areas that lack continuous information in greater detail (e.g., digital elevation model and its derivatives), as it is the case of the study area of this work. In addition, by providing results quickly and inexpensively, it may favor gathering more observations (points visited) in the field and also, through predictions, reduce the number of laboratory analyses. The use of pXRF to improve spatial and non-spatial soil predictions was also found by Silva et al. (2016bSILVA, S. H. G. et al. Proximal sensing and digital terrain models applied to digital soil mapping and modeling of Brazilian Latosols (Oxisols). Remote Sensing, 8:614-635, 2016b. ), who used magnetic susceptibility and pXRF data, as well as continuous variables derived from digital elevation model for soil classes and properties prediction in Brazil, finding that magnetic susceptibility and pXRF data increased the models accuracy when associated with terrain data.

Figure 5:
Plots of observed and estimated values resulted from random forest prediction of soil properties for the whole study area. Ca, Mg and Al refer to exchangeable contents; t - effective cation exchange capacity; V - base saturation; OM - soil organic matter; P-rem - remaining P.

Weindorf; Bakr; Zhu (2014WEINDORF, D. C.; BAKR, N.; ZHU, Y. Advances in portable X-ray fluorescence (PXRF) for environmental, pedological, and agronomic applications. Advances in Agronomy , 128:1-45, 2014. ), after presenting examples of correlations among the element contents obtained by pXRF and results of laboratory analysis, suggested that many works using this equipment would be performed focusing on predicting soil properties in the years to come. Here we demonstrated the potential of this equipment for predicting soil properties also in Brazilian soils, in accordance with Piikki et al. (2016PIIKKI, K. et al. Performance evaluation of proximal sensors for soil assessment in smallholder farms in Embu County, Kenya. Sensors, 16(11):1-21, 2016. ), who used pXRF coupled with three other sensors to predict results of laboratory soil analyses in Kenya, observing that pXRF was frequently employed in good models. Sharma et al. (2014SHARMA, A. et al. Characterizing soils via portable X-ray fluorescence spectrometer: 3. Soil reaction (pH). Geoderma , 232-234:141-147, 2014. ) used pXRF data to predict soil pH from linear regressions.

Data collection through pXRF in this work was carried out in the laboratory; however, the use of this equipment in the field can accelerate the acquisition of data that is more difficult to be obtained, through adjustment of models with data from pXRF scanning in the field. Stockmann et al. (2016STOCKMANN, U. et al. Utilizing portable X-ray fluorescence spectrometry for in-field investigation of pedogenesis. Catena, 139:220-231, 2016. ) evaluated the concentration of elements in soil profiles to infer about their parent materials using the pXRF in the field, in addition to making a comparison with the data obtained in the laboratory. In this way, future tests in this line of research are suggested for tropical soils, since the pXRF in association with robust algorithms can increase the amount of data on soils in Brazil both spatially and punctually, providing results rapidly, at low cost and without generation of chemical residues.

CONCLUSIONS

Soil properties such as exchangeable Ca, Mg, Al, pH, organic matter, base saturation, potential and effective CEC and P-rem could be predicted with high accuracy by random forest from the data obtained by pXRF, surpassing the predictions made by stepwise multiple linear regression. The variables obtained by pXRF allowed the spatial prediction of soil properties related to soil fertility, leading to the generation of accurate maps, which demonstrates the potential of pXRF to be used as a source of variables to help spatial prediction of soil properties rapidly, at low cost and without generating residues.

ACKNOWLEDGEMENTS

The authors would like to thank National Council for Scientific and Technological Development (CNPQ), Coordination for the Improvement of Higher Education Personnel (CAPES) and Minas Gerais Foundation for Research Support (FAPEMIG) funding agencies for the financial support that enabled us to develop this work, and thanks to Luiz da Silva Teixeira.

REFERENCES

  • ADHIKARI, K. et al. Constructing a soil class map of Denmark based on the FAO legend using digital techniques. Geoderma, 214-215(2014):101-113, 2014.
  • ALDABAA, A. A. A. et al. Combination of proximal and remote sensing methods for rapid soil salinity quantification. Geoderma , 239:34-46, 2015.
  • ALVAREZ V., V. H.; FONSECA, D. M. Definição de doses de fósforo para a determinação da capacidade máxima de adsorção de fosfato e para ensaios de casa de vegetação. Revista Brasileira de Ciência do Solo, 14:49-55, 1990.
  • ALVAREZ V., V. H.; RIBEIRO, A. C. Calagem. In: RIBEIRO, A. C.; GUIMARÃES, P. T. G.; ALVAREZ V., V. H. (Eds.). Recomendações para o uso de corretivos e fertilizantes em Minas Gerais - 5° Aproximação. Viçosa: CFSEMG, 1999. p.43-60.
  • ARCHER, K. J.; KIMES, R. V. Empirical characterization of random forest variable importance measures. Computational Statistics and Data Analysis, 52(4):2249-2260, 2008.
  • BIRKELAND, P. W. Soils and geomorhology. 3rd. ed. New York: Oxford University Press, 1999. 448p.
  • BREIMAN, L. Random forests. Machine Learning, 45(1):5-32, 2001.
  • CARVALHO JUNIOR, W. de et al. Regressão linear múltipla e modelo Random Forest para estimar a densidade do solo em áreas montanhosas. Pesquisa Agropecuária Brasileira, 51(9):1428-1437, 2016.
  • DUDA, B. M. et al. Soil characterization across catenas via advanced proximal sensors. Geoderma , 298:78-91, 2017.
  • EMPRESA BRASILEIRA DE PESQUISA AGROPECUÁRIA - EMBRAPA. Sistema Brasileiro de Classificação de Solos. 3rd. ed. Brasília: Embrapa, 2013. 353p.
  • GIASSON, E. et al. Instance selection in digital soil mapping: A study case in Rio Grande do Sul, Brazil. Ciência Rural, 45(9):1592-1598, 2015.
  • GRIMM, R. et al. Soil organic carbon concentrations and stocks on Barro Colorado Island - Digital soil mapping using Random Forests analysis. Geoderma , 146(1-2):102-113, 2008.
  • HENGL, T. et al. Mapping soil properties of Africa at 250 m resolution: Random forests significantly improve current predictions. Plos One, 10(6):0125814, 2015.
  • IOANNIDES, D. et al. A preliminary study of the metallurgical ceramics from Kition, Cyprus with the application of pXRF. Journal of Archaeological Science: Reports, 7:554-565, 2016.
  • JUHOS, K.; SZABÓ, S.; LADÁNYI, M. Influence of soil properties on crop yield: A multivariate statistical approach. International Agrophysics, 29(4):433-440, 2015.
  • LIAW, A.; WIENER, M. Classification and regression by random forest. R News, 2(December):18-22, 2002.
  • LIAW, A.; WIENER, M. Package “randomForest.” 2015. Available in: <Available in: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf >. Access in: January, 3, 2017.
    » https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
  • LIES, M.; GLASER, B.; HUWE, B. Uncertainty in the spatial prediction of soil texture. Comparison of regression tree and Random Forest models. Geoderma , 170:70-79, 2012.
  • LOPES, A. S.; GUILHERME, L. R. G. A career perspective on soil management in the Cerrado Region of Brazil. Advances in Agronomy, 137:1-72, 2016.
  • MCLEAN, E. O. et al. Aluminium in soils: I. Extraction methods and magnitud clays in Ohio soils. Soil Science Society of America Proceedings, 22(5):382-387, 1958.
  • MENEZES, M. D. de et al. Solum depth spatial prediction comparing conventional with knowledge-based digital soil mapping approaches. Scientia Agricola, 71(4):316-323, 2014.
  • MENEZES, M. D. de et al. Spatial prediction of soil properties in two contrasting physiographic regions in Brazil. Scientia Agricola , 73(3):274-285, 2016.
  • MILIĆ, M. pXRF characterisation of obsidian from central Anatolia, the Aegean and central Europe. Journal of Archaeological Science, 41:285-296, 2014.
  • MOTTA, P. E. F.; BARUQUI, A. M.; SANTOS, H. G. Levantamento de reconhecimento de média intensidade dos solos da região do Alto Paranaíba, Minas Gerais. 1. ed. Rio de Janeiro: Embrapa Solos, 2004. 238p.
  • PEINADO, F. M. et al. A rapid field procedure for screening trace elements in polluted soil using portable X-ray fluorescence (pXRF). Geoderma , 159(1-2):76-82, 2010.
  • PIIKKI, K. et al. Performance evaluation of proximal sensors for soil assessment in smallholder farms in Embu County, Kenya. Sensors, 16(11):1-21, 2016.
  • RESENDE, M. et al. Pedologia: Base para distinção de ambientes. 6th. ed. Lavras: Editora UFLA, 2014. 378p.
  • RODRIGUES, M. S.; CORÁ, J. E.; FERNANDES, C. Soil sampling intensity and spatial distribution pattern of soils attributes and corn yield in no-tillage system. Revista Brasileira de Ciencia do Solo, 36:599-609, 2012.
  • ROUILLON, M.; TAYLOR, M. P. Can field portable X-ray fluorescence (pXRF) produce high quality data for application in environmental contamination research? Environmental Pollution, 214:255-264, 2016.
  • SCHAETZL, R. J.; ANDERSON, S. Soil: Genesis and Geomorphology. 1st. ed. New York: Cambridge University Press, 2005. 817p.
  • SEVERIANO, E. D. C. et al. Potencial de uso e qualidade estrutural de dois solos cultivados com cana-de-açúcar em Goianésia (GO). Revista Brasileira de Ciência do Solo , 33(1):159-168, 2009.
  • SHARMA, A. et al. Characterizing soils via portable X-ray fluorescence spectrometer: 3. Soil reaction (pH). Geoderma , 232-234:141-147, 2014.
  • SHARMA, A. et al. Characterizing soils via portable X-ray fluorescence spectrometer: 4. Cation exchange capacity (CEC). Geoderma , 239:130-134, 2015.
  • SILVA, S. H. G. et al. Retrieving pedologist’s mental model from existing soil map and comparing data mining tools for refining a larger area map under similar environmental conditions in Southeastern Brazil. Geoderma , 267:65-77, 2016a.
  • SILVA, S. H. G. et al. Proximal sensing and digital terrain models applied to digital soil mapping and modeling of Brazilian Latosols (Oxisols). Remote Sensing, 8:614-635, 2016b.
  • SOUZA, E. DE et al. Pedotransfer functions to estimate bulk density from soil properties and environmental covariates: Rio Doce basin. Scientia Agricola , 73(6):525-534, 2016.
  • STOCKMANN, U. et al. Utilizing portable X-ray fluorescence spectrometry for in-field investigation of pedogenesis. Catena, 139:220-231, 2016.
  • TAGHIZADEH-MEHRJARDI, R. et al. Comparing data mining classifiers to predict spatial distribution of USDA-family soil groups in Baneh region, Iran. Geoderma , 253-254:67-77, 2015.
  • TERRA, J. et al. Análise Multielementar de solos: Uma proposta envolvendo equipamento portátil de fluorescência de raios X. Semina: Ciências Exatas e Tecnológicas, 35(2):207-214, 2014.
  • WALKLEY, A.; BLACK, I. A. An examination of the Degtjareff method for determining soil organic matter and a proposed modification of the chromic acid titration method. Soil Science, 37(1):29-38, 1934.
  • WEINDORF, D. C. et al. Characterizing soils via portable x-ray fluorescence spectrometer: 2. Spodic and Albic horizons. Geoderma , 189-190:268-277, 2012.
  • WEINDORF, D. C.; BAKR, N.; ZHU, Y. Advances in portable X-ray fluorescence (PXRF) for environmental, pedological, and agronomic applications. Advances in Agronomy , 128:1-45, 2014.
  • ZHU, Y.; WEINDORF, D. C.; ZHANG, W. Characterizing soils using a portable X-ray fluorescence spectrometer: 1. Soil texture. Geoderma , 167-168:167-177, 2011.

Publication Dates

  • Publication in this collection
    Nov-Dec 2017

History

  • Received
    09 Apr 2017
  • Accepted
    22 June 2017
Editora da Universidade Federal de Lavras Editora da UFLA, Caixa Postal 3037 - 37200-900 - Lavras - MG - Brasil, Telefone: 35 3829-1115 - Lavras - MG - Brazil
E-mail: revista.ca.editora@ufla.br