Hierarchical pedotransfer functions for predicting bulk density in Brazilian

: Bulk density (BD) is a soil physical property used as a soil quality indicator and variations in this measurement influence soil water content and carbon stock estimates. This study aims to compile a database of samples of bulk density, textural fractions, and organic carbon values, as well as evaluate the accuracy of published pedotransfer functions (PTF) that predict bulk density, and propose a hierarchical PTF to predict the bulk density of Brazilian Soils. The performance of eleven PTFs and the newly proposed PTFs were evaluated and compared using the root mean square error (RMSE) and coefficient of determination (R 2 ) based on a testing soil database collected from the literature. We noticed a slight improvement in accuracy when organic carbon and coarse and fine sand fractions were included as predictors alongside silt and clay. The best results with existing PTFs were obtained by PTF-A in Tomasella and Hodnett (1998) (RMSE = 0.20 g cm –3 ) and PTF-F in Benites et al. (2007) (RMSE = 0.17 g cm –3 ). Our proposed PTFs use textural fractions and organic carbon as predictors in a hierarchical form. The proposed PTF-4, which uses fine sand, coarse sand, clay, and organic carbon, presented the lowest value for RMSE (0.14 g cm –3 ) for BD prediction.


Introduction
Soil bulk density (BD), which is the ratio of the mass of an oven-dry soil sample to its volume (solids plus pores) (Hillel, 1998), is an essential property for assessing the sustainability of soil management practices in any region (Botula et al., 2015;Palladino et al., 2022).It is an efficient indicator of soil structure, reflecting compaction and its effects on soil-waterplant-atmosphere relationships (De Vos et al., 2005;Assouline, 2006).It influences soil solution fluxes, root growth and density, and seed germination.Soil bulk density is also a key parameter in the determination of soil organic carbon (SOC) and the stocks of other elements (Lettens et al., 2005;Don et al., 2011;Inagaki et al., 2017;Chen et al., 2018), and is commonly used as a predictor variable for certain physical hydraulic PTFs (Karup et al., 2017;Gunarathna et al., 2019;Silva et al., 2020).
Field sampling and direct measurements of BD can be expensive, labor-intensive, and timeconsuming (Kaur et al., 2002;Abdelbaki, 2016;Nasta et al., 2020).As a result, BD is only sometimes determined in soil surveys and routine soil laboratories (De Vos et al., 2005;Don et al., 2011).Corroborating Minasny and Hartemink (2011), it is not feasible to measure all soil physical and chemical properties continuously, particularly in areas with rock fragments and/or woody debris (Nanko et al., 2014;Sevastas et al., 2018).Therefore, it is necessary to use robust systems to estimate the soil properties of a given location.
We aimed to propose a hierarchical system of PTFs to predict BD based on a large database of Brazilian soils and to compare its performance with existing BD-PTFs from the literature using a testing database of Brazilian soil BD, regardless of genesis, type of land use and cover, and management.

Data selection and description
Different Brazilian soil databases were consulted to extract information on Bulk Density Data (BD), such as the Hydrophysical Database for Brazilian Soils (HYBRAS) (Ottoni et al., 2018), the Brazilian Soil Data Repository (SoilData/FEBR) (Samuel-Rosa et al., 2020), and other private Brazilian soil datasets provided by a number of researchers.The 3,050 observed BD data, including all information on sand, silt, and clay

Soils and Plant Nutrition
Research article soils contents (Table 1), were selected.This dataset covers all the 12 textural classes according to the United States Department of Agriculture (USDA) classification (Soil Survey Staff, 2014) (Figure 1).This soil database also contains information on coarse and fine sand data for several soils samples (1,081 samples) in addition to OC (2,827).Table 1 shows the summary statistics of the soil variables from the soil database of this study.The 990 samples have information on BD, clay, silt, coarse sand, fine sand, and OC data available.
The method used for quantifying organic carbon was oxidation with 0.0667 mol L -1 K 2 Cr 2 O 7 and titration with 0.1 mol L -1 Fe(NH 4 ) 2 (SO 4 ) 2 .6H 2 O, as described by Fontana and Campos (2017).Particle-size analysis was performed by dispersing the soil with NaOH solution.The total of sand fractions was obtained by sieving, and clay was determined by pipette and hydrometer.The difference between sand and clay determined silt.The total sand fraction was further divided into coarse (2.00-0.210mm) and fine (0.210-0.053 mm) fractions using the method described by Donagemma et al. (2017).

Evaluating PTFs
We tested twenty PTFs from the literature that use different variables to estimate BD.Some of these PTFs use variables such as the sum of bases, pH value, and cation values, in addition to the more commonly used ones such as texture and OC data.However, we found these PTFs ineffective for all samples as some need more detailed information.As a result, we selected only eleven already-published PTFs for tropical and temperate soils (Tomasella and Hodnett, 1998;Bernoux et al., 1998;Kaur et al., 2002;Benites et al., 2007;Ruehlmann and Körschens, 2009;Hollis et al., 2012;Al-Qinna and Jaber, 2013;Botula et al., 2015;Abdelbaki, 2016) with the best performance (lowest RMSEs) among the twenty models tested.These selected PTFs are presented in Table 2.

A hierarchical PTFs system for BD predictions
A hierarchical system was tested using textural properties (fine sand, coarse sand, total sand, silt, and clay) and OC as predictors.Four models were tested to estimate soil bulk density, as shown in Table 3.The simplest model, Function 1, includes only the three textural fractions (sand, silt, and clay) as predictors.Function 4 contains all the soil variables as inputs, excluding total sand.Function 2 includes OC in addition to the three textural fractions, and Function 3 includes coarse sand (0.2-2 mm), fine sand (0.05-0.2 mm), silt and clay.
The Caret package (Kuhn, 2008) in R software (R Core Team, 2022) was used to fit the PTFs, evaluate goodness of fit, and validate the selected functions.Linear models (lmStepAIC) were used to fit the equations, with significant predictors selected automatically using the smallest Akaike Interaction Criteria (AIC).We also evaluated RMSE and R 2 for the selected functions.To assess the prediction accuracy of the chosen model, we used 10-fold cross-validation (Souza et al., 2016;Haddad et al., 2018;Palladino et al., 2022), reporting the average values of RMSE and R 2 over ten runs.The dataset was split into ten folds in each run, with 90 % of the samples used for calibration and the remaining 10 % for validation.

Model assessment
Evaluation of PTF predictive performances was based on the root mean square error (RMSE) Eq. ( 1) and adjusted coefficient of determination (R adj 2 ) Eq. ( 2) of the function according to the following equations: in which BD obs,i is the i-th observed value, BD pred,I the i-th predicted value, N the number of observations, and p the number of predictors.The closer to zero the RMSE, the more accurate the predictions.The R 2 identifies values between 0.0 and 1.0, where a value of 1.0 indicates a perfect fit.
The results obtained by the available PTFs and the proposed hierarchical functions are plotted against the observed values in a 1:1 line (identity line) for comparison.The accuracy of the linear models was assessed using the scatter plots of residuals (observed minus predicted) versus the observed values of BD.

Results
Our database contains soil samples from all textural classes, as shown in Figure 1.The clayey texture is the most represented, with 643 samples, while silty texture has the smallest number of samples, only six.The lowest BD was observed in clay and silt clay loam classes (0.63 g cm -3 ), whereas the highest BD value was observed in a sample from the sandy clay class (1.96 g cm -3 ) (Figure 2).
The soil classes were classified according to the World Reference Base for Soil Resources (WRB) classification (FAO, 2015), which includes twelve classes.The Acrisol class (796 samples) and Ferralsol class (671 samples) were the most predominant in our dataset, reflecting their common occurrence in Brazil.On the other hand, the Luvisol class was the least represented, with only four samples in our database (Figure 3).Not all 3,050 samples provided fine sand, coarse sand, and OC information.Therefore, the number of observed data varied among the PTFs fitted.PTF-1 used 3,050 samples, PTF-2 2,827 samples, PTF-3 1,081 samples and PTF-4 (which included coarse and fine sand, silt, clay, and OC) was available for 990 observed BD data.This subset of data was used to evaluate the hierarchical system, where BD was predicted using PTF-1, 2, and 3, which had fewer variable predictors.A correlation analysis was performed on a set of 990 samples (Figure 4), which included all the mandatory information for all the PTF that was tested.The visual results of BD-PTFs can be seen in Figure 5A-K, which shows BD's observed and predicted values.
Evaluated PTF-A and -F had the best indices (RMSE, R 2 ).However, PTF-F exhibited more significant variation, particularly for high BD values (RMSE = 0.18 g cm -3 ) (Table 4).Among the BD-PTFs analyzed in Table 4, the PTF-D, -G, -I had the highest RMSE values (0.27, 0.42, and 0.34 g cm -3 ), indicating poor performance.Additionally, the BD estimates obtained using the PTF-J had a low correlation coefficient (R 2 = 0.06), indicating poor goodness of fit.The highest residues were observed for BD values below 1.0 and above 1.5 g cm -3 .
The coefficients and predictors of our proposed hierarchical functions, selected using step AIC, are presented in Table 5.The calibration statistics of the four developed PTFs are provided in Table 6, while the 10-fold cross-validation statistics are shown in Table 7.Despite differences in sample size, the sample standard deviation (SD) of all four training datasets was 0.21 g cm -3 .The value of RMSE calib decreased with an increasing number of predictors in PTF-4.To correct the RMSE calib to the sample standard deviation, we divided RMSE by SD.This yielded a corrected RMSE calib of 0.81 and 0.66 for PTF-1 and PTF-4, respectively.The presentation of the visual outcomes for our proposed hierarchical PTFs can be observed in Figure 6A-D.Additionally, Figure 7A-D illustrates the results of the residual analysis.Furthermore, Figure 8 provides a comprehensive evaluation of the PTFs applied to the same set of samples (990 samples).

Discussion
Soils with a predominance of sand fractions (i.e., sand, sand clay, sand, loam) tend to show high values for BD.
In contrast, clayey soils are quite unpredictable as they can have low (e.g., 0.63 g cm -3 ) and high BD values (e.g., above 1.80 g cm -3 ).We omitted Organosols, which typically show very low BD values, in our analysis.A correlation analysis was performed on 990 samples, which included all the mandatory information for all tested PTF.A correlation heatmap is a type of plot that visualizes the strength of the relationship between two variables.The correlation analysis shows that BD is negatively correlated with silt and clay and positively correlated with total sand, fine and coarse sand (Figure 4).This is one reason why certain published BD-PTFs use only one or two particle size fractions as predictors (Bernoux et al., 1998;Benites et al., 2007).Despite the moderately negative correlation (-0.39) between BD and OC, we kept OC in some of our functions as it is usually available in soil surveys.Bulk density is negatively correlated with clay content and positively correlated with sand content, which reflects sandy soils (e.g., Arenosols, Spodosols) typically having high values of BD, and clayey soils, especially the wellstructured Ferralsols and Acrisols in the tropics, having low BD values (Ottoni et al., 2018;Batjes et al., 2020).On the other hand, positive correlations between BD and total sand, fine and coarse sand are demonstrated in PTFs 1, 2, 3, and 4.
The Al-Qinna and Jaber (2013) proposed several PTFs that were evaluated using various methodologies to estimate the BD of Jordanian soils, located in an arid region.We tested one of these PTFs (PTF-I) that uses Bulk density prediction Sci.Agric.v.81, e20220255, 2024 sand and log (OC) content, which was developed using stepwise regression.They found an RMSE equal to 0.126 g cm -3 for their analysis whereas in our study this PTF performed poorly (Table 4).
We also tested the PTF proposed by Abdelbaki (2016), which uses only the OC content as a predictor of BD (PTF-K).The author utilized an extensive soil database from the US (SSURGO) comprising approximately 174,339 samples, with OC content ranging from 0-to 58 %, and BD values varying from 0.30 to 2.30 g cm −3 .In his study, the author found an RMSE of 0.13 g cm −3 , superior to other PTFs tested.However, our study obtained an RMSE value of 0.20 g cm −3 when using this same PTF.This same author tested several PTFs, and the ones proposed by Ruehlmann and Körschens (2009) and Hollis et al. (2012) had the best performances.The PTF by Ruehlmann and Körschens (2009) uses only the OC content and was developed using a global dataset (PTF-G).The PTF proposed by Hollis et al. (2012) was developed using a European dataset, which includes OC, sand, and clay content (PTF-H).In our study, we obtained RMSE values of 0.42 and 0.19 g cm −3 for the PTFs by Ruehlmann and Körschens (2009) and Hollis et al. (2012), respectively (PTF-G, H) (see Table 4).
The results of the statistical indices (RMSE, R 2 ) for the calibration and the validation of the models were very similar.Pedotransfer function number 1 estimated the BD for all samples in the database (n = 3,050).Since the other PTFs use predictor variables that are not present in all samples, the number of samples used to calibrate PTF 2, 3 and 4 were, 2,827, 1,081 and 990, respectively.
The proposed PTF-1 can be compared with the PTFs B and E, developed by Bernoux et al. (1998) and Benites et al. (2007), in which predictor parameters also correspond to granulometric fractions.The PTFs B and E obtained RMSE values equal to 0.22 and 0.18 g cm -3 , respectively, while PTF-1 had the lowest RMSE (0.17 g cm -3 , as shown in Table 7).The proposed PTF-2, which uses textural fractions and OC to estimate BD, can be compared to other PTFs as PTF-A, -C, -D, -F, -H, -I, -J that presented RMSE values equal to 0.22, 0.20, 0.27, 0.17, 0.19, 0.34, and 0.21 g cm -3 (Table 4), respectively, for the BD estimate.The BD prediction by PTF-2 showed a lower RMSE (0.16 g cm -3 ), presenting the same RMSE value in the cross-validation assessment (Table 7).
The proposed PTF-3 was fitted using 1,081 samples, equiring fine and coarse sand as predictors instead of total sand.However, PTF-3, like PTF-1, uses only particle size fractions as predictors.When comparing predictions for these 1,081 samples, the RMSE values were 0.159 and 0.168 g cm -3 , respectively, showing a slight improvement in the accuracy when total sand is fractionated.
The last proposed PTF is PTF-4, which uses textural fractions and OC content as predictors, showed a lower RMSE (0.14 g cm -3 ) in validation and, therefore, is considered a more accurate PTF when fine and coarse sand, clay and OC are available.On the other hand, PTF-1 showed higher residuals, underestimating (positive values) the BD prediction for samples with values above 1.60 g cm -3 , and overestimating (negative values) the BD for samples with BD less than 1.0 g cm -3 .The PTF with the smallest error was PTF-4 with an RMSE of 0.14 g cm -3 , while the worst (RMSE = 0.17 g cm -3 ) was observed when using only simple textural fractions (sand and clay -PTF-1).
Satisfactory performance is indicated when the RMSE for BD predictions falls within the range of 0.12 to 0.25 g cm −3 , as defined by De Vos et al. (2005).Using this criterion, some evaluated models show good performance for Brazilian soils.As suggested by Al-Qinna and Jaber (2013), it is essential to balance the simplicity, applicability, accuracy, precision, and reliability of models generated in specific circumstances.Therefore, we believe that our four proposed PTFs are suitable for use in various scenarios within Brazilian territory where there is a lack of available BD data.
The script and data bank will be available on the GitHub of the first author of this manuscript.
We proposed four new simple functions to predict bulk density (BD) values for most Brazilian Soils.These functions use granulometric fractions and organic carbon as predictor variables.A hierarchical system of BD predictions showed a slight improvement in accuracy when organic carbon and coarse and fine sand fractions were included as predictors.However, the BD predictions still showed high residuals for soils with low BD values (i.e., below 1.00 g cm -3 ) and high BD values (i.e., above 1.50 g cm -3 ).Incorporating a structural parameter representing the arrangement of soil particles is crucial to improving the accuracy of BD predictions.The best results with existing PTFs were obtained for PTF-A and -F, proposed by Tomasella and Hodnett (1998) and Benites et al. (2007), respectively.

Figure 1 -
Figure 1 -Distribution of the 3,050 points in the textural triangle according to the United States Department of Agriculture (USDA) classification.

Figure 2 -
Figure 2 -Box-plots of observed data in all textural soil classes.The circles represent the outliers.

Figure 3 -
Figure 3 -Box-plots of observed data in all World Reference Base for Soil Classes.The circles represent the outliers.
PTFs of Tomasella and Hodnett (1998) (PTF-A), and Bernoux et al. (1998) (PTF-B, C), were developed using Amazonian soil samples.Bernoux et al. (1998) found an R 2 of approximately 0.50 for estimating BD with clay and OC content.Tomasella and Hodnett (1998) obtained an R 2 value almost equal to 0.60 when estimating BD with silt, clay, and OC content.Benites et al. (2007) also developed PTFs for Brazilian Soils across most biomes and found an R 2 value of 0.63 when predicting BD with clay and OC content (PTF-E, F).Boschi et al. (2018) evaluated a set of 222 soil profile data from all Brazilian biomes, totaling 884 samples, and found that the PTF proposed by Benites et al. (2007) (PTF-F) had good performance (RMSE = 0.19 g cm -3 ).

Figure 8 -
Figure 8 -Comparison of bulk density estimates by the hierarchical system using four pedotransfer functions (PTFs) for the same data set.

Table 1 -
Ranges of bulk density (BD), soil texture (according to the USDA classification), and organic carbon (OC) for the assessed Brazilian soil databases.

Table 4 -
Results obtained by evaluating the BD-PTFs for the presented database.
RMSE = root mean square error; R 2 = coefficient of determination; Number of samples evaluated by the eleven BD-PTFs respecting the ranges of the input variables.

Table 7 -
Statistics (RMSE and R 2 ) of the k-fold cross validation.

Table 6 -
Goodness of fit criteria (RMSE and R 2 ) for the calibration functions (PTF) to predict bulk density.