How accurate are pedotransfer functions for bulk density for Brazilian

The aim of this study was to evaluate the performance of pedotransfer functions (PTFs) available in the literature to estimate soil bulk density (ρb) in different regions of Brazil, using different metrics. The predictive capacity of 25 PTFs was evaluated using the mean absolute error (MAE), mean error (ME), root mean squared error (RMSE), coefficient of determination (R2) and the regression error characteristic (REC) curve. The models performed differently when comparing observed and estimated ρb values. In general, the PTFs showed a performance close to the mean value of the bulk density data, considered as the simplest possible estimation of an attribute and used as a parameter to compare the performance of existing models (null model). The models developed by Benites et al. (2007) (BEN-C) and by Manrique and Jones (1991) (M&JB) presented the best results. The separation of data into two layers according to depth (0-10 cm and 10-30 cm) demonstrated better performance in the 10-30 cm layer. The REC curve allowed for a simple and visual evaluation of the PTFs.


Introduction
Bulk density (ρb) is an important soil physical property known to affect soil water movement, root growth, seed germination and root density (Mouazen et al., 2003;Dexter, 2004). This property has received particular attention due to its importance in weight-to-volume conversions used to assess soil organic carbon (OC) stocks. Soil is known to contain the largest terrestrial carbon pool and can act as an important sink or source for atmospheric CO 2 .
Despite their importance, ρb databases are in short supply because direct measurements of undisturbed samples are labor-intensive and time-consuming. Furthermore, ρb is highly variable in space and time (Alletto and Coquet, 2009).
A number of studies have also tested the performance of available PTFs and observed that these functions are relatively inaccurate when applied to different environments (Kaur et al., 2002;De Vos et al., 2005;Benites et al., 2007;Al-Qinna and Jaber, 2013;Nanko soils? et al., 2014). These studies assessed the performance of PTFs based on several error measures required for a reliable evaluation. The root mean squared error (RMSE) and mean error (ME) are the most commonly used measures (De Vos et al., 2005;Benites et al., 2007;Al-Qinna and Jaber, 2013;Nanko et al., 2014).
The aim of this study was to evaluate the predictive capability of PTFs available in the literature to estimate soil ρb in different regions of Brazil, using different metrics (mean absolute error (MAE), root mean squared error (RMSE), mean error (ME) and regression error characteristic (REC) curve).

Selected pedotransfer functions (PTFs)
We evaluated 25 PTFs available in the literature. The PTFs selected, R 2 values and the size of the data set used to generate the PTFs are shown in Table 1. PTF equations were adjusted to the same units for better comparison. The details about land use, location, ρb range, and the method used to determine the ρb, which was extracted from the original papers, are presented in Table 2.

Data set
The original data set used to evaluate the performance of the 25 PTFs consisted of 222 soil profiles (888 soil layers) distributed in different biomes in Brazil and with different uses (native vegetation, pasture, integrated crop-livestock and integrated crop-livestock-forest systems) (Figure 1).

Soils and Plant Nutrition |
We eliminated samples that had a missing ρb value or whose sum of particle size fractions (sand, silt, and clay) did not equal 100 %. The final data set consisted of 884 soil samples. The evaluation of the 25 PTFs was first performed by considering the whole data set (884 samples). In a second analysis, the data set was divided into two subgroups according to soil depth: the layer between 0 and 10 cm (0-10 cm) and the layer between 10 and 30 cm (10-30 cm).

Evaluation criteria
The most useful approach to an evaluation of predictive models should be based on a set of complementary indices (Donatelli et al., 2004;Nanko et al., 2014). In the present study, the predictive capacity of the 25 PTFs was evaluated using the mean absolute error (MAE), RMSE, ME and R 2 . MAE is the value of the error in absolute terms, and lower values indicate better predictive ability. RMSE is the standard deviation of the error in the prediction, and, again, lower values indicate better model performance. ME quantifies systematic errors and indicates tendencies to Models for soil bulk density estimate Sci. Agric. v.75, n.1, p.70-78, January/February 2018 overestimate or underestimate. R 2 represents the fraction of the total variance that is explained by the model. Higher R 2 values are desirable.

Regression error characteristic curve -REC curve
The evaluation of the PTFs was complemented by the use of the REC curve. The regression error characteristic (REC) curve is a promising alternative for the evaluation of PTFs. The REC curve allows for comparisons of the performance of different models simultaneously using visual assessment, which facilitates interpretation by the user (Mittas and Angelis, 2010).
The REC curve represents the cumulative distribution function of the errors. On the x-axis the REC curve plots the error and on the y-axis the accuracy of a regression function (Mittas and Angelis, 2010). Accuracy is defined as the fraction of points that fit within an error level, ranging from 0 to 1. Using this construction, we can evaluate the level of accuracy as the number of points for a given value of error.
REC curves are built in a way that makes inspecting model performance a simple task of checking which model is closest to the (0, 1) point (upper left corner) (Figure 2A). If the REC curve of one model is above the others, it is an indication that it is the best model for that metric, be it absolute error or squared error, or their relative forms (e.g., model c in Figure 2B). Thus, when comparing models a and b, if the REC curve from model a overlies the curve of model b, we can affirm that model a is superior to model b ( Figure 2B). Model c in Figure 2B is superior to the others.
The REC curve also allows us to estimate the model error by measuring the area over the curve (AOC). If the selected error measure is the squared error, the AOC corresponds to an underestimation of the mean squared error (MSE), and for the absolute error, the AOC corresponds to an underestimation of the MAE.
Finally, the REC curve allows for a comparison of the quality of a particular model with the simplest possible estimation of an attribute from a population. In this study, the mean prediction corresponds to the simplest possible estimation of an attribute, and we used these values as a parameter to compare the performance of existing models. If the use of the mean values is better than a particular model, this model is not recommended  for that situation. The mean value will hereafter be referred to as the null model. When used in such manner, comparing models with REC curves is no different from using the MAE or RMSE. The benefits of the REC curve are related to other easily visualized statistics, which can be the R 2 or the Kolmogorov-Smirnov Statistic to check if two samples are generated from the same distribution. Furthermore, the REC allows for quick visualization of the error median (50 % horizontal line) or any level of confidence for different readers, e.g. one can look at the errors for 80 % confidence or 95 %.
REC curves were constructed following the steps proposed by Mittas and Angelis (2010): 1 -The model was set up, and the error in the predictions was evaluated; 2 -The error values were sorted in ascending order; 3 -For each error value that was not repeated, a point with the error value and the percentage of points with smaller errors were included in the graph.
Step 1 was not necessary because we used available PTFs (Table 1). The RMSE was used as a measurement of error. Comparisons between curves were also made, by applying the Wilcoxon signed rank test at the 0.05 level, with the Holm correction for multiple comparisons. This test assesses whether population mean ranks differ (paired difference test).

General evaluation
The descriptive statistics of the data set are shown in Table 3. The data set used has a wide range of ρb values, with a minimum of 0.81 g cm -3 , a maximum of 1.84 g cm -3 , and a mean value of 1.42 g cm -3 (Table 3). The PTFs tested were developed from data sets with different uses, origins, ρb ranges and methods of determination ( Table  2). The range of ρb values in the data set used for the comparisons was within the range of values observed in the data sets used to generate the PTFs ALEX, BEN, GRIG, HAN, H&B, JEF, KAUR, LEO, PREV, R&K and T&S (Table  2 and Table 3). The PTFs BERX and HUNT used methods different from those used in this study; the PTFs ALEX and H&B used more than one method; and the PTFs HAN, HONG, and M&J did not inform the method used. The others used the core method.
The models performed differently when observed and estimated ρb values were compared ( Figure 3 and Table 4). The ρb values were underestimated, except for the R&K, H&B-B and LEO (A and B) functions ( Figure 3, Table 4). Underestimations in the prediction of ρb were also observed by De Vos et al. (2005), which they attributed to the high proportion of topsoil data used in the calibration of the PTFs. These authors used forest soil data, in which the surface ρb tended to be lower than the subsurface values (Tamminen and Starr, 1994;De Vos et al., 2005), which can lead to underestimated ρb values. The data set used in this study are composed of soil data from the upper 30 cm of the soil profile, considered as topsoil (De Vos et al., 2005;Martin et al., 2009). Additionally, 40 % of our data is from natural vegetation (ρb = 1.28 g cm -3 ) and integrated systems (crop-livestock ρb = 1.33 g cm -3 ; crop-livestock-forest ρb = 1.39 g cm -3 ), which presented a lower topsoil ρb than pasture (ρb = 1.45 g cm -3 ). Data only from topsoil and under different uses may explain the underestimates.
Lower ME values were observed for BEN-C and M&J-B indicating that these models were less biased ( Table 4). The most biased models were HONG, R&K,     ). On the other hand, the most biased models tested by Nanko et al. (2014) were H&B and ALEX, which were among the models with lower values. They used data from volcanic soils, with a range of ρb between 0.13 and 1.78 g m -3 and a very low mean of 0.6 g cm -3 . The best indices (MAE, RMSE, and ME) were observed in the M&J-B, ALEX, BEN-B, BEN-C, M&J-A and T&H models. Out of the models with lower errors (MAE, RMSE, and ME), the T&H model had the highest R 2 value and, therefore, represented a more realistic model. When considering the models developed from the data from Brazilian soils (BEN-A, BEN-B, BEN-C, BERX, and T&H), we observed that three of them are among the best models: BEN-B, BEN-C and T&H.
The worst indices were observed in the HONG, R&K, and KAUR models. Furthermore, HONG, R&K, BEN-C, LEO-B, and C&P models provided similar estimates of ρb ( Figure 3) for a data set with a broad range of values (0.86 g cm -3 to 1.8 g cm -3 ) ( Table 3). This finding reveals that these models do not provide reliable predictions of ρb. The performance of the PTFs using data sets that differ from those used in their development is uncertain and may not be satisfactory, especially when the data sets are sourced from different geographic areas (Wösten et al., 2001) (Table 5).
In addition to the difficulty associated with extrapolating these models due to the specificity of the data sets used in their development (Table 3), the method used to generate the PTFs also appears to affect the results (Martin et al., 2009;Jalabert et al., 2010;Suuster et al., 2011). These methods are quite varied, ranging from simple regressions to more powerful methods such as neural networks, regression trees (Martin et al., 2009;Jalabert et al., 2010;Ghehi et al., 2012), and the nearestneighbor method (Nemes et al., 2010). In the case of the PTFs assessed in our study, the methods used were mostly simple or multiple regressions using the least squares method.
The analysis of input variables did not present any concluding remark. The T&H model, using OC and two granulometry fractions was one of the best five models (M&J-B, ALEX, BEN-B, BEN-C, M&J-A, and T&H); despite one of the worst models (KAUR) also using OC and two granulometry fractions. The other models used Models for soil bulk density estimate Sci. Agric. v.75, n.1, p.70-78, January/February 2018 only OC or OM and one granulometry fraction. Therefore, when considering the origin of the data set used to develop the model, the number, and type of input variables, the model generated from a Brazilian data set with OC and two granulometry fractions as input was able to predict ρb more effectively.
When comparing the observed and predicted values, certain patterns in results appear inadequate to act as a good predictor (Figure 3). An exhaustive discussion of these patterns is beyond the scope of this study; thus, two patterns were observed: the horizontal lines (e.g., BEN-C) and the 'flat top' (e.g., HUNT-A). The horizontal line found in BEN-C is easily explained by the small magnitude of the angular coefficients (10 -3 -10 -5 ) compared with the linear coefficient (1.5688). Although the data set used in this study contains ρb values between 0.81 and 1.84 g cm -3 , the BEN-C model predicted values between 1.14 and 1.51 g cm -3 . This range is even narrower for the LEO-B model, which predicted values between 1.33 and 1.71 g cm -3 . The previous analysis developed for BEN-C and LEO-B can be extended to the R&K and C&P models.
The 'flat top' pattern was observed in the HUNT-A, HUNT-B, and KAUR models and, to a lesser extent, in LEO-A and FED. An evaluation of the parameters of the HUNT-A equation and the values of the data set allowed us to conclude that the predicted values would lie between 1.04 and 1.36 g cm -3 . The maximum value is lower than the average ρb value for the data set (1.42 g cm -3 ), resulting in a flat aspect on the graph.
When the data set was separated according to depth (0-10 cm and 10-30 cm), the values of the indices showed only slight variations ( Table 6). The M&J-B, BEN-B, BEN-C, ALEX, M&J-A, and T&H continued to perform well, and the HONG, R&K, and KAUR models had the worst performance. The lowest RMSE values were observed for the BEN-C and M&J-B and the highest for the HONG model for both soil depths.

REC curve
In the evaluation using the REC curve, the models were separated into different graphs for easier visualization ( Figure 4A and B). The 12 best models are presented in Figure 4A and the remainder in Figure 4B. Table 4 also presents the values of AOC calculated from squared error (AOC_square) and absolute error (AOC_abs), and the results of the comparisons of the REC curves by the Wilcoxon test at the 0.05 level with the Holm correction for multiple testing.
Among the 12 best models, the Wilcoxon test showed that ALEX, BEN-B, H&B-B, M&J-A and T&H models presented a distribution similar to the null model. Otherwise, HUNT-B, H&B-A, BEN-A, PREV, and P&O models presented a different distribution from the null model, with an inferior performance for the majority of the error range ( Figure 4A and Table  4). BEN-C and M&J-B also presented a different distribution from the null model, with a superior performance for a particular range of errors. The better performance of the M&J-B model is likely due to the variability of the data used in its generation (12,000 soil profiles) (Manrique and Jones, 1991). The result for BEN-C is probably related to similarity between data sets (Benites et al., 2007).
The 13 worst models presented a distribution dissimilar from the null model, with an inferior performance for the whole error range, showing that these models have low predictive ability ( Figure 4B and Table 4). The HONG, R&K, and KAUR models had the worst results, with much poorer performance than the null model (Figure 4B and Table 4). De Vos et al. (2005) observed poor performance for both the KAUR and HUNT models using forest soil data from Belgium. These models were devel-  Kaur et al. (2002) 0  (Table 2). Furthermore, the HUNT model used different methods to determine the ρb. Nemes et al. (2010) emphasize that the model selected should be one developed from similar areas as regards soil genesis. However, even the PTFs developed for Brazilian soils, such as those generated by Benites et al. (2007), showed a performance slightly superior to the null model and for specific conditions. The Wilcoxon test for REC curves according to depth showed similar results when considering all the data set (Table 6). For depths up to 10 cm ( Figure 5A and B), ALEX, BEN-B, H&B-B, M&J-A, and T&H, together with M&J-B and LEO-A models, continued to present a distribution similar to the null model ( Table 6). The BEN-C was the only model with a distribution different from the null model, but, again the gain in their application was very small and only for specific error ranges. The other five models, out of the 12 best ones HUNT-B, BEN-A, H&B-A, PREV, and P&O, presented a distribution different from the null model, with inferior performance for the majority of the error range ( Figure 5A).  For depths of 10 to 30 cm ( Figure 5C and D), the Wilcoxon test showed that 10 of the 12 best models did not differ from the null model (Table 6). A slight improvement occurred with the application of ALEX, M&J-B, BEN-B, and BEN-C models, but also for a particular error range and, therefore, cannot be interpreted as better performance of these models compared to the null model ( Figure 5C). Previous studies have shown the influence of soil depth on the prediction of ρb. De Vos et al. (2005) found a 24 % improvement in the performance of PTFs in RMSE for ρb prediction at greater depths. Heuscher et al. (2005) noted that depth was responsible for approximately 1 % of the variation in ρb, and the greatest variation observed was 7 %. Benites et al. (2007) did not observe an improvement in the accuracy of the PTFs after separating the soils by depth (0 to 30 cm and 30 to 100 cm). Tranter et al. (2007) found better results when using the depth expressed on a logarithmic scale. Nemes et al. (2010) observed a decrease in the bias after separating the soil by depth.
In general, only BEN-C presented a slight improvement in ρb estimates for all situations (all data set, depths up to 10 cm and depths of 10 to 30 cm), when compared to the null model. The number of models with better results for topsoil was more restricted than for the subsurface. Having a worse performance for the topsoil as compared with the subsurface soil raises concerns as regards the estimation of soil C stock in which 0 to 30 cm are commonly considered as a reference layer (IPCC, 1997). Given the higher concentration of OC in the surface compared to the lower layer (Table 3), the error will be greater where it matters most.
The method used to determine the predictors of a PTF or the estimated variable (ρb) also seems to affect their performance. Both ρb and OC (or OM) can be measured by different methodologies (Sleutel et al., 2007;Blake and Hartge, 1986). The LOI and wetoxidation methods are the two most commonly used methods for quantifying OM. However, the LOI method has no standard protocol and involves potentially confounding factors (Hoogsteen et al., 2015) even if the granulometric data could present variations since national and international classification systems often use quite different particle size ranges. In fact, Wösten et al. (2001) pointed out that there is no single source of variability, either PTF-related or soil-related, that can explain the uncertainty in every calculated functional aspect of soil behavior. The authors suggest using large and reliable data sets as well as PTFs developed from soils with similar attribute ranges to those used for the predictions. Here, we could see that similar ranges of soil attributes are not a guarantee of good predictions. The REC curve allows us to show that the 12 PTFs with the best indices presented, in general, a performance similar to the null model. BEN-C and M&J-B models presented the best results. The improvements observed were insignificant and were found only in specific error ranges.

Conclusions
The 25 models tested performed differently when observed and estimated bulk density values are compared, and the BEN-C and M&J-B models presented the best results.
The separation of data into two layers according to depth (0-10 cm and 10-30 cm) demonstrated a worse performance for the 0-10 cm layer, which raises concerns about the estimation of soil C in the upper layers that contain most of the organic matter.
The use of the REC curve as a form of analysis allowed for a simple and visual evaluation of the performance of the models.
The pedotransfer functions tested in this study showed a performance close to that of the null model (mean value) when estimating bulk density for soils from different regions of Brazil, indicating little or no additional benefit from the use of the null model.