Bulk Density Prediction for Histosols and Soil Horizons with High Organic Matter Content

: Bulk density (Bd) can easily be predicted from other data using pedotransfer functions (PTF). The present study developed two PTFs (PTF1 and PTF2) for Bd prediction in Brazilian organic soils and horizons and compared their performance with nine previously published equations. Samples of 280 organic soil horizons used to develop PTFs and containing at least 80 g kg -1 total carbon content (TOC) were obtained from different regions of Brazil. The multiple linear stepwise regression technique was applied to validate all the equations using an independent data set. Data were transformed using Box-Cox to meet the assumptions of the regression models. For validation of PTF1 and PTF2, the coefficient of determination (R 2 ) was 0.47 and 0.37, mean error -0.04 and 0.10, and root mean square error 0.22 and 0.26, respectively. The best performance was obtained for the PTF1, PTF2, Hollis, and Honeysett equations. The PTF1 equation is recommended when clay content data are available, but considering that they are scarce for organic soils, the PTF2, Hollis, and Honeysett equations are the most suitable because they use TOC as a predictor variable. Considering the particular characteristics of organic soils and the environmental context in which they are formed, the equations developed showed good accuracy in predicting Bd compared with already existing equations.


INTRODUCTION
Pedotransfer functions (PTF) have been developed in order to convert available data into required data (Bouma, 1989).Those analyses or parameters routinely determined that consume less time and resources are considered as available soil data.The available data should be more easily obtained than predict attribute (required data), because on the contrary, it would be more convenient to make a direct measurement than to predict the response variable (McBratney et al., 2002).
Measurement of bulk density (Bd), for example, takes time and is difficult to achieve in some soils (Sequeira et al., 2014;Nasri et al., 2015).In addition, since Bd is not a soil classification parameter itself, it is not often found in soil survey reports in Brazil, leading to a scarcity of such information for the 0.00-1.00m soil layer (Heuscher et al., 2005;Benites et al., 2007).In such cases, PTFs are alternatives for filling gaps in soil databases (Odgers et al., 2012) since they can generate or estimate Bd data indirectly (Xiangsheng et al., 2016).
Bulk density is an important soil physical parameter, due to its relation with several soil properties and processes, such as hydraulic properties, soil compaction, and erosion (Wischmeier and Mannering, 1969;Aria and Paris, 1981;Hillel, 1998;Fidalski and Tormena, 2007;Silva et al., 2008;Rodríguez-Lado et al., 2015).Additionally, Bd is an important parameter in many calculations, e.g., the soil porosity equation (Blake and Hartge, 1986) and models (Nasri et al., 2015).Especially in the context of this study, Bd is important for converting soil weight into volume, which is necessary for determining the soil carbon (C) stock.Knowledge of C stocks over large areas, such as within a nation, is essential for monitoring changes from global warming that have raised concerns over the depletion of soil C (Eswaran et al., 1993).
In tropical conditions, only a few studies have focused on organic soils (total organic carbon content -TOC >80 g kg -1 (Ebeling et al., 2011), according to the Brazilian Soil Classification System -SiBCS (Santos et al., 2013).These ecosystems are considered fragile and very susceptible to degradation (Chimner and Ewel, 2005).In order to overcome the lack of information on organic soil, Valladares (2003) compiled a Brazilian soil database from graduate study theses and dissertations, which was the basis of this study.Database compilation is a first step for PTF development, and this has been one of the main limitations in Brazil.Although information management techniques have become more accessible due to advances in informatics, most of the information recorded for Brazilian soils is still dispersed (Benedetti et al., 2008).
If an organic soil database is available, the development of the PTF for estimating Bd for organic soils is possible, and it is also possible to test the PTFs published in the literature, such as in Adams (1973), Honeysett and Ratkowsky (1989), Manrique and Jones (1991) and Tamminen and Starr (1994) models.In Brazil, the first attempts to develop PTFs for estimating Bd associated with C stocks in the 0.00-1.00m layer were conducted by Bernoux et al. (1998) and Tomasella and Hodnett (1998) in the Amazon Basin.After that, Benites et al. (2007) used PTFs for Bd determination in the main Brazilian soils and found that, among the different modeling techniques, regression functions using only soil TOC, clay content, and sum of bases (SB) provided the best results, explaining 66 % of Bd variance.In all the PTFs already developed in Brazil, none of them were developed solely with organic soils.
The aim of the present study was to develop new PTFs to predict Bd in the organic horizons of Brazilian soils and evaluate the accuracy of the PTFs already published in the literature.

Database acquisition
Information on organic soils and horizons was obtained from studies carried out in different regions of Brazil (Figure 1), the basis of which was a database of organic soils compiled by Valladares (2003).The survey included dissertations, theses, and studies from educational and research institutes developed from 2003 to 2014.
A total of 210 soil profiles, encompassing 648 horizons with organic characteristics (total organic C greater than or equal to 80 g kg -1 of soil) were found.The profiles surveyed consisted mostly of Histosols, Gleysols (Gleissolo Melânico and Gleissolo Tiomórfico), Cambisols (Cambissolo Hístico and Cambissolo Húmico), Leptosols (Neossolo Litólico), and Podzols (Espodossolo).There was not uniformity or harmonization among the soil analyses presented.Thus, Bd values were available for only 280 horizons, and they were used to develop and test PTFs.The way the full database was used for developing and validating the PTFs is shown in figure 2.

Development of soil bulk density PTFs
From 230 organic soil horizons, two PTFs were generated from stepwise linear regression analysis, here referred to as PTF1 and PTF2.Sand, clay, silt, TOC, SB, CEC, Al 3+ , H + , V, P, and pH(H 2 O) were used as explanatory variables for modeling Bd (dependent variable).Statistical analyses were performed using R software (R Core Team, 2013).To meet the assumptions of regression analysis, data were transformed using the Box-Cox transformation of the MASS package (Venables and Ripley, 2002) for R software.The residual normality of the equations developed was checked using the Shapiro-Wilk test.Means were analyzed by the t-test and homogeneity by the Breuch-Pagan test, using the lmtest package (Zeileis and Hothorn, 2002).The equations also met the assumptions of the overall regression test of the gvlma package (Peña and Slate, 2014).In addition, descriptive analysis of equation residues was performed.

PTFs published in the literature evaluated
Nine PTFs of Bd already published in the literature were evaluated from the database presented in this study.The PTFs analyzed were obtained from Hollis et al. (2012) for Europe, Jeffrey (1970) for northern England, Honeysett and Ratkowsky (1989) for Tasmanian forest soil (Australia), Adams (1973) for a Welsh forest, Tamminen and Starr (1994) for Finish forest soil, Manrique and Jones (1991) for the USA, Bernoux et al. (1998) and Tomasella and Hodnett (1998) for the Brazilian Amazon, and Benites et al. (2007) for different areas of Brazil.The equations, their coefficient of determination (R 2 ), and number of samples (n) used to fit the equations are shown in table 1.

Accuracy assessment of PTFs
The accuracy of the equations was evaluated from information from 50 soil horizons, which constituted an independent data set (not used for modeling).Scatterplot graphics of observed and predicted values a, mean error (ME), and root mean square of error (RMSE) were used for assessing the accuracy of the PTFs.Calculation of ME and RMSE was performed through the following equations: Rev Bras Cienc Solo 2017;41:e0160158 the tendency of over-or underestimating mean error, respectively (De Vos et al., 2005).The closer ME is to zero, the better the fit of the equation.

Descriptive statistics
The number of samples (n) shown is different for each attribute because the horizons evaluated do not contain data on all the variables (Table 2).For instance, sand and clay data may be available for a particular horizon but not for base saturation (V), whereas another site may have information on V but not clay content.); R 2 : coefficient of determination; TOC: total organic carbon; TOC m : total organic carbon determined in muffle furnace; SB: sum of bases; Ln: Neperian logarithm.In Benites et al. (2007), clay and TOC are represented in g kg  The coefficient of variation (CV) of the attributes is high (>50 %), except for pH (Table 2).Asymmetry and kurtosis values are related to data distribution, with data considered symmetric when asymmetry is zero.Data showed positive symmetry (>0), indicating left-skewed distribution (left of the median value).
Kurtosis for sand, clay, TOC, and Bd shows a flatter pattern (platykurtic distribution).The values for silt, sum of bases (SB), cation exchange capacity (CEC), Al 3+ , H + , base saturation (V), P, and pH are more concentrated around the mean and median (leptokurtic distribution) (Table 2).This behavior is not suitable for linear regression because it tends to show a low association with other variables, as was observed in the Bd of the present study.None of the variables showed normal distribution according to the Shapiro-Wilk test (p<0.05;Table 2).
Considering equation validation based on independent data, Bd was the only variable with a CV lower than 50 % (only 18.5 %) (Table 3).The other variables show high dispersion values.Clay showed the best data asymmetry (close to zero), and sand, silt, TOC, Bd, and SB were left skewed.Only pH distribution tended to be right skewed.
Sand and SB data were more concentrated (kurtosis >0.263).Clay, silt, TOC, and pH showed a flatter distribution (kurtosis <0.263), and Bd exhibited mesokurtic distribution.Only Bd and pH exhibited normal distribution (p>0.05,Table 3).In general, the group of independent data used for validation represents the shape of training data distribution.

RESULTS AND DISCUSSION
The development of the PTFs prioritized easily obtainable variables, an important principle in producing prediction equations.The attributes of table 2 were used as input data for regression, and the stepwise procedure selected the significant variables to integrate the equations.As such, only total organic carbon (TOC) and clay were significant for Bd prediction.
The TOC content is the most significant variable in Bd prediction models (Jalabert et al., 2010).However, Benites et al. (2007) reported that TOC is more important than clay for Bd prediction in the 0.00-0.10m soil layer, but clay is the most important at 0.30-1.00m.Two equations were developed in the present study, PTF1 and PTF2 (Table 4).PTF2 was developed based only on TOC, but PTF1 was based on TOC and clay, although clay content is not frequently assessed in organic soils.Equation exponents were determined by data transformation using the Box-Cox transformation.This procedure identifies the most adequate value to use as exponent, promoting normality of equation residues.
The parameters of both equations produced were significant (p<0.05), as were the equations (p m <0.05) (PTF1 and PTF2).R 2 was 0.58 and 0.52 and R 2 adj was 0.56 and 0.52 for PTF1 and PTF2, respectively.SE was 0.19 Mg m -3 for PTF1 and 0.11 Mg m -3 for PTF2 (Table 4).Analysis of variance also shows that the variables were significant at 5 % for both equations (Table 5).

Residue analysis
Residue analysis of the two PTFs proposed is shown in table 6. Residue mean tendency toward zero is a favorable condition.The low standard deviation (SD) of PTF2 and the maximum and minimum values close to zero indicate equation accuracy.The PTF2 residues were close to zero, which is also a positive result.Asymmetry and kurtosis were adequate, indicating normal distribution.In general, the PTF2 residues are slightly better than those of the PTF1.
The Shapiro-Wilk test indicated data normality, with p=0.34 for PTF1 and p=0.21 for PTF2.Residue association for a normal distribution indicates that the mean of model errors tends towards zero, as confirmed by the Student t test (Table 7).These features are desirable for regression analysis.); Clay, in g kg -1 ; TOC: total organic carbon (g kg -1 ).The errors also meet the homoscedasticity assumption (homogeneous distribution of variances), as calculated by the Breusch-Pagan test (p=0.23 and 0.81 for PTF1 and PTF2, respectively).Homoscedasticity is important in validating regression functions.

Model evaluation: error parameters
The performance of the PTFs was assessed by RMSE, ME, and the determination coefficient (R 2 ), fitting them to the independent sample and comparing them to the PTFs developed in other studies.Figure 3 shows that all equations, except for those of Honeysett and Adams, underestimated Bd.The Tomasella and Benites equations exhibited the worst performance, with likelihood of higher ME and RMSE because they were developed for mineral soils.The Bernoux and Manrique equations were also limited, showing high RMSE values.
The equations with the best performance were PTF1, PTF2, Hollis, and Honeysett (Figure 3), with RMSE from 0.22 to 0.26 and ME from -0.12 to 0.05.In general, these were the best parameters for validating the equations based on independent data.Furthermore, these equations showed the highest correlations.
Validation of the Hollis equation explained 29 % of Bd variance, with error probability of ± 39 % (Hollis et al., 2012).In the present study, the Hollis equation explained 36 % of Bd variance, with error probability of ± 13 %.The authors observed that for all the equations developed, Bd prediction is relatively limited when applied to a data set with validation of independent data, even when they were obtained from a same area or soil type.
The PTF1, which showed the highest accuracy, adopts TOC and clay as predictor variables.
As such, use of this equation is recommended to estimate Bd in organic soils.The high clay content in soils must be considered.Table 2 shows average clay content of 350 g kg -1 , with a maximum value of 900 g kg -1 , indicating the significant presence of the mineral fraction.However, clay content is not analyzed for many organic soils.In these cases, the alternative manner to determine Bd uses the PTF2, Hollis, and Honeysett equations, which contain only TOC as a predictor variable.
The PTF2, Hollis, and Jeffrey equations showed the best performance (which contain only TOC), with lower RMSE and ME (Figure 3).The Hollis equation applies the log values of TOC data to linearize the equations, that is, the Bd of organic soils showed a good logarithm relationship with TOC.To produce PTF2, the Box-Cox transformation was used to meet linear regression assumptions.
Distribution of the results predicted by the equations tested is shown in figure 4. The box-plots for the PTF1, PTF2, Hollis, Jeffrey, and Honeysett equations were closest to those of the observed data.In turn, the box-plots for the Benites, Tomasella, and Bernoux equations indicated that they provided the worst Bd prediction.The latter three equations used more than two variables for Bd prediction, including chemical properties.This suggests that chemical attributes are not good variables for predicting Bd in organic soils.
The methods used for chemical and physical analysis of organic soils have been called into question in other studies (Pereira et al., 2006;Gatto et al., 2009).Most procedures are calibrated to mineral soils, hindering interpretation of results.The coefficients of determination obtained by the authors (Table 1) are higher than those obtained by the equations proposed in this study, with the exception of the Manrique equation.These equations were developed for a wider range of soil classes, and with the exception of the equations developed in Brazil, in environments with predominantly temperate climates.The only equation developed exclusively for soils with high organic matter content is the Hollis equation, however, for a temperate climate.The variability of soils with high organic matter content is wide because the database comprises organic soils of a predominantly tropical country, but with data from cold regions, such as southern Brazil and montane regions, especially of the states of Minas Gerais and Rio Janeiro.Removal of outliers was tested, but there was no significant improvement in R 2 .The technique that enabled improvement was the Box-Cox transformation.For the same reasons, the RMSE and ME were also affected, but in validation, they showed similar values to the equations tested (Figure 3).The soil samples evaluated had high C content, characterizing them as one of the most restricted groups among the different soil types.Despite attempts to search for specifications in development of predictive equations for organic soils, limitations in generating and processing the data set were observed.Difficulties include differences in methods of soil analysis and differences in organic soil categories.Soils with high TOC levels encompass a wide range of soil classes, with different compositions.For instance, Histosols can exhibit sapric, hemic, or fibric characteristics.This limitation, however, was not found by Hollis et al. ( 2012), who combined histic and folic soils in a single group, given that they did not observe significant differences in mean Bd.The limitations reported do not compromise the findings of the present study, which proposes specific and accurate equations for Bd prediction in organic soils.
Therefore, the most significant limitations are the use of different methods of soil analysis and high data variability.Carbon content in soils that make up the database was determined by different methods, such as combustion in a muffle furnace, titration with dichromate (Walkley and Black, 1934;Yeomans and Bremner, 1988), and automatic elemental analyzers (CHN), which commonly increases variation in C content among the samples (Pereira et al., 2006;Gatto et al., 2009).These variations may have affected the validation parameters of the equations produced.
One of the qualities of regression models, called regression toward the mean, determines equation parameters based on mean training data values.Therefore, the values obtained by different methods tend to be similar to the mean obtained using different analysis methods.In that case, linear regression is the most appropriate method for reducing these effects in predictive equations, irrespective of the use of an equation developed by the present study or by other authors.

CONCLUSIONS
The use of PTF1, produced using stepwise linear regression, is recommended for predicting Bd in organic soils with data available on clay content, given that it uses not only TOC, The analysis of clay content is scarce for organic soils, and in the absence of these data, the use of the PTF2, Hollis, and Honeysett equations is recommended because they provided the highest accuracy in Bd determination based on TOC as the only predictor variable.
The better performance of the equations produced and applied to data collected in Brazil highlights the importance of the environmental context, especially soil and climate conditions, in the development and accuracy of pedotransfer functions.

Figure 1 .
Figure 1.Map of soil profiles with organic horizons.

Figure 2 .
Figure 2. Steps of equation development and validation.

Figure 3 .
Figure 3. Observed and predicted values of functions developed and obtained from other studies, considering the validation data set (50 samples).Bd: soil bulk density; RMSE: root mean square error; ME: mean error.

Figure 4 .
Figure 4. Boxplot for soil bulk density (Bd), which was observed (Obs) and predicted by the functions tested using the independent sample.

Table 1 .
Pedotransfer functions published by others authors and tested in this study n: number of training samples; PTF: pedotransfer function; na: not-available; Bd: bulk density (Mg m -3

Table 4 .
Pedotransfer functions developed to predict bulk density and the respective indexes of regression

Table 5 .
Analysis of variance for the regressions produced

Table 6 .
Residuals of the regression analysis for the PTF1 and PTF2 models Rev Bras Cienc Solo 2017;41:e0160158