Pedotransfer functions for estimating the van Genuchten model parameters in the Cerrado biome

HIGHLIGHTS: Machine learning algorithms were superior to stepwise regression in estimating water content saturated and residual parameters. The high variability of the fit parameters α and n produced a low precision of the PTFs developed for such parameters. The variables sand, clay, microporosity, and microporosity were the most important variables for the development of PTFs. ABSTRACT The Cerrado biome has presented challenges in reconciling its agricultural expansion with water availability. In this sense, water resources planning and management are fundamental for the economic, social, and environmental development of the Cerrado biome, which has been hampered by the lack of data, especially those referring to irrigation strategies, such as, for example, the water retention curve. The water retention curve is essential to understand water dynamics in the soil; however, obtaining it can be laborious, opening an opportunity for Pedotransfer Functions (PTFs). The current study aimed to develop and evaluate PTFs to estimate the fit parameters of the van Genuchten model for the Cerrado biome. Multiple Linear Regression (MLR) and four machine learning (ML) algorithms were used to develop the PTFs. The ML algorithms were the Multivariate Adaptive Regression Splines (MARS), Random Forest (RF), Support Vector Regression (SVR), and K Nearest Neighbors (KNN). Two combinations of soil data were evaluated, and the predictor variables used in each set were different. Using the RF and SVR models, the best estimates were obtained concerning the parameter θs (saturated water content). As for θr (residual water content), the models showed a moderate predictive capacity. For the other parameters, the models did not perform satisfactorily for α and n (fit parameters).


Introduction
The Cerrado biome has presented challenges in reconciling its agricultural expansion with water availability, especially in regions already experiencing conflicts in water availability (Ferreira et al., 2021).Thus, integrated water resource planning in the Cerrado biome is necessary to establish strategies aiming to increase water use efficiency by different users.
A water retention curve (WRC) is a mathematical representation of the drying process of a particular soil.It is a nonlinear, empirical relationship between the suction (or tension) exerted by soil on the surrounding moisture and the soil water content (Campbell, 1974).The WRC shape depends on soil physic-hydric properties.It is considered fundamental information for understanding water dynamics in the soil, and water balance calculations require efficient irrigation management.This curve resembles an inverted smoothed S where the upper and lower bounds correspond to saturated water content and residual water content, respectively.Several mathematical models were developed as an attempt to adequately represent the general shape of the curve, such as the models by Campbell (1974), van Genuchten (1980), Hutson & Cass (1987), Durner (1994), Fredlund & Xing (1994), Kosugi (1994), Seki (2007), and others.
Parameters of the WRC are estimated from local information on the water content observed values for each tension applied soil water.However, when dealing with large areas, such as the Cerrado biome, direct estimation of the retention curve parameters is not feasible at an appropriate scale.Such determination requires time and laborious routines, making using Pedotransfer Functions (PTFs) appealing as an indirect way of obtaining retention curves.PTFs allow the estimation of WRC parameters from physical-hydraulic attributes that are easy to measure and at low costs, such as sand, silt, clay, organic matter, and bulk density, among others (Vereecken et al., 2010).
This study aimed to develop pedotransfer functions to estimate van Genuchten model parameters for the Cerrado biome using multiple linear regression models and machine learning algorithms.

Material and Methods
The data used in this study were obtained from the Research Group on Water Resources of Embrapa Cerrados and the Hybras dataset (Ottoni et al., 2018).Initially, soil samples with data from WRCs and sand, silt, and clay, bulk density, particle size, total porosity, macroporosity, and microporosity were selected.The georeferenced samples (Figure 1) within the territorial limit of the Cerrado biome (Brazil), with a buffer of up to 100 km, were selected, totaling 188 samples (Figure 1).As for the non-georeferenced samples, 384 samples were selected, of which 216 belong to the states of Goiás (GO), 103 to Tocantins (TO), and 65 to Mato Grosso do Sul (MS), totaling at first 572 WRCs selected for the Cerrado region.
The methods for determining soil properties were reported in the consulted databases.For the obtention of soil water content, a tension table was used for tensions between saturation and tension of 6 kPa, and a pressure chamber for higher tensions (EMBRAPA, 2017).For the determination of the granulometric contents, the pipette method was used.To obtain the bulk density, the volumetric ring method, and for the particle density, the volumetric flask and pycnometer methods (EMBRAPA, 2017).Total porosity (Pt) was based on bulk density (Bd) and particle density (Dp) (Pt = 1-Bd/Dp).The macroporosity was calculated by the difference between total porosity and microporosity (EMBRAPA, 2017).
The model used to represent the water retention curves was the van Genuchten (1980) equation.The model components, that is, the response variable (θ (ψ) ), the predictor (ψ) and the set of model parameters are defined in Eq. 1.
The SWRC Fit software (Seki, 2007) was used to obtain the estimated parameters of the van Genuchten (1980) model, namely, θ s , θ r , α, and n.It is worth mentioning that a minimum number of observed responses corresponding to each tension value (ψ, θ (ψ) ) is necessary for model fitting.This number must be equal to or higher than the number of parameters.For instance, at least four points are required for those WRC fittings.
PTFs were developed only for the method that presented the overall best performance in estimating the WRC parameters for the Cerrado biome.The training subset was organized (1) considering two sets of predictors, namely, A1: sand, silt, clay, bulk density, particle density, total porosity, macroporosity, and microporosity; and A2: sand, silt, clay, bulk density, and particle density.Two different sets of predictors were used to visualize how the PTFs development models would behave.
In the development of the PTFs, five methods were evaluated: Multiple Linear Regression (MLR), Multivariate Adaptive Regression Splines (MARS), Random Forest (RF), Support Vector Regression (SVR), and K Nearest Neighbors (KNN), the last four algorithms being of machine learning.The PTFs were all developed in an R environment (R Core Team, 2019).
For the MLR, the Stepwise method was used to select the 'best' set of predictors.This method consists of adding more significant predictor variables or removing less significant ones during each model construction stage (Eq.2) (Olubi, 2021).The PTF was developed using only the predictor set A1 in this case.
adjustment tolerance of the models.The radial kernel function was used for this study, and the model with the lowest RMSE value was selected.
Finally, KNN is a model that estimates the variable as a function of the average distance of its nearest neighbors in the data set, and for this, a measure of distance is used.In this case was adopted the Euclidean distance (Kohli et al., 2021).This means that models were developed with different numbers of nearest neighbors controlled by the hyperparameter k, and the one with the lowest RMSE value was selected.The R package kknn (Schliep & Hechenbichler, 2013) was used for this.
The repeated holdout method (Tanner et al., 2019) was used to validate the generated models.The database was divided into two independent subsets, namely, 'training' and 'test' , consisting of 70 and 30% of the data, respectively.
The hyperparameters of each machine learning model were adjusted by the k-folds cross-validation method with repetitions (k = 10, n = 3).Finally, the performance of the tested methods was evaluated using the test set, allowing the evaluation of the generalization capacity of the PTFs.
In general, the predictive capacity of PTF is evaluated from indices that measure the errors between predicted and observed data (Nasta et al., 2021).To evaluate the performance of the PTFs developed for the WRC parameters, the following statistical indexes were used: the coefficient of determination (R²), the mean error (ME), and the root mean squared error (RMSE) were used.These indexes are commonly used in the evaluation of PTFs (Nguyen et al., 2017;Nasta et al., 2021).The R² expresses the degree of agreement between the observed values and those predicted by the PTFs (Eq.3); corresponds to the squared Pearson correlation (r), assuming values between 0 and 1.The ME (Eq.4) is an overall measure that indicates if the model tends to overestimate (ME > 0) or underestimate (ME < 0), based on an average of model residuals the response variable, and the RMSE indicates the average absolute error magnitude without account to the sign (positive or negative) of the model residuals (Eq.5). where: Y i -variable to be estimated (parameters of the van Genuchten (1980) model: θ s , θ r , α, and n); β 0 -intercept of multiple linear regression; β 1 … β n -angular coefficients linked to soil predictor variables; and, X 1 ... X n -soil predictor variables (sets A1 and A2).
In addition, the Shapiro-Wilk test was used to verify the normality of the data, and those variables that showed a tendency to non-normality were transformed using the decimal logarithm function.
For the MARS model, the R package called earth was used.MARS is an algorithm that automatically models nonlinearity and interactions between variables where the training sets were divided into linear segments fitted into polynomial curves (splines) with different numbers of interactions and joined by knots (Hastie et al., 2009).For this, models with different numbers of interactions and nodes were developed, and the model with the lowest RMSE value was selected.
The RF is a model that combines regression trees, providing the average prediction of all trees (Liaw & Wiener, 2022).RF uses bootstrap sampling, i.e., it randomly draws a sample (with replacement), keeping the original size of the data in each tree.The R package randomForest (Liaw & Wiener, 2022) was used for this.A bootstrap sampling was performed for each generated tree, with the number of variables selected at each split of the tree controlled by the 'mtry hyperparameter' , and the model with the lowest RMSE value was selected.
The SVR is an algorithm based on the hyperplane fit that separates the points in an n-dimensional space, where n is the number of predictor variables (Zhong et al., 2019).The R package e1071 (Meyer et al., 2019) was used for this.The model uses the kernel function to optimize the obtaining of the hyperparameters C (cost) and γ (gamma), responsible for the ∑ where: y j and ŷ j -estimated and observed values of the response variable, respectively; y j -mean of the y j values; N -number of samples; Σ(ŷ j -y j ) 2 -variance explained by the model; and, Σ(y j -y j ) 2 -total variance. (2) (3) (4) (5)

Results and Discussion
The database used in this study presented textured soils with high levels of sand and clay, with most soils classified as clayey (A -clayey) (Figure 2).The soils of the Cerrado biome are mostly considered Oxisols, soils with high clay content and well structured, and may have high hydraulic conductivity.These characteristics confer one of the main differences between tropical soils and soils from temperate climates and, consequently, the reason for the inaccuracy of FPTs developed in these regions when applied to tropical soils.
Table 1 shows the descriptive statistics of the parameters of the van Genuchten (1980) equation.The parameter θ s presented the lowest variability compared to the other parameters of the WRC, resulting in low values of both standard deviation and CV.Conversely, θ r , α, and n presented the highest variability expressed by high values for both standard deviation and CV values.The CV of the n parameter also showed a large difference between the training (76.23%) and the test set (45.15%) in the test set.The α parameter presented the greatest variability, resulting in a large difference between its average values obtained from the training and test sets, around 280%.
The high variability of the α parameter is observed in several studies.Vereecken et al. (2010) and Gupta et al. (2022) emphasize this high variability of α as inherent to the parameter and state that retention curves with high values of α indicate soils with considerable sand contents, as verified in the database used in this study for the Cerrado biome (Table 1).
Regarding the results of the Shapiro-Wilk test, all variables showed a tendency toward non-normality in the data distribution, so these variables were submitted to transformation using the decimal logarithm function to construct the multiple linear regression.The normality trend was confirmed for all variables after the transformation.
The general ability of the methods for estimating the parameters of the WRC model was evaluated from the average value of the performance criteria (Table 2).The RF model showed the best performance for the θ s and θr parameters.For α and n fit parameters, the R² values were approximately zero for all models and predictor sets, with the MLR as the best model.
ME values for θ r were relatively low, varying between -0.009 and 0.004 m 3 m -3 for sets A1 and A2.As for θ s , the MARS and SVR models were overestimated, with mean residuals around 5.0 and 6.0 m 3 m -3 .The ME value for the α parameter indicates that the model overestimates the unknown true values on average, with mean residuals around 3.0 m -1 for set A1 and around 1.0 m -1 for set A2.For n, the ME values also indicated overestimation, but the means residuals were lower, corresponding to the half part of ME for α (1.5 m -1 ), except for the MLR.
Regarding the RMSE values, set A1 presented a variation close to zero for θ s and θ r , and higher values for α, with a mean equal to 10.18 m -1 for A1 and A2 predictor sets, except for the MLR.These high values of RMSE and ME, as well as the low values of R² for the parameters α and n, can be attributed to the high variability of the soil and, consequently, the model fitting difficulty, as mentioned by Vereecken et al. (2010).
It is observed that, in general, machine learning models present a better performance when compared to multiple linear regression.Araya & Ghezzehei (2019) highlighted the potential of machine learning algorithms due to the nonlinearity between the physical-hydraulic properties of the soil, allowing these more robust methods to perform better than the models considered more straightforward.
Figure 3 shows the estimates obtained by the best model for each parameter (θ s , θ r , α, n).Note that the fitting lines (red line) are far from the 1:1 line for parameters α and n, indicating that the adjustments were poor.As for the soil water content, Table 2. Mean values of the summary statistics used to evaluate the performance of the method tested for PTFs development to estimate the parameters of the van Genuchten (1980) model using the predictor sets A1 and A2 the behavior is more acceptable, emphasizing the saturated water content, which presented the best fit of all the evaluated parameters.Barros et al. (2013) developed PTFs to estimate van Genuchten parameters in northeastern Brazil, finding R² values equal to 0.12 and 0.21 for parameters α and n, respectively, using soil texture data, bulk density, and organic matter.Other authors, such as Bai et al. (2022) andBaumann et al. (2022), also found difficulties in estimating the α and n parameters considering different types of soils and other regions of the world.
When analyzing the results of the A1 predictors set, PTFs performance improved for estimating θ s and θ r .Fatichi et al. (2020) highlight the importance of soil structural properties for assessing soil water content.The classification of the importance of the variables allows for visualizing the predictor variables that contributed the most in each round (repetition) in the development of the PTFs (Figure 4).The number of repetitions performed is found on the Y-axis, a hundred in this case, and the predictor variables on X-axis.It is observed that the structural variables, macroporosity, and microporosity, were crucial in all models evaluated (the bars are close to 100, indicating that they were considered in almost all repetitions).For example, analyzing the KNN model, Figure 4A, concerning the relevance of macroporosity, it is observed that for θ s , the bar reaches the value of 100.On the other hand, the variable θ r appears only in 60 repetitions.
Regarding the MARS model (Figure 4B), the independent variable total porosity is not very important as a predictor of θ s and θ r , occurring in less than 25 of the 100 repetitions.For RF (Figure 4D), sand, clay, macroporosity, and microporosity were considered essential for θ s and θ r , occurring in the 100 repetitions.In MLR, Figure 4C, macroporosity was deemed important only for θ r (less than 25 repetitions).Finally, the SVR, in Figure 4E, macroporosity occurred in all repetitions for θ s and only 60 of the 100 repetitions for θ r .It is worth mentioning that the macroporosity and microporosity variables are correlated so that the models can select one or the other to explain the variables θ s and θ r .
In addition to the macroporosity, microporosity sand and clay were selected in the PTFs development, considered necessary in all models evaluated.This behavior can be explained by the low variability of sand and clay contents in most of the soil samples that presented high levels of sand and clay in this study.The silt, in turn, because of its higher variability, was selected as a predictor in residual water content estimation by using the KNN, RF, MLR, or SVR models.
On the other hand, the total porosity was not selected for estimating the response variable θr due to its great correlation with this response.It is worth remembering that the total porosity was not used in the θ s estimation; therefore, it is not considered in its analysis.A similar explanation can be given to the behavior of the Bd and Pd variables.It is observed that they had a higher occurrence in the saturated water content estimation than in the residual due to the density correlation in the θs calculation.
Although the MLR did not present the best performance among the evaluated models, it allows PTF to be represented as an equation and, consequently, can be directly applied, which differs from the machine learning algorithms used in this study.Thus, the PTFs obtained for the parameters θ s , θ r , α, and n using the predictor set A1 are presented in Table 3.

Conclusions
1. Machine learning algorithms performed better when compared to multiple linear regression to estimate the parameters θ s and θ r .The variables sand and clay, as well as the incorporation of macroporosity and microporosity in the predictor, set A1, improved the performance of the machine learning algorithms in the estimation of saturation and residual water contents.
2. In general, the PTFs developed for the parameters α and n of the van Genuchten equation presented low performances in all models and predictor sets, tending to parameter overestimation.The models MLR were superior to machine learning algorithms to estimate α and n.For the parameter θ r , the predictive capacity was moderate using machine learning algorithms and low using multiple linear regression.As for the parameter θ s , the results were better using machine learning algorithms, and the equations obtained by multiple linear regression offer moderate predictive capacity.

Figure 1 .
Figure 1.Location of georeferenced soil samples in the Cerrado biome

Figure 2 .
Figure 2. Texture triangle of soil samples from the Cerrado biome used to develop the pedotransfer functions for estimating soil water retention curve parameters MA -Very clayey; A -Clay; AS -Silty clay; AAr -Sandy clay; FA -Loamy loam; FAS -Silty clay loam; FAAr -Sandy clay loam; F -Loamy; FS -Silty loam; FAr -Sand loam; S -Silt; ArF -Loamy sand; Ar -Sand Figure 3.Estimated parameters of the van Genuchten (1980) equation obtained by the best-performing models (A, B, C, D) RF, and (E, F) MLR compared to the parameters of the van Genuchten (1980) equation estimated from the original data of tension and volumetric soil moisture in each selected location in the Cerrado biome θs -Saturated water content; θr -Residual water content; α, n -Fitting parameters; A1 -Set predictor A1; A2 -Set predictor A2

Figure 4 .
Figure 4. Classification of the importance of predictor variables in the estimation of saturation (θ s ) and residual (θ r ) soil water content parameters using the A1 predictor set for the tested methods: (A) KNN, (B) MARS, (C) MLR, (D) RF, and (E) SVR

Table 3 .
Coefficients of pedotransfer functions obtained by stepwise multiple linear regression