Prediction of Topsoil Texture Through Regression Trees and Multiple Linear Regressions

Users of soil survey products are mostly interested in understanding how soil properties vary in space and time. The aim of digital soil mapping (DSM) is to represent the spatial variability of soil properties quantitatively to support decision-making. The goal of this study is to evaluate DSM techniques (Regression Trees RT and Multiple Linear Regressions MLR) and the ability of these tools to predict mineral fraction content under a wide variability of landscapes. The study site was the entire Guapi-Macacu watershed (1,250.78 km2) in the state of Rio de Janeiro in the Southeast region of Brazil. Terrain attributes and remote sensing data (with 30 m of spatial resolution) were used to represent landscape co-variables selected as an input in predictive models in order to develop the explanatory variables. The selection of sampling sites was based on the Latin Hypercube algorithm. A representative set of one hundred points with feasible field access was chosen. Different input databases were tested for prediction of mineral fraction content (harmonized and original data). The Spline algorithm was used to harmonize data according to the GlobalSoil.Net consortium standards. The results showed better performance from the RT models, using input from an average of six covariates; the simplest MLR model used twice as many input variables, creating more complex models without gaining precision. Furthermore, better R2 values were obtained using RT models, irrespective of harmonization of soil data. The harmonized dataset from the 0.00-0.05 and 0.05-0.15 m layers, in general, presented better results for the clay and silt, with R values of 0.52 (0.00-0.05 m) and 0.69 (0.05-0.15 m), respectively. Prediction of sand content showed better results when the original depth data was used as an input, although all regression tree models had R values greater than 0.52. The RT models provided a better statistical index than MLR for all predicted properties; however, the variance between models suggests similarity of performance. Regarding harmonization of soil data, both input databases (harmonized or not) can be used to predict soil properties, since the variance of model performance was low and generalization of the soil maps showed similar trends. The products obtained from the digital soil mapping approach make it possible to integrate the factor of uncertainties, providing easier interpretation for soil management and land use decisions.


INTRODUCTION
Soil maps are widely used as primary information in land management and protection of natural resources.Soil scientists face great challenges in meeting societal demand for soil information on appropriate scales to support decisions regarding land use and management of natural resources.Digital soil mapping techniques are able to provide useful soil information on an appropriate scale and in digital format.Soil texture or mineral particle size is a highly variable soil physical property, and studies on a regional and local scale have shown spatial dependence at short distances interfering in crops yield (Marques Júnior and Lepsch, 2000), which indicates the essential role of this property in growing crops, engineering projects, and land protection and conservation (White, 2006).The effects of soil texture on land capability, water and nutrient storage, and vegetation distribution and composition are well known globally (Klingebiel, 1963;Jenny, 1980;Silver et al., 2000;Fernandez-Illescas et al., 2001).Particle size distribution in soils is directly related to topographic indexes and slope (Leão et al., 2010(Leão et al., , 2011)).
Accurate prediction of soil texture is very important for agronomical purposes, particularly in precision farming, since soil texture is directly related to yield (He et al., 2013;Gozdowski et al., 2014).In this context, pedometric tools can be useful in predicting soil properties variability (spatial and vertical), such as soil particle size (Moore et al., 1993;Arrouays et al., 1995;McBratney et al., 2000).Usually, soil sampling at various depth intervals and description of horizons/layers are performed according to morphological properties related to pedogenesis.The measured values of soil properties correspond to the depth of the horizon, which varies according to the type of soil profile.However, surveys with specific objectives usually sample the soil according to predefined depths.
The global soil survey consortium (GlobalSoilMap project) has proposed standard depths (vertical soil profile) to expand the database of soil properties.The six pre-defined depths correspond to the following layers: 0.00-0.05,0.05-0.15,0.15-0.30,0.30-0.60,0.60-1.00,and 1.00-2.00m.The specifications of the GlobalSoilMap project suggest that data at these six depth intervals can be harmonized through soil depth function, usually applying the equal-area quadratic spline (Arrouays et al., 2014).The legacy dataset has been successfully harmonized by Nussbaum et al. (2018), allowing comparison between different models applied to predict soil properties for distinct soil layers.
Spatial prediction of soil properties using statistical tools and pedometric concepts is supported by correlation with landscape attributes derived from a digital elevation model (DEM) and remote sensing data (Dobos et al., 2000;McBratney et al., 2003).Application of these techniques was exemplified at Moore et al. (1993), McBratney et al. (2003), and Odeh et al. (1994).Most papers concerning prediction of soil properties focus on carbon storage and hydrological properties.Multivariate linear regression models and/or tree-based models applied to predict soil properties are exemplified by Moore et al. (1993), Henderson et al. (2005), Eldeiry and Garcia (2008), Vasques et al. (2008), Ließ et al. (2012), Minasny et al. (2013), and Carvalho Junior et al. (2014a).
The study hypothesis is based that the application of digital mapping techniques can aid the quantitative spatial prediction of soil particle size components at distinct depths.In this sense, the main goal of this study was to compare two different models used to predict the spatial distribution of soil mineral particle-size fractions (clay, silt, and sand).

Study area and soil sampling
The Guapi-Macacu watershed is located in the Southeast region of Brazil, in the state of Rio de Janeiro.In Brazil, the watershed is established as a territorial unit for managing water resources and land use by the National Policy on Water Resources (Law No, 9433/97).The Guapi-Macacu watershed is one of twelve in the Guanabara Bay Hydrographic Region.It has a catchment area of 1,250.78km² and a perimeter of 199.2 km.The climate is classified as tropical rainy with a dry winter (Aw, according to the Köppen classification system), supporting different land uses, such as agriculture, cattle raising, and a preserved natural park under typical rainforest vegetation (Atlantic Forest).The area is located between the UTM coordinates 7488481-7526005 mS and 699292-752193 mW (horizontal datum WGS-84), and it has a wide variety of landscape features.The region is located in the central part of Guanabara Graben, known as the Macacu Sedimentary Basin, which was formed by several deposition sequences from tectonic events at the beginning of the Tertiary (Ferrari, 2001).As an example of landscape variability, elevation varies from sea level (0 m) up to 2,600 m within the watershed (Hora et al., 2010).
The study was conducted in 2010 and 2011, and figure 1 shows the location of sampling sites in the Guapi-Macacu watershed in Rio de Janeiro, Brazil.
Conditioned Latin Hypercube Sampling (cLHS) was used to best achieve distribution of the sampling sites according to landscape attributes while considering the feasibility of acquiring the samples (Minasny and McBratney, 2006;Roudier et al., 2012).To set the parameters for conditioning the sampling scheme, a buffer size of 100 m to each side of the mapped roads (source: national database in scale 1:50,000 from Brazilian Institute of Geography and Statistics -IBGE), the number of sample points (100), correlation and data weight (0.5 and 1.0, respectively), and number of iterations (20,000) were set.All of these input parameters are required, and the values can be adjusted based on the specific research area and limitations (Minasny and McBratney, 2006).The selection of sampling points within this watershed was based on the parameters of spatial position, elevation slope, curvature, and land use map (Fidalgo et al., 2008).The urbanized areas were excluded, and selection of sampling sites was restricted to the buffer area (Roudier et al., 2012) defined by preliminary analyses (Carvalho Junior et al., 2014b).Rev Bras Cienc Solo 2018;42:e0170167 Sand, silt, and clay content were obtained according to the procedures described by Claessen et al. (1997).The analytical results of the mineral fraction content correspond to the depth of the genetic horizons/layers, as identified in the field soil survey.The soil was classified according to the Brazilian System of Soil Classification -SiBCS (Santos et al., 2013) and the corresponding classes in the World Reference Base for Soil Resources -WRB (WRB, 2014).The typical classes that occur in the area are Planosols (Planossolos), Cambisols (Cambissolos), Gleysols (Gleissolos), Ferralsols (Latossolos), Fluvisols, and Regosols (Neossolos Flúvicos and Litólicos, respectively).

Input covariates
The main Geographic Information System (GIS) used was ArcGIS Desktop v.10 (ESRI, 2010).Terrain attributes were obtained through the System for Automated Geoscientific Analyses -SAGA v.2.1.4 (Conrad, 2007).This software is focused on landscape analysis but can be used for soil mapping (Conrad, 2007;Hengl, 2009).
Additional analyses related to remote sensing data were performed on Erdas Imagine v.9.1.software (Erdas Systems, 2008), and the landform map was created on the Geographic Resources Analysis Support System (Grass, 2013), with the Geomorphons add-on.The DEM was generated by interpolation of the primary elevation data and the drainage network, and was restricted to the watershed limits.The primary elevation data involved contour lines and precision elevation points extracted from the official Brazilian charts, 1:50,000 scale (Brazilian Institute of Geography and Statistics, and Brazilian Geological Service).An elevation model with spatial resolution of 30 m was generated by the "Topo to Raster" tool in the ArcGIS Desktop v.10.After the interpolation procedure, "sink" cells were completely filled so the final DEM would not result in interpolation failures in the model and derivatives.
Digital landscape attributes were generated from the adjusted DEM to expand the set of predictive variables used as input for the predictive models.Analyses were first performed to understand the variability of terrain variables and soil properties in the watershed.They included visual evaluation of the maps and descriptive statistics parameters (mean, standard deviation, minimum and maximum).After this procedure, thirty-seven terrain variables were selected for testing as predictor variables, as described below.
Attributes derived from the DEM and stream networks, such as elevation, slope, curvature, Compound Topographic Index (CTI), and the Euclidean distance from stream networks were generated by "Surface tools" in the "Spatial Analyst" toolbox (ArcGIS Desktop v.10).The CTI was obtained by a sequence of commands in the "grid" module of ArcINFO.These attributes were first tested, and they showed effectiveness in predicting soil classes in the same watershed (Pinheiro et al., 2013).Terrain co-variables were derived from the 30 m resolution DEM and drainage network using the "Terrain Analysis" toolbox in SAGA (System for Automated Geoscientific Analyses) to provide enough quantitative data to represent landscape features and environmental functions to predict soil properties.The derived covariates included: (i) 15 terrain attributes related to relative position and relief features [mass balance index, mid slope position, modified catchment area, multiresolution index of ridge top flatness (MrRTF), multiresolution index of valley bottom flatness (MrVBF), normalized height, protection index, slope height, valley depth, hillshade, channel network base level, altitude above the channel, vertical overland flow distance, SAGA wetness index]; and (ii) nine climatic properties (sky view factor and simplified sky view factor, solar radiation, total insolation, terrain view factor, wind effect, diffuse insolation, direct insolation, duration of insolation).
Additional information about the procedures for creating terrain attributes in SAGA can be found in Olaya (2004).
The landform map was generated by the geomorphon method (Jasiewicz and Stepinski, 2013), in the Geographic Resources Analysis Support System (Grass), with the geomorphon Rev Bras Cienc Solo 2018;42:e0170167 add-on.The pre-defined parameters to create the landforms correspond to 45 cells (1,350 m) of search radius and 1 st of flatness threshold.
Remote sensing data from Landsat 5 TM (reference image of September/2011) were used as input variables, represented by six spectral bands (1, 2, 3, 4, 5, and 7) and three indexes calculated from the spectral bands of Landsat 5 TM, also with a 30 m spatial resolution.The indexes were the Normalized Difference Vegetation Index (NDVI), the iron oxide index (ratio between band 3/band 1), and the clay mineral index (ratio between band 5/band 7).The iron oxide index highlights the presence of iron oxides and sulfates, and the clay mineral index highlights the presence of clay minerals, such as alunite, illite, kaolinite, and montmorillonite (Sabins, 1997;Chagas et al., 2013).These last two indexes are commonly used in remote sensing applied to geology studies to recognize hydrothermal alteration and unaltered rocks (Sabins, 1999).The geology map was created by vectoring tools in ArcGIS Desktop v.10, based on the official Brazilian charts in a 1:50,000 scale (Geological Survey of Brazil and Department of Mineral Resources), and was simplified according to the type of parental material in four classes: alkaline rocks, granite/gneiss, sedimentary rocks, and quaternary sediments.
All layers were projected in the Universal Transverse Mercator (UTM) coordinate system, and the horizontal data according to the Geocentric Reference System for the Americas -Sirgas 2000, Zone 23S.

Modeling procedures
The procedures used to predict soil properties (sand, silt, and clay) were regression trees and multiple linear regression.The statistical procedures were implemented in the R software (R Development Core Team, 2014).
Multiple Linear Regressions (MLR) have been widely used to predict the response of a dependent variable from a set of independent variables, as a function of the correlations between them.The MLR algorithm was implemented using the lm( ) command, with stepwise (backward) analysis, fitting the model by removing variables according to confidence level (95 %).The approximation through least-squares was used to validate and constitute the best linear unbiased estimators of the regression parameters (Berry and Feldman, 1985;Vasques et al., 2008).
Regression Trees (RT) are implemented through the Recursive Partition and Regression Trees package, named "rpart" (Therneau et al., 2017), primarily based on the CART (classification and regression trees) algorithm (Breiman et al., 1984).The logic of the tree-based methods is a binary procedure, which is obtained by recursive partitioning of the dataset in two subsets.These methods are more homogeneous, based on the importance of the covariates over the data.This procedure is repeated recursively until the number of subsets reaches a minimum, or there are no gains in fitting models through further subdivisions.The pre-defined parameters were complexity parameter (cp) equal to 0.001 (default) and the model was fitted as analysis of variance (Anova) according to the least square mean error.Each partitioning tends to minimize the difference between two subgroups at each node, and subdivisions that do not improve the fitted model are pruned by cross-validation.Finally, terminal nodes represent the predictive value as the average of all measured values (Vasques et al., 2008).
Assuming that the influence of terrain variables on soil properties is markedly closer at the soil surface (Florinsky et al., 2002), and topsoil models are stronger than subsoil models (Henderson et al., 2005), this study focused on prediction of sand, silt, and clay content in the topsoil layer.
To accomplish the proposed goals, the analysis was organized into two steps.In the first step, the soil data from the original database was used as input for the predictive algorithms.The second step applied the predictive algorithms to the harmonized database at the regular depths of the surface layers (0.00-0.05 and 0.05-0.15m).
Rev Bras Cienc Solo 2018;42:e0170167 To meet the specifications of the GlobalSoilMap project (Arrouays et al., 2014), a new database was created from the original to represent the harmonized data in the 0.00-0.05and 0.05-0.15m layers.Harmonization of soil properties at regular depths was performed using the soil depth function to interpolate the data.The spline function proposed by Ponce-Hernandez et al. (1986) represents a nonparametric function, called an equal-area spline, appropriated to model soil properties (Bishop et al., 2009;Malone et al., 2009).
The equal-area spline function considers each horizon as the pre-defined interval and the knots of each horizon lie between horizon boundaries, with one inflexion in each interval.
The knots should lie as near as possible to the inflexion and as far from boundaries as possible, which, in essence, preserves the mean value of the soil property (Odgers et al., 2012).The spline functions were applied from the original data collected in the horizon layer (genetic horizons) to harmonize at six pre-defined depths according to the GlobalSoilMap project.From the output data generated by this procedure, the data from the first two layers (0.00-0.05 and 0.05-0.15m) were selected to represent the topsoil layer.This procedure was performed to contrast the results obtained by the different input databases (harmonized data at two depths, and original data).
Maps and graphs were generated to compare performance between models (multivariate linear regressions and regression trees), and between input data (original depth and harmonized at 0.00-0.05and 0.05-0.15m).The results were compared through the coefficient of determination (R 2 ), root mean square error (RMSE), complexity of the model (number of variables used), and map generalization.All statistical procedures used to create the maps and graphs were created on R and RStudio software, and the final layout was built with the support of ArcGIS Desktop v.10.

Landscape covariates and importance in predicting soil texture
A brief description of the covariates and their respective importance in modelling the variability of soil texture in accordance with the methods tested are presented in table 1.
The predictive models (MLR and RT) tend to prioritize input variables that provide significant explanatory effects (Faraway, 2002).Redundant predictors act as noise to the estimation and should be removed to make the model as simple as possible, according to the principle of Occam's razor, which advocates that the simplest theory should be chosen from among all theories (Young et al., 1996).
According to table 1, all models included at least one input covariate from remote sensing data (Landsat bands or derived indexes) to predict the soil properties of interest.The geology map was selected by all MLR models, regardless of the input dataset (original or harmonized), showing the direct relationship between parental material and soil texture.
On the other hand, some of the terrain attributes, such as elevation, geomorphons, CTI, mass balance index, wind effect, and direction and duration of insolation, were not selected by any model tested in this study.One possible explanation is that the models discard variables correlated with each other, as highlighted by Beven and Binley (1992) and Deng et al. (2008).For example, a pairs of terrain attributes can vary simultaneously such as elevation and slope, insolation attributes, and analytical hill shading.Modified catchment area and altitude above the channel were used in one model each, showing their restricted influence on prediction of soil properties.Sabins (1997Sabins ( , 1999)), Chagas et al. (2013), Pinheiro et al. (2013 (1,2,3) 0

Diffuse insolation
Incoming solar radiation reflected by atmospheric components (KWH m -2 yr -1 ) Böhner and Antonic (2009), Wilson and Gallant (2000) (3) 0 Rev Bras Cienc Solo 2018;42:e0170167 Pinheiro et al. (2017) for the same area.The terrain attributes of valley depth, normal height, and SAGA wetness index were important only for predicting silt content, which was also observed in the field, and table 1 shows each covariate used to predict the texture components.Higher silt contents were related to less developed soils, such as Regosols (Neossolos Litólicos Distróficos), Cambisols (Cambissolos Haplicos Tb Distróficos), and Fluvisols (Neossolos Flúvicos Tb Distróficos).The Fluvisols had the highest values for wetness index due to their low slopes, and their occurrence was related to deep and irregular fluvial deposits in the broader valleys.The attributes of analytical hillshade and altitude above the channel were used only in sand prediction models.Topsoil layers with high sand content were also related to proximity to river channels and young soils.
It is well known that success in modeling environmental characteristics is related to the quality of the input data, associated with a powerful set of predictive covariates (Zhu, 2001;Minasny et al., 2003).In this sense, this study can contribute to improving the modeling techniques applied to the mapping of soil properties.As for the input data, harmonization of the data allows creation of a map for the target properties corresponding to a layer of pre-defined thickness, which can be useful for agricultural purposes, for example.As for predictive covariates, by testing a large set of environmental data, it was possible to identify primarily those covariates related to soil texture, but that also may be related to other soil properties in this watershed, such as cation exchange capacity (CEC) and soil types.However, to build better predictive models for soil particle size, further studies are necessary for determining the appropriate input covariates to understand the relationships between landscape attributes and soil variability.

Variability of soil texture in the watershed
The statistical description of soil properties (sand, clay, and silt) based on soil sample analysis (original data and harmonized data) (0.00-0.05 and 0.05-0.15m) is presented in table 2.
The Guapi-Macacu watershed exhibited substantial variability in soil types, predominantly Ferrasols -Latossolos (28 %), Acrisols -Argissolos (24 %), Cambisols -Cambissolos (18 %), and Gleysols -Gleissolos (15 %).Soils with high sand content were common along the Macacu and Guapi-Açu floodplains, which have wide texture variation due to river deposition systems and events.In the floodplains, particularly near river deltas and in estuarine deposits, Histic horizons and Gleysols with low pH (<4.5) were documented.Clayey soils show a wide area of distribution and were primarily derived from granite and gneiss parent materials.Some Acrisols have abrupt textural changes, with sandy surface horizons above clayey horizons (Santos et al., 2013).Parent materials of sedimentary rock origin are limited in the watershed and, in general, the soils formed have clayey textures and xanthic properties (WRB, 2014).A detailed analysis of soil-landscape relationships and soil genesis in the area can be consulted in Pinheiro et al. (2017).
The spline fitted curve of the profile (Figure 2) illustrates the original data (mean value corresponding to the depth of layer/horizon) and the harmonized data (fitted curve) according to the pre-defined depth intervals.The profile distribution of sand decreased with depth (Figure 2c).In contrast, the clay distribution increased substantially below 0.50 m of depth, and in the same layer, the silt fraction decreased drastically (Figures 2a and 2e).Silt content had decreasing values with depth to the bottom of the solum (approximately 1 meter).Below the solum depth, the increase in silt content in particular can be related the influence of weathered parental material.This textural pattern is typical for Haplic Ferralsols (Dystric), classified as Latossolo Vermelho-Amarelo according to the Brazilian System of Soil Classification -SiBCS (Santos et al., 2013), which are predominant in the watershed (Pinheiro et al., 2013).The spline function was executed on all sample point data to improve the capacity of preliminary results to predict the soil mineral fraction in the surface layer and to standardize the input database according to the GlobalSoilMap project (Arrouays et al., 2014).

Digital mapping of mineral soil particle size fraction
The linear models showed a greater range and more even distribution of output values compared to the predicted attributes.With the regression tree output, the values were discreet and points were often linear and parallel to the 'x' axis (Figures 3a,3b,3c,3g,3h,3i,3m,3n,and 3o).Rev Bras Cienc Solo 2018;42:e0170167 In general, the RMSE showed low values when the original database was used, although the data range was large.The data range for sand was 112.13-144.05g kg -1 , for clay was 94.68-106.35g kg -1 , and for silt was 46.10-60.27g kg -1 .When utilizing the RT models, the output values according to the terminal nodes (leaves) and the plot results show a horizontal trend (Figure 3) that suggests a grouping of output values subordinated to the homogeneity of the output nodes.This homogeneity is intrinsic to the method since the output values are grouped in terminal nodes.
The RT models fitted the predicted results better than MLR models, and had the lowest values for RSME errors (94.68 for clay content, 112.13 for sand content, and 46.10 for silt content) suggesting better predicted results than the RLM models, which exhibited 101.32 for clay content, 138.43 for sand content, and 57.66 for silt content.These values were within the range that was proportional to the magnitude of the input data values.A lower RMSE is associated with greater predictive ability, but this index cannot be used to compare different properties since it depends directly on the scale of values (Henderson et al., 2005).Regarding clay content, the mean value for RSME was 103.00 for the MLR models and 96.78 for the RT models.The lowest index values were obtained for silt content, with a mean value of 46.10 for the RT models and 58.53 for the MLR models.The highest mean values of RSME were found in sand prediction (140.70 in MLR models and 115.65 in RT) due to the assumed natural range of sand content, which is larger than that of the other components.A detailed discussion about model performance is presented below.

Evaluation and selection of the models to represent topsoil texture
Results of predictive models (multiple linear regressions and regression trees) for the three soil properties and different input databases are summarized in table 3.
General analysis of the models primarily showed better performance of regression trees than multiple linear regressions for all three properties of mineral soil fraction content, regardless of the database used.Nussbaum et al. (2018) observed that linear regression models are unstable with a large number of covariates.The authors indicate that difficulties in working with datasets with a large number of covariates are chances of over-fitting calibration data, multi-collinearity, and noisy covariates.Unsatisfactory performance from linear regression models in predicting soil particle size, probably due to inter-correlation among covariates, was observed by Chagas et al. (2016).
Some positive aspects of the models were the ability to quickly tune parameters and to yield insight into decision rules and predictors.Vasques et al. (2008) had different results predicting total soil carbon, in which the stepwise multiple linear regressions showed better performance than regression tree models.
Sand content showed better performance in the regression tree models, in which all R 2 values were greater than 0.52 in the RT models, and the highest value was 0.58. ( 1 Variance between coefficients of determination (R 2 ); Adj R 2 = adjusted R 2 ; RT = regression tree; MLR = multiple linear regression.
Rev Bras Cienc Solo 2018;42:e0170167 Meanwhile, multiple linear regressions had R 2 values ranging from 0.34 to 0.39.Among the predicted soil properties, silt content had the best performance, in which all R 2 values were higher than 0.65 in all RT models (Figure 3 and Table 3).Clay content prediction also showed better performance with RT models, reaching an R 2 of 0.52 with the harmonized data in the 0.00-0.05m layer; the lowest R 2 value (0.46) was in the 0.05-0.15m layer.The R² values in this study have higher correlations than the values described by Henderson et al. (2005) using tree-based models for prediction of particle size fractions in Australian topsoil layers (R 2 values reaching 0.44).Similar values for mineral soil prediction were obtained by Sudduth et al. (2010) studying soils in Missouri (USA) where the clay, sand, and silt had R 2 values of 0.56, 0.28, and 0.68, respectively.The values for clay and silt contents are considered relatively good, due to the high variability of these properties in soils.However, for sand content, the value are considered low.Lower values for soil texture prediction (average values of R 2 lower than 0.20) were obtained by Carvalho Junior et al. (2014a) in a hillslope environment in Brazil, using the GlobalSoilMap harmonized depths.
The coefficient of determination (R 2 ) is a well-known index used to evaluate regression models.However, comparison between models with a different number of variables is more appropriate through the adjusted R 2 .This index is also useful in comparing models with distinct input datasets, since the algorithm compensates for different sample sizes (Hair et al., 2009).The adjusted R 2 and the R 2 showed similar patterns of variability, with low values mostly ranging from 0.13 to 0.68 (Table 3).The variability of the predictive models for each soil property was compared through variance in the coefficients of determination, showing small values of variance between MLR and RT models; the greatest variability was related to clay prediction.Carvalho Junior et al. (2004a) observed similar performance between MLR and RT models used to predict soil texture components.
Concerning the number of covariates, the best performance model (R 2 = 0.69) used the lowest number of terrain variables (3), which suggests a strong correlation among those terrain variables (band 4 of Landsat, the clay mineral index, and the protection index) and the silt content.In general, the RT models used 3 to 7 covariates, and an average of 6 covariates.
Tree models produced discreet output values in the terminal nodes (leaves), and for that reason, they were considered a good technique for separating a dataset into homogeneous groups.The range of terminal nodes was from 5 to 8, with an average of 7. Similar results regarding the number of covariates and terminal nodes of regression trees was demonstrated by Vasques et al. (2008) for soil carbon prediction when models used, on average, seven covariates and ten terminal nodes.Figure 4 presents the maps of the soil mineral fractions with the terminal node values related to the area.
In general, the surface horizons of soils in the watershed showed sand contents higher than silt and clay, as observed in the field survey.This was correlated with the soil classes, predominantly Ferrasols and Acrisols, with a clay content increasing along with soil depth.Another reason is the occurrence of surface laminar erosion and landslides, removing finer soil particles, such as silt and clay, due to steep slopes that are common in the watershed, as observed by Pinheiro et al. (2017).This may be observed in figure 4, particularly in the northern portion of the watershed within the mountain range, which showed high values of sand content.These observations were corroborated in the field survey and by interpretation of analytical data.Moriasi et al. (2007) suggested that qualitative analysis (visual comparison) and quantitative statistics should be used in evaluation of model performance, particularly in watershed modeling.Clay content distribution showed that the topsoil layer had lower clay content in the floodplains and greater content near the main river channels, which was influenced by deposition of small particles in suspension in depositional environments of slow moving water.According to the prediction map and histograms  (Figure 4), more than 54 % of the watershed area had clay content in the topsoil layer, ranging from 280 to 300 g kg -1 .In contrast, greater sand contents were related to active floodplains and the sand content was lowest at the mouth of the watershed where clay content was highest due to the depositional environment.In this area, the topsoil has the highest values of total organic carbon, which is also related to the estuarine depositional environment near sea level.
In general, the soils in the watershed had irregular distribution of silt, with a trend of high silt content where Fluvisols predominate.Areas with high silt content in the surface layer were also identified in landscapes with a steep slope in the mountain range, with poorly developed soils (Regosols) and the presence of rock outcrops.Both types of soils show the strong influence of parental material enhancing an incipient pedogenic process and structural development.
The main differences in the final products (attribute maps) are inherent to the models since the RT model produces discreet output values corresponding to each terminal node (Vasques et al., 2008), instead of the wide range of continuous values presented by the MLR models.
The analysis indicated that differences among databases (original and harmonized data) were small, which suggests that they can likewise be used for modeling.Thus, soil scientists are encouraged to harmonize their data, as proposed by the GlobalSoilMap project, and in this way contribute to the global soil database of soil properties.
The products from the digital mapping approach may enhance soil survey reports, providing easier interpretation for soil management and the uncertainties associated with soil property predictions.Additionally, the digital soil map products provide higher resolution property predictions, which can be combined to develop many use-oriented indexes to target particular management issues related to soil-landscape function.All these beneficial outcomes from digital soil mapping can be used to address land use decisions in the Guapi-Macacu watershed, Rio de Janeiro, and other locations where these maps are developed.

CONCLUSIONS
The regression tree models performed better for all the predicted properties and soil depths tested, although multiple linear regression showed similar results.The harmonized dataset at the 0.00-0.05and 0.05-0.15m layers, in general, had better results for clay and silt properties, with values of 0.52 for clay in the 0.00-0.05m layer, and 0.69 for silt in the 0.05-0.15m layer.The prediction of sand content showed better results with the original data depth as input, although all regression tree models for this attribute had R 2 values greater than 0.52, and small variance among them (0.0007).Variance between the coefficients of determination was small; thus, both databases (original and harmonized) may equally be applied to modeling the soil properties in the watershed.
The generalization of soil texture components (sand, clay, and silt) performed by the regression tree methods were consistent with field observations and the watershed landscape characteristics.This evidence supports a relationship between terrain attributes and topsoil properties, which can be determined by field observations and model predictions.
The number of covariates reflected the complexity of the models.The RT models used an average of six covariates (up to seven), whereas the MLR models had an average of sixteen predictors.More research is needed to create additional efficient input variables to help resolve soil variability and improve the accuracy of soil map products.

Figure 1 .
Figure 1.The location of sampling sites in the Guapi-Macacu watershed within Rio de Janeiro -Brazil.

Figure 2 .
Figure 2.An example of clay, sand, and silt content variability with depth within a soil profile, with original data and harmonized data created from a spline function.

Figure 3 .
Figure 3. Plotted results of clay, sand, and silt content predictions for the three input datasets (original data, and harmonized layers of 0.00-0.05and 0.05-0.15m).RT = Regression Tree; MLR = Multiple Linear Regression; RMSE = Root Mean Square Error; Y axis = predicted values; X axis = observed values.

Figure 4 .
Figure 4. Prediction maps for sand (original data), clay (0.00-0.05 m), and silt (0.05-0.15 m) contents in the watershed with the graphs illustrating the area related to the terminal nodes.

Table 1
(Resende et al., 1988)of the geology map, slope, MrRTF, and Euclidean distance from stream networks; these factors were used as input in models to predict sand and clay contents.The relationship of Euclidean distance from hydrography is important particularly for alluvial soils, since they have layers with wide variation regarding thickness and properties(Resende et al., 1988).Clay content had the Euclidean distance from streams

Table 1 .
Terrain attributes in the watershed: description, references, and contribution to predictive models

Table 3 .
Summary of results for all models tested in the watershed according to soil particle size, model type, and depth