Multinomial Logistic Regression and Random Forest Classifiers in Digital Mapping of Soil Classes in Western Haiti

Digital soil mapping (DSM) has been increasingly used to provide quick and accurate spatial information to support decision-makers in agricultural and environmental planning programs. In this study, we used a DSM approach to map soils in western Haiti and compare the performance of the Multinomial Logistic Regression (MLR) with Random Forest (RF) to classify the soils. The study area of 4,300 km is mostly composed of diverse limestone rocks, alluvial deposits, and, to a lesser extent, basalt. A soil survey was conducted whereby soils were described and classified at 258 sites. Soil samples were collected and subjected to physical and chemical analyses. Recursive Feature Elimination (RFE) was used to select the most important covariates from auxiliary data, such as climate, lithology, and morphometric properties to describe the soil-landscape relationship. Mapping performance was assessed by the Kappa index and overall accuracy derived from a confusion matrix generated using a 5-fold cross validation process. In addition, an external mapping validation was carried out using an independent soil dataset. Accordingly, the soil dataset was split into 80 % and 20 % for training and validation of the models, respectively. No significant statistical difference (Z = 0.56< |1.96|) was found between maps generated with both classifiers (Kappa index 0.45 for MLR and 0.42 for RF). Based on the Kappa values, the classification performance can be characterized as moderate for both algorithms. Surprisingly, the RF classifier outperformed MLR in the validation process (Kappa values of 0.55 and 0.33, respectively). These results suggest a higher generalization ability of RF. However, no significant statistical difference (Z = 1.83< |1.96|) was observed. The soil map derived from RF indicated the occurrence of Leptosols (48.5 %), Gleysols (19.6 %), Chernozems (8 %), and Fluvisols (6.6 %) in most of the study area. The DSM approaches proved suitable for mapping soils in western Haiti and could be used in other parts of the country, thereby closing information gaps with regard to Haitian soils.


INTRODUCTION
In the last decades, soil mapping has become increasingly automated by means of computational tools, owing to the great advances in computational technology, the development of geographic information systems (GIS), and the availability of more extensive geographic data.This development has been in some ways fostered by the current demands for soil data to address environmental issues and climate change (Grunwald, 2010;Minasny and McBratney, 2016;Muñoz-Rojas et al., 2017), and has contributed to the creation of a sub-discipline of soil science known as digital soil mapping -DSM (Minasny and McBratney, 2016).The framework for DSM was conceptualized by McBratney et al. (2003) and is based on the Scorpan model.As a spatial information system, DSM was created from numerical models to produce digital information that explains temporal and spatial variation in soil classes and properties based on correlated environmental variables.Comprehensive reviews of methods and techniques used for DSM were published by McBratney et al. (2003) and Scull et al. (2003).
In comparison to conventional soil mapping, the uncertainty and accuracy associated to the entire process of mapping can be quantified by DSM approaches.They provide fast and increasingly accurate information by using modern GIS techniques, spatial data from digital elevation models, and remote sensing imagery together with auxiliary data (Bacon et al., 2010;Moonjun et al., 2010).
In several countries, the development and use of predictive models using DSM techniques has constantly evolved, producing more precise information (Grunwald et al., 2011;Arrouays et al., 2017).Yet, there is a lack of application of DSM in some regions of the world, such e.g., Haiti, where the only soil map, representing the whole country, was produced at a scale of 1:250,000 (Libohova et al., 2017).Soil resources in Haiti are not well known and the few studies that have been carried out are largely exploratory.Notable examples of Haitian soil studies are the studies of Sweet (1926) in the Artibonite area, Haspil and Butterlin (1955), Colmet-Daage and Lagache (1965), and Guthrie and Shannon (2004) in west and southwest Haiti.The most recent studies are those of Chaves et al. (2010) on the Mapou region (Southeast of the country), of Hylkema (2011) on soil fertility in five pilot watersheds, and of Libohova et al. (2017), who developed a detailed soil map at a 1:24,000 scale for an area of 3,000 ha located in the Cul-de-Sac depression.
When performing soil mapping, the decision on which predictive model to use and how to select the best set of predictive covariates is important.Many algorithms for categorical classification, as of soil classes, are implemented in a variety of available software programs (McBratney et al., 2000).Among the most popular classifiers are Neural Networks, Fuzzy logic, tree-based model (Cart and Random Forest), and Logistic Regression.
The large amount of spatial data available for use as soil predictors has intensified the need to use data-mining technique to establish a more efficient relationship between variables and predictors and to optimize the selection of a promising predictor set (Hastie et al., 2009;Kuhn and Johnson, 2013;Heung et al., 2014).Some studies addressed the comparison of data mining approaches for predicting soil classes by different predictive models.Nevertheless, despite the broad success of DSM techniques using tree-based model and logistic regression, few studies have compared the performance of promising classifiers for soil class mapping, such as logistic regression and Random Forest (Hengl et al., 2007;Collard et al., 2014;Taghizadeh-Mehrjardi et al., 2015;Pahlavan-Rad et al., 2016;Camera et al., 2017;Heung et al., 2017).Under similar conditions of western Haiti, this kind of studies is rarer.
In this context, the main objective of this research was to assess the use of DSM techniques for mapping soils in western Haiti, specifically by assessing the importance of environmental covariates in soil mapping and comparing the performance of Multinomial Logistic Regression (MLR) with Random Forest (RF) classifiers.

Study area
The study area (between 71° 42' 39" W and 73° 4' 22" W longitude and 18° 15' 30" N and 18° 58' 25" N latitude) is located in the western region of Haiti and covers approximately 4,300 km 2 (Figure 1).According to Köppen classification system, the regional climate is tropical (Aw), with low temperature variation in plains and hills during the year; while at higher elevations (>1,500 m) the climate is subtropical highland (Cwb), with dry winters and hot and humid summers.
The original vegetation of the study area consisted of highland forests, pine forests, xerophilous forest/grassland, savanna, and mangrove forest (Robar, 1984).Currently, due to the demographic pressure and deforestation, the natural vegetation has been incredibly reduced, being transformed into small fragments of isolated plants and species (Koohafkan and Lilin, 1989;Churches et al., 2014).
Western Haiti is geologically complex, composed of various sedimentary and igneous rocks folded, faulted, and fractured by tectonic activities in the Caribbean region (Potter et al., 2004).It is crossed by the Cul-de-Sac depression, a tectonic depression where the hypersaline Lake Azueï or Etang Saumâtre lies.Aside from basaltic rock, the lithology is markedly formed by sedimentary formations composed of diverse limestone rocks, in addition to alluvial sediments, dejection cones, and mud banks.Tertiary carbonate formations, such as marine limestone and limestone marlstone, represent a significant percentage of Haitian territory (Woodring et al., 1924).It is worth mentioning that carbonate formations account for more than 65 % of the lithology of the area.
The region is geomorphologically dominated by mountainous landforms and deep valleys, as is representative of the general geomorphological features in other parts of Haiti.The landform system was shaped by the regional endogenic forces and erosional processes.The elevation ranges from sea level to 2,666 m (Morne La Selle), the highest point in the country.The relief is rugged with slopes between 0 and 62 degrees.Agricultural land, forestland (mixed deciduous, coniferous, and bushland), mixed urban or built-up land, and bare exposed rock are the most common categories of land use in the region.This land use pattern has accelerated the erosional processes that are considered a particularly severe problem in Haiti (Bayard et al., 2006).
The area of this study encompasses the Haitian capital, Port-au-Prince, along with its metropolitan region including the municipalities of Gressier, Leogâne, Grand-Goâve, Petit-Goâve, Cabaret, Arcahaie, Thomazeau, Ganthier, Fond-Verette, Cornillon, and Kenscoff.The area has a population of approximately 4 million people, corresponding to approximately 36 % of the country population (IHSI, 2015).

Soil survey and classification
A pedological survey was carried out in two phases: the first based on a conventional survey approach, in which the sampling was undertaken in transects; and in the second phase, a randomly stratified sample was established by selecting the locations for soil sampling based on the sampling density of the first phase.During these surveys, soils at 258 sites were described and sampled, accounting for 140 complete soil profiles and, 118 auger observations and mini pits (Figure 1).The soil samples collected were subjected to physical, chemical, and mineralogical analyses, according to the guidelines of Claessen (1997).
The soils were classified according to the Brazilian Soil Classification System -SiBCS (Santos et al., 2006) at the suborder level (Table 1).Eight soil groups were observed in the study area: Cambisols, Chernozems, Fluvisols, Gleysols, Leptosols, Luvisols, Nitisols, and Vertisols.The corresponding reference soil groups in the World Reference Base (WRB, 2015) were used to delineate the soil map units in the mapping process.
Due to the low level of detail which characterizes the soil survey scale (1:50,000), coupled with the spatial resolution of most covariates (30 m), nine soil map units were delineated, of which four represented a single component (Chernozems, Gleysols, Leptosols, and Fluvisols), for being considered major components of the study area.Likewise, soil associations were formed by two or three soil components (Table 1).The associations define the map unit by grouping soil taxonomic units that are distinctly and geographically associated and can be mapped individually in more detailed scale mapping.Among the map units formed by associations, the extension of Luvisols and Vertisols is minor.Leptosols and rocky outcrops occur as a complex and were therefore mapped within a single map unit. (1)Map units.

Soil covariates
A set of 14 environmental covariates, including lithology, indices based on satellite imagery, climatic maps, and primary and secondary terrain properties derived from a digital elevation model (DEM), was used as predictor variable set in the soil mapping (Table 2).The DEM was obtained from images from a Shuttle Radar Topography Mission (SRTM) with a spatial resolution of 30 m.Before calculating the topographic properties, the DEM was preprocessed by removing the sinks.The following properties were derived from the DEM: elevation above sea level, aspect, slope, topographic wetness index, flow direction, curvature, channel network base level, terrain surface texture, vertical distance to channel network, and relative slope position.In addition to the terrain properties, mean precipitation and lithology were incorporated as variables, representing the climate and parent material soil formation factors, respectively.
The vegetation covariate, a normalized difference vegetation index (NDVI) calculated by equation 1, was derived from Landsat 8 satellite images, sensor OLI and the clay mineral index (CMI), according to Boettinger et al. (2008) (Equation 2).
The maps of covariates were generated using ArcGIS v.10.1,SAGA 2.1.0,and R software.
The references used for generating the covariates and their relationship with soil formation factors are presented in table 2. Since the topographic properties are 1 arc-second (30 m) resolution data, the lithological map was rasterized (1:500,000) at 30 m.Likewise, the low-resolution precipitation map (≈ 30 arc-second or 1 km) was resampled at 30 m (Table 2).

Training and validation datasets
For the classification training and validation, the data were partitioned with "Data Splitting functions" using R and the "caret" package (Kuhn, 2017) at 80 % for model training and 20 % for independent validation.This random stratified splitting tries to preserve the overall distribution of the data within each class.The soil data distribution in classes is shown in figure 2.

Covariate selection
To identify an optimal subset of covariates from the set of all available covariates (Table 2), we used Recursive Feature Elimination (RFE), which is a backward selection algorithm that iteratively eliminates the least promising predictors from the model based on an initial predictor importance measure (Kuhn and Johnson, 2013).It is implemented within the caret package (Kuhn, 2017) and available for model training within a variety of models, such as the support vector machine and linear model.We chose to run RFE with Random Forest, as it can deal with both numeric and categorical variables and capture the non-linear relationship between predictor and response variables as well as the interactions between predictors (Heung et al., 2017).The methodological efficiency of variable selection using RFE for soil class mapping was demonstrated by Brungard et al. (2015) and for soil nutrient prediction by Jeong et al. (2017).

Classifiers for soil mapping
In the process of digital soil mapping, the target variable is represented by the soil classes described in the area and organized in soil map units (Table 1), and the explanatory variables are represented by environmental covariates correlated with the soils, according to the soilenvironment relationship conceptualized by McBratney et al. (2003) in the Scorpan Model.
For classification by Random Forest (RF) and Multinomial Logistic Regression (MLR) classifiers, we used the packages Random Forest (Liaw and Wiener, 2015), nnet (Ripley and Venables, 2016), and caret (Kuhn, 2017) in R software, version 3.3.2(R Development Core Team, 2016).The models were run in parallel using default tuning for selecting the best model, considering the "mtry" for RF and "decay" value for MLR.In the following, we give a brief description of the RF and MLR classifiers.

Random Forest (RF)
Random Forest was developed by Breiman (2001) to perform regression and classification.
The model is based on the construction of a large set of random trees during the model training, leading to a single prediction.Each tree of the forest is constructed based on a bootstrap sample (sample with replacement) of the original training data with each bootstrap set, disregarding about one-third of the observations (Breiman, 2002;Hastie et al., 2009).The best split of each node of the tree is only searched among a Rev Bras Cienc Solo 2018;42:e0170133 randomly selected subset of the total number of predictors and the final prediction in the regression case is the average of the individual tree value, whereas for classification, the correct soil class is determined by a majority vote of the trees (as was the case in this study) (Liaw and Wiener, 2015).
Random Forest uses the out-of-bag (oob) samples, i.e., training observations not included in the bootstrap, to estimate the classification error.The oob samples are also used to construct a different variable importance measure (mean square errors, Gini index) to express the strength of each variable in the prediction (Hastie et al., 2009).Here, for categorical classification, the Gini index determines the importance of each predictive covariate in discriminating the soils: the higher the value of the index for a particular variable, the greater is its contribution to discriminating the class.
The model has two parameters to be tuned "mtry" (number of variables randomly sampled as candidates at each split) and "ntree" (number of trees to grow).To determine the best value of "mtry", Breiman (2002) suggests starting with the default value (square root of the number of predictors for classification) and trying a default value twice as high and half as low to search for the optimal value.The number of "ntree", on the other hand, does not require specific judgment for the setting of the parameter.
As a tree-based model, RF has advantages compared to linear models such as Multinomial Logistic Regression.It is able to model non-linear relationships between predictors and the response variable to handle noise data (observations with missing covariate data) and other situations in which a small dataset is associated with a large number of covariates (Collard et al., 2014).Although RF has shown better performance for soil class mapping when compared to a set of other classifiers (Pahlavan-Rad et al., 2016;Heung et al., 2017), the studies performed by Collard et al. (2014), Taghizadeh-Mehrjardi et al. (2015), and Camera et al. (2017) showed that MLR performed better than RF.

Multinomial Logistic Regression (MLR)
The MLR classifier is part of the family of generalized linear models and is used when the response variable has more than two categories (Hosmer and Lemeshow, 1989).This helps predict the probability of occurrence of each soil class in the landscape studied.
The MLR applies a non-linear log transformation that allows to calculate the probability of occurrence of any number of classes of a dependent variable (in this case, soil class "y") based on explanatory variables.Due to the sigmoidal behavior of its curve, the MLR differs from linear regression models.
The probability of occurrence of y 1 is π 1 and that of y 2 is π 2 = 1 -π 1 .Logistic regression relates probability π 1 to a set of predictors using the logit link function (Equation 3): in which α is a constant, x a vector of predictive variables, and j is the model vector of coefficients.The model is analogous to a logistic regression model, except that the probability distribution of the response variable is multinomial instead of binomial and there are J − 1 equations instead of one, so that: π j (x) = P (Y = j | x), where x explanatory variables, with x = 1, and ∑ j π j (x)= 1.Hence, the counts at J categories of Y can be treated as multinomial with probabilities of {π 1 (x), π 2 (x),….. π j (x)}.
Finally, the equation 4 expresses multinomial logistic models in terms of probabilities of answers (in this case, x soil class):

Eq. 4
Rev Bras Cienc Solo 2018;42:e0170133 Contrary to linear regression models that use the least squares as estimator, MLR coefficients are typically estimated using maximum likelihood.A more extensive description is found in the literature, notably of Hosmer and Lemeshow (1989).
Evaluation of MLR for soil mapping has been reported in a variety of situations regarding soil data acquisition (legacy polygon map, soil pits, and soil data derived from spectral reflectance spectroscopy), landscape, and level of soil classification (i.e.order, group, family and soil map unit component).Due to differences in data handling, the performance of MLR has been compared to a set of other classifiers for soil mapping purposes (Hengl et al., 2007;Collard et al., 2014;Brungard et al., 2015;Pahlavan-Rad et al., 2016;Heung et al., 2017) with some differences in accuracy depending on the sampling scheme, covariate scale and resolution, landscape pattern, and use and level of soil classification.
Soil classes were predicted at the order level by Vasques et al. (2015) using MLR with a coupled model combining field data with visible-near infrared (vis-NIR) spectral data.The use of soil data extracted from a legacy soil polygon map is more common since field surveys and laboratory analysis are the most expensive aspects of soil mapping.This approach has become widespread and was assessed by Giasson et al. (2008), Kempen et al. (2009), and ten Caten et al. ( 2011), who each used a conventional map at a 1:50,000 scale to extract training data for the soil classes.Similarly, data extracted from a 1:80,000 scale soil map were used by Figueiredo et al. (2008) and Collard et al. (2014) used soil data extracted from a polygon map to update a reconnaissance soil map (1:250,000) and compare MLR to other classifiers, observing a satisfactory performance of MLR for most soil classes.
An assessment of MLR, which compared the use of legacy data with ground data from soil pit data was performed by Heung et al. (2017), resulting in a better MLR performance when using a MLR-bagging than a single MLR model.
The accuracy of MLR for soil class mapping proved moderately accurate compared to other classifiers.For mapping soils in the Rio Doce Basin, Brazil, Souza et al. (2014) used MLR and found a kappa value of 0.35.A similar accuracy was found by Hengl et al. (2007), with a kappa of 0.36 for mapping WRB soil groups.Mapping soils in a northern province of Iran for taxonomic levels (great group, subgroup, and series), Pahlavan-Rad et al. ( 2016) reported a difference of around 10 % when comparing MLR to RF model in soil mapping on Northern of Iran.Nevertheless, soil mapping approaches used in those studies were different from the one proposed in this study.

Classifier evaluation
Classification is seen as a statistical and probabilistic process that tries to bring digital mapping as close to reality as possible.Accuracy is usually expressed in terms of indexes computed from a confusion matrix, which establishes the agreement between the classified map and the set of reference ground data (Foody, 2004).The confusion matrix compares, class by class, the relationship between the reference data from the field and the corresponding results obtained by classification (Congalton and Green, 1999).Thus, to assess the model fit, we used a 5-fold cross validation, and compared the Kappa index of the classification with the kappa value of the external validation.This validation was performed by standard procedure, in which the model adjusted with the training dataset was applied to the independent dataset.The agreement between the predicted classes and the observations in the independent dataset was then measured by means of the confusion matrix technique, as described by Congalton and Green (1999).
The accuracy of the maps produced by the RF and MLR classifiers was evaluated by the confusion matrix indexes, expressed by the Kappa index and Overall Accuracy.The quality of the classification associated with the Kappa statistics, as proposed by Landis and Koch (1977), can be classified as: 0.00-0.20 = slight; 0.21-0.40= fair; 0.41-0.60= moderate; 0.61-0.80= substantial; and 0.81-1.00= almost perfect.
Rev Bras Cienc Solo 2018;42:e0170133 The soil maps predicted using RF and MLR classifiers were validated using an independent dataset, and the metrics of the classification (Kappa and variance) were compared by means of the Z test (Congalton and Green, 1999), at significance level of 0.05.The Z test is calculated as presented on equation 5: Eq. 5 where K ^1 and K ^2 stand for the Kappa values of map 1 and map 2, respectively; and "VÂR" stands for the variance of the respective map.

Covariates and selection of optimal dataset
In the exploratory data analysis of the covariates by hierarchical grouping (Figure 3), a high correlation was observed between elevation and channel network (cnw) (0.97).
The NDVI and CMI (0.89) and relative slope position (rsp) and vertical distance to channel network (vdcn) (0.89) were also highly correlated.
To run the data-mining approach using RFE, we defined the minimal predictor number as 5 of (all) 14 covariates.The accuracy and oob error for the groups of covariates are shown in figure 4. The dataset with eight covariates resulted in the highest accuracy (0.53), equal to the accuracy of the datasets with 13 and 14 covariates.Although the kappa value was one unit lower when reducing the number of covariates, the reduction of the covariates in the model is justified by considering parsimony in the model.
The eight covariates included in the most accurate model consisted of: CNW, elevation, lithology, NDVI, precipitation, slope, terrain surface texture, and VDCN.
In soil mapping, we proceeded with these selected covariates.Although CNW presents more than 97 % of correlation with elevation, both covariates were included in the set of best predictors by the data mining approach applied through the RFE algorithm.This selection showed diversity of the sources of selected covariates, including the categorical map (lithology), maps from satellite imagery (NDVI), and topographic maps.

Classification by Random Forest
The RF model was trained using 5-fold cross validation repeated 5 times, using the predictors covariates selected with the RFE.It was verified that the best model parameter was the one with "mtry" = 2, oob (out-of-bag) error of 48 %, overall accuracy of 52 % (Table 3).This oob error criteria is lower than the one reported by Taghizadeh-Mehrjardi et al. (2015) who obtained 78 %.Using the Gini index, irrelevant covariates with weak correlation with the soils were eliminated by a data-mining approach using the RFE algorithm.The variables channel network base level (CNW), elevation, terrain surface texture, precipitation, NDVI, VDCN, slope, lithology, and curvature were the most important predictive covariates (Figure 5).These results appear to be consistent with those reported in the literature, where elevation is regarded as the most common variable used in digital soil mapping (McBratney et al., 2003).On the other hand, lithology and curvature contributed least to improve the model performance, as their relevance in terms of explaining the spatial distribution of the soils was ranked low (Figure 4).With regard to lithology, its poor capability to discriminate the soils in the study area may be related to the low spatial variability, since limestone occurs in more than 65 % of the area.
It is worth mentioning that the single soil components (LP, GL, FL) and soil associations NV and VX showed a fair contribution of the covariates CNW and elevation, indicating   Rev Bras Cienc Solo 2018;42:e0170133 a well-defined relationship of these soils with the landscape features (Figure 6).Slope was inefficient in explaining the spatial distribution of the map unit NV, since this soil association depicts large slope ranges.On the other hand, the soils of this map unit were found to be adequately correlated with precipitation, due to occurring in the most humid part of the study area.
Map unit LP1 was stronger correlated with slope and terrain surface texture, while CM distribution was reasonably explained by slope, precipitation, terrain surface texture, and lithology.Map unit LV was best explained by CNW, precipitation, NDVI, and terrain surface texture, although these correlations were relatively weak.
The soil map produced by the RF classifier is shown in figure 7. Leptosols and Rocky outcrops (LP1) were found mostly in the uplands in the northern part of the study area, corresponding to the steepest and most eroded slopes.Correspondingly, Leptosols (LP) occurred extensively in the study area and were found in both the northern and southern Gleysols (GL) were commonly found in the central part, particularly in concave terrains where drainage is inhibited.Also, they occupied coastal areas near mangroves and other wetlands.
The occurrence of Chernozems (CH) was associated with hills, foothills, and usually calcareous material.These soils develop on gentle slopes and can be found from lower heights to about 1,000 m above the sea level, where rainfall is not abundant (≈ 762-1,194 mm yr -1 ) in the study area.Chernozems have been used for diverse cultivation systems such as common bean-corn rotation, cassava, and sweet potatoes.
Nitosols + Leptosols (NT) are found on the uplands, particularly in the moistest part of the region, at higher elevations (≈ 1,524-2,683 m).This association corresponds to the most weathered soils in the area and are mainly used for vegetable cultivation.Moreover, a typical pine forest (Pinus occidentalis) occurs on those well-drained soils, on hill tops.
Fluvisols (FL) occupy the flatlands on recent alluvial deposits, in the central part of the area.They are poorly drained and frequently flooded.Nevertheless, they are intensively cultivated, as they are seen as very productive lands.
The association Luvisols + Chernozems (LV) are the least representative map unit, which occurs mostly in the southwestern part of the area and is located mainly on mid slopes.Similarly, the association Vertisols + Chernozems (VR) was found to be irrelevant in terms of geographical extent, occurring on lowlands as well as hills, and developed on low slopes.It occupies mostly lowlands of the southwestern part, at elevations between 274 and 396 m.Rev Bras Cienc Solo 2018;42:e0170133

Classification by Multinomial Logistic Regression
Curvature showed no contribution to the soil mapping, while lithology contributed to 3 %.The variable importance of the other covariates (15 to 100 %), indicated that elevation and cnw (channel network base level) were the most important predictive variables.Precipitation presented a contribution of 49 % and some morphometric properties, e.g., terrain surface texture, vdcn (vertical distance to channel network), and slope, reached up to 40 % on the global scale (Figure 8).
The set of covariates showed different rankings of importance, compared to those presented in the RF model, particularly for the terrain surface texture and slope.In MLR, the slope was considered the fourth most important variable, whereas in RF, it was placed in seventh position.2015) showed an about 10 % higher accuracy of MLR than RF, when mapping soils in northern Iran.
For all soil map units in this study, the classification accuracy by MLR was equal to or higher than for those classified by RF.The accuracy by MLR varied from 19 to 79 % among soil classes, whereas for RF, the highest value was 72 % and the lowest 13 %, while one map unit (LV) was not classified, neither by RF nor MLR (Table 3 and 4).
The low representativeness of the map unit LV (Luvisols + Chernozems), with a total of five observations during the training step, was trivial for the learning pattern and therefore misclassified.This fact can be related to the process of random recursive binary partitioning of soil observations to build a forest of trees, in which the number of samples in the training and internal validation does not always include the less representative classes (Strobl et al., 2009).The partitioning algorithm tends to favor the most representative target category, which typically leads to drastic overfitting.
The quality of the classification assessed by external validation, according to Landis and Koch (1977), can be characterized as moderate for RF, and fair for the MLR classifier.The Kappa index of 0.33 for the MLR classification was lower than what was found by Hengl et al. (2007) and by Figueiredo et al. (2008), which verified a Kappa index of 0.36 and 0.39, respectively.The soil unit LP (48.5 %) and LP1 (5.2 %), corresponding, respectively, to Leptosols and Leptosols-Rocky outcrops in RF mapping, account together for 53.7 % of the whole area (Table 7).This large proportion of area covered by these map units confirms once again that the soils of the region are relatively young.Additionally, this pattern reflects the strong influence of morphogenetic processes on pedogenesis.The association of Vertisols-Chernozems is not very expressive, with a total of less than 2 % of the area by both classifiers, MLR and RF.

CONCLUSIONS
Of the covariates selected in this mapping process, channel network base level and elevation had the strongest power in separating the soils and therefore stressed the close relationship between the landscape and spatial distribution of soils in Western Haiti.On the other hand, the low importance of lithology in the mapping process may be due to its low variability, since limestone occurred in more than 65 % of the study area.
The close soil-landscape relationship was highlighted by the high contribution of topographic covariates to satisfactorily explain the soil distribution in the western region of Haiti.
The two evaluated classifiers proved capable of satisfactorily mapping the soils of the region, although RF outperformed MLR, reflecting its ability for learning and generalizing soil data of larger geographic areas.
Digital sol mapping techniques using MLR and RF can be used to produce soil maps for other parts of Haiti, as they were efficient in mapping the soils of the study area (overall accuracy of 0.47 and 0.64, respectively).Moreover, they can provide information for environmental planning and response programs of the country in which soil degradation has reached an advanced degree.

Figure 1 .
Figure 1.Spatial distribution of the soil pits described and sampled in western Haiti, Caribbean.

Figure 2 .
Figure 2. Distribution of soil classes in the training and validation dataset (n = 258).

Figure 4 .
Figure 4. Overall accuracy and oob error per group of predictor covariates selected using RFE.

Figure 7 .
Figure 7. Soil map generated with classifier Random Forest -western Haiti, Caribbean.

Table 3 .
Confusion matrix of soil classification using Random Forest

Table 4 .
Camera et al. (2017)ure which ranked third in RF, ranked sixth in MRL.Also, a difference was observed for the variable curvature: it contributed to about a 10 % decrease in the MSE in RF while in MLR, it has no contribution.The confusion matrix for the classification using MLR is displayed in table4.It shows a Kappa of 0.45 and overall accuracy of 0.55.Out of nine soil classes, MRL failed to classify one (LV) and it classified CM with an accuracy of 19 % while the other soil classes were classified with an accuracy level, ranging from 39 to 79 %.Importance of covariates in soil mapping by Multinomial Logistic Regression (MLR) classifier.Confusion matrix of soil classification using Multinomial Logistic Regression RF (0.78).Consequently, for lower levels of soil classification (great group and family), the authors reported that RF outperformed Logistic Regression.On the other hand, contrary to what was found in this study,Camera et al. (2017)observed a better performance of RF than MLR for predicting soil groups inCyprus, and Taghizadeh-Mehrjardi et al. ( LP = Leptosols; LP1 = Leptosols + rocky outcrops; CM = Cambisols + Fluvisols; GL = Gleysols; CH = Chernozems; NT = Nitisols + Leptosols; FL = Fluvisols; LV = Luvisols + Chernozems; VR = Vertisols + Chernozems; User Ac. = user's accuracy; Prod.Ac. = producer's accuracy.Rev Bras Cienc Solo 2018;42:e0170133

Table 6 .
Conditional Kappa of soil mapping by Random Forest (RF) and Multinomial Logistic Regression (MLR) for training and validation

Table 7 .
Area of soil classes mapped by Random Forest and Multinomial Logistic Regression (MLR)