DIGITAL SOIL MAPPING USING MULTIPLE LOGISTIC REGRESSION ON TERRAIN PARAMETERS IN SOUTHERN BRAZIL

Soil surveys are necessary sources of information for land use planning, but they are not always available. This study proposes the use of multiple logistic regressions on the prediction of occurrence of soil types based on reference areas. From a digitalized soil map and terrain parameters derived from the digital elevation model in ArcView environment, several sets of multiple logistic regressions were defined using statistical software Minitab, establishing relationship between explanatory terrain variables and soil types, using either the original legend or a simplified legend, and using or not stratification of the study area by drainage classes. Terrain parameters, such as elevation, distance to stream, flow accumulation, and topographic wetness index, were the variables that best explained soil distribution. Stratification by drainage classes did not have significant effect. Simplification of the original legend increased the accuracy of the method on predicting soil distribution.


INTRODUCTION
Soil surveys are recognized as important sources of information for land use planning and management.In Brazil, soil surveys for most of the country are available only at small scale (1:750,000), and just a small portion of the Brazilian territory has semidetailed or detailed soil surveys because of funding limitations.
A more cost-effective approach to traditional, large scale soil surveys would be to map soils of representative areas within homogeneous regions and use the soil-landscape relationships to predict soil distribution on non-surveyed areas (Schneider & Klamt, 1996).This approach is similar to the reference area method (Lagacherie et al., 2001), which is based on the hypothesis that it is possible to sample a reference area including most of the soil classes of a region.Based on this area, the prediction of soil distribution on other areas may be facilitated if the landscape is modeled by digital terrain analysis (Hengl & Rossiter, 2003), and if relationships between soils and landscape are modeled.Multiple logistic regression have been used successfully in soil science and related fields, as to predict landslide hazard (Ohlmacher & Davis, 2003) or probability of occurrence of soil drainage classes (Campling et al., 2002;Kravchenko et al., 2002), or to relate the presence of a non-calcareous clay-loam horizon to terrain attributes (King et al., 1999).As soil map units are categorical variables, multiple logistic regression may be suitable for predicting occurrence of soil classes from landscape variables, with the advantage of allow to associate the prediction with probabilities of occurrence of soil mapping units.
Although previous works used logistic regression to estimate the occurrence of specific soil characteristics instead of soil taxonomic classes or mapping units, they suggest that logistic regressions may have potential for producing soil maps from terrain parameters based on relationships between these parameters and soil occurrence.This study evaluates a new method of extrapolation on landscape parameters by testing how well multiple logistic regressions can reproduce a soil map of a reference area.

MATERIAL AND METHODS
The study area was the Sentinela do Sul county, located in southeastern Rio Grande do Sul, Brazil (UTM 22S, Easting from 434,000 m to 453,000 m, and Northing from 6,596,000 m to 6,626,000 m), comprising an area of 253 km 2 with minimum elevation at sea level and maximum elevation of 525 m.The regional climate is subtropical (Köppen classification: Cfa II 1 d), with mean annual temperature 17.6ºC and mean annual precipitation 1600 mm.
The relief is nearly flat in the alluvial plains, gently sloping or strongly sloping in the colluvial deposits, and very steep in the highlands, the landform that dominates the landscape.There are two different types of parent material: a granite-gneiss mixture with associated migmatite, and Cenozoic sediments, including sediments of gravitational and colluvial sources (Silveira, 1984).Recent and old fluvial deposits complete the transition of sedimentation between the complexes of granites and the coastal sedimentation on the eastern side of the county, making the parent material distribution complex.
A detailed reconnaissance soil survey of the county at scale 1:50,000 (Klamt et al., 1996), produced according to usual soil survey procedures that included extensive field work and photointerpretation, established eight mapping units (MU1, MU2, ... and MU8) (Table 1).Most MUs are combinations (undifferentiated group, complex, or association) of two or more of the nine different soil taxonomic units found in the area because of the difficulty in separate individual soil classes as consociations (i.e., mapping units with only one dominant soil class) given the complexity of soillandscape relationships, where each distinguishable landform usually had several classes of soils (Klamt et al., 1996).
The digital elevation model (DEM) used had a resolution of 3 arc sec, corresponding to a pixel size of approximately 92 m, and was obtained from USGS SDTS -SRTM (Shuttle Radar Topography Mission) Table 1 -Soil mapping units in the 1:50,000 detailed reconnaissance soil survey map of Sentinela do Sul, RS, Brazil (Klamt et al., 1996).(Rabus et al., 2003).The DEM was used, directly or as a component, to calculate other nine soil prediction variables for each pixel: slope gradient, profile curvature, planar curvature, curvature (combination of planar and profile curvature), flow direction, flow accumulation, flow length, stream power index (SPI), and topographic wetness index (TWI) (Wolock & McCabe, 1995).Each of these landform parameters was selected to be tested as explanatory variable because they were expected to represent changes on soil-forming factors and, therefore, are believed to be informative on the occurrence of soil mapping units.
For the establishment of relationships between these explanatory variables and soil distribution, logistic regressions were used because they allow to predict probabilities of occurrence of soil mapping units based on terrain variables, an advantage when compared to other prediction techniques.The multiple logistic regression model is a non-linear transformation of the linear regression, which allows to estimate the probability of occurrence of any number of classes of a dependent variable (in this case, soil mapping units) based on explanatory variables (Hosmer & Lemeshow, 1989).
Data sampling for training points consisted of 7,500 random map observations (one observation per each 3.5 ha) consisting of DEM, DEM derived parameters, and soil classes, as classified in the soil survey.The option for not using the entire area as training points intended to allow to test the results of the prediction in a data set different that the training set of points, which in this case was defined as the entire study area.Hengl & Rossiter (2003) used operator-selected representative sampling points, the option for using random points intended to eliminate subjectivity and to allow simple reproducibility.The data sampled in ArcView 3.2 environment (ESRI, 1999) was exported as tables and analyzed statistically using software Minitab version 11 (Minitab Inc., 1996).
Best sets of logistic regressions explaining the soil distribution were selected based upon two prediction methods: 1) stratified prediction, by modeling the entire study area without stratification of areas by drainage classes; and 2) unstratified prediction, by modeling the entire study area by stratification of areas by drainage classes.This separation in drainage classes was based on TWI values and used a threshold value of TWI = 7.7 (well drained < 7.7 < poorly drained), after applying a 3 pixel by 3 pixel mean filter to the computed TWI map.The threshold value was determined by testing filtered TWI classification looking for classes that would better reproduce drainage classes groups as represented on the original soil map.
Parameters for multiple nominal logistic regressions were calculated in Minitab environment us-ing soil mapping unit as response or dependent variable, classified in eighth nominal classes (UM1 to UM8).A step-by-step procedure was used to obtain the best fit set of logistic regressions, starting with a larger number of variables and excluding the variables considered less related to variations on the response variable.For each prediction method, a set of best fit regressions was selected based on criteria as goodness to fit tests (Pearson and deviance), log-likelihood, odds ratio, and Z test (Hosmer & Lemeshow, 1989).A set of logistic regressions was determined for the unstratified prediction method, and two sets of logistic regressions were selected for the stratified prediction method, respectively for well drained areas and for poorly drained areas.These sets of logistic regressions were used to estimate the probabilities of occurrence of each soil mapping unit.These sets of equations were organized as scripts in ArcView 3.2 environment, assigning a probability value to each pixel for the entire study area and creating eight maps, each of them representing for the entire study area the probability of occurrence of each soil mapping unit.Using the function Map Query of ArcView Spatial Analyst Extension on the probability maps, two soil maps were estimated (one for stratified and one for not stratified procedures) by assigning to every single pixel the denomination of the soil mapping unit that had the larger probability of occurrence on that pixel.
The accuracy of the estimated soil maps was determined by using error matrices (Congalton, 1991) comparing all pixels of the estimated maps to the original soil map, by this way using a larger dataset than in the training points dataset, and intending to evaluate what would be the effects of using logistic regressions for extrapolating prediction of soil occurrence to other areas than only the training points.The four map accuracy indicators were used: 1) overall accuracy, calculated by dividing the number of correctly-classified pixels by the total number of pixels; 2) producer accuracy, which measures how well an area has been classified, or what the proportion of each mapping unit was mapped according to the original soil map; 3) user accuracy, which measures the reliability of the map, or how well the estimated map is reproducing the original map in a specific point; and 4) Cohen's Kappa statistic (Cohen, 1960), which corrects for agreements that would happen by chance between the original soil map (with and without legend simplification) and the estimated soil maps.

RESULTS AND DISCUSSION
The unstratified prediction of soil mapping units with multiple logistic regressions selected vari- ables elevation, distance to streams, and TWI to explain the occurrence of soil mapping units; these variables related to water accumulation and water table depth.For the stratified prediction, variables selected to explain the occurrence of soil types in well drained areas were elevation, distance to streams, and slope, showing that soil occurrence in these areas is more related to factors that affect water movement and soil erosion processes.For poorly drained areas the selected variables were elevation, distance to streams, and flow accumulation, indicating that soil distribution in these areas is more related to factors that affect soil drainage and water table depth (Table 2).These sets of best predictors are related to long-term known relationships between soil forming factors, landform, and soil distribution, which associates soil distribution to erosion processes in steep, well drained areas, and with water dynamics, as water table depth (variable elevation) and water movement and accumulation, in poorly drained areas.
Table 2 presents three sets of equations, respectively for unstratified area and for area stratified by drainage classes.Each set of equations allows calculate the probability of occurrence of soil mapping units.For example, from the first column of Table 2 we may extract the following equation: log [p1/p8] = 10.073-(0.9359 * TWI) -(0.030455 * elevation) + (0.0044564 * distance to streams).
For evaluating the reproducibility of the original soil map, the concordance between the original soil map and the two newly generated maps (stratified and unstratified by drainage classes) (Figure 1) was evaluated using error matrices (Table 3).Considering the major classification errors, without stratification MU1, MU3, MU5, and MU6 were underclassified, while MU4 and MU7 were overclassified.When the study area was stratified, MU2, MU4, MU7, MU7, and MU8 were overclassified and MU3 and MU6 were underclassified.
Attempts to classify the entire landscape never achieved better results than a 48% overall accuracy (Kappa = 36%), obtained without drainage class separation, while the map estimated with separation of drainage classes had an overall accuracy of 45% (Kappa = 31%).Overall accuracy and Kappa indices were considered unsatisfactory in both cases, although they are in the same magnitude that values found by Hengl & Rossiter (2003).Best user accuracy was obtained for MU3 (61.5%),MU5 (54.5%), and MU7 (51.0%) when drainage classes were not separated, and MU8 (86.9%),MU4 (85.4%), and MU6 (67.7%), when drainage classes were separated (Table 3).
For most soil mapping units, higher producer and user accuracy was achieved when the study area    1 Log-likelihood = -9007.9;test that all slopes are zero: G = 8661.2,DF = 21, P < 0.001. 2 Log-likelihood = -3534.8;test that all slopes are zero: G = 2135.6,DF = 9, P < 0.001.was not stratified.Although it would be expected higher accuracy when separating areas by drainage classes, this was not observed.Given the characteristics of the SRTM DEM, its precision may not have reproduced small variations of elevation, and its resolution may have not showed the actual terrain variations at short distance.Significant differences with stratification of a study area in hill land and plains were obtained by Hengl & Rossiter (2003), who affirm that although stratification has advantages, "it would be more practical to develop a single data set and predictive map of the entire area at once".
To improve the capacity to reproduce the original soil map, the legend was simplified by grouping similar mapping units based on their higher taxonomic categorical level, aiming to verify if a simplified categorical legend, which would usually correspond to a cartographic scale reduction, could be better predicted.The simplified legend joined mapping units MU2, MU3, and MU4 (reclassified as MU24) and mapping units MU6, MU7, and MU8 (reclassified as MU68).Thus, the simplified legend was formed by mapping units MU1, MU24, MU5, and MU68.Same procedures with and without stratification of the area by drainage classes were used for estimating multiple logistic regressions and evaluating map accuracy.
For the maps estimated using the simplified legend (Figure 2), the procedure without stratifica-UA = user accuracy, PA = producer accuracy, OA = overall accuracy, K = Kappa index tion by drainage classes had an overall accuracy of 71% (Kappa = 54%), which is an increase of 48% in relation to the same procedure without stratification using the original legend.The map estimated with stratification of the area by drainage classes had overall accuracy of 68% (Kappa = 51%), an increase of 51% in comparison to when the original legend was used.Overall percent correct and Kappa were similar for situations with or without stratification of the area by drainage classes.Higher user accuracy was obtained with stratification by drainage classes in most of the mapping units, with values of 83.7% for MU24 and 70.5% for MU68.Higher producer accuracies were obtained for the procedure without stratification by drainage classes for soil mapping units MU24 (86.6%) (Table 4), and major errors were underclassification of MU1 and overclassification of MU68 when not stratifying by drainage classes.When stratification was used, major errors were overestimation of MU1 and underestimation of occurrence of MU68.
Although the use of a simplified legend makes the soil map lose precision (more soil classes included in a map unit), it makes the soil map gain accuracy, i.e., the capacity to reproduce a reference soil map, either using an original field surveyed or a map with simplified legend.

Figure 1 -
Figure 1 -Field surveyed soil map with original legend (a), and maps estimated with similar legend, respectively without stratification by drainage classes (b) and with stratification by drainage classes (c).

Table 2 -
Multiple logistic regressions for estimation of probabilities of occurrence of soil mapping units.

Table 3 -
Error matrices comparing the field surveyed soil map with the original legend with soil maps estimated either with stratification or without stratification by drainage classes.

Table 4 -
Error matrices comparing the field surveyed soil map with simplified legend with soil maps estimated either with stratification or without stratification by drainage classes.