Decision trees for digital soil mapping on subtropical basaltic steeplands

When soil surveys are not available for land use planning activities, digital soil mapping techniques can be of assistance. Soil surveyors can process spatial information faster, to assist in the execution of traditional soil survey or predict the occurrence of soil classes across landscapes. Decision tree techniques were evaluated as tools for predicting the ocurrence of soil classes in basaltic steeplands in South Brazil. Several combinations of types of decicion tree algorithms and number of elements on terminal nodes of trees were compared using soil maps with both original and simplified legends. In general, decision tree analysis was useful for predicting occurrence of soil mapping units. Decision trees with fewer elements on terminal nodes yield higher accuracies, and legend simplification (aggregation) reduced the precision of predictions. Algorithm J48 had better performance than BF Tree, RepTree, Random Tree, and Simple Chart.


Introduction
Soil surveys are almost a mandatory source of information for land use planning.However, finer scale soil maps are scarce in developing countries because conventional mapping techniques are time-consuming and costly.Moreover, conventional soil maps do not explicitly document a pedologist's mental model of the soillandscape relationships that guide mapping work and which could support subsequent land use planning process.To overcome this limitation, the association of traditional soil survey techniques and new technologies of digital soil mapping (DSM) may improve the overall process of soil mapping, making the process more quantitative.
Environmental variables that control soil variation and distribution across landscapes can be quantified and mapped.Several approaches have been applied in DSM: logist ic regressions (Figueiredo et al., 2008;Giasson et al., 2006;Giasson et al., 2008), discriminant analysis (Bell et al., 2000;Pavlik and Hole, 1997), fuzzy logic (Zhu, 1997), kriging (Voltz at al., 1997;Knotters at al., 1995), and decision trees (Lagacherie and Holmes, 1997;Bui and Moran, 2001).Classification and regression tree analysis represent a supervised approach to classification.Predictive soils mapping using decision tree analysis, which split up the datasets into blocks by a tree, can increase mapping efficiency and accuracy by extracting relationships between soil types and environmental variables, applying these relationships to predict soil types for unmapped areas and explicitly documenting the process.
For the establishment of relationships between these variables and soil spatial distribution, decision trees may be used for predicting the occurrence of soil map units based on terrain and hydrologic variables.In decision tree analysis, observations enter at the root node and a test is applied to best separate the observations Sci.Agric.(Piracicaba, Braz.), v.68, n.2, p.167-174, March/April 2011 into classes, making groups purer.The observation then passes along to the next node and the process of testing the observations to split them into classes continues until the observation reaches a terminal node.Observations reaching a particular terminal node are classified the same way.Many terminal nodes may make the same classification.Several different paths may be followed for an observation to become part of a particular class.
The objective of this study was to test the use of decision tree algorithms for producing digital soil maps.Using as predictive variables soil distribution information and terrain parameters, several decision tree algorithms were used with variable number of objects in terminal nodes of trees.Predicted soil maps were evaluated compared to conventional soil survey map.

Material and Methods
The study area is located in Vale dos Vinhedos, Bento Gonçalves (Rio Grande do Sul State, Brazil) (Figure 1), and comprises an area of 6,735 km 2 between longitudes 51º34'31.86"W and 51º33'1.86"W and latitudes 29º10'31.78"S and 29º09'1.78"S. It is located in the physiographic region Planalto das Araucárias, a plateau formed by basaltic rocks of the Serra Geral Formation with relief characterized by steep slopes (IBGE, 1986).The regional climate is subtropical with mild summer (Köppen Cfb class), with mean temperatures of the coldest month between -3ºC and 13ºC, mean temperature of the warmest month lower than 22ºC, and precipitation well distributed along the year (Moreno, 1961).A detailed soil survey of the study area at scale 1:10,000 was produced according to conventional soil survey procedures, including extensive field work, airphoto interpretation, and soil taxonomic classification according to SiBCS -Brazilian Soil Classification System (Embrapa, 2006).The final map was produced with a legend comprising six soil map units (Table 1).
For evaluating the use of decision tree for predictive soil mapping, two soil maps were evaluated: i) a soil map with original six soil map units legend, and ii) a soil map with modified legend, grouping soil map units by major soil groups (order) in the Brazilian Soil Classification System (Embrapa, 2006), resulting in a four map units legend, named as follows: MU1 (Ultisols or Argissolos -according to Brazilian Classification), MU2 (Inceptisols/Cambissolos), MU34 (Molisols/ Chernossolos, grouping of MU3 and MU4), and MU56 (Entisols/Neossolos, grouping of MU5 and MU6).
A digital cartographic base was created by aerophotogrametric survey with digital restitution of planimetry and altimetry with 5 m between contour lines.A Triangular Irregular Network (TIN) digital elevation model (DEM) was produced by linear interpolation of contour lines, and a parabolic function was used to adjust the relief representation on valleys and elevations.The TIN was converted to a raster DEM with a spatial resolution of 5 m.The raster DEM was used for calculating nine predictive variables: slope gradient, profile curvature, planar curvature, curvature (combination of planar and profile curvature), flow direction, flow accumulation, flow length, Stream Power Index (SPI), and Topographic Wetness Index (TWI) (Wolock and McCabe, 1995).Each of these hydrologic or landform parameters was selected to be used as predictive variable because they may represent changes on soil-forming factors and, therefore, are believed to be informative on the occurrence of soil map units.
Data sampling for training consisted of 1,333 points (one observation per each 0.005 km 2 ) distributed randomly among soil map units as test points or training  points (67% and 33%, respectively).The option for using random points intended to eliminate subjectivity, to allow simple reproducibility (Hengl and Rossiter, 2003), and to have sample points distributed proportionally to areas occupied by each soil map unit.In each random point all data layers were sampled.The attribute table of the layer of points received fields with values of elevation, DEM derived parameters, and soil map units (using both original and simplified legend).The data sampled in ArcView 3.2 environment (ESRI, 1999) was exported as tables and converted into a comma delimited file (CSV format) used for decision tree analysis, using the software Weka (Witten and Frank, 2005).Decision tree analysis tested combinations of eight minimum number of elements in terminal nodes (150, 125, 100, 75, 50, 25, 10, or 5) and five decision tree algorithms (J48, BF Tree, Rep Tree, Random Tree, and Simple Chart).All of these algorithms are decision tree techniques which use a supervised machine learning approach.J48 is an implementation of a decision tree technique that is based on the C4.5 algorithm which was originally proposed by Quinlan (1986).REPtree method is also based on C4.5 algorithm and can produce classification (discrete outcome) or regression trees (continuous outcome).It sorts numeric attributes only once.SimpleCart method is a decision tree analysis based on Breiman et al. (1984).BFTree is a best-first decision tree learner and it is a learning algorithm for supervised classification learning.Best-first decision trees represent an alternative approach to standard decision tree techniques such as the C4.5 algorithm since they expand nodes in best-first order instead of a fixed depth-first order.
Decision tree structures were generated by partitioning the data recursively into a number of groups, each division being chosen to differentiate the response variable in the resulting nodes.Results were evaluated individually for each resulting tree of a set of 80 generated trees using error matrices.Additionally, accuracy statistics were computed for weighted error matrices, used because it was considered that not all mapping errors are equally serious for soil map users.Therefore, weights were assigned for calculating weighted error matrices using subjective criteria based on the degree of the importance of the maping mistake for land use planning.In the weight matrices, diagonals and the off-diagonals range from 0 to 1, with a value of 0 indicating that the mistake is more serious, and as the values increase towards 1, the mistake is considered decreasingly serious.A value 1 means that two classes are considered identical for accuracy assessment.
Criteria for evaluation of better response and selection of best tree were number of soil map units predicted, number of correctly classified instances, kappa index, mean absolute error, and size of the tree.For each legend scenario (original and grouped legend), the best combination of algorithm and number of objects in terminal nodes were selected.The output was imported in ArcView (ESRI, 1999) by implementing the generated decision rules and creating a map for each soil map unit.These maps were overlayed, resulting in predicted soil maps.
Accuracies of the produced soil maps were determined by error matrices (Congalton, 1991).These are matrices with columns representing reference soil map units (i.e. the original soil survey, with the original or aggregated legends) and rows representing predicted digital soil map units.Each cell in the matrix contains the proportion in the mapped class of its row that was in fact observed in the class of its column.The diagonal represent agreement between the original map and predicted map and off diagonals represent misclassifications.In our study, error matrices comparing all pixels of the whole study area were used.
Four map accuracy matrices were calculated: i) overall accuracy, which is the proportion of correctly-classified pixels compared to total number of pixels; ii) producer accuracy, which is the probability of a pixel in a given unit of soil to be classified correctly; iii) user accuracy, which is the probability of a pixel classified as a given soil mapping unit to be correctly classified; and iv) Kappa index which compares the agreement between original and predicted soil maps against that which might be expected by chance alone (Cohen, 1960).Accuracy statistics were additionally calculated for weighted error matrices, which consider that not all mapping errors are equally serious for soil map users.Weights for calculating weighted error matrices were assigned subjectively, using as criterium the authorsṕ erception of the importance of soil characteristics for changing soil behavior when used for different agricultural land use types.

Results and Discussion
In algorithm runs reproducing the soil map with the original legend, the best performance (overall accuracy = 66.4%,Kappa index = 0.518, mean absolute error = 0.094, 145 terminal nodes) was observed using algorithm J48 with a minimum number of elements at the terminal node of N = 5 (Table 2).Although other combinations produced greater overall accuracy (up to 71.7%) or kappa indices (up to 0.575), they could only estimate five soil classes.Using an aggregated legend (legend of four classes), the best combination was algorithm J48 and minimum number of elements at terminal node of N = 25, which produced an overall accuracy of 73.6%, kappa index of 0.583, mean absolute error of 0.160, with a tree of 29 terminal nodes.Other combinations generated greater overall accuracy (up to 71.7%) or kappa index (up to 0.575), but only three soil map units were predicted (Table 2).
Algorithm J48 performed best, predicting the occurrence of all soil map units with higher overall accuracy and higher kappa indexes (Table 2).Variations in the number of objects in terminal node influenced accuracy, size of trees, and number of predicted classes.Small  number of objects in terminal nodes reduced mean absolute error and increased overall accuracy, kappa index, size of the tree, and number of predicted soil map units.When using same algorithms, overall accuracy was equal or larger producing maps with grouped map legend (four map units legend), although generating maps with higher mean absolute error.When using the grouped legend, more combinations of algorithms and sizes of terminal node were able to estimate the correct number of soil classes (Table 2).
Original soil maps (with original and aggregated legends) and estimated soil maps prepared using these optimal combinations described above are presented in Figure 2. A comprehensive evaluation of the reproducibility of the conventional soil map with digital soil mapping techniques using error matrices, in this case considering the agreement between conventional and estimated soil maps, is presented in Tables 3 and 4. Overall accuracies were 68% and 69%, and Kappa indexes were 0.54 and 0.56, respectively for original and grouped legend maps.Based on the weight matrices (Tables 5 and 6), accuracy statistics were computed for weighted error matrices.Error evaluation using weighted error matrix for the estimated map with original legend (Table 7) presented an overall accuracy of 68% and kappa index of 0.571.For the map estimated with grouped legend (Table 8), overall accuracy was 80% and kappa index was 0.574 Error matrices show user's accuracies from 0.28 to 0.78 for unweighted error matrices (Tables 3 and 4) and from 0.52 to 0.88 for weighted error matrices (Tables 7  and 8), with higher user accuracies for weighted error matrices occurring for MU4 (which represents 27.3% of the area) (Table 7) for the original legend map and MU34 (which represents 34.0% of the area) (Table 8) for the grouped legend map.Producer's per-class reliability was  computed similarly, using the table column totals.Moving down the columns (predicted classes), errors of omission occur when the mapper fails to correctly identify this reference site in its true class, i.e., the mapper failed to map the site in its correct class, leading to a lower producer accuracy.Producer accuracies ranged from 0.23 to 0.87 for unweighted error matrices and from 0.44 to 0.91 for weighted error matrices.For weighted error matrices (Tables 7 and 8), higher producer accuracies were associated to MU2, which covers 41.8% of the study area, both for the original and aggregated legend maps.For weighted error matrices (Tables 7 and 8), lower producer's accuracies were found for MU6 (which represents 5.1% of the area) and MU1 (which represents 15.1% of the area), for the original and aggregated legend maps, respectively.For unweighted error matrices (Tables 3 and 4), lower producer accuracies were found for MU6 which represents 5.1% of the study area for the original legend map (Table 3) and MU56 (which represents 9.2% of the area) for the aggregated legend map (Table 4).When predicting soil classes using the selected combination of algorithms and number of elements on terminal node, both user and mapper accuracies were higher for soil classes that associated with larger mapped areas.These accuracies increased when   using weighted error matrices as more importance is attributed to separation of soil units that are considered more distinct.The use of random sampling generated different number of samples per soil map unit and represents map units that occupy larger areas with more samples.Because of this, lower user's and producer's accuracies were obtained for map units less represented in the random sampling, such as MU3, MU5, and MU6.The use of stratified sampling could reduce errors in the classification of these map units that occupy smaller areas and are underrepresented in the random sampling.

Conclusions
The algorithm J48 performed better than other tested algorithms for decision tree analysis.Soil map legend simplification by class aggregation resulted in only a small increase in both overall accuracy and Kappa index.This map legend simplification reduced precision and is not recommended, as not all soil map units could be discriminated.Application of weighted error matrices for evaluating estimated maps with original legend did not increase the overall accuracy, but increased kappa indices slightly, while the use of weighted error matrices for the evaluation of maps estimated with aggregated legend increased slightly the overall accuracy and kappa index.The use of weighted error matrices may be beneficial in some cases, although assigning weights is a critical step due to their subjective nature.

Figure 1 -
Figure 1 -Location of the study area in the Vale dos Vinhedos region of the Rio Grande do Sul State, Brazil.Map projection UTM Zone 22J, datum SAD69.

Figure 2 -
Figure 2 -Soil maps of the study area: (a) soil map with six class legend, (b) estimated soil map with six class legend, (c) soil map with four class legend, and (d) estimated soil map with four class legend.

Table 1 -
Soil map units of the conventional soil map.

Table 2 -
Results of decision tree analysis in function of variations of algotithms and size of terminal nodes.

Table 3 -
Error matrix for evaluation of maps with original map legend.

Table 5 -
Weights assigned for the weighted error matrix for original legend map evaluation.

Table 4 -
Error matrix for evaluation of maps with grouped map legend.

Table 6 -
Weights assigned for the weighted error matrix for grouped legend map evaluation.

Table 8 -
Weighted error matrix for evaluation of maps with grouped map legend.

Table 7 -
Weighted error matrix for evaluation of maps with original map legend