Acessibilidade / Reportar erro

Preprocessing procedures and supervised classification applied to a database of systematic soil survey

ABSTRACT:

Data Mining techniques play an important role in the prediction of soil spatial distribution in systematic soil surveying, though existing methodologies still lack standardization and a full understanding of their capabilities. The aim of this work was to evaluate the performance of preprocessing procedures and supervised classification approaches for predicting map units from 1:100,000-scale conventional semi-detailed soil surveys. Sheets of the Brazilian National Cartographic System on the 1:50,000 scale, “Dois Córregos” (“Brotas” 1:100,000-scale sheet), “São Pedro” and “Laras” (“Piracicaba” 1:100,000-scale sheet) were used for developing models. Soil map information and predictive environmental covariates for the dataset were obtained from the semi-detailed soil survey of the state of São Paulo, from the Brazilian Institute of Geography and Statistics (IBGE) 1:50,000-scale topographic sheets and from the 1:750,000-scale geological map of the state of São Paulo. The target variable was a soil map unit of four types: local “soil unit” name and soil class at three hierarchical levels of the Brazilian System of Soil Classification (SiBCS). Different data preprocessing treatments and four algorithms all having different approaches were also tested. Results showed that composite soil map units were not adequate for the machine learning process. Class balance did not contribute to improving the performance of classifiers. Accuracy values of 78 % and a Kappa index of 0.67 were obtained after preprocessing procedures with Random Forest, the algorithm that performed best. Information from conventional map units of semi-detailed (4th order) 1:100,000 soil survey generated models with values for accuracy, precision, sensitivity, specificity and Kappa indexes that support their use in programs for systematic soil surveying.

Keywords:
machine learning algorithms; random forest; tacit soil-landscape relationships; digital soil mapping

Introduction

One of the challenges of modeling soil classes for digital soil mapping has been to reproduce soil-landscape relationships through tacit information on conventional soil maps (Hudson, 1992Hudson, B.D. 1992. The soil survey as paradigm-based science. Soil Science Society of America Journal 56: 836-841.). A possible strategy for overcoming this is to make the assumption that conventional soil survey databases implicitly carry information on soil-landscape relationships. Databases derived from soil surveys and those from soil predictive covariates, such as relief and parent material covariates (McBratney et al., 2003McBratney, A.B.; Santos, M.L.M.; Minasny, B. 2003. On digital soil mapping. Geoderma 117: 3-52.), can then be analyzed to produce patterns of soil spatial variation with techniques that belong to the field conceived as Knowledge Discovery in Databases or KDD, of which data preprocessing and data mining are essential steps in the entire process (Fayyad et al., 1996Fayyad, U.M.; Shapiro, G.P.; Smyth, P. 1996. From data mining to knowledge discovery: an overview. p. 1-34. In: Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P.; Uthurusamy, R., eds. Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, Menlo Park, CA, USA.).

Optimal routines for the application of data mining techniques are far from reaching a consensus on digital soil surveys (Bagatini et al., 2016Bagatini, T.; Giasson, E.; Teske, R. 2016. Expansion of pedological maps for physiographically similar areas by digital soil mapping. Pesquisa Agropecuária Brasileira 51: 1317-1325 (in Portuguese, with abstract in English).; Behrens and Scholten, 2007Behrens, T.; Scholten, T. 2007. A comparison of data mining techniques in predictive soil mapping. p. 353-365. In: Lagacherie, P.; McBratney, A.; Voltz, M., eds. Digital soil mapping: an introductory perspective. Elsevier, Amsterdam, The Netherlands. (Developments in Soil Science, 31).), but they can accelerate the generation of information on spatial distribution of soil classes. Classification algorithms such as Artificial Neural Networks (ANN) and Decision Trees (DT) have been widely used for soil survey modeling (Behrens et al., 2005Behrens, T.; Förster, H.; Scholten, T.; Steinrücken, U.; Spies, E.D.; Goldschmitt, M. 2005. Digital soil mapping using artificial neural networks. Journal of Plant Nutrition and Soil Science 168: 21-33.; Silva et al., 2013Silva, C.C.; Coelho, R.M.; Oliveira, S.R.M.; Adami, S.F. 2013. Digital soil mapping of the Botucatu sheet (SF-22-Z-B-VI-3): data training on conventional maps and field validation. Brazilian Journal of Soil Science 37: 846-857 (in Portuguese, with abstract in English).). ANN simulates biological neural networks; its basic component, a neuron, receives input signals that are aggregated and compared to a threshold or bias of the neuron. If the aggregated signal is greater than the bias, the neuron will be activated and the output signal generated by an activation function (Zhou, 2012Zhou, Z.H. 2012. Ensemble Methods: Foundations and Algorithms. Chapman & Hall, Cambridge, UK.). Neurons are linked by weighted connections to form a network. DT uses the divide and conquer process, based on the values of information gained, to create classification rules visually similar to trees (Witten et al., 2016Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J. 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA, USA.). Algorithms with integrated approaches are also being tested for pedological modeling. Bayesian Neural Networks, which integrate the maximization of probability estimation by Bayes’ theorem with ANN (Zhou, 2012Zhou, Z.H. 2012. Ensemble Methods: Foundations and Algorithms. Chapman & Hall, Cambridge, UK.) and Random Forest, an Ensemble Method that uses the strategy of Bootstrap Aggregating to create a stronger classifier, based on random DT (Zhou, 2012Zhou, Z.H. 2012. Ensemble Methods: Foundations and Algorithms. Chapman & Hall, Cambridge, UK.; Breiman, 2001Breiman, L. 2001. Random forests. Journal of Machine Learning Research 45: 5-32.), are expected to produce very robust models (Hastie et al., 2009Hastie, T.; Tibshirani, R.; Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, Stanford, CA, USA. (Springer Series in Statistics).; Chagas et al., 2017Chagas, C.S.; Pinheiro, H.S.K.; Junior, W.C.; Anjos, L.H.C.; Pereira, N.R.; Bhering, S.B. 2017. Data mining methods applied to map soil units on tropical hillslopes in Rio de Janeiro, Brazil. Geoderma Regional 9: 47-55.).

The aim of this research was to evaluate the performance of data preprocessing procedures and supervised classification approaches applied to conventional map units and environmental covariates as reference data sources for predicting soil map units.

Materials and Methods

Studied settings

The research was carried out in the Geographic Information System (GIS) environment with map polygons and legend from three 1:50,000-scale sheets of the 1:100,000-scale soil survey maps of the Brotas and Piracicaba quadrangles, in the state of São Paulo, Brazil (Figure 1).

Figure 1
Sheets on 1:50,000 scale from study area (São Paulo, Brazil).

The studied region has its largest extension located on the Peripheral Depression, but also has part of it on the Basaltic Cuestas, both being geomorphological provinces of the state of São Paulo, elevation ranging from 453 to 1069 m. On these landscapes, relief classes range from nearly level to very steep and lithology is mostly of sedimentary rocks but also, in the province of Cuestas, of basaltic rocks. Köppen's climate are Cwa and Aw (Alvares et al., 2013Alvares, C.A.; Stape, J.L.; Sentelhas, P.C.; Gonçalves, J.L.M.; Sparovek, G. 2013. Koppen's climate classification map for Brazil. Meteorologische Zeitschrift 22: 711-728. DOI 10.1127/0941-2948/2013/0507.
https://doi.org/10.1127/0941-2948/2013/0...
).

Databases

A 30-m resolution digital elevation model generated from the Brazilian Institute of Geography and Statistics (IBGE) 1:50,000-scale toposheets provided seven predictive relief attributes: Elevation, Slope Gradient, Relief Class, Profile Curvature, Plane Curvature, Distance to Drainage, and Topographic Wetness Index (TWI). Geological Formation or Lithology, as on the geological map of the state of São Paulo (1:750,000-scale) (Perrota et al., 2005Perrota, M.M.; Salvador, E.D.; Sachs, L.L.B. 2005. Geological Map of the State of São Paulo. 1:750,000-scale. = Mapa Geológico do Estado de São Paulo. CPRM-Serviço Geológico do Brasil, Brasília, DF, Brazil (in Portuguese).), was the 8th predictive variable.

The target variable was either locally named Soil Units or soil classes in the 2nd (suborder), 3rd (great group) or 4th (subgroup) level of the Brazilian System the Brazilian System of Soil Classification (SiBCS) (Santos et al., 2013Santos, H.G.; Jacomine, P.K.T.; Anjos, L.H.C.; Oliveira, V.A.; Lumbreras, J.F.; Coelho, M.R.; Almeida, J.A.; Cunha, T.J.F.; Oliveira, J.B. 2013. Brazilian System of Soil Classification. = Sistema Brasileiro de Classificação de Solos. 3ed. Embrapa, Brasília, DF, Brazil (in Portuguese).) extracted from the Brotas (Almeida et al., 1981Almeida, C.L.F.; Oliveira, J.B.; Prado, H. 1981. Semi detailed Soil Survey of the State of São Paulo: Brotas Sheet (SF-22-Z-B-III) Map, 1:100,000 scale. = Levantamento Pedológico Semidetalhado do Estado de São Paulo: Quadrícula de Brotas (SF-22-Z-B-III). Instituto Agronômico, Campinas, SP, Brazil (in Portuguese).) and Piracicaba (Oliveira and Prado, 1989Oliveira, J.B.; Prado, H. 1989. Semi detailed Soil Survey of the State of São Paulo: Piracicaba sheet (SF-23-Y-A-IV) Map, 1:100,000-scale. = Carta Pedológica Semidetalhada do Estado de São Paulo: Piracicaba (SF-23-Y-A-IV). Instituto Agronômico, Campinas, SP, Brazil (in Portuguese).) 1:100,000-sheet soil surveys. Equivalence among map Soil Units, SiBCS subgroups, and U.S. Soil Taxonomy subgroups (Soil Survey Staff, 2014) is shown on Table 1. Mapping concepts used for Soil Units were those from the original 1:100.000-sheet soil surveys (e.g. Oliveira, 1999Oliveira, J.B. 1999. Soils of the Piracicaba Sheet = Solos da Folha de Piracicaba. Instituto Agronômico, Campinas, SP, Brazil (Boletim Científico, 48) (in Portuguese).). Nomenclature of soil classes in the three categorical levels of SiBCS plus Soil Unit names applied to a database of soil map units enabled the structuring of four matrices of predictive attributes plus soil classes.

Table 1
Equivalence of mapped Soil Units to soil classes of the Brazilian System of Soil Classification (SiBCS) and U.S. Soil Taxonomy.

In order to favor machine learning of soil classes with a low number of instances, minority classes were merged, producing a larger number of instances per class. This was the case for the classes of Hydromorphic soils (Glei), Orthents (Litólicos), certain Alfisols with abrupt textural changes (Diamante), and Spodosols (Podzóis).

Data mining

Preprocessing

Efficiency of the procedures for variable discretization, data selection, under- and oversampling, class balancing, and variable selection (Table 2) was evaluated using the Weka software, version 3.8.0 by “Hold-Out” (Supplied Test Set - 2/3 for training, 1/3 for test).

Table 2
Summary of preprocessing procedures.

The predictive variable “relief class” was obtained by discretization of the slope gradient into the following classes: 0-4, 4-8, 8-20, 20-45, 45-75, > 75 %. These variables (slope gradient and relief class) were used simultaneously in both the continuous and discrete forms in data matrices. Discretization of continuous predictive variables in arbitrary classes was also carried out for profile curvature and plane curvature. The discrete classes described above can better represent the hydrodynamic behavior of landscapes than continuous variables. Discretization was tested for the variables TWI and distance to drainage using intervals of equal ranges and equal frequencies. The adopted interval for defining criteria was the one that best improved the performance of the classifier, as evaluated by the “Hold-Out” method.

Resampling procedures were evaluated in order to improve the predictive power of the models in the less representative soil classes in the reference area. Undersampling, oversampling, and class balancing procedures were applied to 2/3 of the map unit database. Resampling was applied at three levels, zero (0.0), representing the original distribution of the data, one (1.0), the balanced distribution of soil classes, and 0.5, a distribution involving undersampling of majority classes and oversampling of minority classes. The test database for these procedures was 1/3 of the instances using the “Hold-Out” method (Supplied Test Set).

Ranking predictive variables by importance was carried out by chi-square (χ2) and information gain, two commonly used feature-selection methods.

Soil class prediction

Preprocessing procedures were evaluated using four algorithms, Random Forest, J48, MLP and Bayes Net to explore their capabilities to predict soil map units (Table 3). Performance of classifiers for each prediction class was evaluated by accuracy, true and false positive rates (TPR and FPR), and by the area under the curve (Bradley, 1997). To evaluate the global performance of classifiers, we used global accuracy (Weiss and Zhang, 2003Weiss, S.M.; Zhang, T. 2003. Performance analysis and evaluation. p. 425-440. In: Nong, Y., ed. The handbook of data mining. Lawrence Erlbaum, Mahwah, NJ, USA.), weighted mean precision, weighted average of the true positive (weighted mean sensitivity) and the false positive (one minus weighted mean specificity), and the Kappa index (Cohen, 1960Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 37-46.).

Table 3
Algorithms used for supervised classification.

Results

Preprocessing procedures

Discretization

Profile and plane curvature performed better as discrete variables, whereas elevation, distance to drainage and topographic wetness index (TWI) had better performance as continuous variables. Slope had better performance used in conjunction with its discrete form (relief class). Small differences in performance were considered for pre-selecting the type of predictive attribute. The best adjustment results in these cases are shown in Table 4.

Table 4
Best algorithm performances after discretization procedures. Classification in soil unit. Accuracy = Global Accuracy; TPR = weighted average of true positive rates; FPR = weighted average of false positive rates.

Accuracy values were around 50 % and Kappa indices between 0.35 and 0.45, meaning fair and good agreement (Table 4). The weighted averages of true positive rates (average sensitivity) for the best models were between 42 % and 50 % (Table 4). The specificity of the rules generated by the models was high, indicated by the low mean of false positive rates, with values from 5 % to 9 % (Table 4).

Data selection

Data selection was fundamental to the acceptance of models generated by all evaluated algorithms. Accuracy ranged from 65 % to 78 %, and the Kappa index from 0.50 to 0.67, representing good and very good agreement (Table 5). Composite map units (soil associations and soil complexes) from conventional soil maps showed inconsistency (reduction of predictive performance) and were therefore removed.

Table 5
Performance of the algorithms after discretization and data selection. Classification in soil units. Accuracy = Global Accuracy; Precision = weighted average precision; TPR = weighted average of true positive rates; FPR = weighted average of false positive rates.

Weighted average of true positive rates (TPR) (mean sensitivity) for models generated by conventional map units were between 65 % and 68 % (Table 5). Weighted average of false positive rates (FPR) ranged from 6 to 16 %.

Resampling procedures (subsampling, class balancing and oversampling) did not favor model performance. There was a considerable reduction in global accuracy and class precision as soil unit distribution approached the fully balanced distribution (1.0) (Table 6). Databases always produced models with better performance when they were not submitted to resampling procedures.

Table 6
Performance of the algorithms for holdout accuracy (2/3 training and 1/3 test) after subsampling (0.5), class balancing (1.0) and resampling keeping the original data distribution (0.0).

Results for variable selection showed that all the predictive variables were important for the generated models, with classifier performance reduction as any predictive variable was removed from databases. Evaluation of variables by chi-square (χ2) and information gain methods showed attributes in the following descending order of predictive power: elevation, geology, distance to drainage, slope gradient, relief class, topographic wetness index (TWI), profile curvature and plane curvature (Table 7).

Table 7
Assessment of covariates by their prediction power based on information gain and chi-square (χ2).

Algorithms

Global model evaluation used the following metrics: accuracy (overall accuracy), precision, weighted mean of true positive rates, weighted mean of false positive rates and Kappa index (Table 5). The algorithm with the best overall performance was Random Forest, with an Ensemble Method for Prediction approach that generated 20 decision trees for the creation of models. Algorithm J48 (Decision Tree) presented results close to those of Random Forest, but always inferior, whereas the algorithms MLP (Artificial Neural Networks) and Bayes Net (Bayesian Neural Networks) showed worse performance, despite Kappa indexes of 0.57 and 0.50, respectively (Table 5).

Precision per class was evaluated in order to differentiate the performance of the models of each algorithm per predicted class. Results showed that precision per class followed the results for global model evaluation (Table 8). In general, models with better overall performance showed better accuracy performance per prediction class. Exceptions occurred for Campestre and Baguari soil units, where the accuracy of MLP algorithm was greater than that obtained by J48. For Sete Lagoas, Engenho, Alva and Diamante soil units, the model generated by the J48 algorithm showed better performance than that developed by Random Forest. Precision for Itaguaçu, Santana and Alva units was greater in the model generated by Bayes Net algorithm than by MLP, in which they obtained zero precision. The São Lucas unit had better precision in the model generated by MLP algorithm (Table 8).

Table 8
Precision of algorithms evaluated by Hold-Out (2/3 training and 1/3 test) in each soil map unit and classification in Soil Units.

Assessment of categorical levels of SiBCS

As for the evaluated categorical levels of SiBCS (Suborder, Great Group and Subgroup), there was little difference in accuracy and Kappa between the categories. A small decrease in performance was associated with an increase in the detail of the categorical level, the algorithm MLP (Artificial Neural Networks) was an exception, and had better performance with classification at the 3rd categorical level of SiBCS (Great Group) (Table 9). Predictions of conventional map units classified by the Brazilian System of Soil Classification (SiBCS) (Table 9) were very similar to classification by soil units (Table 5), even though slightly better.

Table 9
Best performance of the algorithms evaluated by Hold-Out (2/3 training and 1/3 test), with classification in the 2nd, 3rd and 4th categorical levels of SiBCS.

Results for the weighted average of true positive rates in each class by the algorithm of best performance with classification at the 4th level (Subgroup) of SiBCS was 78 %, indicating good average sensitivity (average chance of the classifier to hit a particular class) (Table 10). The weighted mean false positive rates were 13 %, indicating that the classification rules also showed good average specificity (average chance of the classifier failing in a given class) due to the low number of false positive occurrences (Table 10).

Table 10
Performance per class of the Random Forest algorithm. Classification in the 4th level (Subgroup) of SiBCS. TP = true positive rate; FP = false positive rate; AUC = area under th e curve.

The lowest sensitivity values were for “Planossolos” (Alfisols with abrupt textural changes), “Espodossolos” (Spodosols), “Chernossolos” (Mollisols), “Latossolos Amarelos and Vermelho-Amarelos psamíticos” (coarse- and fine-loamy, Xanthic and Typic Oxisols), and “Latossolos Vermelhos Eutroférricos” (Rhodic Eutrudox). Rules created for the remaining classes showed good sensitivity, the best results being obtained for “Argissolos Vermelho-Amarelos” (Arenic and Grossarenic Paleudults and Paleudalfs) and “Latossolos Amarelos Distróficos úmbricos, textura media” (coarse- and fine-loamy Typic Hapludox), with sensitivity values close to 0.9 (Table 10).

The low false positive rates indicate that the rules created by the Random Forest algorithm (committee of 20 decision trees) showed good specificity with the conventional map units and classification at the 4th level (Subgroup) of the SiBCS (Table 10). Values obtained for the area under the curve were quite satisfactory, ranging from 0.7 to 1.0 (Table 10).

Discussion

Preprocessing procedures were extremely important to improving the performance of the models generated by the algorithms. When evaluating data selection, certain information showed inconsistent for machine learning (Han et al., 2011Han, J.; Kamber, M.; Pei, J. 2011. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA, USA.), as they drastically reduced the performance of the evaluated models in supervised classification. This was observed for map units of soil associations and soil complexes. Composite map units (soil associations or complexes) are supposed to carry greater complexity of soil forming factors than those present in soil consociations. Thus, as soil forming factors relief and parent material were used for deriving covariates for soil prediction, this greater complexity could affect the results. Exclusion of composite map units from the training set does not preclude to map areas with features associated with these map units since soil complex and soil associations, composed of two or more soil consociations, are represented by the single unit of the main consociation in the training set. Reducing complexity of predictive covariates has been a successful strategy for improving the prediction of soil map units (Ten Caten et al., 2012Ten Caten, A.; Dalmolin, R.S.D.; Ruiz, L.F.C. 2012. Digital soil mapping: strategy for data pre-processing. Revista Brasileira de Ciência do Solo 36: 1083-1091.).

To deal with the substantial amount of information extracted from the training areas (1,013,329 instances after preprocessing) we used the Hold-Out method, 2/3 for training and 1/3 for model testing, increasing the amount of information for training and testing the models, optimizing the analysis procedure in relation to computational capacity or processing time.

Results from class balancing followed the pattern found by Crivelenti et al. (2009)Crivelenti, R.C.; Coelho, R.M.; Adami, S.F.; Oliveira, S.R.M. 2009. Data mining for soil-landscape relationships inference in digital soil mapping. Pesquisa Agropecuária Brasileira 44: 1707-1715 (in Portuguese, with Abstract in English)., with a decrease in performance of classifiers after class balancing (training in equal frequency classes). This indicates that, in this case, undersampling majority classes was detrimental to the machine learning process, probably due to the failure of classifiers to learn important relationships. Therefore, even though an improvement in the prediction of minority classes after balancing was expected, class balancing was detrimental to the overall performance of the models.

Random Forest, a supervised classification method with ensemble approach, produced models with the best performances, similar to the findings of Dias et al. (2016)Dias, L.M.S.; Coelho, R.M.; Valladares, G.S.; Assis, A.C.C.; Ferreira, E.P.; Silva, R.C. 2016. Soil class prediction by data mining in an area of the São Francisco sedimentary basin. Pesquisa Agropecuária Brasileira 51: 1396-1404 (in Portuguese, with Abstract in English)., and Chagas et al. (2017)Chagas, C.S.; Pinheiro, H.S.K.; Junior, W.C.; Anjos, L.H.C.; Pereira, N.R.; Bhering, S.B. 2017. Data mining methods applied to map soil units on tropical hillslopes in Rio de Janeiro, Brazil. Geoderma Regional 9: 47-55., surpassing J.48, a decision tree algorithm. This opposes the findings of Ten Caten et al. (2013)Ten Caten, A.; Dalmolin, R.S.D.; Pedron, F.A.; Ruiz, L.F.C.; Silva, C.A. 2013. An appropriate data set size for digital soil mapping in Erechim, Rio Grande do Sul, Brazil. Revista Brasileira de Ciência do Solo 37: 359-366., in terms of accuracy and kappa indexes above 70 % when using decision trees in smaller datasets than those studied here. The use of Bootstrap Aggregating (Bagging) in the Random Forest algorithm shows advantages due to the combination of classifiers (Zhou, 2012Zhou, Z.H. 2012. Ensemble Methods: Foundations and Algorithms. Chapman & Hall, Cambridge, UK.).

The high performance of the model generated by the Random Forest algorithm at the 4th level (Subgroup) of SiBCS (accuracy above 78 % and kappa index above 67 %) indicates that the approach has great potential for producing digital pedological maps compatible to medium and high intensity reconnaissance (4th order) soil surveys.

Some of the minority classes were better predicted by models with lower global performance. However, in all cases the information gain with these individual classes was not significant enough to improve the overall performance of the models. This fact may be due, in the main, to the prevalence of certain classes in the training area, which resulted in assigning great weight to small decreases in the performance of these majority classes. The high accuracy level in most of the predicted classes is indicative of the high predictive power of the models tested.

Conclusions

Composite soil map units (soil complex and soil associations) proved to be inadequate for the machine learning process, since their exclusion from the training dataset improved overall prediction.

When modeling soil map units for pedological mapping, training on unbalanced databases outperformed training on balanced databases, showing no need for class balancing for machine learning on the studied dataset.

The Random Forest algorithm had good performance for soil class prediction and, though requiring preprocessing procedures, outscored algorithms of different approaches, such as single Decision Trees, Artificial Neural Networks and Bayesian Classification.

The applied predictive variables (terrain attributes and geology) trained on 1:100,000 soil survey maps showed excellent performance for predicting soil map unit distribution, and can be used to create digital pedological maps consistent with high intensity reconnaissance soil survey (4th order) maps.

References

  • Almeida, C.L.F.; Oliveira, J.B.; Prado, H. 1981. Semi detailed Soil Survey of the State of São Paulo: Brotas Sheet (SF-22-Z-B-III) Map, 1:100,000 scale. = Levantamento Pedológico Semidetalhado do Estado de São Paulo: Quadrícula de Brotas (SF-22-Z-B-III). Instituto Agronômico, Campinas, SP, Brazil (in Portuguese).
  • Alvares, C.A.; Stape, J.L.; Sentelhas, P.C.; Gonçalves, J.L.M.; Sparovek, G. 2013. Koppen's climate classification map for Brazil. Meteorologische Zeitschrift 22: 711-728. DOI 10.1127/0941-2948/2013/0507.
    » https://doi.org/10.1127/0941-2948/2013/0507
  • Bagatini, T.; Giasson, E.; Teske, R. 2016. Expansion of pedological maps for physiographically similar areas by digital soil mapping. Pesquisa Agropecuária Brasileira 51: 1317-1325 (in Portuguese, with abstract in English).
  • Behrens, T.; Förster, H.; Scholten, T.; Steinrücken, U.; Spies, E.D.; Goldschmitt, M. 2005. Digital soil mapping using artificial neural networks. Journal of Plant Nutrition and Soil Science 168: 21-33.
  • Behrens, T.; Scholten, T. 2007. A comparison of data mining techniques in predictive soil mapping. p. 353-365. In: Lagacherie, P.; McBratney, A.; Voltz, M., eds. Digital soil mapping: an introductory perspective. Elsevier, Amsterdam, The Netherlands. (Developments in Soil Science, 31).
  • Breiman, L. 2001. Random forests. Journal of Machine Learning Research 45: 5-32.
  • Chagas, C.S.; Pinheiro, H.S.K.; Junior, W.C.; Anjos, L.H.C.; Pereira, N.R.; Bhering, S.B. 2017. Data mining methods applied to map soil units on tropical hillslopes in Rio de Janeiro, Brazil. Geoderma Regional 9: 47-55.
  • Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 37-46.
  • Crivelenti, R.C.; Coelho, R.M.; Adami, S.F.; Oliveira, S.R.M. 2009. Data mining for soil-landscape relationships inference in digital soil mapping. Pesquisa Agropecuária Brasileira 44: 1707-1715 (in Portuguese, with Abstract in English).
  • Dias, L.M.S.; Coelho, R.M.; Valladares, G.S.; Assis, A.C.C.; Ferreira, E.P.; Silva, R.C. 2016. Soil class prediction by data mining in an area of the São Francisco sedimentary basin. Pesquisa Agropecuária Brasileira 51: 1396-1404 (in Portuguese, with Abstract in English).
  • Fayyad, U.M.; Shapiro, G.P.; Smyth, P. 1996. From data mining to knowledge discovery: an overview. p. 1-34. In: Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P.; Uthurusamy, R., eds. Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, Menlo Park, CA, USA.
  • Han, J.; Kamber, M.; Pei, J. 2011. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA, USA.
  • Hastie, T.; Tibshirani, R.; Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, Stanford, CA, USA. (Springer Series in Statistics).
  • Hudson, B.D. 1992. The soil survey as paradigm-based science. Soil Science Society of America Journal 56: 836-841.
  • McBratney, A.B.; Santos, M.L.M.; Minasny, B. 2003. On digital soil mapping. Geoderma 117: 3-52.
  • Oliveira, J.B. 1999. Soils of the Piracicaba Sheet = Solos da Folha de Piracicaba. Instituto Agronômico, Campinas, SP, Brazil (Boletim Científico, 48) (in Portuguese).
  • Oliveira, J.B.; Prado, H. 1989. Semi detailed Soil Survey of the State of São Paulo: Piracicaba sheet (SF-23-Y-A-IV) Map, 1:100,000-scale. = Carta Pedológica Semidetalhada do Estado de São Paulo: Piracicaba (SF-23-Y-A-IV). Instituto Agronômico, Campinas, SP, Brazil (in Portuguese).
  • Perrota, M.M.; Salvador, E.D.; Sachs, L.L.B. 2005. Geological Map of the State of São Paulo. 1:750,000-scale. = Mapa Geológico do Estado de São Paulo. CPRM-Serviço Geológico do Brasil, Brasília, DF, Brazil (in Portuguese).
  • Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA, USA.
  • Santos, H.G.; Jacomine, P.K.T.; Anjos, L.H.C.; Oliveira, V.A.; Lumbreras, J.F.; Coelho, M.R.; Almeida, J.A.; Cunha, T.J.F.; Oliveira, J.B. 2013. Brazilian System of Soil Classification. = Sistema Brasileiro de Classificação de Solos. 3ed. Embrapa, Brasília, DF, Brazil (in Portuguese).
  • Si, J.; Nelson, B.J.; Runger, G.C. 2003. Artificial neural network 410 models for data mining. p. 41-66. In: Ye, N., ed. The handbook of data mining. Lawrence Erlbaum, Mahwah, NJ, USA.
  • Silva, C.C.; Coelho, R.M.; Oliveira, S.R.M.; Adami, S.F. 2013. Digital soil mapping of the Botucatu sheet (SF-22-Z-B-VI-3): data training on conventional maps and field validation. Brazilian Journal of Soil Science 37: 846-857 (in Portuguese, with abstract in English).
  • Soil Survey Staff. 2014. Keys to Soil Taxonomy. 12ed. USDA-Natural Resources Conservation Service, Washington, DC, USA.
  • Ten Caten, A.; Dalmolin, R.S.D.; Ruiz, L.F.C. 2012. Digital soil mapping: strategy for data pre-processing. Revista Brasileira de Ciência do Solo 36: 1083-1091.
  • Ten Caten, A.; Dalmolin, R.S.D.; Pedron, F.A.; Ruiz, L.F.C.; Silva, C.A. 2013. An appropriate data set size for digital soil mapping in Erechim, Rio Grande do Sul, Brazil. Revista Brasileira de Ciência do Solo 37: 359-366.
  • Weiss, S.M.; Zhang, T. 2003. Performance analysis and evaluation. p. 425-440. In: Nong, Y., ed. The handbook of data mining. Lawrence Erlbaum, Mahwah, NJ, USA.
  • Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J. 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA, USA.
  • Zhou, Z.H. 2012. Ensemble Methods: Foundations and Algorithms. Chapman & Hall, Cambridge, UK.

Edited by

Edited by: Thomas Kumke

Publication Dates

  • Publication in this collection
    20 May 2019
  • Date of issue
    Sep-Oct 2019

History

  • Received
    21 May 2017
  • Accepted
    07 May 2018
Escola Superior de Agricultura "Luiz de Queiroz" USP/ESALQ - Scientia Agricola, Av. Pádua Dias, 11, 13418-900 Piracicaba SP Brazil, Phone: +55 19 3429-4401 / 3429-4486 - Piracicaba - SP - Brazil
E-mail: scientia@usp.br