Preprocessing procedures and supervised classification applied to a database of systematic soil survey

Valadares, Alan Pessoa; Coelho, Ricardo Marques; Oliveira, Stanley Robson de Medeiros

doi:10.1590/1678-992X-2017-0171

Acessibilidade / Reportar erro

Brasil

Español English

sumário « anterior atual seguinte »

Sumário

Soils and Plant Nutrition • Sci. agric. (Piracicaba, Braz.) 76 (5) • Sep-Oct 2019 • https://doi.org/10.1590/1678-992X-2017-0171 copy

Preprocessing procedures and supervised classification applied to a database of systematic soil survey

Authorship SCIMAGO INSTITUTIONS RANKINGS

ABSTRACT:

Data Mining techniques play an important role in the prediction of soil spatial distribution in systematic soil surveying, though existing methodologies still lack standardization and a full understanding of their capabilities. The aim of this work was to evaluate the performance of preprocessing procedures and supervised classification approaches for predicting map units from 1:100,000-scale conventional semi-detailed soil surveys. Sheets of the Brazilian National Cartographic System on the 1:50,000 scale, “Dois Córregos” (“Brotas” 1:100,000-scale sheet), “São Pedro” and “Laras” (“Piracicaba” 1:100,000-scale sheet) were used for developing models. Soil map information and predictive environmental covariates for the dataset were obtained from the semi-detailed soil survey of the state of São Paulo, from the Brazilian Institute of Geography and Statistics (IBGE) 1:50,000-scale topographic sheets and from the 1:750,000-scale geological map of the state of São Paulo. The target variable was a soil map unit of four types: local “soil unit” name and soil class at three hierarchical levels of the Brazilian System of Soil Classification (SiBCS). Different data preprocessing treatments and four algorithms all having different approaches were also tested. Results showed that composite soil map units were not adequate for the machine learning process. Class balance did not contribute to improving the performance of classifiers. Accuracy values of 78 % and a Kappa index of 0.67 were obtained after preprocessing procedures with Random Forest, the algorithm that performed best. Information from conventional map units of semi-detailed (4^th order) 1:100,000 soil survey generated models with values for accuracy, precision, sensitivity, specificity and Kappa indexes that support their use in programs for systematic soil surveying.

Keywords:
machine learning algorithms; random forest; tacit soil-landscape relationships; digital soil mapping

Soil Units	Soil classes at the 4^th level of the SiBCS^a a Abreviations as in the Brazilian System of Soil Classification (SiBCS) (Santos et al., 2013).	U.S. Soil Taxonomy
Alva	PVAd e PVAe abrúptico, A moderado, textura arenosa/média	Sandy over Fine-loamy, Arenic and Typic Paleudult
Areia Quartzosa	RQo típico, A moderado	Typic Quartzipsamment
Baguari	PVAd e PVAe típico e abrúptico, A moderado, textura média e média/argilosa	Fine-loamy, Typic Kandiudult
Barão Geraldo	LVdf típico, A moderado, textura argilosa e muito argilosa	Fine and Very Fine, Rhodic Hapludox
Campestre	PVe nitossólico e NVe típico, A moderado, textura argilosa/muito argilosa	Fine, Rhodic Kandiudult and Kandiudalf
Canela	PVd e PVAd típico, A moderado, textura média e média/argilosa	Fine-loamy over Fine, Typic Kandiudult
Coqueiro	LVAd psamítico e típico, A moderado e fraco, textura média	Coarse-loamy, Typic Hapludox
Diamante	SXe e SXd típico e vertissólico, A moderado, textura média/argilosa	Fine-loamy over Fine, Vertic, Albaquic and Typic Hapludalf
Engenho	MTf e MTo típico, textura argilosa	Very Fine and Fine, Typic Paleudoll
Estruturada	NVef e NVdf típico, A moderado, textura argilosa e muito argilosa	Very Fine and Fine, Kandiudalfic Eutrudox
Hidromórficos	GXvd, GXve, GXbd e GXbe típico, A moderado e proeminente, textura argilosa	Fine, Aquept, Aquent, Aquox, Aquult, Aqualf
Hortolândia	LVd típico, A moderado, textura média	Fine Loamy, Rhodic Hapludox
Itaguaçu	NVdf latossólico, A moderado, textura argilosa e muito argilosa	Fine and Very Fine, Kandiudalfic Eutrudox and Rhodic Kandiudox
Laranja Azeda	LVAd típico, A moderado, textura média	Fine-loamy, Typic Hapludox
Limeira	LVd típico, A moderado, textura argilosa e muito argilosa	Very Fine and Fine, Rhodic Hapludox
Litólicos	RLe e RLm típicos, A moderado e chernozêmico, textura média	Loamy, Lithic Udorthent
Monte Cristo	PVAd e PVAe abrúptico e arênico abrúptico, A moderado, textura arenosa/média e média/argilosa	Sandy over Fine-loamy and Sandy over Fine, Arenic Kandiudult and Arenic Kandiudalf
Olaria	NXd típico, A moderado, textura argilosa e muito argilosa	Fine and Very Fine, Typic and Rhodic Kandiudult
Podzóis	ESKo típico, textura arenosa/média	Sandy over Coarse-loamy, Humod
Ribeirão Preto	LVef típico, A moderado, textura argilosa e muito argilosa	Fine and Very Fine, Rhodic Eutrudox
Santa Cruz	PVAd e PVAe abrúptico, A moderado, textura média/argilosa media/muito argilosa e argilosa/muito argilosa	Fine-loamy over Fine, Typic Kandiudult and Typic Kandiudalf
Santana	NXe chernossólico, textura média/argilosa	Fine-loamy over Fine, Typic Paleudoll
São Lucas	LAd e LVAd psamítico, A moderado, textura média	Coarse-loamy, Typic Hapludox and Kandiudox
Serrinha	PVAd, PVAe, PAd e PAe arênico abrúptico, A moderado e fraco, textura arenosa/média	Sandy over Fine-loamy, Arenic Paleudult, Grossarenic Paleudult, Arenic Paleudalf and Grossarenic Paleudalf
Sete Lagoas	CYbd e CYbe típico, A moderado e proeminente, textura argilosa e média	Fine and Fine-loamy, Fluventic and Typic Dystrudept
Taquaraxim	CXbd e CXbe típico, A moderado e proeminente, textura média e argilosa	Fine-loamy, Typic Dystrudept
Três Barras	LAd úmbrico, textura media	Coarse- and Fine-loamy, Xanthic and Typic Hapludox

Procedures	Importance /Application
Stratified sampling	Stratified data sampling separating training and testing datasets in the reference area.
Data Selection	Identification and exclusion of inconsistent information.
Discretization	Transformation of continuous quantitative variables into categorical ones.
Undersampling	Resampling by gradual elimination of information from the majority classes in the training of unbalanced classes.
Oversampling	Replication sampling of minority classes in training unbalanced classes.
Class Balancing	Resampling with standardization of the distribution (frequency) of prediction classes.
Selection of Variables	Evaluation of the predictive power of each explanatory variable and elimination of those detrimental to machine learning.

Algorithm (classifier)	Reference	Type of approach
J.48 (C4.5)	Quinlan (1993)Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA, USA.	DecisionTree (divide and conquer process based on data information gain)
Random Forest	Breiman (2001)Breiman, L. 2001. Random forests. Journal of Machine Learning Research 45: 5-32.	Ensemble (bootstrap aggregating based on random decision trees)
Multi-Layer Perceptron	Si et al. (2003)Si, J.; Nelson, B.J.; Runger, G.C. 2003. Artificial neural network 410 models for data mining. p. 41-66. In: Ye, N., ed. The handbook of data mining. Lawrence Erlbaum, Mahwah, NJ, USA.	Artificial Neural Networks (transfer functions based on input signal, connections weight and neuron bias)
Bayes Net	Hall et al. (2009)	Bayesian Classifiers (integrates Bayesian probability function to ANN)

Algorithms^a a Algorithm Random Forest could not be used with this dataset due to computational limitations.	Accuracy	Error	Precision	TPR	FPR	Kappa
	---------------------------------------------------------- % ---------------------------------------------------------
J48	50.19	49.81	48.10	50.20	5.50	44.70
MLP	47.87	52.13	44.90	47.90	8.80	39.76
Bayes Net	42.52	57.48	37.30	42.50	6.50	35.78

Algorithms	Accuracy	Error	Precision	TPR	FPR	Kappa
	-------------------------------------------------------- % --------------------------------------------------------
Random Forest	78.13	21.87	77.70	78.10	12.80	67.0
J48	76.09	23.91	75.30	76.10	13.50	64.0
MLP	71.64	28.36	71.00	71.60	15.70	57.0
Bayes Net	65.75	34.25	65.00	65.70	15.30	50.0

	Random Forest			J48			Bayes Net
	1.0	0.5	0.0	1.0	0.5	0.0	1.0	0.5	0.0
	Weighted average precision (%)
	72.10	75.80	76.50	68.80	71.40	71.70	40.68	61.46	65.00
Soil units	Precision per Soil Unit
Alva	0.131	0.206	0.394	0.098	0.128	0.372	0.041	0.159	0.756
Areia Quartzosa	0.730	0.803	0.805	0.699	0.758	0.755	0.696	0.738	0.736
Baguari	0.433	0.701	0.754	0.360	0.564	0.617	0.161	0.347	0.427
Barão Geraldo	0.563	0.663	0.717	0.517	0.597	0.631	0.155	0.179	0.195
Campestre	0.282	0.495	0.648	0.211	0.346	0.453	0.068	0.123	0.291
Canela	0.761	0.803	0.782	0.709	0.740	0.731	0.501	0.558	0.589
Coqueiro	0.205	0.455	0.746	0.147	0.260	0.378	0.036	0.061	0.077
Diamante	0.500	0.429	0.000	0.231	0.292	0.000	0.000	0.000	0.000
Engenho	0.071	0.127	0.231	0.042	0.079	0.072	0.041	0.071	0.067
Estruturada	0.441	0.565	0.639	0.327	0.434	0.506	0.082	0.111	0.127
Hidromórficos	0.423	0.587	0.646	0.406	0.523	0.570	0.269	0.330	0.356
Hortolândia	0.666	0.680	0.699	0.577	0.585	0.584	0.289	0.320	0.380
Itaguaçu	0.732	0.688	0.700	0.673	0.660	0.560	0.087	0.313	0.167
Laranja Azeda	0.370	0.584	0.718	0.291	0.457	0.590	0.160	0.326	0.574
Limeira	0.743	0.773	0.747	0.695	0.728	0.718	0.510	0.573	0.575
Litólicos	0.371	0.614	0.653	0.300	0.481	0.525	0.187	0.368	0.467
Monte Cristo	0.720	0.735	0.748	0.632	0.642	0.668	0.238	0.285	0.376
Olaria	0.350	0.553	0.655	0.259	0.398	0.529	0.098	0.194	0.277
Podzóis	0.053	0.121	0.250	0.038	0.045	0.083	0.027	0.056	0.000
Ribeirão Preto	0.371	0.394	0.381	0.241	0.263	0.212	0.096	0.129	0.089
Santa Cruz	0.461	0.680	0.699	0.420	0.581	0.589	0.321	0.463	0.486
Santana	0.676	0.800	0.867	0.430	0.630	0.565	0.108	0.350	0.579
São Lucas	0.144	0.341	0.526	0.113	0.199	0.277	0.049	0.063	0.054
Serrinha	0.862	0.806	0.792	0.840	0.794	0.780	0.762	0.742	0.744
Sete Lagoas	0.410	0.463	0.532	0.436	0.493	0.592	0.236	0.323	0.393
Taquaraxim	0.587	0.666	0.774	0.516	0.585	0.674	0.363	0.418	0.449
Três Barras	0.902	0.893	0.904	0.868	0.860	0.868	0.698	0.688	0.728

Covariates	Information Gain	χ²
Elevation	0.885	2,141,866
Geology	0.774	1,258,271
Distance to Drainage	0.219	270,550
Slope Gradient	0.108	168,237
Relief Class	0.074	74,552
TWI	0.054	58,413
Profile Curvature	0.049	45,768
Plane Curvature	0.028	23,355

Soil Unit	Random Forest	J48	MLP	Bayes Net
Alva	0.411	0.550	0.000	0.806
Areia Quartzosa	0.826	0.809	0.760	0.736
Baguari	0.778	0.708	0.709	0.441
Barão Geraldo	0.726	0.670	0.510	0.205
Campestre	0.678	0.552	0.644	0.276
Canela	0.800	0.749	0.624	0.591
Coqueiro	0.744	0.612	0.000	0.000
Diamante	0.333	0.600	0.000	0.000
Engenho	0.000	0.083	0.000	0.000
Estruturada	0.687	0.586	0.487	0.120
Hidromórficos	0.656	0.625	0.419	0.356
Hortolândia	0.727	0.645	0.582	0.376
Itaguaçu	0.721	0.596	0.000	0.154
Laranja Azeda	0.728	0.640	0.689	0.578
Limeira	0.758	0.751	0.587	0.586
Litólicos	0.700	0.640	0.559	0.477
Monte Cristo	0.783	0.694	0.621	0.375
Olaria	0.693	0.585	0.425	0.322
Podzóis	0.667	0.200	0.000	0.000
Ribeirão Preto	0.494	0.338	0.000	0.075
Santa Cruz	0.718	0.695	0.643	0.489
Santana	0.947	0.735	0.000	0.481
São Lucas	0.574	0.409	0.804	0.067
Serrinha	0.792	0.781	0.753	0.742
Sete Lagoas	0.566	0.635	0.624	0.412
Taquaraxim	0.790	0.714	0.602	0.453
Três Barras	0.911	0.889	0.781	0.733

Algorithms	SiBCS hierarchical levels			SiBCS hierarchical levels
	2^nd	3^rd	4^th	2^nd	3^rd	4^th
	Accuracy			Kappa
	-------------------------------------------------------- % --------------------------------------------------------
Random Forest	78.69	78.61	78.18	67.90	67.81	67.42
J48	76.77	76.62	76.25	65.00	64.89	64.57
MLP	71.74	72.22	71.45	57.42	57.78	57.42
Bayes Net	66.71	66.37	65.84	51.28	50.86	50.42

TP	FP	Precision	AUC	Soil class^a a Abbreviations as in the Brazilian System of Soil Classification (SiBCS) (Santos et al., 2013).	U.S. Soil Taxonomy
0.701	0.001	0.811	0.972	CXbd e CXbe típicos, A moderado e proeminente, textura média e argilosa	Fine-loamy, Typic Dystrudept
0.683	0.006	0.578	0.990	CYbd e CYbe típicos, A moderado e proeminente, textura argilosa e média	Fine and Fine-loamy, Fluventic and Typic Dystrudept
0.000	0.000	0.000	0.707	ESKo típico, textura arenosa/média	Sandy over Coarse-loamy, Humod
0.601	0.006	0.663	0.973	GXvd, GXve, GXbd e GXbe típicos, A moderado e proeminente, textura argilosa	Fine, Aquept, Aquent, Aquox, Aquult, Aqualf
0.180	0.002	0.564	0.860	LAd e LVAd psamíticos, A moderado	Coarse-loamy, Typic Hapludox and Kandiudox
0.902	0.001	0.922	0.997	LAd úmbrico, textura média	Coarse- and Fine-loamy, Xanthic and Typic Hapludox
0.300	0.000	0.770	0.903	LVAd psamítico e típico, A moderado e fraco, textura média	Coarse-loamy, Typic Hapludox
0.617	0.001	0.739	0.965	LVAd típico, A moderado, textura média	Fine-loamy, Typic Hapludox
0.850	0.005	0.759	0.994	LVd típico, A moderado, textura argilosa e muito argilosa	Very Fine and Fine, Rhodic Hapludox
0.617	0.002	0.737	0.986	LVd típico, A moderado, textura média	Fine Loamy, Rhodic Hapludox
0.682	0.002	0.738	0.987	LVdf típico, A moderado, textura argilosa e muito argilosa	Fine and Very Fine, Rhodic Hapludox
0.212	0.000	0.607	0.913	LVef típico, A moderado, textura argilosa ou muito argilosa	Fine and Very Fine, Rhodic Eutrudox
0.010	0.000	0.143	0.743	MTf e MTo típicos, textura argilosa	Very Fine and Fine, Typic Paleudoll
0.611	0.000	0.733	1.000	NVdf latossólico, A moderado, textura argilosa ou muito argilosa	Fine and Very Fine, Kandiudalfic Eutrudox and Rhodic Kandiudox
0.546	0.001	0.655	0.964	NVef e NVdf típicos, A moderado, textura argilosa e muito argilosa	Very Fine and Fine, Kandiudalfic Eutrudox
0.527	0.001	0.669	0.943	NXd típico, A moderado, textura argilosa e muito argilosa	Fine and Very Fine, Typic and Rhodic Kandiudult
0.759	0.000	0.837	0.991	NXe chernossólico, textura média/argilosa	Fine-loamy over Fine, Typic Paleudoll
0.707	0.002	0.763	0.988	PVAd e PVAe abrúpticos e arênico abrúpticos, A moderado, textura arenosa/média e média/argilosa	Sandy over Fine-loamy and Sandy over Fine, Arenic Kandiudult and Arenic Kandiudalf
0.572	0.000	0.490	0.998	PVAd e PVAe abrúpticos, A moderado, textura arenosa/média	Sandy over Fine-loamy, Arenic and Typic Paleudult
0.571	0.021	0.714	0.931	PVAd e PVAe abrúpticos, A moderado, textura média/argilosa, média/muito argilosa e argilosa/muito argilosa	Fine-loamy over Fine, Typic Kandiudult and Typic Kandiudalf
0.604	0.006	0.769	0.943	PVAd e PVAe típico e abrúpticos, A moderado, textura média e média/argilosa	Fine-loamy, Typic Kandiudult
0.902	0.235	0.793	0.907	PVAd, PVAe, PAd e PAe arênicos abrúpticos, A moderado e fraco, textura arenosa/média	Sandy over Fine-loamy, Arenic Paleudult, Grossarenic Paleudult, Arenic Paleudalf and Grossarenic Paleudalf
0.836	0.003	0.788	0.994	PVd e PVAd típicos, A moderado, textura média e média/argilosa	Fine-loamy over Fine, Typic Kandiudult
0.412	0.001	0.677	0.947	PVe nitossólico e NVe típico, A moderado, textura argilosa/muito argilosa	Fine, Rhodic Kandiudult and Kandiudalf
0.518	0.011	0.702	0.916	RLe e RLm típicos, A moderado e chernozemico, textura média	Loamy, Lithic Udorthent
0.757	0.040	0.829	0.956	RQo típico, A moderado	Typic Quartzipsamment
0.000	0.000	0.000	0.964	SXe e SXd típicos e vertissólicos, A moderado, textura média/argilosa	Fine-loamy over Fine, Vertic, Albaquic and Typic Hapludalf
0.782	0.128	0.777	0.929	Weighted average

Escola Superior de Agricultura "Luiz de Queiroz" USP/ESALQ - Scientia Agricola, Av. Pádua Dias, 11, 13418-900 Piracicaba SP Brazil, Phone: +55 19 3429-4401 / 3429-4486 - Piracicaba - SP - Brazil
E-mail: scientia@usp.br

Acompanhe os números deste periódico no seu leitor de RSS

[1] *Corresponding author <rmcoelho@iac.sp.gov.br>