Geostatistics or machine learning for mapping soil attributes and agricultural practices

Applying the upcoming technologies in agriculture has been a major economic, environmental and social challenge for scientists and farmers. In order to overcome such challenge, this study evaluated the advantages and limitations of using geostatistics and machine learning for soil mapping in agricultural practices and soil surveys. The study occurred in Tocantins State, Brazil, and consisted into seven areas with a total extension of 17.24 km 2 , 222 meters regular gridded resulting in one-point sampling per 0.0493 km 2 of five randomly sampled cores within a 1 m circle radius. It was collected 332 georeferenced soil samples at 0-20 cm depth using an auger and then, soil laboratory analyses performed. Afterward, liming rate maps were originated from the predicted soil attributes clay, cation exchange capacity and base saturation comparing four methods: ordinary kriging, random forest, cubist, support vector machine and the best model results of each soil attribute. Evaluating the methods, the Pearson’s index presented strong results for soil attributes predicted by random forest and ordinary kriging. Machine learning methods can be successfully applied for soil mapping in agricultural practices and soil surveys using less soil samples rather than geostatistical framework.


INTRODUCTION
Applying the upcoming technologies in agriculture has been an enjoyable challenge for scientists and agricultural entrepreneurs. However, the challenge is to reach economic, environmental and social sustainability by using these technologies in the field of agricultural practices. Techniques such as geostatistics and machine learning have been performed in land management (Rodrigo-Comino et al., 2018), crop yield prediction (Adamchuk et al., 2017), soil type mapping (Demattê et al., 2015) and zone management (Castro-Franco et al., 2018). The geostatistics began in the field of geology by Krige (1951). Since then, these application of random theory functions are extremely used for spatialisation of georeferenced data and as it has been performed for agricultural purposes (Dowd, 1991;Shannon et al., 2018;Wackernagel, 2014). The machine learning approach estimates soil spatial arrangements using ancillary variables such as digital elevation models (DEM) and its covariates, and remote sensing data such as satellite images from Landsat 5 Thematic Mapper (McBratney et al., 2003;Fongaro et al., 2018;Gallo et al., 2018;Castro-Franco et al., 2018).
The foremost difference between the geostatistical and machine learning methods is how each one deals with spatial statistics and sampling data. The sampling density for geostatistics has to be spatially dependent, otherwise it is classical statistics and no prediction can be performed by using this method. Meanwhile, machine learning relies on georeferenced sampling data only. Both methods depend on georeferenced dataset; however, there is no need of spatial dependency on machine learning. Ordinary kriging is a stationary random function of geostatistics, which means to calculate an average of the radium of each point (Wackernagel, 2014). Other three random spatially functions are Random Forest (Flaxman et al., 2011) and Cubist (Quinlan & Ross, 1993), which are a decision tree normally used in regressions and classifications. The third one is the Support Vector Machine that uses a kernel function to generalise non-linear models (Vapnik, 2000). The three methods are classified as machine learning.
Each method has advantages and limitations when applied to agriculture. The main farmers' complain about geostatistics are the sampling density because the ideal grid for predicting chemical and physical soil attributes is less than one point per hectare (Cherubin et al., 2015;Nanni et al., 2011) and one point per 7.2 hectares, respectively (Nanni et al., 2011). Thus, the geostatistical method can be non-economic sustainable. In order to solve this issue, this study evaluates the advantages and limitations of using geostatistics or machine learning for soil mapping in agricultural practices and soil surveys. We create a liming rate map originated from the predicted soil attributes cation exchange capacity and base saturation comparing four methods: ordinary kriging, random forest, cubist, support vector machine and the best model results of each soil attribute.

Study area and sampling data
The study area is located in the municipality of Barra do Ouro, State of Tocantins, Brazil ( Figure 1). The site was subdivided into seven areas with total extension of 17.24 km 2 (1,724 hectares), 222 meters regular gridded (222 x 222 m) resulting in one-point sampling per 0.0493 km 2 (4.93 ha) of five randomly sampled cores within a 1 m circle radius. It was collected 332 georeferenced soil samples at 0-20 cm depth using an auger. The soil samples were dried (45 ºC for 24 h), grounded and sieved (2-mm mesh). Afterward, the soil chemical and physical analyses were performed as described in Donagemma et al. (2011) and Camargo et al. (2009).

Prediction of soil attributes
The ordinary kriging (OrdKrig), support vector machine (SVM), cubist (Cub) and random forest (RF) methods were applied to predict the soil attributes: clay (g kg -1 ), cation exchange capacity (CEC, cmol c dm -3 ) and base saturation (V%). We only predicted these three attributes because they are required in the liming rate calculation. The ancillary variables used in the machine learning process were (i) the Digital Elevation Model (DEM) retrieved from the Earth Resource Observation and Science Center that distributes the Shuttle Radar Topography Mission data (SRTM v.3, 30 m), and (ii) the Synthetic Soil Image (SYSI) generated from 27-year time-series (1985-2011) of Landsat 5 TM by the methodology described in Demattê et al. (2018). The SYSI represents bare soil pixels along time. The characterisation of the parameters used in the study area is summarized in the Table 1. The methods were performed using the software R (R Development Core Team, 2019) by specific packages as it follows: OrdKrig from "automap" package (Hiemstra et al., 2009), SVM from "e1071" package (Dimitriadou et al., 2011), Cub and RF from "caret" package (Kuhn, 2008). The fitted semivariogram was generated for the three soil properties before of interpolating them using OrdKrig. For machine learning methods (SVM, Cub and RF), the data were initially analysed and no value fields excluded. Then, the 307 sampling points left were randomly divided into 80% for training (247 samples) and 20% for validation purposes (60 samples). Finally, the 80% training data were cross-validated (3 fold, repeated 3 times) for each model (Heung et al., 2014).

Model evaluation
The ordinary kriging method was evaluated by its standard deviation and variance results (Mueller et al., 2004). The machine learning algorithms were evaluated analysing the predicted and observed values of each soil attribute by accessing the Root Mean Squared Error (RMSE), the Pearson correlation coefficient (r) and the Index Of Agreement (IOA) (Willmott et al., 2012).

Liming rate calculation
The liming rate calculation is based on the following classical formula: LR = [(V% 2 -V% 1 ) * CEC]/TNP (Molin et al., 2015;Raij, 1983). LR is the Liming Requirement in ton per hectare; V% 2 is the base saturation that is considered between 60-80% for most crops; V% 1 is the base saturation obtained in laboratory and predicted map; CEC is the cation exchange capacity in cmol dm -3 from laboratory and predicted map; and TNP is the total neutralizing power in %. For the LR calculation, we adopted base saturation of 80% for crop and TNP of 100%. We used the raster calculation of QGIS (QGIS Development Team, 2018) to achieve the LR. Basically, this tool retrieved from the V% and CEC maps the value of each pixel at the same location and then, the formula of liming rate was calculated generating the final map with a pixel resolution of 30 m.

RESULTS AND DISCUSSION
The Pearson's correlation coefficient (r) of the dataset presented moderate linearity between the soil attributes and the satellite bands of the SYSI, which represents the bare soil pixels (Figure 2). However, the DEM had low r values for all attributes (Figure 1). The linearity among them is a fundamental principle to identify the normal distribution needed before of mapping the soil attributes, and the evaluation of the kriging process in geostatistics is analysing its variance and standard variation. The variance and standard deviation increases as the spatial dependency decreases (Figure 3). Thus, whether the distance among the collected samples are higher than one sample per hectare, the soil chemical attributes have no fair prediction. For example, the recommended grid size for phosphorus and potassium prediction in Oxisol is less or equal to 1 sample per 100x100 m (Cherubin et al., 2015). Soil grids larger than 100x100 m display high variance and to some degree no spatial dependency, which reallocate them in the conventional statistics. For physical attributes, Figure 2: Pearson's correlation coefficient (p < 0.01) for all attributes used to map the area. SYSI is the Synthetic Soil Image and the numbers are the satellite's bands; DEM is the digital elevation model; V% is the base saturation; and CEC is the cation exchange capacity. Checkbox values no significant. it is acceptable grids up to one sample per 7.2 hectares (Nanni et al., 2011). These authors found moderate correlation for clay prediction (R 2 > 0.53).
The machine learning methods are evaluated by other meanings, which are the RMSE, r and IOA. The Index Of Agreement (IOA) and r values close to one means that the model prediction fitted well (Table 2). Taking this into account, the RF was the best algorithm to predict clay (IOA = 0.689 and r = 0.83) (Beguin et al., 2017;Vasava et al., 2019) and had close correlation to SVM predicting CEC. For mapping base saturation, the cubist model had better performance rather than the other two methods (Nussbaum et al., 2017). We basically performed the three methods in order to predict the soil attributes. Subsequently, we calculate the liming rate based on the results from RF, Cub, SVM, and the best predicted attribute independent of the method. This last one named Best Model (BM). The ordinary kriging was kept a part from the machine learning methods because our intention is to prove that machine learning methods can be a reasonable framework economically, socially and environmentally sustainable for agriculture. Furthermore, it was calculated the liming rate for an extent of the study area showing the practicability of machine learning

Soil attributes
Figure 4: Pearson's correlation coefficient (p < 0.01) for liming rate predicted by Ordinary Kriging (LR_OrdKrig), Support Vector Machine (LR_SVM), Random Forest (LR_RF), the best model (LR_BM), the extent of the study area (LR_Extent), and cubist (LR_Cubist) methods compared with the calculation from soil laboratory analyses (LR_Lab). methods because they can spatialize attributes or response variables without needs of grid size dependency. Checking the reliability of the calculation of all methods, the Pearson's index presented strong results for soil attributes predicted by RF and OrdKrig. These methods had better performance among others (Figure 4). The advantage of using machine

CONCLUSIONS
Machine learning methods can be successfully applied for soil mapping in agricultural practices and soil surveys using less samples rather than the geostatistical approaches. This conclusion is a key point to farmers that want to apply optimized methods in their agricultural day life. The machine learning frameworks, mainly Random Forest, proved to be economically and environmentally sustainable for agriculture.