REDUCTION OF SAMPLE SIZE IN THE ANALYSIS OF SPATIAL VARIABILITY OF NON-STATIONARY SOIL CHEMICAL ATTRIBUTES

In the study of spatial variability of soil attributes, it is essential to define a sampling plan with adequate sample size. This study aimed to evaluate, through simulated data, the influence of parameters of the geostatistical model and sampling configuration on the optimization process, and resize and reduce the sample size of a sampling configuration of a commercial area composed of 102 points. For this, an optimization process called genetic algorithm (GA) was used to optimize the efficiency of the geostatistical model estimation based on the Fisher information matrix. The simulated data evidenced that the variation of the nugget effect or practical range did not significantly alter the sample size. GA was efficient in reducing the sample size, determining for soil chemical attributes a sample size between 30 and 40 points (29.41 to 39.22% of the initial sampling grid). The presence of spatial dependence was observed for all soil chemical attributes in the two sampling configurations (initial and optimized). The optimized


INTRODUCTION
Soil quality is essential to sustainable development and preservation of ecosystems and biodiversity, and the variability of soil chemical attributes is influenced by differences in interactions between soil formation factors and processes, which contribute to the existence of spatial variability of crops (Artur et al., 2014). In addition, it is important for the cultivation system to reduce costs of applying inputs and possibilities of environmental problems in order to improve the management of the production process and maximize the profitability of production (Bernardi et al., 2014).
Geostatistical techniques allow studying the spatial variability of georeferenced attributes (Cressie, 2015). Thus, understanding the spatial variability of soil is important for planning a soil sampling configuration and crop management (Cherubin et al., 2014).
A reduced sampling plan, i.e., a sampling configuration with the smallest possible size, is important in experiments that involve the spatial variability, allowing a reduction of operational costs and minimization of quality loss of the obtained results (Guedes et al., 2014;Siqueira et al., 2014).
There are several traditional methodologies of spatial sampling that can be used to study the spatial variability of soil and select a sample size, such as stratified (Wang et al., 2012), systematic (Guedes et al., 2011;Wang et al., 2012;Cherubin et al., 2015), random (Guedes et al., 2011;Wang et al., 2012), lattice plus close pairs (Chipeta et al., 2017), and lattice plus infill samplings (Chipeta et al., 2017;Cheng et al., 2018). In contrast to traditional samplings that use a fixed number of samples, there is the sequential sampling in which the sample size increases item by item until it reaches a conclusion in order to accept or reject a hypothesis (Santos et al., 2017).
Moreover, the choice of a configuration and a sample size can be defined as an optimization problem. This methodology is used in the context of redefinition of a sampling configuration obtained from known information of an initial sampling configuration, in which a sampling configuration that minimizes the loss of information on the Engenharia Agrícola, Jaboticabal, v.39, special issue, p.56-65, sep. 2019 results of the analyses should be chosen (Guedes et al., 2014). One of these optimization processes is called genetic algorithm (GA), which consists of a search technique based on the process of evolution and adaptation of individuals of a population so that the fit ones remain in this population (Pessoa et al., 2015).
In addition, the process of resizing sampling configurations must consider a search criterion known as the objective function, which is minimized or maximized and expresses the optimization efficiency. There are criteria of optimization efficiency based on spatial prediction (mean or weighted variance, sum of the quadratic error, measure of accuracy, overall accuracy, etc.) (Guedes et al., 2011;Guedes et al., 2016;Szatmári et al., 2018), as well as criteria that consider the efficiency as the geostatistical model estimation, such as the objective function based on the inverse-Fisher information matrix (Zhu & Stein, 2005).
Previous studies involving optimized sampling configurations only optimized the sample size or sampling configuration. A methodology to simultaneously optimize sample size and sampling configuration was obtained by Guedes et al. (2014), who used the optimization algorithm called simulated annealing. However, the simulated annealing has as a disadvantage the direct relationship of the computational cost and the number of samples in the initial configuration.
Moreover, these studies have used georeferenced stationary variables, i.e., the average of the georeferenced variable throughout the area is constant. However, stationarity is not a characteristic not always identified in soil properties (Szatmári et al., 2018).
Considering non-stationary simulated and real data (soil chemical attributes), this study aimed (a) to evaluate the influence of parameters of the geostatistical model and the initial sampling configuration used in the optimization process; and (b) to propose and evaluate the resizing of a sampling configuration, aiming at reducing its sample size for a commercial area of soybean cultivation.

MATERIAL AND METHODS
Initially, a simulation study was carried out to reproduce a set of possibilities in the real data to be evaluated in this research, as well as to extend the theoretical-practical knowledge on the optimization of size and sampling configuration in soil chemical properties with non-stationary spatial dependence structure.

Study of simulations
Nine non-stationary simulated data sets were generated to combine parameters of the geostatistical model with low, medium, and high radius (range) and intensity (relationship between the nugget effect and sill) of spatial dependence. Simulations were generated with reference to the sampling configuration of the agricultural area considered in the practical study. The lattice plus close pairs configuration, composed of 100 sample points distributed in a 9 × 9 regular sampling grid with addition of 19 nearby points, which were randomly added to the regular grid, showing lower distances with some grid points than that between points of the regular grid, wad used. For this, a square area with x and y coordinates ranging from 0 to 1 was used, which represented a discretization of the study area ( Figure 1). The values of the regionalized variables were simulated for each simulated data set by a Monte Carlo experiment, which represented stochastic process realizations { ( ), ∈ }, where ( ), … , ( ) are observations of the georeferenced variable at = ( , ) ( = 1, … , ) sampled spatial locations, where ⊂ ℛ and ℛ is the two-dimensional Euclidean space (Mardia & Marshall, 1984). The georeferenced variable was expressed by a Gaussian linear spatial model (Uribe-Opazo et al., 2012) described in matrix notation by = + , where ~ ( , ∑), and the random error vector has E( ) = (null vector × 1) and covariance matrix ∑ = [(σ )], × , with elements σ = , , i,j = 1,...,n (Mardia & Mashall, 1984;Uribe-Opazo et al., 2012).
The georeferenced variable was considered nonstationary, and the vector of mean ( = , × 1) represented a directional trend of the georeferenced variable expressed by the model = β + β y, where = (β , β ) is a vector of unknown parameters, such that β and β need to be estimated and is the full-rank delineation matrix (Cressie, 2015).
Furthermore, it was assumed that ∑ is the nonsingular covariance matrix, such that ∑ = φ + φ (φ ), where φ is the nugget effect, is the identity matrix × , φ is the contribution, φ is the range function of the model, where the practical range ( = ( )) is the radius of spatial dependence, and (φ ) is a matrix × , which is a function of φ (Uribe- Opazo et al., 2012;De Bastiani et al., 2015).
Simulations were carried out at each test considering β = 10 and β = 3 and an exponential model to define the covariance with the parameter contribution (φ ) equal to 1 and all combinations of the following values for the practical range parameters ( = 0.45, 0.60, and 0.90) and nugget effect (φ = 0, 0.5, and 0.8).
An iterative optimization process of configuration and sample size was applied for each simulation of each test. This optimization process consists of two nested phases: the "external" and "internal" phases. A sampling plan with an established sample size was performed in the external process.
The internal phase, based on the methodology of the genetic algorithm, was applied in this sample size. This algorithm always seeks to obtain modifications in the optimization process, i.e., changes in the individuals of the population, always seeking an improvement in the objective function (Equation 1) (Zhu & Stein, 2005;Cressie, 2015). The flowchart shown in Figure 2 exemplifies the optimization process. The configuration and optimized sample size were obtained at the end of this process.

Practical study
Soil chemical properties were observed in the 2010/2011 cropping season in a commercial area of soybean production with 167.35 ha located at Fazenda Agassiz in Cascavel, PR, with minimum and maximum limits for the geographical coordinates of 24°57′30″ and 24°56′45″ South latitude and 53°35′ and 53°34′ West longitude, Datum WGS84, and an average elevation of 650 m. The soil is classified as a Dystroferric Red Latosol with clay texture. A total of 102 soil sample points of a lattice plus close pairs configuration were collected (Chipeta et al., 2017), with a minimum distance between regular grid points of 141 meters, and in some randomly selected places, sampling was performed with smaller distances (75 and 50 meters between pairs of points) ( Figure 3). The samples were located and georeferenced by a Global Positioning System (GPS) signal receiver in a Datum coordinate system WGS84, UTM (Universal Transverse Mercator) projection.
Soil samples were taken at each demarcated point ( Figure 3). Four soil subsamples were collected near these points at a depth of 0.0 to 0.2 m, mixed and stored in plastic bags, with samples of approximately 500 g, thus composing the sample representative of the plot. Chemical analyses were performed using the Walkley-Black method (Walkley & Black, 1934).
Descriptive and geostatistical analyses were performed for each soil chemical attribute. The existence of anisotropy using the non-parametric test of Maity & Sherman (2012) (MS) was evaluated at 5% significance level. The following models of the semivariance function were estimated by the maximum likelihood method (Uribe-Opazo et al., 2012;Cressie 2015): exponential, Gaussian, and Matérn family with shape parameters k = 1, 1.5, and 2. The choice of the model was performed by the crossvalidation technique (leave one out) (Faraco et al., 2008;Cressie, 2015). Subsequently, the spatial prediction by kriging of each soil chemical attribute was carried out in a grid of non-sampled locations in the agricultural area under study (Figure 3). The thematic map of each attributed was constructed considering this spatial prediction.
Subsequently, the GA was applied to each soil chemical attribute taking into account the same phases and criteria applied in the simulations (Figure 2). A small sample size configuration was obtained for each soil chemical attribute at the end of the optimization process, and exploratory and geostatistical analyses were carried out again.
Furthermore, the initial and optimized sample configurations were compared. The purpose of this comparison was to identify which one provided a better estimation of the variable in non-sampled locations. For this, the following measures were used: the mean of the kriging variance, overall accuracy (OA), and Kappa (Kp) and Tau (T) concordance indices. The studies of Guedes et al. (2014) and Landis & Koch (1977) are recommended for further details of the indices.
Simulations, GA implementation, and statistical and geostatistical analyses were performed in the software R (R Development Core Team, 2018) using the packages geoR (Ribeiro Jr. & Diggle, 2001) and sm (Maity & Sherman, 2012).

Study of simulations
The estimated values for the logarithm function of the determinant of the inverse-Fisher information matrix (V (θ)), obtained at the end of the optimization process, are very close and have a low dispersion of these minimum values in all simulations (Table 1). A relevant mean decrease in the estimated value of V (θ) (ranging from 68 to 108%) was observed when the estimated values of V (θ) were compared at the beginning and end of the optimization process, indicating an efficiency in the minimization of V (θ) ( Table 1). In addition, all the simulations presented a low variability of the estimated values of V (θ), which means that the optimization process determined a reduced size sample configuration with a higher minimization of V (θ), thus showing the efficiency of the process.
Engenharia Agrícola, Jaboticabal, v.39, special issue, p.56-65, sep. 2019 TABLE 1 Mean values according to the simulated practical range and the simulated nugget effect of the estimated value of the logarithm of the determinant of the inverse-Fisher matrix information (V (θ)) of the optimized sample, the percentage of decrease of (V (θ)) (Δ (%)) in relation to the beginning and end of the optimization process, and the reduced sample size (N). In parentheses is the standard deviation of these values. a is the simulated practical range (km); φ is the simulated nugget effect; and ∆ (%) = .
The simulation study showed no relationship between the estimated values of V (θ) (obtained at the end of the optimization process) and values of the nugget effect and practical range (Table 1). However, according to Landim (2006), better estimates of parameters of the geostatistical model are obtained when these models are based on semivariograms that show the lowest ratio between the nugget effect and sill and highest practical range.
In most cases, when the nugget effect or the practical range varied, no relevant change was observed in the reduced sample size, and its lowest dispersion was obtained for the simulation with the lowest nugget effect and highest practical range (φ = 0 and = 0.90). Considering all the simulations, the best sample configurations obtained by the optimization process had, on average, 35 to 39 points, thus reducing the number of sampling points by 62 to 66% in relation to the initial grid (Table 1).
On average, the smallest sample size was obtained with the lowest value of the nugget effect and the highest value of practical range (φ = 0 and = 0.90), while the largest sample size was obtained with the highest value of nugget effect and the second largest value of practical range (φ = 0.80; = 0.60). No pattern was identified for the arrangement of chosen points in all simulated cases when comparing the layout of points of the optimized sample grid (for an example of each simulation - Figure 4).

Practical study
The commercial area was initially composed of 102 sampling points. A minimum sample size ranging from 30 to 40 points was obtained after applying GA for each soil chemical attribute (Table 2), being the highest and lowest number of points in the reduced sample found for Mn and C content, respectively. This reduced sample size corresponded, respectively, to 39.22 and 29.41% of the number of points in the initial sample grid, i.e., a reduction of 60.78 to 70.59% in the initial grid and, consequently, in the cost with laboratory analysis of future studies.
These results were similar to those obtained in simulations and lower than the sample size optimized by simulated annealing proposed by Guedes et al. (2014) or the fixed sample size (50% of the initial grid) in the optimization of a sample configuration proposed by Guedes et al. (2011) using a hybrid genetic algorithm and considering the efficiency of spatial prediction. In addition, these results corroborate the findings of Dias et al. (2018), who evaluated the effect of sample densities and observed that a reduction in an interval of 60 to 80% of the sample grid allowed the identification of spatial variability.
There is no consensus in the literature regarding the number of samples to be collected per hectare. The results of the present study show a variation of one sample for every 4 to 6 ha, which is in line with the amplitude of the sample density found in the literature (Cherubin et al., 2014;Siqueira et al., 2014;Zonta et al., 2014) (Table 2). Table 2 shows that even with sample reduction, the descriptive statistics of soil chemical attributes obtained for the sampling configuration optimized by GA were similar to the results of the initial sampling configuration.
All soil chemical attributes showed a decrease in the value of the coefficient of variation (CV) when comparing the initial and optimized sampling configurations (Table 2). According to Schmidt et al. (2002), attributes with a high dispersion are theoretically better to evidence some locations than attributes with lower dispersion. CV is the coefficient of variation; Coef. X (r), Coef. Y (r): Person's linear correlation coefficient between soil chemical attributes and X and Y coordinates; and and * are the number of points of the original and optimized grid, respectively.
The presence of trend in the north direction (Y coordinate) for each soil chemical attribute was intensified when the optimized sampling configuration was used, which is due to an increase in the values of the Pearson's linear correlation coefficient (r) of each soil chemical attribute with the coordinates Y (coef. Y (r)). All soil chemical attributes showed no trend in the south direction (X coordinate) for both sampling configurations (initial and optimized) ( Table 2).
The null hypothesis that the spatial dependence structure is isotropic was not rejected (p-value> 0.05) in the non-parametric MS test of isotropy applied for each soil chemical attribute. In addition, for both sampling configurations, the best model of the semivariance function was the Matérn family model with k = 2 for Ca and C contents and the exponential model for Cu and Mn contents and pH.
All soil chemical properties in all sampling configurations presented spatial dependence when the estimated value of the relative nugget effect (RNE) was evaluated (Table 3) (Cambardella et al., 1994). The ratio between the nugget effect and sill (RNE), which characterizes the spatial dependence, decreased for most soil chemical attributes with a reduction in the number of points, a result that has also been found in the literature when considering different intensities of regular soil sampling (Souza et al., 2014). Thus, the lower the ratio between the nugget effect and sill is, the lower the variance of the estimate and hence the higher the confidence in the estimation (Landim, 2006).
Engenharia Agrícola, Jaboticabal, v.39, special issue, p.56-65, sep. 2019 An increase in the spatial dependence radius ( ) and a reduction in the estimated value of the nugget effect were observed for all soil chemical attributes when using the optimized sampling configuration in the estimation of the geostatistical model, which evidenced a difference in the estimated values of parameters of the geostatistical model. This reduction is related to a small reduction of randomness as the number of samples decreased, i.e., a sampling grid with a higher number of samples is associated with a higher variability in the measured values and also with a higher presence of sampling or measurement noise (Porto et al., 2011;Souza et al., 2014). TABLE 3. Estimated values of the parameters of the adjusted geostatistical model and objective function obtained by GA for the soil chemical attributes Ca (cmolc dm −3 ), C (g dm −3 ), Cu (mg dm −3 ), Mn (cmolc dm −3 ), and pH, considering the original and optimized sampling configurations.
Soil chemical attributes showed an estimated value of the spatial dependence radius ( ) ranging from 250 to 880 m when considering the initial sampling configuration and values from 280 to 1370 m when considering the reduced sampling configuration (Table 3). The increase in the range value produces a thematic map with more continuous structures, without the formation of small subregions, which facilitates the agricultural management. However, the thematic maps become less attenuated as the value of the nugget effect decreases, with a higher influence of neighboring samples to points to be estimated, which leads a higher precision of the neighborhood (Cressie, 2015).
Considering the optimized sampling configuration, all soil chemical attributes showed a reduction in the estimated values of V (θ) (from 4 to 192%) when compared with the estimated values of V (θ), obtained with the original sample.
The lowest reduction was obtained for pH, which presented the highest value for RNE, indicating it is close to the threshold defined as weak spatial dependence for original and optimized sampling configurations (RNE > 75%, Cambardella et al., 1994). Moreover, the highest reduction was obtained in the attribute that presented the smallest sample size in the reduced sampling configuration (Table 3).
The estimated values of the standard deviation of the parameters indicated a decrease for most of soil chemical attributes associated with parameters of the regression model, which explains the mean, nugget effect, practical range, and contribution when comparing the reduced and original sampling configurations. It shows that model estimation in the optimized configuration was more efficient than in the original configuration (Table 4) (Pigoto & Barreto, 2004). The maps of all soil chemical attributes constructed using the optimized and initial sampling configuration did not present visual similarities, which was confirmed by measurements of overall accuracy and Kappa and Tau concordance indices (OA<0.85; K<0.67; T<0.67) ( Figure 5) (Landis & Koch, 1977, Guedes et al., 2014. In relation to spatial prediction, this dissimilarity can be considered a disadvantage for the optimization process, which considered only one criterion associated with the efficiency of the estimation quality of the geostatistical model. However, criteria associated with the quality of the geostatistical model estimation or spatial prediction are not necessarily concomitant (Muller, 2007). Also, these measurements do not identify which sampling configuration is the best in the spatial prediction, but only indicate the similarity present between maps.
All soil chemical attributes presented a reduction in the mean value of the kriging variance when the reduced sampling configuration was used in the spatial prediction. Thus, kriging produced better estimates of the georeferenced variable in non-sampled locations when the optimized sampling configuration was used ( Figure 5).
For all soil chemical attributes, the visual analysis of the layout of selected locations (red dots in Figure 5) shows that the optimization process sought to select points of heterogeneous sub-regions or points close to each sub-area described by the thematic map of the original grid. A higher scattering of selected points was observed mainly for C, Cu, and pH, which presented intermediate practical range values when compared to other attributes ( Figure 5). Therefore, the algorithm sought a total area coverage, tending not to select contiguous samples, which produces better results regarding the analysis of spatial variability (Fattorini et al., 2015). Thus, in general, a scattering of the chosen sample points was observed throughout the study area.

CONCLUSIONS
The optimization process was efficient for the simulated and real data and resized the sample grid that involves the experiment, reducing its sample size and improving the estimates of the Gaussian spatial linear model.
The new optimized sampling configuration varied from 30 to 40 points for all soil chemical attributes, which corresponds respectively to 29.41 to 39.22% of the original grid. Thus, one sample at every 4 or 6 hectares would be required for the composition of the sampling configuration. These conclusions were obtained from an optimization process that considers previously known information, such as an initial sampling configuration and spatial dependence structure of the already estimated attributes. Thus, the implementation of an initial sampling configuration composed of 30 to 40 points and efficient in obtaining the results of the spatial variability analysis would be difficult.
Engenharia Agrícola, Jaboticabal, v.39, special issue, p.56-65, sep. 2019 Regarding the estimation of the spatial dependence structure, all soil chemical attributes in both sampling configurations presented moderate or strong spatial dependence when the relative nugget effect and practical range were simultaneously evaluated. Relevant differences were observed for all soil chemical properties between thematic maps constructed considering the configuration of sampling points of the original grid and reduced size. However, the values of the mean of kriging variance and deviations of model estimates showed that the optimized sampling configuration produced a better quality in describing the spatial dependence structure.
Regarding the simulated data, the variation in the nugget effect or practical range did not provide any relevant change in the reduced sample size in most cases.