SPATIAL VARIABILITY OF SOYBEAN YIELD THROUGH A REPARAMETERIZED T-STUDENT MODEL

The t-Student distribution has been used to the spatial dependence modelling of soybean yield as an alternative to the normal distribution, being used for data with heavier tails or discrepant values. However, a usual Student t-distribution does not allow direct comparisons of geostatistical methods with a normal distribution. The aim of this study was to assess the soybean yield spatial variability through a reparameterized t-Student linear model, comparing the results with those of a Gaussian linear model. For parameter estimation, a complete maximum likelihood (CML) method was used through an expectation-maximization (EM) algorithm. The maps constructed with both reparameterized t-Student and normal distributions are dissimilar and present a kappa index (K) equivalent to 0.64. The reparameterized t-Student distribution is an alternative in studying data with discrepant values, showing the ability to decrease the influence of these points.


INTRODUCTION
Geostatistics can assist in precision agriculture since its techniques allow constructing maps that determine the spatial dependence structure of yield associated with soil and plant attributes.Thus, it helps the producer to decide on the use of agricultural inputs in appropriate quantities and locations in order to increase yield, reduce losses, and maintain environmental quality.This technique is based on the regionalized variable theory proposed by Matheron, influenced by the observations made by Kriger.According to Vieira (2000), Kriger analyzed gold concentration data in South Africa and observed the impossibility of finding meaning in the variances without taking into account the distance between samples.Therefore, the values of a variable distributed in the space are correlated within a radius of spatial dependence.
In a spatial variability study, the results obtained by geostatistical methods can be influenced by discrepant data, leading to biased predictions (Cressie, 2015).A solution to the presence of discrepant data is the use of robust models, whose parameter estimation is less sensitive to these data.According to Manghi et al. (2016), class models of symmetric distributions allow reducing the influence of discrepant data, incorporating additional parameters that adjust the kurtosis of data distribution.The t-Student distribution belongs to the class of symmetric distributions and exhibits symmetry properties, greater flexibility regarding the degree of kurtosis, and has as additional shape parameter 0 v  , which defines the degrees of freedom of distribution (Assumpção et al., 2011;   2014).Lange et al. (1989) propose a reparametrization of the t-Student distribution from a transformation in the shape parameter v , allowing us to assume the existence of the second finite moment and thus a more direct comparison with the normal distribution.This reparametrization is justified by the importance that the spatial dependence modeling represents since the new shape parameter  is limited and this process allows estimating parameters by maximum likelihood (Nesi et al., 2013) and implementing the EM iterative algorithm (Dempster et al., 1977;Assumpção et al., 2014).
This study aimed to assess the spatial variability of soybean yield by means of a reparameterized t-Student linear model, comparing the results with a Gaussian linear model.For estimating these model parameters, a complete maximum likelihood (CML) method was used through an expectation-maximization (EM) algorithm.

Reparameterized t-Student distribution
Much of the statistical inference involving continuous random variables is based on normal distribution.However, to obtain reasonable inferences, assuming normality, it is necessary to ensure conditions such as symmetry and a certain value of kurtosis.Among the symmetric models alternative to the normal distribution is the t-Student distribution, which presents as an additional parameter the degree of freedom   0 vv that allows kurtosis modeling.A priori, this parameter can be fixed.However, Lange et al. (1989) recommend fixing it at 4 v  for a small data set and its estimation for a large data set.This distribution has been widely used in the study with real data because it has tails longer than the normal distribution and allows the discrepant points present in the data set to be encompassed (Lange et al., 1989;Osorio et al., 2007).Galea et al. (2002) suggest the t-Student distribution as an alternative to the normal distribution due to the statistical inference based on the t-Student distribution to combine conceptual and computational simplicity with generality, in addition to being applicable in a great variety of situations.An important feature of t-Student distribution is that when the degree of freedom v increases, the t-Student distribution approaches to the normal probability distribution.Lange et al. (1989)

 
,, where 1 is a vector of 1's of order n × 1, for 1 v  , and the covariance matrix

 
Cov Y is undefined.Lange et al. (1989) suggest the reparametrization of t-Student distribution for allowing the direct comparison between parameter estimation of the mean vector and the covariance matrix with the model assuming normality.The authors also mention that an improvement of inference is observed when the degree of freedom presents the transformation where, Ys can be written as: being the deterministic term ( ) , is the vector p × 1 of unknown parameters to be estimated, and () i es is a spatially correlated random component.
Equation ( 2) can be written in a matrix form as: , where X is a matrix n×p of columns with complete rank, with lines T i x and the variation between points in space is determined by some covariance function where 1 where Maximization of [eq.( 6)] is obtained by using an iterative process.In this case, the EM (expectation and maximization) algorithm was applied, being the stopping criterion the relative error (RE), where RE  Kano et al. (1993), were applied.For the reparameterized t-Student model, cross-validation is given by [eq.( 7)]: , being the i-th line of the matrix X, is the prediction at the location i s without considering the observation   ,,  r T  After choosing the estimation of  , the best Mátern family model was defined with different shape parameters  (Matérn, 1986) by using the lowest standard error.The map was constructed by means of the regression-kriging method (Michel & Kobiyama, 2015) since it allows the use of covariates.Finally, the maps constructed with the reparameterized t-Student distribution and normal distribution were compared using the Kappa index (K) (De Bastiani et al., 2012), used to measure the exactness of thematic classifications, i.e. it provides a measure of agreement between the reference map values and the model map values.This index is recommended as an adequate precision measure because it uses all elements of the error matrix, being defined by [eq.( 9)]: where N * is the total area, n ii is the area belonging to class i of the model and reference maps, n i+ is the area belonging to class i of the model map, n +i is the area belonging to class i of the reference map, and r is the number of classes.According to Krippendorff (2004) classification, K is classified with low similarity if K < 0.67, medium similarity if 0.67 < K < 0.80, and high similarity if K > 0.80.

Location and characteristics of the study area
Data on soybean yield, plant height, and pods per plant were collected from an experimental area of 47.95 ha located in Cascavel, the western region of Paraná, Brazil, with an approximate location of 24.83° S and 53.60° W, and an average altitude of 650 meters.The soil of this area is classified as a clayey Oxisol (Haplorthox) (EMBRAPA, 2011) and regional climate is a temperate super-humid climate type Cfa (Köeppen) with average annual temperature of 21 °C.All samples were georeferenced in the spatial coordinate system (UTM) by using a Trimble GPS25 (Global Positioning System) GEOEXPLORER 3 data receiver.Figure 1   In 2006, soybean was cultivated in this area by means of the no-tillage system.In 2007, data on soybean yield were collected, being estimated by considering the amount of soybean harvested from all plants distributed in two rows over a meter long, representing a plot.Grains were weighed for each plot and the water content was verified for subsequent correction to 13%.Yield value was converted into t ha −1 .The estimation of average plant height (Hgt), in cm, was performed at soybean vegetative peak by calculating the average of four plants over a linear meter.For the average number of pods per plant (N), four plants were chosen at each point and the number of pods was counted per plant at harvest time.
Statistical analyses were performed using the free software R, version 3.2.0(R Core Team, 2016).The following packages were used: geoR (Ribeiro Junior & Diggle, 2016) for studying the spatial data, map construction by regression kriging interpolation, and comparison of thematic maps; matrixcalc (Novomestky, 2012) for trace calculation; e1071 (Meyer et al., 2015) for calculating the asymmetry and kurtosis; and classInt (Bivand, 2015) for choosing the class intervals for continuous numerical variables.

RESULTS AND DISCUSSION
Table 1 shows the exploratory analysis of values found for the variables soybean yield (Prod) (t ha −1 ), average plant height (Hgt) (cm), and an average number of pods (N).The average soybean yield is 2.99 t ha −1 , with a minimum value of 1.50 t ha −1 and a maximum value of 5.53 t ha −1 .Moreover, 75% of the area presents a yield lower than or equal to 3.35 t ha −1 .Soybean yield is classified as heterogeneous since the coefficient of variation (CV) is 21.27%.The boxplot graph presented in Figure 2a detected a single discrepant point, which corresponds to the sample element 6, with coordinates (236325, 7250475), referring to the maximum yield value in the data set, being equivalent to 5.53 t ha −1 .According to the Postplot graph shown in Figure 2b, observation 6 is in a region where the nearest neighbors have a soybean yield between 2.60 and 2.94 t ha −1 .Rosangela C. Schemmer, Miguel A. Uribe-Opazo, Manuel Galea, et al. Eng. Agríc., Jaboticabal, v.37, n.4, p.760-770, jul./ago. 2017 766 In order to identify the spatial dependence structure of soybean yield as a function of the average plant height (Hgt) and an average number of pods per plant (N), the average soybean yield in the position 2 i sS    was considered as a spatial linear regression model given by: where 12 , , and 3  are the unknown parameters to be estimated.
Parameter estimation studies were performed by complete maximum likelihood (CML) using the EM algorithm of the spatial linear model defined in [eq.( 10)] and parameters of the spatial dependence structure Σ given in [eq.( 4)], considering the Matérn family with parameters 0.5, 1.0, 2.0, 5.0, 10   and 20 associated to shape parameters of the reparameterized t-Student   0.05, 0.067, 0.1, 0.143, and 0.2.
Table 2 shows the determination of the best shape parameter  of the reparameterized t- Student distribution associated to each shape parameter  of the Matérn family using the cross- validation criterion and trace defined by Equations ( 7) and ( 8).In bold is presented the choice of each parameter  for each  with the lowest values of cross-validation TABLE 2. Cross-validation and trace for the choice of the best shape parameter .

 
T r  : trace;  : shape parameter of the Matérn family.In bold is the best shape parameter  ; underlined is the lowest value of cross-validation and trace.An increase in area percentage was observed in the 1st, 2nd, and 5th classes of the map constructed with a normal distribution (Map 2) when compared to the map constructed with the reparameterized t-Student distribution (Map 1) (Figure 4 and Table 4).Consequently, the 3rd and 4th classes presented a reduction, with the 3rd class obtaining a greater reduction, equivalent to 6.08%, decreasing from 37.33 to 31.25% of the area.For comparison between maps, the kappa accuracy index (K) was calculated.This index is considered an appropriate measure by Anderson et al. (2001) since it uses all elements of the error matrix constructed from omission errors and designation between maps (De Bastiani et al., 2012).The obtained value of K = 0.64 allows classifying it as a low similarity.Consequently, the maps constructed with reparameterized t-Student and normal distributions are dissimilar due to the influence of the discrepant point.
As a complementary analysis, a new geostatistical study was carried out by removing the point 6, which was considered as discrepant and assuming that the data presented reparameterized t-Student distribution and normal distribution.The maps constructed without the discrepant point are shown in Figure 5.The kappa accuracy index for comparison between the new maps was K = 0.89, indicating a high similarity between maps (Krippendorff, 2004).Therefore, the interference of this Spatial variability of soybean yield through a reparameterized t-Student model Eng.Agríc., Jaboticabal, v.37, n.4, p.760-770, jul./ago. 2017 769 discrepant point in mapping is relevant.

CONCLUSIONS
When applying the methodology proposed in this study for soybean yield data with the covariates average height and an average number of pods per plant, the parameters estimated by complete maximum likelihood using the reparameterized t-Student distribution presented differences in the estimates of parameters that define the spatial dependence structure when compared to those obtained from a normal distribution.Consequently, differences were observed in soybean yield maps obtained from the different methods.Thus, the use of reparameterized t-Student distribution is an alternative in studying data with discrepant values, showing the ability to decrease the influence of these points.
, n, of the stochastic process Y .The covariance function   , iu C s s is used in the study of spatial dependence of the stationary process and it is specified by a three-dimensional vector given in [eq.(4)] (Uribe-Opazo et al., 2012): n × n.The parametric form of the covariance matrix Σ , represented in [eq.(4)], occurs for several stationary and isotropic processes, in which the covariance     , of the stochastic process reparameterized t-Student Y is given by   12 0. C   On the assumption that   ,, n  YX βΣ , where  represents the shape parameter, considered fixed and the unknown parameters of the model be estimated by maximizing the logarithm of the complete likelihood Spatial variability of soybean yield through a reparameterized t-Student model Eng.Agríc., Jaboticabal, v.37, n.4, p.760-770, jul./ago.2017 763 function defined by [eq.(5)]: the shape parameter  considered fixed, the criteria of cross-validation     VC  , presented by De Bastiani et al. (2015), and the trace criterion


without considering the i-th observation and ii h is the i-th diagonal element of the matrix Hat ( ), also called a projection matrix.Trace criterion consists of calculating the trace of the asymptotic covariance matrix of the estimated mean   ˆ,  μ Xβ as a criterion in choosing considering that the shape parameter is obtained by: For the two criteria, the best shape parameter  is determined by the lowest values of cross-validation shows the experimental area in a regular grid of 75 × 75 meters, totaling 83 observations for the 2006/2007 agricultural season.

Figure 3
Figure 3 shows the cross-validation   VC  and trace   r T  graphs for each  value of the Matérn family model related to those chosen in Table3.For the results of parameter estimation and the respective standard deviations considering the  values for each  selected in Table2.The lowest standard deviations of estimators correspond to the estimated values of 0

Figure
Figure 4a shows the soybean yield map constructed by means of regression kriging interpolation considering that the data have a reparameterized t-Student distribution with 0.05   and shape parameter of the Matérn model 0.5   with the following parameters estimated by CML

FIGURE 4 .
FIGURE 4. (a) Map 1: soybean yield with reparameterized t-Student distribution with 0.050   and Matérn family model with shape parameter 0.5   ; (b) Map 2: soybean yield with normal distribution and Matérn family model with shape parameter 0.5   .

FIGURE 5 .
FIGURE 5. (a) Map 1: soybean yield with reparameterized t-Student distribution with 0.050   and Matérn family model with shape parameter 0.5   without point 6; (b) Map 2: soybean yield with normal distribution and Matérn family model with shape parameter 1.0   without point 6.
state that if a random vector

TABLE 1 .
Descriptive statistics for the variable soybean yield (Prod), the covariates average plant height (Hgt) and an average number of pods (N).

TABLE 3 .
Estimation of the parameters  and  via EM algorithm for different and  .
.  : estimated parameters of the spatial linear regression model; ˆ. : estimated spatial parameters.

TABLE 4 .
Area percentage at each map class of soybean yield constructed with the reparameterized t-Student distribution and normal distribution.