LOCAL INFLUENCE FOR SPATIAL ANALYSIS OF SOIL PHYSICAL PROPERTIES AND SOYBEAN YIELD USING STUDENT'S t-DISTRIBUTION

The modeling and estimation of the parameters that define the spatial dependence structure of a regionalized variable by geostatistical methods are fundamental, since these parameters, underlying the kriging of unsampled points, allow the construction of thematic maps. One or more atypical observations in the sample data can affect the estimation of these parameters. Thus, the assessment of the combined influence of these observations by the analysis of Local Influence is essential. The purpose of this paper was to propose local influence analysis methods for the regionalized variable, given that it has n-variate Student’s t-distribution, and compare it with the analysis of local influence when the same regionalized variable has n -variate normal distribution. These local influence analysis methods were applied to soil physical properties and soybean yield data of an experiment carried out in a 56.68 ha commercial field in western Paraná, Brazil. Results showed that influential values are efficiently determined with n variate Student’s t -distribution. espacial.


INTRODUCTION
Geostatistics is an important tool to analyze spatial variability of soil properties and crop yield. Thematic maps based on geostatistics are excellent resources for the analysis of agricultural performance, and are considered the most complete option for the zoning of crop spatial variability (Molin, 2002). However, when the data set contains influential values, the maps diverge from reality and can therefore cause misinterpretation.
The identification of influential observations in the data set is known as diagnostic analysis (Paula, 2004) and was presented by Cook (1986) as a new method called "Local Influence" for diagnostic analysis of the Gaussian process. This analysis was based on the assumption that the model is reliable, and analyzed the power of conclusions for data or model disturbances.
Several studies of spatial analysis have been developed using diagnostic techniques to find observations that influence model sets. Christensen et al. (1992Christensen et al. ( , 1993 worked on studies that analyzed the prediction of parameters in linear spatial models by applying diagnostic techniques to find observations that can influence the estimation of the covariance matrix, used in universal kriging. Militino et al. (2006) tried to identify outliers in multivariate linear spatial models. Borssoi et al. (2009Borssoi et al. ( , 2011 conducted diagnostic studies of linear Gaussian spatial models, and developed several measures of local influence. The purpose of this paper was to propose a method that uses n-variate Student's t-distribution with a fixed degree of freedom and also uses the maximum likelihood function as method of estimating spatial variability parameters of soybean yield. From the estimation of these parameters, a study of local influence is developed, which analyzes the existence of influential observations in the spatial dependence structure (graph C i ) and in the linear predictor (graph l p ), considering the yield itself as explicative variable and soil penetration resistance (SPR) and soil density (Des) as covariates. Finally, these results were compared with those obtained by using the n-variate normal distribution.

Student's t linear spatial models
Let {Y(s), s ∈ S} be a stationary stochastic process, where μ is the vector of location parameters and ∑ the scaling matrix, ie Y ~ t n (μ,∑,v). Consider that Y = (Y(s 1 ),…, Y(s n ) T are registered at known locations in space known in s i and s j for i ≠ j = 1,…,n. Each Y(s i ) can be written as: where μ(s i ) is a deterministic and ε(s i ) a stochastic term, both depending on the parameter space where Y(s i ) operates. It is assumed that the stochastic error ε has zero mean, ie, (s i ) = E[ε(s i )] = 0, i = 1,…, n and that the variation of the points in space is determined by a covariance function C(.) of s i and s j . This covariance function is also specified by a parameter vector Φ = (ϕ 1 ,ϕ 2 ,ϕ 3 ) T , which are the parameters that determine the structure of spatial dependence.
For some known functions of S, such as X 1 (s),…,X p (s), we consider the linear spatial model in the mean of the process μ(.), as: μ(s i ) = β j X j (s i ), i = 1,…, n, whereas X 1 is a vector of um's, X 2 ,…,X p are the covariables and β 1 ,…,β p are unknown parameters.
Equation (1) can be written in matrix form, where Y = (Y(s 1 ),…,Y(s n )) T and ε = (ε(s n )) T are vectors nx1, μ = Xβ is a vector nx1, X is a matrix n x p composed by the vector of um's and by (p-1) covariates and β = (β 1 ,…,β p ) T is the vector of unknown parameters to be estimated.
Random error vector expectation ε, E(ε) = 0 is a zero vector, and its scaling matrix is Assuming Σ is not singular and that the matrix X is full rank (rank (X) = p), the scaling matrix Σ can assume a spatial structure of: where l n is the identity matrix nxn, ϕ 1 is the parameter that determines the nugget effect, ϕ 2 is the parameter that defines the sill or contribution, R = R(ϕ 3 ) is a symmetric nxn matrix, and the reach a is a function of ϕ 3 , ie α = g(ϕ 3 ).
The elements i and j of the matrix Σ are C(s i ,s j ) = C(h ij ) = ϕ 2 r ij , where h ij = s i -s j , whereas the elements r ij are from the matrix R of the form for i ≠ j and r ij = 1, for i = j = 1,…, n.
The joint probability density function of Y is: where Γ(.) is the gamma function. For v > 1, μ = Xβ is the vector of means and for v > 2, is the covariance matrix.
Let θ = (β T Φ T ) T be the vector of unknown parameters and Y the vector of the observed data, where Y ~ t n (Xβ,Σ,v) with fixed v. The likelihood function logarithm of the observed data Y in relation to θ is given by

Algorithm for parameter estimation (EM algorithm)
The parameter vector θ = (β T Φ T ) T is estimated by the EM algorithm, based on the mixture of normal distribution that obtains Student's t-distribution (Liu & Rubin, 1995).
We consider Y c = (Y,Y m ) as the complete data set with a parameterized density f(Y c |θ) by a vector of ndimensional parameters, θ∈Θ⊂ℜ s , with s=p+q (q=3), where, Y and Y m are the observed and non-observed data, respectively. The maximum likelihood estimate (ML) of θ can be obtained based on the complete data of the likelihood function log and the EM algorithm (Dempster et al., 1997). This algorithm consists of two steps: Expectation (Step E) and Maximization (Step M).

Local influence
A perturbation vector ω = (ω 1 ,…,ω n ) T is considered, varying in a region Ω⊂ℜ d , used as additive perturbation of Y ω = Y + ω. Let the likelihood function be L(θ,ω|Y) in the observed data and L c (θ,ω|Y c ) in the complete data for the perturbed model and assuming that there is ω 0 so that L(θ,ω 0 |Y) = L(θ|Y) and L c (θ,ω 0 |Y c ) = L c (|Y c ) for every θ, and also that (ω 0 ) is the ML estimator of θ for L(θ,ω|Y). Cook (1986) considers the displacement of the likelihood function log LD(ω) = 2[L( |Y) -L( (ω)|Y)] to assess the local influence of a small perturbation. Due to the difficulty of its application to complicated models, Zhu & Lee (2001) proposed an option with a shift function for In this case, the normal curve Cf g ,l of α(ω) in ω 0 in the direction of some unit vector l can be used to summarize the local behavior of f Q (ω). The normal curvature Cf Q ,l for α(ω) and ω 0 is defined: Poon & Poon (1999) include the conformal normal curvature Β fQ , l for ω 0 in the direction of some unit vector l, such as: |ω=ω 0, and B fQ , l is a function of one to one of the normal curvature, considering values between [0,1].
Given the matrix B fQ , l and considering where b ii are the elements of the main diagonal of matrix B fQ , l , we find graph C i according to order i, to analyze the existence of influential observations in the spatial dependence structure. Zhu & Lee (2001) stated that for C i , the i th point is influential in the spatial dependence structure if: where b ii is an element of matrix B fQ , l , and .
In the study of spatial data, universal kriging has been used as a measure of prediction, to obtain values for the regionalized variable at non-sampled points. Let Y 0 = Y(s 0 ) be the universal kriging predictor of location s 0 ∈ S. The mean of Y 0 is given by , where = (x 01 ,…,x 0p ) and x 0j = x j (s 0 ) for j = 1,…,p.
The predictor of the least mean square error is given by Thus, we have S(l) = l T (s 0 ,θ), where (s 0 ,θ) is a vector n x1 given by (8) In equation (8)  The direction of maximum local slope for the linear predictor is given by equation (9)   (9) Graph is used to assess the influence on the linear predictor,

MATERIAL AND METHODS
The experimental data were collected in the 2004/ 2005 growing season, through a research that was conducted in a field of commercial grain production (56.68 ha) in Cascavel, in western Paraná (approx. 24.95º S, 53.57º W, 650 m asl). The soil was classified as a clayey Oxisol (specifically Latossolo Vermelho distroférrico), with a long-term crop succession of oat in the winter and soybean in the summer. The local climate is mild, mesothermal, and super humid, Cfa (Köppen classification), with moderate temperatures and well-distributed rainfall. The mean winter temperature is below 16 °C, with possible frost, and summers are hot, with temperatures above 30 ºC.
The soybean variety COODETEC 216 (CD216) was planted in no-tillage in the experimental area. For the study, 47 plots were marked, all of which were georeferenced using a Trimble GeoExplorer 3 GPS receiver (Global Positioning System) and the static method with post-processed differential correction, for the correct location in the spatial system of geographic coordinates Universal Transverse Mercatur (UTM), using metric coordinates.
The Des was determined by the volumetric ring method (Kiehl, 1979), where one repetition was applied per depth per sampling. The SPR in the layers examined was measured with a penetrometer (SC-60, Soil Control; shaft 600 mm, diameter 9.53 mm), equipped with a cone at the tip (base area 129.3 mm², diameter 12.83 mm, corner angle 30°). A mean value of mechanical resistance to penetration was calculated based on four random replications per location and experimental plot.
Thereafter, the spatial analysis of soybean yield was performed, considering SPR and Des as their covariates in a linear spatial model. Then n-variate Student's t-distribution and n-variate normal distribution were taken into account, and the yield spatial dependence structure was determined for both, using the method of maximum likelihood. The linear spatial model parameters of soybean yield as a function of covariates and of covariance structure were estimated by: and where β 1 ,…,β 1 and ϕ 1 ,ϕ 2 , ϕ 3 are the unknown parameters to be estimated.
The cross-validation criterion, Akaike information criterion (Faraco et al., 2008), and the maximum value of the log-likelihood function (LLF) were used for the choice of model space. Local influence analysis was performed to identify influential points. Universal kriging interpolation of the variable under study was the final step along with the creation of thematic maps.

RESULTS AND DISCUSSION
The soybean yield mean was 3.23 t ha -1 , standard deviation SD = 0.38 t ha -1 and the coefficient of variation CV = 11.79 %, identifying data homogeneity. With increasing depth, the covariate of soil penetration resistance (SPR) decreased the values of the mean, first quantile (Q 1 ), median and trird qualite (Q 3 ), and increased akewness and kurtosis. It was observed that for SPR, in all layers, there are sample data above 2.60 MPa and that in RSP 0-10 , 50 % of the data were over 2.64 MPa. According to Canarache (1990), SPR values in the range [1.1 to 2.59] MPa are not very restrictive for roots, but root-limiting in the range [2.6 to 5.0] MPa. It was also observed that the standard deviation of SPR does not vary much in the three layers, remaining within 0.49 and 0.56 MPa.
The three layers have homogeneity (coefficient of variation below 30 %). For soil density (Des), all statistical measures showed increases from layer 0-10 cm to layer 10-20 cm, and reduction in statistical measures from layer 10-20 cm to layer 20-30 cm, which features a compacter layer below the soil surface, that prevents water infiltration. The mean, standard deviation, and density coefficient of variation do not change much in the three layers. As for SPR, homogeneity was observed for Des in the three layers (coefficient of variation below 30 %).
Coefficients of skewness and kurtosis were calculated. In covariates RSP 20-30 , Des 10-20 , and Des 20-30 , the calculated coefficients do not belong to the 95 % skewness and kurtosis coefficient confidence intervals constructed by Jones (1969) that characterize normal distribution of probability.
The box-plot graph of soybean yield (Figure 1a), shows an outlier point, with a value of 2.09 t ha -1 , which is the 13 th value of the data series, located in the experimental area (Figure 1b). Tables 1 and 2 show results of linear spatial model parameter estimates of Equation (10) of soybean yield, setting the following models: exponential (Exp), Gaussian (Gaus), and Matérn with kappa parameter at 0.7. Student's t-distribution was considered for data with degrees of freedom equal to 3 (v = 3) and normal distribution. The Kolmogorov-Smirnov test was used to confirm the hypothesis that the population distribution, from which the given sample was withdrawn, follows a t-distribution probability with v = 3 degrees of freedom. Table 2 shows that estimates of parameters (nugget effect), (contribution) and (function of range) have small variations among the three models, assuming Student's t-distribution with v = 3 or with normal distribution. For both distributions the lowest standard deviation was given by the Gaussian model. The degree of spatial variability classified by Canarache (1990) by using the coefficient of relative nugget effect (RNE) shows moderate spatial dependence of the variables studied. Table 3 shows results of the selection criteria for cross-validation models, Akaike information criterion (AIC) and maximum value of the log-likelihood function (LLF) for the soybean yield variate. Considering these results, it appears that both by the Student's t-distribution with v = 3 and by normal distribution, the best-fitting was the Gaussian model. Local Influence Analysis: Graphs C i in figure 2a,b, considering the limit of the graph defined in Equation (7), shows that by adopting t-distribution with v = 3, elements 13 and 30 are identified as influential for the structure of spatial dependence. However, when adopting normal distribution, elements Table 1

. Parameters estimated by LM assuming Student's t-distribution with v = 3 and normal for soybean yield using theoretical models: exponential (Exp), Gaussian (Gaus), and Matérn with k = 0.7 (Matérn)
Student's t-distribution with v = 3, N: normal distribution; In brackets, the standard deviations of each parameter estimated.

Table 2. Parameters estimated when adopting Student's t-distribution with v = 3 and normal distribution for the yield variable
Student's t-distribution with v = 3, N: normal distribution; In brackets, the standard deviations of each parameter estimated. EPR(%) = ϕ 2 /(ϕ 1 + ϕ 2 ): coefficient of relative nugget effect. 13, 23, and 30 are identified as influential for the structure of spatial dependence, with greater emphasis on element 30.
According to graphs l p in figure 2c,d, which assess the influence on the linear predictor, by adopting tdistribution with v = 3, element 13 continues to be considered influential. However, when normal distribution was adopted, element 31 was identified as influential.
Following with the analysis of graphs C i and l p , it was decided to remove two observations; 13 (2.09 t ha -1 ) and 30 (2.87 t ha -1 ) separately, to identify the influence of these points in the study of spatial variability. To distinguish the new data sets, we considered Prod: total data, Prod (13): data excluding element 13, and Prod (30): data excluding element 30.
Descriptive analysis after the removal of influential values: Table 4 shows that element 13 produced major changes in the descriptive measures besides the removal of element 30. The removal of these elements also altered data skewness and kurtosis, but according to criteria used to assess normality, new data sets still showed normality.
Parameter vector estimate after the removal of influential values: The parameters were estimated for the three models, for the tdistribution with v = 3 and for the normal distribution, for each one of the variables Prod, Prod (13), and Prod (30). Considering these results, it appears again that both by the t-distribution with v = 3 and by normal distribution, the best-fitting model is the Gaussian model, ie, the removal of influential points did not alter the choice of the set model.  Comparing the estimates of the parameter vector (Table 5) obtained from the data sets Prod (13) and Prod (30), with the estimates obtained with complete data (Prod), it appears that both for the t-distribution (v = 3) and for the normal distribution, estimates 1 and 3 decreased, and estimates 2 , 4 and 5 increased. Estimates 6 and 7 have different variations when adopting t-distribution and normal distribution.
Looking at table 6 and considering the tdistribution (v = 3), it appears that changes in the parameters of the structure of spatial variability after the removal of elements 13 (Prod (13)) and 30 (Prod (30)) were the same, showing reduction of the estimates 1 and 2 . The estimate 3 was reduced when obtained with data set Prod (13) and increased when obtained together with the data Prod (30). These changes were more significant for the data set Prod (13) than for Prod (30), when compared to yield (Prod) of the complete data.
However, considering normal distribution, after the removal of element 13 (Prod (13)), there was reduction of 1 and 2 and increase of 3 , and after the removal of element 30 (Prod (30)) 1 there was an increase and the estimates 2 and 3 were reduced. But again these changes were more significant for data set Prod (13) than for Prod (30), when compared to yield (Prod) of the complete data. Therefore, it can be said that the points identified as influential affect parameter estimates.
When the coefficient of relative nugget effect is examined in the three data sets, it is possible to note that there is no change of the spatial dependence, it remained moderate. Figure 3 shows the thematic maps of data sets Prod, Prod (13), and Prod (30), based on universal kriging interpolation, using the covariates for the estimation of parameters .
The maps were constructed using the Gaussian model adopting t-distribution (v = 3) and normal distribution. Table 7 shows the percentage of each class on the maps of the variable. From the maps of figure 3 and table 7, it can be observed in figure 3a,b that there was a reduction of the area with yield in the first class interval (2.7 and 3.1 t ha -1 ) changing from 35.84 to 31.81 % of the area, and also an increase in areas of the other classes, more pronounced in the 3.1 and 3.3 class t ha -1 , from 29.94 to 32.17 %. Comparing figure 3a,c, it is possible to note a reduction of the area with yield in the interval of the 1 st and 4 th classes, and increases of the other classes. Table 5. Parameters estimated when adopting t-distribution with v = 3 and normal for variables Prod, Prod (13), and Prod (30), using Gaussian (Gaus) Student's t-distribution with v = 3, N: normal distribution; In brackets, the standard deviations of each parameter estimated. Table 6. Spatial parameters estimated when adopting t-distribution (v = 3) and normal for variables Prod, Prod (13), and Prod (30), using Gaussian (Gaus) Student's t-distribution with v = 3, N: normal distribution; In brackets, the standard deviations of each parameter estimated. E(%) = ϕ 2 /(ϕ 1 + ϕ 2 ): coefficient of relative nugget effect.
By comparing figure 3d,e, a reduction in the areas corresponding to the 1 st , 2 nd and 4 th classes is observed, and an increase in the 3 rd class. Comparing figure 3d,f there is a reduction in the 1 st and 3 rd , and an increase in the 2 nd and 4 th classes. Thus, one can say that the points identified as influential affect the thematic maps underlying the management of the area with a view to future interventions in the soil treatment. It is therefore important to take all factors into consideration that may alter spatial analysis.

CONCLUSIONS
1. Observation 13 is not only an outlier, but also an influential value. This element (13) had greater influence than element 30 on the estimate of the Table 7. Percentage of class on the maps of soybean yield parameters, as well as on the construction of thematic maps, when t-distributions or normal distribution is adopted.
2. In a comparison of the maps, it was noted that the map generated by t-distribution is less changed with the removal of an influential value (13) than the map generated by normal distribution. Therefore, it is possible to perform spatial analysis from Student's t-distribution with little concern about influential values, which do not influence this distribution.
Technological Development (CNPq) and the Fundação Araucária for financial support.