Correlation maps to assess soybean yield from EVI data in Paraná State , Brazil

Vegetation indices are widely used to monitor crop development and generally used as input data in models to forecast yield. The first step of this study consisted of using monthly Maximum Value Composites to create correlation maps using Enhanced Vegetation Index (EVI) from Moderate Resolution Imaging Spectroradiometer (MODIS) sensor mounted on Terra satellite and historical yield during the soybean crop cycle in Paraná State, Brazil, from 2000/2001 to 2010/2011. We compared the ability of forecasting crop yield based on correlation maps and crop specific masks. We ran a preliminary regression model to test its ability on yield estimation for four municipalities during the soybean growing season. A regression model was developed for both methodologies to forecast soybean crop yield using leave-one-out cross validation. The Root Mean Squared Error (RMSE) values in the implementation of the model ranged from 0.037 t ha−1 to 0.19 t ha−1 using correlation maps, while for crop specific masks, it varied from 0.21 t ha−1 to 0.35 t ha−1. The model was able to explain 96 % to 98 % of the variance in estimated yield from correlation maps, while it was able to explain only 2 % to 67 % for crop specific mask approach. The results showed that the correlation maps could be used to predict crop yield more effectively than crop specific masks. In addition, this method can provide an indication of soybean yield prior to harvesting.


Introduction
Monitoring agricultural crops during the growing season is important to forecast yield prior to harvesting (González-Sanpedro et al., 2008).Several techniques have been developed to achieve accurate yield estimates, namely the linear regression analysis based on remote sensing data (Wall et al., 2008).This approach is based on estimating photosynthetic capacity from vegetation indices related to yield (Becker-Reshef et al., 2010).
The Moderate Resolution Imaging Spectroradiometer (MODIS) data have great potential to monitor biophysical parameters (Huete et al., 2002) and improve accuracy in crop yield assessment (Ren et al., 2008;Funk and Budde, 2009).The coarse spatial resolution is a limiting factor to the use of MODIS data, which results in mixed pixels that may not be suitable for crop yield models (Shao et al., 2015).
Several approaches have been tried to address this problem.Genovese et al. (2001) applied weighted Normalized Vegetation Index (NDVI), called CORINE-NDVI (CNDVI) to extract indicators for crop yield monitoring in Spain.The authors found that indicators based on CNDVI were more closely related to crop yield than those based on NDVI.Becker-Reshef et al. (2010) used a regression model to estimate wheat yield in Kansas, the United States, based on a percentage map using the pure pixels that allowed reliable yield estimates prior to harvesting.Maselli and Rembold (2001) found that improvement in estimates of yield capacity depends on the crop and the values of vegetation index considered in the area.The authors estimated yield in North African countries by correlating NDVI and yield.They reported that areas with low NDVI values could present high correlation due to the presence of grasses with a similar phenology, such as cereals.Combining both the percentage map and the correlation between NDVI and yield, Kastens et al. (2005) created the "yield-correlation masking" approach to estimate cereal yield in the United States.The authors found that vegetation in a region can integrate the growing conditions, which could be more indicative of crop potential.
Several studies in Brazil have achieved reliable results using the regression analysis to estimate yield (e.g.Gusso et al., 2013;Picoli et al., 2014), however, the crop mask of these methods does not always represent the reality of the area.Therefore, this study investigates the potential of using correlation maps to estimate soybean yield based on the regression analysis and find the suitable period to estimate yield in Brazil.
Soybean crop is cultivated during the summer season in Brazil.For the study region, the planting date occurs approximately between October and December, which corresponds to the period of high rainfall, while the harvesting period occurs from February to April (Ta-  ble 2). Figure 2 shows the average soybean crop cycle for the study region.Acquisition of cloud-free images was almost impossible, therefore, the monthly Maximum Value Composite (MVC) imagery was used.

Correlation Maps
There are many methodologies to establish a relationship between vegetation indices and final yield.The methods are often based on monthly vegetation indices values (Maselli and Rembold, 2001) or on accumulation over determined periods of the crop phenological stage (Tucker et al., 1980;Rasmussen, 1992;Genovese et al., 2001;Kastens et al., 2005;Ren et al., 2008).
To examine variation in correlation with crop development, this study was based on the first approach in which the monthly EVI values were regressed with the historical soybean yield values at the pixel level.The first step examined each month separately, for example, the 11 images from October were arranged and regressed with the annual yield values.This process was repeated for ev-ery month of the growing season.Thus, the correlation maps were built based on the crop cycle of each municipality.

Regression model
The correlation maps were used to build masks with different correlation ranges to assess the ability of each range to estimate yield.Kastens et al. (2005) used a similar approach by applying masks with different sizes to test its capacity to forecast yield.
In this study, the ranges were separated as the correlation increased.Therefore, the first mask corresponds to the range 0-10 % of correlation, the second 11-20 %, and this process was repeated until a correlation of 100 %.
For comparison purposes, we used crop specific masks to estimate yield and compared it with the methodology proposed here.For these crop specific masks, we followed the methodology described by Araújo et al. (2011), where multi-temporal color composites were created in RGB channels.The color composites were based on the soybean crop cycle in which the Red channel corresponds to the vegetative peak, and the Green and Blue channels correspond to the beginning of the crop cycle, thus, only the soybean crop was highlighted in the composite.This process was repeated until the end of the phenological cycle due to differences in the planting dates in the state.A color composite was generated for each 16-day period.
In order to create a soybean specific crop layer, pixels from the composites described above were selected with gray level values above 200 on the Red channel and below 200 on the Green and Blue channels.This resulted in a binary classification of the soybean areas.To generate the final annual soybean classification, the 16-day soybean masks were overlapped for the entire study area.This process was applied to generate a soybean crop mask for each year of study.Figure 3 illustrates the main steps of the study.
An automated process was applied to extract pixels from the time-series data based on the masks.We used the system developed by Esquerdo et al. (2011).This system requires the time-series and the coordinate location of the fields (extracted from the masks), which results in the EVI values corresponding to the masks.Then, these pixels were used as input data to estimate yield.A similar approach was used by (Fernandes et al., 2011).
For each of the correlation masks and approaches of crop specific masks to quantify yield, a linear regression model (equation 1) was developed to calculate the estimated yield: where Y is the estimated soybean yield; EVI is from the monthly MVC composite; a and b are the regression equation parameters.
The residual analysis was applied to verify homoscedasticity, normality and independence of residuals (Breusch and Pagan, 1979;Shapiro and Wilk, 1965).In addition, the absence of autocorrelation in the data was verified.As a final test, we developed regression models using a "leave-one-out" cross validation to validate the estimated yield.All error values were calculated by comparing the observed and estimated yield.The fraction of soybean yield variation, which was explained by the progressive addition of EVI in the linear regression analysis, was quantified by means of the coefficient of determination (R²).The Root Mean Squared Error (RMSE) was used to measure the model performance and the Mean Absolute Error (MAE) was used to measure the model accuracy.Furthermore, the Willmott index of agreement (d) (Willmott, 1981) was used to measure the degree of accuracy between estimated and observed values.

Correlation Maps
Figure 4 shows which period had the strongest correlation between EVI and historical yield.The low correlation for Oct and Nov indicated that EVI during this period did not result in a good correlation with yield data.This period corresponded to the beginning of the crop cycle, thus, pixels consist of a spectral mixture of both plant and soil.
In Cascavel and Toledo, the highest correlation between EVI and yield occured in Dec, Jan and Feb, while in Ponta Grossa, Dec and Jan showed the highest values.This can be explained because this period corresponds to the vegetative peak of the crop cycle and, consequently, high EVI values.Correlation maps for Castro were not able to detect periods with high correlation, since this municipality does not consist of large soybean crops, as observed in the other regions.
Table 3 presents the coefficient of determination (R²) for the linear regression analyses using the ranges from the correlation masks.As the correlation range increases, R² becomes stronger and, thus, the significance level is improved.Analyzing the crop growth, we confirm that the highest values of R² occurred during Dec, with Oct and Nov having the lowest predictive capacity for annual yield estimates.December, Jan and Feb approximately corresponded to the flowering and filling seed phenological stages in Cascavel and Toledo.In Ponta Grossa, Feb presented the highest correlations (R² < 0.5).
Evaluating each range of correlation mask, we found that a good coefficient of determination was reached at 0.60 for the western region.On the other hand, for the eastern region, ranges between 0.50 and 0.40 for Castro and Ponta Grossa, respectively, presented good results and for both regions, R² improved as the range increased achieving the highest values at 1.00.This is linked to soil and climate differences of both regions, where the western region is located in the humid subtropical climate with an annual average rainfall between 1800 mm and 2000 mm, Oxisol soil type and elevation between 600-750 m.The eastern region has temperate climate, annual average rainfall between 1600 mm and 1800 mm, inceptsol soils and elevation between 900-1100 meters (IAPAR, 2013;USDA, 2005).
Maselli and Rembold ( 2001) and Kastens et al. (2005) reported that the correlation map could perform well in regions of sparse crop distribution, such as in the eastern region of our study area, once it can express information on the fractional land cover in a pixel.Our analysis supports the general findings of these previous studies.

Estimated yield: correlation maps versus crop specific masks
The performance of all correlation masks and crop specific mask is shown in Figure 5.The approach of crop specific mask for the four municipalities studied showed low performance, however, it was lower in municipali-  ties in the eastern region (Castro and Ponta Grossa).This may be explained due to regional characteristics, such as irregular terrain and, consequently, smaller cropland areas compared to the western region.Kastens et al. (2005) pointed out that the correlation approach would be better applied in these types of areas since there are many difficulties in building crop specific masks with coarse resolution in areas with low production or where cropland areas are interspersed with non-cropland.
To compare the effectiveness of the proposed method, masks from 91 to 100 % correlation were compared with crop specific masks.The overall model performance is shown in Table 4, which demonstrates a higher precision and accuracy for the model generated for the correlation masks approach in terms of performance of the crop specific mask approach.
The RMSE for all municipalities using correlation masks was below 0.19 t ha −1 , while using the crop specific mask, it varied between 0.21 and 0.35 t ha −1 .These values are in agreement with Kastens et al. (2005) that used a yield-correlation mask to estimate soybean yield in Iowa and Illinois (USA) and obtained an RMSE 0.15 t ha −1 and 0.16 t ha −1 , respectively.In contrast, using the crop specific mask approach, Ren et al. (2008) obtained RMSE = 0.21 t ha −1 , Mkhabela et al. (2011) reported RMSE values below 0.65 t ha −1 to predict cereal grain in Canada.MAE values showed that the correlation masks approach had magnitude error lower (0.14 t ha −1 to 0.02 t ha −1 ) than the crop specific mask (0.3 t ha −1 to 0.18 t ha −1 ).It showed that models using correlation masks were more accurate than the crop specific mask.The index of agreement (d) for the correlation masks approach was higher than 0.9 for all municipalities and for crop specific mask was 0.64, 0.85, 0.01, and 0.58 in Cascavel, Toledo, Castro, and Ponta Grossa, respectively, that is, lower than expected.The identification of pixels well correlated with historical yield is a significant step in the context of forecasting operational yield.This is clearly shown here since the non-significant results of the crop specific mask approach are primarily due to the fact that EVI is representative of all crops in the area (pixel), thus, it is highly influenced by dominant crops, becoming less related to non-dominant crops that can often be the crop of interest (Bolton and Friedl, 2013;Kastens et al., 2005;Maselli and Rembold, 2001).However, this does not occur for the correlation masks approach because the focus is on highly correlated areas (pixels) with yield.This is similar to the results obtained by Huang et al. (2014), who reported an improvement in estimated grain yield in China using areas with a strong relationship between NDVI and crop yield.
A comparison of estimated yield with official data is given in Figure 6, where the coefficient of determination (R 2 ) ranged from 0.96 to 0.98 for the correlation masks approach; however, for the crop specific mask approach, it only ranged between 0.02 and 0.67 (Table 4).The poor linear relationship between the estimated and official yield data is likely due to the EVI being influenced by other targets resulting in spectral mixture issues, especially when using low spatial resolution images.This point was highlighted by Genovese et al. (2001), who showed improved estimates of yield values when the weighted NDVI eliminated noise.With the same ob-jective, Maselli and Rembold (2001) and Kastens et al. (2005) applied the correlation maps in their study areas, since the aim was to use areas of a good relationship with yield data. Kastens et al. (2005) reported that this technique is successfully employed in areas with sparse production.The eastern region in Paraná State has this characteristic, where municipalities have uneven relief, with the correlation masks approach estimating yield values close to the official data.This also helps to explain the significant errors generated by the model for this region using the crop specific mask.

Conclusions
A linear model to predict soybean yield was developed using correlation maps based on spectral data and historical yield.The correlation maps showed an increased performance of the yield model in all municipalities in relation to a more traditional crop specific approach.They provided results similar to the official yield reports.Our results showed that this approach improved the estimated yield, especially in areas with sparse production, as in the case of the municipality of Castro, where validation estimates using correlation maps obtained R² = 0.98 while crop specific masks obtained R² = 0.02.This study also showed the limitations of crop specific masks to estimate yield especially with low spatial resolution images.The correlation maps proved to be more efficient at predicting yield since it is based on the relationship of EVI with crop yield, eliminating factors that could influence the results.

Figure 1 -
Figure 1 -Paraná State and location of the studied municipalities.

Figure 2 -
Figure 2 -Average of eleven years of soybean crop cycle for the four municipalities.

Figure 3 -Figueiredo
Figure 3 -Flowchart of the main steps adopted in the study.EVI = Enhanced Vegetation Index; IDL = Interactive Data Language; Y= yield; LOOCV = Leave-one-out cross validation.

Figure 5 -
Figure 5 -Model performance for the four studied municipalities.RMSE = Root Mean Squared Error; MAE = Mean Absolute Error; d = index of agreement.

462-470, September/October 2016
in relation to other states in Brazil and it ranked first in the southern region(CONAB, 2013).According to the Köppen's climate classification map for Brazil(Alvarez et al., 2013), the climate in Paraná state is Cfa type (i.e., subtropical mesothermal climate) and Cfb type (i.e., temperate mesothermal climate).The average annual temperatures is 19 °C, with the hottest month averaging above 22 °C and the coldest below 18 °C.Agricultural statistics for the soybean growing season were collected from the Department of Agriculture and Supply of Paraná (SEAB) database and the Brazilian Institute of Geography and Statistics (IBGE).Both SEAB and IBGE obtain data through agricultural surveys with growers and cooperatives.These data were used to create a time series of soybean area and total production for the period between 2000/2001 and 2010/2011 growing seasons.Table1shows statistics for the soybean area and yield time series used in this study.

Table 1 − Planted area and historical yield for the study area from 2000/2001 to 2010/2011 crop seasons.
*SD = Standard deviation, CV = Coefficient of Variation, Trend = trend over the studied period Area = municipality area.Source: IBGE.

Table 2 −
Monthly percentage of planting and harvesting soybean crop season in Paraná State*.

Table 3 −
Coefficients of determination (R²) for linear regression analysis using ranges from correlation masks for the four study regions.

Table 4 -
Statistical analysis of model adequacy for the correlation masks (CM) and crop specific mask (CSM) models for each municipality.Models were tested using the "leave-one-year-out" approach.