Comparative analysis of orbital sensors in soybean yield estimation by the random forest algorithm

ABSTRACT Remote sensing has proven to be a promising tool allowing crop monitoring over large geographic areas. In addition, when combined with machine learning methods, the algorithms can be used for estimating crop yield. This study sought to estimate soybean yield through the enhanced vegetation index and normalized difference vegetation index. These vegetation indices were obtained using moderate-resolution imaging spectro-radiometer (MODIS) sensors on AQUA and TERRA satellites and multispectral instrument (MSI) sensor on Sentinel-2 satellite. Random forest (RF) algorithm was used to predict soybean yield and the estimation models were compared with the actual plot’s yield. The RF algorithm showed good performance to estimate soybean yield with our models (R2 = 0.60 and RMSE = 0.50 for MSI; R² = 0.63 and RMSE = 0.59 for MODIS). Vegetation indices with imaging dates corresponding to the crop’s maturation had a higher degree of importance in its predictive ability. However, when comparing the actual and predicted soybean production values, differences of 145 kg ha-1 in contrast to 4 kg ha-1 were found for the MODIS and MSI models, respectively. Therefore, the MSI sensor integrated with machine learning algorithms accurately estimated crop yields.

Index terms: Remote sensing; yield estimation; machine learning.

INTRODUCTION
Brazil is the second largest soybean producer in the world, and part of this success is due to solid investments in technologies allowing this crop to adapt to its soil and climate conditions.The development of highly productive cultivars resistant to tropical climates, advances in plant mineral nutrition technology, and strategies for pest and disease control are among the most significant advances (Empresa Brasileira de Pesquisa Agropecuária -EMBRAPA, 2020).
During the last 10 years, the soybean cultivation area in Brazil increased from approximately 24 to 39 million hectares, with production rising from 75 to 124 million tons and 3.526 kg ha -1 on average (Companhia Nacional de Abastecimento -CONAB, 2023).The Brazilian soybean 2019/2020 harvest was approximately 125 million tons.The largest producers were the Mato Grosso state reaching about 36 million tons, Paraná with 21 million, Goiás with 13 million, Rio Grande do Sul and Mato Grosso do Sul with 11 million tons each (CONAB, 2020).For the Paraná state, the productivity was 3.792 kg ha -1 (Departamento de Economia Rural -DERAL, 2020).
As soybean productivity increases, production costs also rise.Thus, monitoring crop nutritional status and adopting more sustainable techniques that guarantee the rational use of inputs are becoming more necessary for farmers.Precision agriculture has emerged as a tool allowing the management of the production unit through its spatial and temporal variations using numerous methodologies, including remote sensing (RS).
Through images obtained by sensors embedded in satellites, RS can generate information about the physiological and developmental conditions of crops, even in large areas, in a practical and low-cost manner.This type of technology can be used to predict productivity in regions where crops are being grown (Weiss;Jacob;Duveiller, 2020).Although orbital sensors have limitations concerning resolution, especially regarding spectral and spatial accuracy and cloud cover, they are well positioned on stable platforms compared to airborne sensors, automatically generating images with less distortion (Singh et al., 2020).
The application of RS techniques to crops includes understanding the interaction processes between electromagnetic radiation and the different vegetation physiognomic types (Ponzoni;Shimabukuro;Kuplich 2012).The spectral response of crops depends on a series of biochemical factors of the targeted plant species, in addition to the physical characteristics of the canopy.These factors are specific to the canopy architecture, plant development stages, agronomic parameters, and atmospheric conditions (Martins;Gallo, 2015).In this context, the enhanced vegetation index (EVI) and the normalized difference vegetation index (NDVI) are widely used (Ba et al., 2022) because they capture the status and trend of crop growth (Shammi;Meng, 2021).NDVI typically relies on the pigment absorption feature in the red and near-infrared regions.EVI relies on the electromagnetic spectrum's red, blue, and near-infrared regions.Compared to NDVI, EVI is less sensitive to different soil compositions (Huete;Justice;Leeuwen, 1999).
The estimation of agricultural production is essential for market planning and the adoption of public policies to combat hunger.RS has been widely used for data analyses of farming systems.However, this requires processing vast amounts of data from different orbital and suborbital platforms.Machine learning (ML) methods have been employed in this complex scenario due to their high capacity to process large amounts of input data and deal with linear tasks.Recently, advances in target detection technologies and ML methodologies have provided greater cost-effectiveness and solutions for better estimating the state of crops.This will soon be a routine practice in precision agriculture (Chlingaryan;Sukkarieh;Whelan, 2018).
In this context, several agricultural studies have integrated data from orbital sensors and ML algorithms.Stepanov et al. (2020) used data from the moderateresolution imaging spectro-radiometer (MODIS) sensor to monitor soybean crop yields in the Far East of Russia.In turn, Habibi et al. (2021) analyzed the spatial variability of soybean plant density with images from the commercial PlanetScope sensor.Xin et al. (2013) developed models to estimate corn and soybean production efficiency with MODIS sensor data.Li et al. (2022) used data from the MODIS sensor associated with environmental variables to estimate wheat yields in northwest China.
Among the different ML methods, the decision tree-based random forest (RF) method has been widely used in different research areas (Minnoor;Baths, 2023), with good performance in estimating crop productivity (Alabi et al., 2022;Khanal et al., 2018).Furthermore, RF can identify the relative importance of each predictor for the response variable.This study used the RF regression algorithm to compare the performances of MODIS and MSI orbital sensors in estimating soybean yield through EVI and NDVI vegetation indices.

MATERIAL AND METHODS
This study was conducted in 16 agricultural plots, with areas between 12 and 150 ha, concerning the 2020/2021 harvest.This area is located in the Pato Branco Regional Nucleus of Paraná State Department of Agriculture and Supply (SEAB, Brazilian acronym), in municipalities in the southwest of the Paraná State (Figure 1), in the southern region of Brazil.
The Pato Branco Regional Nucleus had a 12.9% increase in soybean cultivation area and a 33.9% increase in soybean grain production, totaling 1,292,682 tons.These values allowed this regional nucleus to occupy the fifth-ranked position in production in the Paraná State in the 2019/2020 harvest (DERAL, 2020).
The region's climate is predominantly Cfa and Cfb according to the Köppen classification, with a historical average annual precipitation (from 1977 to 2006) ranging from 1800 to 2000 mm (Agência Nacional de Águas -ANA, 2013).The pedology is comprised of Latosol, Nitisol, Chernozem, and Neosol in those areas with greater slope (EMBRAPA, 2006).
To obtain real data on soybean production in the 2020/2021 harvest, as well as the spatial variability of each plot, a John Deere GreenStar TM 3 (2630) harvest monitor was used, which made it possible to obtain production data with approximate dimensions of 1.5 x 8.5 meters (Figure 2).
To estimate soybean production, the predictor variables NDVI (Equation 1) and EVI (Equation 2) corresponding to the period from sowing to harvest (October to March) were generated using RS.Visible and infrared images from the MSI sensor of SENTINEL-2A and 2B (L2A) satellites of the European Space Agency (ESA) were obtained without the presence of clouds, with a spatial resolution of 10 m and temporal resolution of five days, adjusted to the top of the atmosphere, totaling 16 variables.
Images from the MODIS sensor were obtained from the portal of the Brazilian Agricultural Research Corporation (EMBRAPA, Brazilian acronym), which distributes cloud-free images from the National Aeronautics and Space Administration (NASA).The MODIS image collection had 38 variables with a spatial resolution of 250 m and a temporal resolution of two days.The preparation of the indices and the data extraction were performed using QGIS 3.10 software.(2) The knowledge-discovery in databases (KDD) method is used to process a large dataset through the pre-processing, mining, and post-processing of data (Goldschmidt;Passos;Bezerra, 2015).Production granularity of the pre-processing stage, from the specific data on soybean productivity in the plots, was adjusted according to the spatial resolutions of the MODIS and MSI images.The average was calculated for each pixel of 250 and 10 meters, respectively.After the adjustment, the average values of each sensor's EVI and NDVI vegetation indices were extracted using the Zonal Statistics tools of the QGIS 3.10 software.
Processing and extraction of soybean vegetation indices in the selected plots made it possible to create two data sets for the 2020/2021 harvest, one for MSI with 17 attributes and 74,570 instances; and another for the MODIS sensor, with 39 attributes and 311 instances.Subsequently, outliers and extreme values were identified and excluded.
The data mining step was carried out in R Studio (R Core Team, 2021), where the division of the sets first took place, 70% for training and 30% for model validation.RF regression was performed using the Random Forest package based on the production response variable and the vegetation indices, the latter used as predictor variables.
A ntree of 100 was determined for RF regression and the predictor variables were set as the default.Decision trees are represented as a set of rules that start at the tree's root and group to one of its leaves.The final product of these decisions consists of a directed acyclic graph in which each leaf node corresponds to a class or a decision node containing a test of some attribute (Monard;Baranauskas, 2003).
The RF regression algorithm builds a multitude of decision trees when training the samples through an average prediction of the individual trees.For James et al. (2013), this algorithm builds decision trees each time a split in a tree was considered, causing a random sample of n predictors to be chosen as candidates, and dividing complete sets of predictors (Figure 3).From the analyzes carried out by RF, it was possible to evaluate the performance of the models using data from the MODIS and MSI sensors, generate descriptive statistics, and consider which were the five most important variables for estimating soybean yield.

RESULTS AND DISCUSSION
In our test model, the RF regression models have a coefficient of determination (R²) of 0.63 or 0.60 for vegetation indices generated by MSI (Figure 4c) and MODIS (Figure 4d) data, respectively.Root mean squared deviation (RMSE) values showed that MSI and MODIS sensor data models were similar.
These results were similar to other studies using vegetation indices as predictive variables to estimate the productivity of crops.For example, the estimation models from Johnson (2014) showed R 2 = 0.71 for soybean and R 2 = 0.77 for corn.Liu et al. (2020)  It is worth noting the recent use of unmanned aerial vehicles in monitoring crops, which allows greater autonomy when obtaining images, especially with their high spatial resolution.However, the study carried out by Alabi et al. (2022) monitored a soybean field located in Nigeria with images of 12 cm of spatial resolution and algorithms such as RF and the Cubist model obtained an R² of 0.89; higher than the results presented in this research.Regarding the average of actual and estimated productivity of the fields studied (Table 1), a difference of only 4 kg ha -1 was found from the MSI data and 145 kg ha -1 for the MODIS sensor.These results showed that the MSI sensor was more sensitive in generating the vegetation indices and consequently gave a better predictive performance by ML.
The RF regression algorithm performed well in estimating soybean yields in the 16 fields studied (Table 1 and Figure 3).Jeong et al. (2016) the potential of the RF regression algorithm for estimating global and regional crops.RF can model complex cropping systems such as wheat, corn, and potato, and configures itself as an alternative statistical modeling method for crop yield prediction.
The regression models generated from MODIS and MSI images showed differences in predicting the average productivity of the plots measured in tons per hectare.However, they presented similar R² and RMSE.The differences in the resolution of the sensors could explain this, despite the low temporality of the MSI images with its spatial resolution of 10 m, which allowed to estimate the production of each 100 m 2 .In contrast, the MODIS sensor has a pixel of 250 m allowing to estimate the production for every 6.25 ha.decrease in node impurities from the division in the predictors.An increase in the IncNodePurity value implies a reduction in the mean squared error, which means that the highest values represent the essential variables for the response (Habibi et al., 2021).Figure 5 shows the relative importance of the variables in the models used in this study.
Both EVI and NDVI vegetation indices, despite having differences in their composition and sensitivity to different soil compositions, as highlighted by Huete, Justice, and Leeuwen (1999), had the same importance in predicting soybean yield with the models generated for both MODIS and MSI sensors.
The most prominent predictor variables were those corresponding to January and February, equivalent to the physiological maturation stages of the soybean crop called R7 and R8.In these stages, the predominant characteristic is the appearance of the first normal pod on the main stem with mature color (until 95% yellowing of the pods) (Neumaier et al., 2020).
In contrast, the study carried out by Shami and Meng (2021) in the Mississippi Delta with the MODIS EVI and NDVI indices was based on growth metrics from the beginning to the complete development of the pods (R3, R5, and R6), and showed the best-predicted soybean crop productivity with 95% accuracy.However, the response variable inserted in the model was at the municipality level obtained through agricultural statistics, unlike the present study, which uses production data from orbital sensors and vegetation indices.Among these analyses, it was also possible to evaluate which five variables contributed most to generating the productivity estimation models.The importance of the input variables can be evaluated through the impurities implemented in the RF algorithm.The impurities are extracted from the regression trees by calculating IncNodePurity, corresponding to the total

Figure 1 :
Figure 1: Location of study areas.
= adjustment coefficients for the aerosol effect of the atmosphere.

Figure 2 :
Figure 2: General scheme for obtaining soybean harvest data in plots.Adapted from John Deere (2022).
applied the MODIS NDVI to estimate barley, rapeseed, and wheat yields in humid regions and obtained R 2 values between 0.53 and 0.70.Khanal et al. (2018) used a high-resolution multispectral image of bare land and topographic terrain to predict soil properties and corn yield using machine learning algorithms and obtained R² = 0.53 with the RF algorithm.Pantazi et al. (2016) estimated wheat production with counter-propagation artificial neural networks with an average overall accuracy of 78.3%.In the work of Li et al. (2022), vegetation indices from MODIS images were used to estimate wheat yields in Northwest China, with R² = 0.74 and RMSE of 0.758.

Figure 4 :
Figure 4: Comparison of actual and predicted soybean yield dispersion in the field by using RF regression in testing data obtained from MODIS (a, c) and MSI (b, d) sensors.

Figure 5 :
Figure 5: Ranking of the most important predictor variables for the MODIS (a) and MSI (b) sensor models.

Table 1 :
Descriptive statistics of actual and estimated soybean yield (tons ha -1 ) of the plots by RF.