Retrieval of leaf chlorophyll content in Gannan navel orange based on fusing hyperspectral vegetation indices using machine learning algorithms

ABSTRACT: Estimating leaf chlorophyll contents through leaf reflectance spectra is efficient and nondestructive. The literature base regarding optical indices (particularly chlorophyll indices) is wide ranging and extensive. However, it is without much consensus regarding robust indices for Gannan navel orange. To address this problem, this study investigated the performance of 19 published indices using RDS (raw data spectrum), FDS (first derivative data spectrum) and SDS (second derivative data spectrum) for the estimation of chlorophyll content in navel orange leaves. The single spectral index and combination of multiple spectral indices were compared for their accuracy in predicting chlorophyll a content (Chla), chlorophyll b content (Chlb) and total chlorophyll content (Chltot) content in navel orange leaves by using partial least square regression (PLSR), adaboost regression (AR), random forest regression (RFR), decision tree regression (DTR) and support vector machine regression (SVMR) models. Through the comparison of the above data in three datasets, the optimal modeling data set is RDS data, followed by FDS data, and the worst is SDS data. In modeling with multiple spectral indices, good results were obtained for Chla (NDVI750, NDVI800), Chlb (Datt, DD, Gitelson 2) and Chltot (Datt, DD, Gitelson2) by corresponding index combinations. Overall, we can find that the AR is also the best regression method judging by prediction performance from the results of single spectral index models and multiple spectral indices models. In comparison, result of multiple spectral indices models is better than single spectral index models in predicting Chla and Chltot content using FDS data and SDS data, respectively.


INTRODUCTION
Gannan navel orange is a national geographical indication product of China.It is of high quality, rich in essential nutrients and enjoys the reputation of Chinese famous fruit.As a major indicator of nutrients, chlorophyll content is involved in various biochemical and physiological processes that are vital for navel orange production.Real-time and non-destructive assessment of navel orange chlorophyll content is important for evaluating crop productivity and improving the precise management of N (LI et al., 2016).Although, traditional measurement of chlorophyll content via wet chemistry methods Ciência Rural, v.53, n.3, 2023.Lian et al. is precise but cost intensive, time-consuming and a destructive for measuring leaves (LIU et al., 2018).Therefore, traditional wet chemistry destructive method is obviously undesirable for assessing chlorophyll content in navel oranges.For this reason, hyperspectral remote-sensing technology can be applied to estimate the spatio-temporal variations in the physical and chemical parameters of vegetation, including the chlorophyll content, at a relatively low cost compared to field measurements (MUTANGA et al., 2004;ZHANG et al., 2008;HE, 2013;PU et al., 2014;PENG et al., 2018;HOEPPNER et al., 2020;JI et al., 2020;LI et al., 2020).
Hyperspectral remote-sensing methods for estimating vegetation chlorophyll content rely on observed spectral features via an empirical relationship that links the variables of interest to the sensitive bands, spectral indices, or spectral transform values (MEER, 2001;DORIGO et al., 2007;L. LIANG et al., 2012;LIANG et al., 2013).However, hyperspectral reflection is always affected by the biophysical characteristics of vegetation, canopy shape and structure, atmospheric absorption and scattering, and soil backgrounds (THENKABAIL et al., 2000;A et al., 2010).Therefore, to capture more effective information from the spectra, derivative spectra was applied to minimize the influences of background interference and spectral noise.Several studies showed that the first derivative reflectance (FDR) was sensitive to crop chlorophyll (CHO & SKIDMORE, 2006).Feng et al. (FENG et al., 2008) reported that the chlorophyll content was highly related to the radio of the red-edge integral areas and the blue-edge integral areas.Liu et al. (LIU et al., 2002) also demonstrated that the first derivative spectrum (FDS) at 740-760 nm was highly correlated with chlorophyll content in rice.
In the field of hyperspectral, given the spatial resolution, neighbor intensities are highly correlated so a spectral signature contains a lot of redundant information.The main issue consists of extracting the discriminative information to reduce the set of relevant bands (LE BRIS et al., 2015).Spectral indices are an important method for extracting information from remotely sensed data because indices reduce, but do not eliminate, effects of soil, topography, and view angle (RDJ & ARH, 1991;HATFIELD & PRUEGER, 2013;HATFIELD et al., 2015).Identically, spectral indices are also an important method for analyzing imaging spectrometer data (GITELSON et al., 2006).Several studies have successfully estimated chlorophyll content in leaves based on their own study dataset using different vegetation indices.Extensive use has been made of visible ratios (BISUN & DATT, 1998), visible/NIR ratios (HABOUDANE et al., 2002;GITELSON et al., 2006), red edge reflectance-radio indice (SIMS & GAMON, 2002), spectral and derivative red edge indices (MILLER et al., 1990).Combinations of the red edge and infrared region built a spectral index for measuring chlorophyll contents (ZARCO-TEJADA et al., 2003;MAIN et al., 2011).In addition, BROGE & LEBLANC (BROGE & LEBLANC, 2001) developed the triangular vegetation index (TVI) based on the area of a triangle with vertices at green, red and NIR wavelengths, which is sensitive to both chlorophyll content and LAI.However, most VIs have been focused on retrieving chlorophyll contents in specific species (MAIN et al., 2011;MALENOVSKY et al., 2013), which have a limitation and un-stability for common use in other species.
The research objects of this study are: (1) to find an indexes suitable for navel orange leaves because index may not be suitable for all species datasets (LE MAIRE et al., 2008); (2) to the best of our knowledge, derivative of the indexes has not yet been extensively applied in regression analysis, so we explored the capability of the derivative of the index in retrieval chlorophyll content; (3) to establish the relationship between the spectra data and chlorophyll content by using five different regression algorithms that are commonly used for data analysis in the field of remote sensing and comparative analysis the performances of these five regression methods.

Data acquisition and processing
The study area was performed in a navel orange planting site stand located in National Navel Orange Engineering Research Center, Gannan Normal University, GanZhou, China.Sixty leaf samples were collected by the lopper from the branches of the canopy with ten trees on June 23, June 29, July 03, and July 04, 2018, between 9:00 and 11:00 or 14:00 and 16:00, respectively.Leaf samples were all in a normal range of maturity and health.A laboratory hyperspectral imaging system (GaiaField-V10E, Sichuan Dualix Spectral Image Technology Co. Ltd, Sichuan, China) was used to acquire hyperspectral images of the leaf samples in reflectance mode.Imaging spectrograph (ImSpectorV10E) coves the spectral range from 400 to 1000nm with 360 wavebands (wavelength interval 1.67 nm).Threepoint average spectral reflectance was used as the spectral value of each navel orange leaf sample.
The datasets of 60 leaf samples were randomly split into two parts: a calibration set of 45 samples and a validation set of 15 samples.First derivatives (FDS) and second derivatives (SDS) of the apparent absorbance spectra were calculated.The chlorophyll a content (Chl a ), chlorophyll b content (Chl b ) and total chlorophyll content (Chl tot ) of the leaf samples was measured by spectrophotometry.The absorbance at 665 nm (A665) and 649 nm (A649) was subsequently measured by V-5100 spectrophotometer.Chlorophyll content was calculated according to the reference using the correction equations ( 1), ( 2) and ( 3). (1) (2) (3) where V is the volume of the extraction solution (ml), and W is the weight of the leaf sample (g).The results of Chl a , Chl b and Chl tot were expressed in mg/g.

Modeling methods
Some commonly used machine learning algorithms, such as partial least square regression (PLSR), adaboost regression (AR), random forest regression (RFR), decision tree regression (DTR) and support vector machine regression (SVMR) can cope with the strong nonlinearity of the functional dependence between spectral variables and biophysical or biochemical parameters.In this study, PLSR, AR, RFR, DTR and SVMR were applied to learn the relationship between the vegetation indices (VIs) and chlorophyll content by fitting a flexible model directly from the spectrum datasets.K-fold cross validation is used to avoid model overfitting, and the value of K is 5.The grid search algorithm is used to find optimal model hyper parameters for 5 algorithms.
The coefficient of determination (R 2 ) and root mean square error (RMSE) are used as an evaluation indicators.Assuming that the relationship follows a normal distribution, R 2 ranges from 0 to 1.The higher the R 2 value is, the stronger the predicted relationship is.The lower the RMSE is, the better performance of the model is.Which calculate as follows: (4) Where Ȳ α is the average predicted value and Ȳm is the average measured value.
(5) Where y αi is the output value; y mi is the measured data; and n is the sample number.

Spectral reflectance and leaf chlorophyll content
The apparent reflectance of spectra was subjected to first and second derivation in agreement (SHI et al., 2013) in order to increase signal-to-noise ratios at the different wavelengths.Higher signalto-noise ratios obtained by derivation seem to have been able to highlight the parts of the spectral where important chemical information is located.They are expected to subsequently improve the regression models for chlorophyll content determination.The content of Chl a measured by spectrophotometry ranged from 0.45mg/g to 2.73mg/g, Chl b range from 0.08mg/g to 0.75mg/g and Chl tot range from 0.53mg/g to 3.48mg/g.

Vegetation indices used in this study
A number of hyperspectral indices have been established to estimate chlorophyll content at both the leaf and canopy scales.In this study, using the leaf reflectance data, we calculated 19 published chlorophyll indices (Table1).In this paper, five superior indices used by LIU et al. (LIU et al., 2018), top five in the 20 spectral indices used by Main et al. (MAIN et al., 2011) and nine indices used by DIAN et al. (DIAN et al., 2015) were selected as spectral indices group.

Analysis of modeling results
In this experiment, the raw data spectrum (RDS), first derivative data spectrum (FDS) and second derivative data spectrum (SDS) were used to calculate the index, and then observed their performance on chlorophyll estimation.In addition, five different regression methods are applied for modeling, the main purpose is to get the best suitable regression method and indexes for navel orange chlorophyll content regression.The consistency and robustness of the various VIs in estimating leaf chlorophyll content was assessed in two different Ciência Rural, v.53, n.3, 2023.
Lian et al.
ways, namely, using single index and using the combined of multiple spectral indexes.

Single spectral index modeling
The table 2 lists 45 indicators that perform best in specific data sets and inversion variables, for example, in PLSR algorithm, RDS data set and Chl a variable, the index with the best performance among all indicators is TCARI with serial number 18 in table 1.
The horizontal line in the table is to highlight the data set with the best inversion effect in the same algorithm.Among the five machine algorithms, except for RFR algorithm, RDS data set got the best regression effect, followed by FDS, and finally SDS.However, when modeling with SDS data, the best estimate of Chl b (R2=0.931and RMSE=0.045) in navel orange leaves is obtained.Using RDS data as the input vector of the model, AR shows great potential in chlorophyll content estimating for Chl a (R2=0.955and RMSE=0.139)and Chl tot (R2=0.966and RMSE=0.150),yet the estimate of Chl b is a bit worse, but is still better than other modeling methods.In addition, R² value from Chl a and Chl tot is better than Chl b , the reason may be that these selected indices are not particularly sensitive to Chl b .As can be seen from table 2, the regression effect of AR algorithm is the best, regardless of different data sets or different regression variables.
In order to find the spectral indices suitable for searching for Chl a , Chl b and Chl tot in RDS data, FDS data and SDS data, we select the best index according to the frequency of index occurrence, and the index with the most occurrence is identified as the best index.Compared with other spectral indices, during all the regression methods, Vogelmann, which appeared 10 times in the RDS data were considered as the best spectral indices in the RDS data.Similarly, when modeling with FDS data and SDS data, DD and Vogelmann3 are the best modeling spectral indices respectively.To choose a spectral index that is excellent for the estimation of Chl a , Chl b and Chl tot , Vogelmann3, DD and Vogelmann would be recommend respectively.Lastly, an index that is good for various modeling methods and various modeling data sets can be selected from, which is Vogelmann.--------------Spectral index--------------Formula Source I 1 BGI R

Multiple spectral index modeling
Combinations of different indices were used to estimate chlorophyll content, show as table 3.
The AR models built by the FDS data, RDS data and RDS data provided relatively robust results for predicting Chl a (R 2 =0.966 and RMSE=0.121),Chl b (R 2 =0.921 and RMSE=0.048)and Chl tot (R 2 =0.969 and RMSE=0.144),compared with the other models.Through the comparison of the above data in these three datasets, there is not much difference between RDS and FDS in the regression effects, both of which are relatively excellent, and the worst is SDS data.Consistent with the conclusion of single index, AR is the algorithm with the best performance among the five algorithms.In modeling with multiple spectral indices, good results were obtained for Chl a (NDVI750, NDVI800), Chl b (Datt, DD, Gitelson2) and Chl tot (Datt, DD, Gitelson2) by corresponding index combinations.

Comparative analysis of the optimal models
The prediction results for Chl a , Chl b and Chl tot in both single spectral index and multiple spectral indices using the RDS data, FDS data and SDS data are shown in table 2 and table 3. It can be founded that the AR is also the best regression method judging by prediction performance from the results of single spectral index models and multiple spectral indices models.In comparison, result of multiple spectral indices models is better than the single spectral index.In contrast, single spectral index models got better prediction results for Chl b using the SDS data.In general, modeling with multiple spectral indices is suitable for the estimation of Chl a and Chl tot , and using single spectral index is suitable for the retrieval of Chl b .

DISCUSSION
This study investigated the performance of 19 published indices using RDS data, FDS data and SDS data for the estimation of chlorophyll content in navel orange leaves.The aim was to understand which of the myriad of published VIs would be consistent or robust enough when applied to Gannan navel orange leaves.The indices varied greatly in terms of their original focus, and intended targets, but they were tested none-the-less and produced interesting results.We applied the indices to would provide a more than adequate examination of their abilities, due to the variety of leaf structures, leaf surfaces, moisture contents and chlorophyll contents present.Among the navel orange leaves, the Chl tot content is the highest, followed by the Chl a , and the least content is Chl b .A common observation in the study was that AR appeared regularly in the top of the rankings for each Table 2 -The performance of the vegetation indices to predict chlorophyll content (mg/g) according to the single index.

Method
----------------Chla----------------------------------Chlb------------------------------------Chltot----------------- of the scenarios.The phenological state of the leaves, as well as inherent differences between species, results in datasets with variable moisture contents, leaf surfaces, and leaf internal structures.The indices would have had different responses to these moisture and structure variations, which in turn could have influenced their ability to predict for chlorophyll content.It can be assumed that the best performing indices probably show a decreased sensitivity to varying leaf structures or moisture contents and can be considered more robust than indices that only did well for crop species.Results also have similarities to those reported in MAIN et al. (MAIN et al., 2011), where the data from various species to test the performance of 70 published chlorophyll indices was used.Some of the same indices that performed well in the MAIN et al. (MAIN et al., 2011) study also perform well in this study (e.g.Vogelmann 3 index, Datt index and Vogelman indices).The question regarding whether there is one single index or multiple indices to use in order to estimate Gannan navel orange chlorophyll content has been answered in this paper.A number of robust and consistent indices for Chl a , Chl b and Chl tot are proposed; and could therefore, be seen as priority indices to be tested in any follow up work.For instance the Vogelmann index by VOGELMANN et al. (VOGELMANN et al., 1993) consistently performed well in all modeling methods, and was the best performer for the low Chl b .Further research is recommended regarding whether these indices be able to measure Gannan navel orange canopy Table 3 -The performance of the vegetation indices to predict chlorophyll content (mg/g) according to the multiple index.spectral.Another further study is to combine the actual spectrum with the radial transfer models (RTMs) to make the model simulation data set make up the shortcomings of the actual measurement data set.

CONCLUSION
Firstly, the regression of Chl a , Chl b and Chl tot has achieved good results, thus it's available to use hyperspectral vegetation indices estimate the Chl a , Chl b and Chl tot contents in navel orange leaves.Secondly, in 19 published spectral indices, modeling by the single index or multiple indices, a suitable index was founded for navel orange leaves.Finally, when RDS data, FDS data and SDS data are used as input vectors of the models, among the five modeling methods the best modeling method is AR, which can establish a quantitative monitoring model of chlorophyll content and control the nutritional status of navel orange fruit trees in real time.

Table 1 -
Vegetation indices used in the study.