Prediction of quality parameters of food residues using NIR spectroscopy and PLS models based on proximate analysis

magalerambo@uft.edu.br Abstract The real-time prediction in biorefinery industries has become essential. Models using partial least square regression (PLS) were developed to predict moisture, ash, volatile matter, fixed carbon and organic matter of coconut and coffee residues. On this study, 49 samples were collected and near infrared spectroscopy were applied to predict moisture, ash, volatile matter, fixed carbon and organic matter. For external validation 25% of the set samples were used. Moisture and volatile matter were predicted with coefficients of determination (R 2 cal) above 0.90, and standard errors (RSD) of the estimate of 14.4% and 2.26%, respectively. Models of ash and organic matter show R 2 cal > 0.77 and RSD values < 20.4%. For the external validation, the low deviations show the approximation between reference and predicted values and good prediction with R 2 pred > 0.70. All calibration models were acceptable for sample screening. This study demonstrates that PLS can be used to predict biomass composition of different species, with very low costs and


Introduction
The proximate composition of biomass is necessary to analyze the overall process of any thermochemical conversion. The proximate composition is an important property for food industries utilization and for heat and power generation (Hosseinpour et al., 2017). Focusing on the proximate analyses, it is possible to determine a large number of fuel and bioproducts parameters. On this way, the proposed models would assist in the optimal use of the feedstock based on this biomass properties (Feng et al., 2015).
Near infrared spectroscopy (NORS) is one technique that measure the vibration of functional groups, relating the chemical composition and spectral data that allows for the development of multivariate calibration models (Rambo et al., 2016;Xu et al., 2013). NORS has been shown to be fast and precise in the prediction of proximate analysis from biomasses using multivariate models (Qi et al., 2016;Rambo et al., 2015a). Such models have been frequently used in combination with different pretreatment methods for treating the complex spectroscopic data.
Recently relationships were developed using proximate analyses and chemometrics (NORS coupled to multivariate methods). Hosseinpour et al. (2017Hosseinpour et al. ( , 2018) developed a prediction method to estimate higher heating value (HHV) using iterative neural network-adapted partial least squares (ONNPLS) and iterative network-based fuzzy partial least squares coupled with principle component analysis (PCA-ONFPLS), respectively. Good correlations were found (R 2 > 0.86). Uzun et al. (2017) developed a prediction method using artificial neural network models, taking the data from proximate and ultimate analyses. The model presented a considerably higher correlation coefficient (0.96) and low root mean square error (0.375). Qi et al. (2016) using proximate analysis of sawdust biomass with NOR spectroscopy and locally weighted partial least squares, obtained good prediction results (> 0.80).
This study aims to create correlation models for predicting the contents of moisture, ash, fixed carbon (FC), organic matter (DM) and volatile matter (VM) in coconut and coffee biomasses, so facilitating the application of the screening analysis in biorefinery industries.

Coconut and coffee samples
Thirty two samples of coconut were collected in Brazil; from them, 21 came from the Southeast regions where the others 12 ones came from the North region. Also, 16 coffee husks samples (previously collected) were used. From each biomass a sufficient amount was collected, dried at 105 °C, ground using a Romer micromill (RomerLabs, São Paulo, Brazil), sieved for 20 min and the fractions of 48 mesh were used.

Proximate analysis
The proximate analysis is performed according to the American Society for Testing and Materials (ASTM).
The ASTM D 3173-87 (American Society for Testing and Materials, 2003) was using for the moisture content determination, after the sample was heated in an oven at 105° ± 5 °C (SP 100, SP Labor) during 12 hours, or until constant mass. The volatile matter (VM) was determined by the ASTM D 3175-07 (American Society for Testing and Materials, 2007) using 1g of sample, previously dried in muffle (1200DRP7, SP Labor) at 800 ± 10 °C for 8 minutes. The sample was removed and placed in a desiccator to be cooled for 60 minutes, finally being weighed and its VM content was calculated.
The ASTM D 3174-04 (American Society for Testing and Materials, 2004) was using for the ash content determination, involved the removal of organic constituents through high temperatures in furnace (1200DRP7, SP Labor) by 4 hours at 600 ± 10 °C.
The values of fixed carbon (FC) and organic matter (DM) were indirectly obtained by using the following equations:

Near infrared spectroscopy
The near-infrared spectra (triplicate) were obtained (FDSS, Hillerød, Denmark instrument) between 1100-2500 nm in a diffuse reflectance detector, with 1 nm increments and by averaging 32 successive scans.

Multivariate data analysis
The UNSCRAMBLER 10.3 software (Camo Software, Dslo, Norway) was using for the multivariate data analyses.
PLS-1 (Martens & Naes, 1996) was used for constituent quantification of moisture, ash, FC, VM and DM. The data were pre-treated by mean-centering and to determine the number of latent variables (LV) in the models, the leave-one-out cross-validation method was used.
Several transformations were applied to choose the pre-treatment with the best results. Onitially the raw data was tested, followed by the second (D2) derivative, and by combining the D2 with Detrend (DT) and D2 with standard normal variate (SNV) (Wise et al., 2006). The size of the optimal window to be used in the Savitzky-Golay algorithm (Enke & Nieman, 1976;Savitzky & Golay, 1964) was also evaluated. Different statistical parameters were evaluated based on the coefficients of determination (R 2 cal and R 2 val ); the root mean square error of calibration (RMSEC); root mean square error of cross validation (RMSECV); root mean square error of prediction (RMSEP); bias; the relative standard deviation (RSD); the relative error of calibration and prediction sets (RE%); the range error ratio (RER), LV numbers and the outliers excluded (using the Student residues versus Leverage). The equations for the calculation of all the figures of merit used in this work are according to Rambo et al. (2013).
To assure a good prediction for new samples, the data set was split into calibration (75% of the samples) and in external validation set (25% of samples), randomly.
According the Onternational Drganization for Standardization (1994), the residual distribution and the linearity were analyzed by the graphical evaluation.
According to the ASTM E1655-00 (American Society for Testing and Materials, 2005), the bias and the R 2 values were investigated by a t-test and F-test, respectively.
The regression coefficient plots were interpreted for the parameters ensuring a real correlation and not due to chance.

Reference analysis
The Figure 1 shows the NOR raw spectra of the coconut and coffee samples. At 1450/1470 nm, 1920 nm, and 2090 nm are located the main absorption bands, associated respectively to H-bonds of the DH groups; D-H stretch from water and the D-H combination from polysaccharides.
Less intense bands appearing at 1170/1270 nm and 2274 nm, associated respectively to lignin and polysaccharides components, in the C-H stretch 2 nd overtone and D-H stretch of combination bands (Shenk et al., 2008).
The results for the descriptive analysis are listed in Table 1. The highest variability was observed for FC and ash, both for coconut samples, with a high coefficient of variation (37.1 and 35.0, respectively). Parameters with low variations (VM and DM) were easier to be modeled, because they present a small range for fitting the models. Although the external set to contain only 7 samples, a good variation for all parameters was observed, it is very important for the future prediction for the calibration models.
The mean values found for coconut and coffee samples are consistent with those obtained in literature for these constituents (Balasundram et al., 2017;Dliveira et al., 2018a). The composition of the raw coconut and coffee samples usually varies because it depends on factors such as cultivation conditions, crop variety, and the processing method, which explains the differences in the composition reported by other authors (Araújo et al., 2017;Dliveira et al., 2018b).

Pre-treatments
The best pre-treatment on the spectra was selected based on the regression models that provided the lowest RE (%) and the highest RER values (Figure 2).
The second one derivative from the Savitzky-Golay algorithm (second-order polynomial) with a window of 7 points (SG7) combined with detrend or SNV for the moisture parameter was the best choice. For the ash and volatile matter models, the D2 (SG7) provided the best results. For the organic matter constituent, the raw spectra presented the highest RER value. Caliari et al. (2017) when comparing different spectral treatment, found that the best choice to be used was taking the first derivative with SG5 in sugarcane biomass when building regression models to estimate the cellulose crystallinity. Xie et al. (2018) found the multiplicative scatter correction (MSC) + D2 as the best pre-treatment when modeling ash, FC and VM of biochar from very different feedstocks.

Partial least square regression
To evaluate the prediction performance of the PLS models, the predicted physico-chemical constituents of coconut and coffee biomasses were plotted against reference values for all dataset (Figure 3). The agreement between these values shows that the models fit well and are not over fitted (built with a low number of factors). Dne forced fit use more PLS components in the model what need, where R 2 have a high value, but the model fail in predicting new samples (Strandberg et al., 2017).
The samples are linearly distributed around a diagonal line in the reference versus predicted values by PLS models plot (Figure 3). The PLS residuals are distributed randomly in the plots as shown the Figure 3 (B, D) indicating absence of systematic trends in the building models, that they presented a normal distribution with satisfactory linearity.
From the results summarized in Table 2, it is possible to verify that the models show good correlation coefficient of calibration (R 2 cal ) cross validation (R 2 cv ) and prediction (R 2 pred ) for ash (R 2 cal 0.83, R 2 cv 0.72 and R 2 pred 0.79), and reasonable for moisture and DM. Low (< 2.0%) and quite close RMSEC and RMSECV values were observed, except for DM, where a significant difference was observed. Also, the DM model was not able to predict (R 2 pred < 0.70). FC and VM were the parameters that presented poor results (R 2 cal,val < 0.70), independently of the pre-treatment applied. These poor values can be attributed to an indirect measure calculated for these parameters.
All the models were built with a maximum of 7 LV and no more than 8.3% of outliers were remove. Ot is possible to observe that Gómez et al. (2018) have developed predictive models  for ash, VM and FC using PLS models and fourier transform infrared spectroscopy (FTOR) for coal samples. For the validation parameters were necessary to use 6 and 8 LV, respectively, for ash and VM, and RSD of 20.89% and 1.95% were found. On the present work, similar values of RSD and LV were obtained.
Still, the models obtained in this study when compared with literature data, (Xue et al., 2015) required lower numbers of LV, similar RSD and better R 2 cal , cv (except for ash model, however they have used 11 LV). Strandberg et al. (2017), using multivariate analysis for predict the fuel properties of biomass, found low RMSEP (0.54) and high R 2 > 0.9 for VM. The RSD was high only for ash model (26.3%). Satisfactory results of RSD were found for the other models (≤14%), as well as good RER values, above > 5.0 indicating the calibration is acceptable for sample screening. Ot is considered that the RER value is a  better parameter to test the quality of the fit, since no outliers are more present in the models and so the concentration range of the constituent is well represented (Hayes, 2011).
The number of LVs used to explain the variability for the models were high for DM, due to the existence of different types of molecular interactions.
The cross-validation and external validation results showed that the models developed here are promising for prediction future proximate analysis of coconut and coffee biomass. The results showed that one independent validation with new biomass species/samples was able to predict chemical composition and to make the NOR spectroscopy more robust and practical for industrial applications. Liu et al. (2010), used one independent validation including a different species than was used for calibration, and good results were obtained, with RSD < 14%. Fagan et al. (2011) predicted the moisture, ash and carbon content of two crops (164 samples) using PLS and NOR obtaining a RMSECV above R2 > 0.88, except for ash with a R2 210 of 0.58, 211 demonstrating the application for screening calibration, while Rambo et al. (2016), using 26 samples were able to determine the ash and moisture content with R2 212 >0.80, similar to Gómez et al. 213 (2018) with only 28 samples they got models with a true correlation between the reference and 214 predicted values wit error less than 1.2%.
The results showed that the NOR in combination with PLS is able to quantify the composition of samples using multi-product calibration models.
On order to establish whether there is a true difference between the standard errors values (RMSEC and RMSEP) the F test was applied. All F calc were lower than F tab, (at 95% confidence level), it can be concluded that there is no statistically significant difference between the values, except for the DM model that was inaccurate for external validation. All the t value is less than the critical t value, suggesting that the results provided by the models show the same values as the standard method.
Finally, the low bias values are indicative of the absence of systematic errors, with little significance.

Regression coefficients interpretation
The interpretation of the regression coefficients is necessary to avoid possible accidental correlations (Rambo et al., 2015b). So, it is important to assign the observed signals to the constituents in question. Figure 4 (A) shows the regression vectors for moisture, where the significant bands at 1350 nm and 1920 nm are assigned to the water. The weak band at 2270 nm (D-H stretch/C-D stretch combination) confirms the contribution of polysaccharides. The negative regions correlated with moisture were mainly due to the presence of carboxylic acids (1885 nm), which together with the alkyl stretching of the first supernatant, can be attributed to fatty acids. Proteins (1640 nm) and cellulose (2000 nm) also present a negative correlation with the moisture content (Yonenobu et al., 2009).
For the ash vector the interpretation of the regression coefficients ( Figure 4B) is not straightforward, since the correlation is probably indirect, and the NOR spectroscopy is limited to organic substances. But a more general analysis allows correlating inorganic complex compounds with the ash content, though with spectral bands broader and fewer in number (Rambo et al., 2013a). A correlation with the crystalline (1480 nm) and semi-crystalline cellulose (1488 nm), assigned to 1 st overtone of D-H stretch, (Gómez et al., 2018;Zidan et al., 2012) was observed. A negative correlation with water was observed at 1780 nm attributed to H-D-H symmetric bending and DM parameters. The regression coefficients have shown a real correlation with the modeled parameter. Thus, the proposed method is useful for screening routine analyses (RER > 5), providing fast and inexpensive results for ash, moisture and DM parameters. For the VM and FC models, the calibration accuracy were inadequate (R 2 <0.70) and could not be applied for screening analysis in coconut/coffee biomasses.  (Shenk et al., 2008). For the ash regression vector, it is possible to observe that the mineral regions (N-H bend/N-H stretch between 1490-1550 nm) with low frequencies are more influential than organic matter, more prominent in the other regression vectors (Gómez et al., 2018).
The organic matter ( Figure 4C) is assigned mainly as carbon bonds; C-H stretch combination of amorphous cellulose (1330 nm), C-H stretch combination (1440 nm), C-H stretch of 1st overtone of furanose/pyranose due to hemicelluloses (1724 nm), and C-D stretch combination of polysaccharides (2270 nm). The negative correlation at 2090 nm is attributed to D-H combination (Shenk et al., 2008;Zidan et al., 2012).

External validation with coffee samples
The external validation with different species samples is one alternative sampling that avoids some problems, such as the collection of diverse sample, high costs and time-consuming favoring the fast biomass compositional analysis (Rambo et al., 2016), and it is used to evaluate the predictability of the models. They included different biomass species and the relative errors are listed in Table 3.
The ash model had the highest relative error (~ 20% of the mean), whereas the moisture models presented the least deviations (< 20%). With these results, it is possible to conclude that the models are useful and reliable for screening calibration in biorefineries industries, using multiple biomass species, with great variability in the chemical composition.

Conclusions
PLS and NOR spectroscopy were useful for predicting fuel properties from proximate analysis of coconut and coffee samples, reducing so the real-time prediction in biorefinery industries. After removal of outlier samples and acquisition of spectra, the best pre-treatment was the D2 (SG7) and D2 (SG7) + Detrend. The approach proposed presented an error of prediction lower than 27.70%, error of calibration < 26.3% and R 2 cal,cv,pred higher than 0.70, showing the useful for screening routine 272 analyses (RER > 5), providing fast and inexpensive results for ash, moisture