Uses of mid-infrared spectroscopy and chemometric models for differentiating between dried cocoa bean varieties

HIGHLIGHTS: The ATR-FTIR technique discriminates between cocoa genotypes by chemometric methods. The LDA and PLS-DA chemometric models achieve high reliability in the discrimination of food matrices. It will be possible to improve the processes of dry cocoa bean classification in the industry. ABSTRACT Generally, the taxonomic classification of cocoa beans is based on the theobromine/caffeine ratio determined using high-performance liquid chromatography (HPLC). However, this technique involves laborious and time-consuming calculations. Attenuated total reflectance Fourier transform infrared (ATR-FTIR) spectroscopy is a valuable, effective, and rapid tool for analyzing the chemical composition of food products. The objective of this study was to examine the potential of ATR-FTIR combined with chemometric tools such as principal component analysis (PCA), linear discriminant analysis (LDA), and partial least squares regression-discriminant analysis (PLS-DA) to discriminate between the Trinitario and Forastero dry bean cocoa varieties defined by theobromine and caffeine measurements via HPLC. The cocoa varieties were evaluated using HPLC analysis of 36 dry cocoa bean samples to determine the theobromine/caffeine ratio. Moreover, ATR-FTIR spectra were analyzed in the mid-infrared (MIR) region, and signals associated with theobromine and caffeine were identified and analyzed using the LDA and PLS-DA models. The LDA and PLS-DA models allowed the satisfactory differentiation between cocoa varieties, providing overall prediction capacity values of 98.2 ± 1.8% and 96.1 ± 2.4%, respectively. The results show the potential of ATR-FTIR spectroscopy for the reliable, fast, and easy differentiation of dried cocoa beans.


HIGHLIGHTS:
The ATR-FTIR technique discriminates between cocoa genotypes by chemometric methods.
The LDA and PLS-DA chemometric models achieve high reliability in the discrimination of food matrices.
It will be possible to improve the processes of dry cocoa bean classification in the industry.

Introduction
Cocoa is a globally important agricultural commodity and its quality depends on many factors, including postharvest processing, genotype, geographical origin, agronomic management, climate, and soil conditions (Barrientos et al., 2019;Kongor et al., 2016).
Criollo, Forastero, and Trinitario are three commonly available varieties (Żyżelewicz et al., 2018).The cultivars show genetic variability in their theobromine/caffeine ratio, which can therefore be used for their differentiation (Carrillo et al., 2014).High-performance chromatography (HPLC) has facilitated the identification of these compounds; however, it is expensive.Therefore, a combination of chemometrics and Fourier transform infrared spectroscopy technique (FTIR) has been used to determine the composition or purity of products (Vasquez-Vuelvas et al., 2020) with satisfactory results in various fields of research, as was described by Juybar et al. (2020).
The powerful integration of FTIR and chemometric tools has been successfully applied to food discrimination, differentiation, defect detection, quality prediction, and adulteration (Christou et al., 2018).Other studies have included the discrimination of espresso coffees (Belchior et al., 2019), quantification of defects in coffee (Craig et al., 2015), detection of food fraud (El Darra et al., 2017), antioxidant capacity of cocoa (Batista et al., 2016), classification of cocoa varieties (Barbin et al., 2018), and differentiation of coffee processed using different post-harvest techniques (Barrios et al., 2020).
As was previously reported, the cocoa variety was ascertained by measuring the ratio of theobromine and caffeine by HPLC; however, multivariate models based on FTIR information have not been used to predict the variety.This study aimed to examine the potential of attenuated total reflection (ATR)-FTIR combined with chemometric tools such as principal component analysis (PCA), linear discriminant analysis (LDA), and partial least squares regressiondiscriminant analysis (PLS-DA) for differentiating between Trinitario and Forastero dry bean cocoa varieties defined by theobromine and caffeine measurements via HPLC.

Material and Methods
Thirty-six cocoa samples of Theobroma cacao L. were obtained directly from different farmers in the growing areas of Huila, Colombia.The origins and geographic locations of the cocoa samples are listed in Table 1.Raw cocoa samples (60 kg) were obtained from the fruits, fermented for 8 d in a wooden box (30 × 30 × 30 cm), and turned every 24 hours to guarantee uniform fermentation.The samples were then spread on a meshed wooden tray with an area of approximately 120 × 120 cm and raised 130 cm above ground level.
The sun-drying process was conducted daily between 9 am and 4 pm until a moisture content of 6-7% on a wet basis (% w.b.) was achieved.This was monitored using a grain moisture tester (Gehaka G600, Gehaka AGRA, São Paulo, Brazil), and the dried cocoa samples (5 g) were dehydrated in an oven (UF55, Memmert GmbH + Co.KG, Schwabach, Germany) at 105°C for 24 h to determine the moisture content.The results were expressed as dry matter percentage (% d.b.).
The water activity (a w ) of the samples was evaluated using a vapor sorption analyzer (VSA).To measure a w , 2-3 g of dried and ground cocoa was placed inside a VSA (Aqualab Decagon Devices, Inc., Pullman, WA, USA), with prior calibration of the dew point sensor with four saturated aqueous salt solutions purchased from the instrument's manufacturer:13.41m LiCl (0.25 ± 0.003 a w ), 8.57 m LiCl (0.50 ± 0.003 a w ), 6.0 m NaCl (0.76 ± 0.003 a w ), and 2.33 m NaCl (0.92 ± 0.003 a w ).All measurements were performed in triplicates.
Dried cocoa beans without the skin were ground independently using a blender (Oster®, Colombia).The aqueous extractions were made in triplicate with 100 mg of dried cocoa powder in 25 mL Milli-Q water at 85°C for 25 min in a water bath (WNE 45, Memmert, Schwabach, Germany) and stirred in a magnetic plate at 800 rpm for 10 min.The extracts were centrifuged at 9,000 rpm for 10 min with an EBA 200 (Hettich, Kirchlengern, Germany) centrifuge and filtered with 0.22 μm nylon filters.
Analysis was performed using an Agilent 1260 Infinity II series liquid chromatography instrument (Agilent Technologies, Santa Clara, CA, USA) with a Poroshell 120- C18 (2.7 μm. 4 μm − 4.6 × 150 mm) column.The injection volume was 20 µL with a flow rate of 1 mL min -1 .Separation was performed using isocratic elution with methanol (Merck, Darmstadt, Germany) and water containing 0.2% acetic acid (20:80 v/v) for 10 min.The detection was performed using a diode array detector (DAD) at 280 nm.Theobromine and caffeine were identified by comparing their retention times and UV-spectra of the standards (Borja Fajardo et al., 2022).
An Agilent Cary 630 FTIR spectrophotometer (Agilent Technologies, Santa Clara, CA, USA) with an ATR sampling accessory was used for the ATR-FTIR measurements, which were performed in a dry atmosphere at room temperature (20 ± 0.5°C) (Craig et al., 2018).Approximately 0.5 g of dried cocoa powder (0.5 g) was placed in a sampling accessory and pressed.All spectra were obtained in triplicate and recorded within the mid-infrared (MIR) range of 4000-650 cm -1 with 4 cm -1 resolution and 16 scans.They were subjected to background subtraction.The ATR-FTIR standard spectra of theobromine (CAS 83-67-0 purity ≥ 98.0%) and caffeine (CAS 58-08-2 purity ≥ 99.0%) were determined in triplicate under the same spectral conditions and were considered to correlate with the signals of these compounds in the dried cocoa samples.
The theobromine and caffeine concentrations were expressed as means ± standard deviations (SD) in triplicate for every other sample.One-way analysis of variance (≤ 0.05 was performed using the STATGRAPHICS Centurion XVIII (Manufacturers Inc., Rockville, MD, USA).To compensate for and remove the bias linked to the experimental assessment of the spectrum, the infrared spectral data were preprocessed using baseline correction.Subsequently, multiplicative dispersion correction (MSC) was applied.Data processing was performed using R statistical software (version 3.6.3,R statistics, St. Louis, MO, USA) and the ChemoSpec (Hanson, 2022) R package.
Multivariate statistical analysis of the spectroscopic data was performed using principal component analysis (PCA) to explain the data variability.To detect and remove outlier observations from the experimental data, multivariate control statistics, such as the residual sum of squares (RSS) and Hotelling T 2 were used.Subsequently, LDA and PLS-DA were used to develop classification models to differentiate between the dried cocoa bean varieties established using HPLC.The analyses were performed using R statistical software with the DiscriMiner and Factoextra packages.
For LDA model construction, it was used fewer uncorrelated variables than the subjects.Therefore, the orthogonal eigenspace from the PCA (all principal components that summarized 100% variability of the original dataset) was used as the input for the LDA model.Additionally, the PCA scores for the LDA model training were screened using the mean decrease accuracy criterion of the Random Forest algorithm, which was computed using the randomForest R-package.To achieve the ''leave one out'' cross-validation of the LDA and PLS-DA models, the samples were randomly divided 100 times into calibration (75%) and validation (25%) data sets.The chemometric models were trained with the calibration dataset (using cross-validation) and external validation (with 25% of the remaining data) was subsequently performed for each iteration.
The predictability of the classification models was evaluated based on the overall accuracy (%) and sensitivity (Eq.1), specificity (Eq.2), precision (Eq.3), recall (Eq.4), and F-score (Eq.5).Additionally, to select the best classifier, a multifactor analysis of variance (ANOVA) was performed considering the models and iterations as factors and the different classification goodness metrics of the validation dataset as responses, employing a comparative mean with least significant difference (LSD) test intervals (p ≤ 0.05).Residual validation of all ANOVA models was performed using Shapiro-Wilk's test to verify residual normality, the Ljung-Box test to check residual independence, and a multifactor ANOVA was performed on square residuals to verify the homoscedasticity hypothesis.Statistical assumptions were verified at p ≤ 0.05 using STATGRAPHICS Centurion XVIII (Manugistics, Inc., Rockville MD, USA).

Results and Discussion
Firstly, the water activity and moisture content of the dried cocoa bean samples were between 0.32 and 0.42 a w and 6.38 and 7.52% d.b., respectively.The cocoa varieties identified using HPLC are listed in Table 2.A theobromine/caffeine relationship was observed, and dried cocoa beans were classified into the Trinitario and Forastero varieties.According to Carrillo et al. (2014) and Samaniego et al. (2020), theobromine/caffeine relationship values between 3 and 9 are indicative of the Trinitario genotype, and values higher than 9 can be classified as the Forastero variety.
As can be observed, the theobromine concentrations in both varieties were higher than caffeine concentrations, thus indicating that theobromine was the predominant compound in the extracts, which is consistent with previous findings (Carrillo et al., 2014;Hernández-Hernández et al., 2022).Furthermore, significant differences in methylxanthine levels were observed.The Trinitario variety had higher theobromine and caffeine contents than the Forastero variety, as is shown in Table 2.These results are similar to those reported by Carrillo et al. (2014) for Colombian cocoa beans from different growing areas (Samaniego et al., 2020) in Ecuador.The differences in methylxanthine observed between our results and those in the literature can be attributed to different post-harvest treatments, geographical origins, agronomic management, climate, and soil conditions (Kongor et al., 2016).Theobromine and caffeine constitute approximately the total concentration of alkaloids in Theobroma cacao and its derivative products, with the former being the most abundant methylxanthine and the latter found only in small amounts (Bartella et al., 2019).The methylxanthine concentrations in the Trinitario and Forastero cocoa varieties are shown in Table 3.
A variety of dried cocoa bean samples were prepared (Table 2), and the theobromine/caffeine ratio could be related to bean quality (Álvarez et al., 2012).The results could be valuable for determining the genotype of dried cocoa beans and their application in the cocoa industry, because the Forastero variety has been regarded as a precursor of ordinary and basic quality notes and is the primary raw material used in 80% of the global chocolate production.The Trinitario variety is a hybrid resulting from crossing Criollo (more aromatic and floral notes, less bitter, and smoother than Forastero), which is used in close to 10-15% of chocolate production, and is known to produce some wine flavor notes (Quiroz-Reyes & Fogliano, 2018).
The observed ATR-FTIR spectra showed typical vibration patterns of biological material constituents, such as proteins, lipids, and carbohydrates, reflecting the composition of dried cocoa beans and the influence of their genotype on absorbance unit variation.According to Baker et al. (2014), the most important spectral regions are the fingerprint region (1450-600 cm -1 ), amide I and II regions (1700-1500 cm -1 ), and the higher wavenumber region of 3500-2550 cm -1 , which is associated with stretching vibrations (S-H, C-H, N-H, and O-H).
Several ATR--FTIR studies have been conducted on cocoa beans and chocolate.The bands at 963 cm -1 , 1018 cm -1 , 1076 cm -1 , and 1112 cm -1 were associated with the C-O and C-C stretching modes (Hu et al., 2016, Batista et al., 2016).The peaks at 1283 cm -1 , 1360 cm -1 , and 1457 cm -1 (Figure 1) were related to the O-C-H, C-C-H, and C-O-H bending modes, respectively.C-H deformation of the ring was observed at 882 cm -1 , and C-OH of the phenyl group was observed at 1143 cm -1 and 1517 cm -1 .The alkene stretching vibrations (C = C) at 1663 cm -1 and 1620 cm -1 can be attributed to axial deformation of the N-H group of the aromatic ring due to the possible presence of alkaloids such as caffeine and theobromine, which usually show signals in the range 1750-1600 cm -1 (Rojas et al., 2020).Additionally, the vibrations at 1645-1544 cm -1 can be attributed to the C-C stretching of the aromatic ring (Batista et al., 2016).The bands at 2922 cm -1 , 2852 cm -1 , and 1743 cm -1 are related to asymmetric and symmetric CH 2 stretching, as well as to the C=O stretching group of triglycerides adjacent to the C-O group in esters (Batista et al., 2016;Craig et al., 2018;Barrios et al., 2021).Finally, the wavenumber at 3003 cm -1 was associated with the stretching vibration of the cis-olefinic double bond (Sánchez-Reinoso et al., 2017), and the bands associated with the phenol group at 3562-3322 cm -1 were attributed to O-H stretching (Batista et al., 2016).
The standard spectrum was recorded in the wavenumber range of 1800-1400 cm -1 .In Figures 2B and D, the signals at 1650 and 1600 cm -1 confirmed the presence of theobromine, and those at 1692 and 1642 cm -1 confirmed the presence of caffeine in the cocoa samples (Figure 2F).These signals resembled those reported by Rojas et al. (2020), who argued that the vibration of these alkaloids typically shows signals within a wavenumber range of 1750-1600 cm −1 .The ATR-FTIR spectrum of the caffeine standard (Figure 2C) corresponded to that reported by Bahamon et al. (2018).Moreover, the vibrational band at 2954 cm -1 was attributed to the C-H stretching of caffeine.Nugrahani et al. (2019) determined the specific spectral area of caffeine at 2967.27-2930.51cm -1 and its association with the -N-H amide functional group.To confirm the presence of methylxanthine (alkaloids) in the ATR-FTIR spectra of the cocoa samples, theobromine (Figure 2A) and caffeine standards were analyzed according to a previously reported approach (Figure 2C).
An exploratory PCA was performed with all spectral data (with baseline correction) using different processing techniques, such as SNV, MSC, and first and second derivatives.The best clustering results were obtained for MSC (Figure 3).As was mentioned previously, the chemical compounds, theobromine and caffeine, obtained using HPLC were adequate for identifying the cocoa bean variety.These were found in the MIR spectral measurements of the samples at a wavelength spectral range between 1700 and 1600 cm -1 , suggesting that these chemical compounds can also be identified via ATR-FTIR in the biochemical fingerprint region.
As shown in Figure 3, the first two principal components summarized 68.55% of the spectral variability, indicating that almost all the variability of the MIR information was explained by these components (Figure 3A).The first principal component (PC1) contributed the most (43.12%),followed by the second (PC2; 25.43%).The score plot indicated that Trinitario and Forastero could be separated into two groups.Trinitario cocoa samples were located on the negative axis of PC1 and were distributed throughout PC2, whereas Forastero samples were generally located on positive PC1 values and had similar distributions on PC2.Moreover, differentiation between the spectral features of cocoa was observed.From the grouping behavior, important differences were observed in the infrared spectra of the Forastero and Trinitario varieties, and the PCA analysis explained the differences in the MIR profile and the grouping of the samples.
To understand the discriminatory effect of PCA, Figure 3B shows the highest loadings for the first two principal components.The loadings of PC2 showed a maximum positive value at 1744 cm -1 , which correlated with a decrease in the same area for PC1.This matches the wave number observed in Figure 1, suggesting that PC2 influences the separation of cocoa varieties owing to the ester functional group (C=O) observed in this spectral region.
The stretching of this functional group can be attributed to the lipid content of the cocoa samples (Cortés et al., 2019).This suggests that the differences in the lipid concentrations of each variety contributed to their differentiation.Similarly, the negative loading in the wavelength range 1670-1500 cm -1 is related to the stretching of the cis functional group C=C (1654 cm -1 ) and broadening of the NH amide group (1531 cm -1 ) identified in the FTIR spectrum (Craig et al., 2018;Cortés et al., 2019).These functional groups can be associated with theobromine and caffeine contents (Figure 2).The peak at 1167 cm -1 did not correspond to a specific signal in the infrared spectrum (Figure 1).However, this signal was related to the C-OH groups associated with the polyphenol content in chocolate (Hu et al., 2016).In PC1 (positive) and PC2 (negative), the influence of the 2922 and 2852 cm -1 bands related to the asymmetric and symmetric stretching of CH 2 and the C=O stretching group of the triglycerides, was evident.A strong contribution of the wavenumber at 3003 cm -1 associated with the stretching vibration of the cis-olefinic double bond, was also observed (Sánchez-Reinoso et al., 2017).
According to the results obtained, analyzing the loadings coupled with infrared spectral information obtained from chemical standards and samples constitutes a valuable tool for identifying the MIR features that facilitate differentiation between the Trinitario and Forastero cocoa varieties.
To discriminate between these cocoa varieties, LDA and PLS-DA were performed on the preprocessed spectral information; the results are shown in Table 4.
Using both chemometric models, all Trinitario and Forastero cocoa samples were satisfactorily classified into datasets belonging to the pre-established varieties that were differentiated using HPLC (Table 3).The overall accuracy percentage, sensitivity, specificity, precision, recall, and F Score of the models were high for both the calibration and validation sets.As is shown in Table 4, the two supervised classification methods resulted in an overall accuracy higher than 90%.However, the ANOVA-based LSD intervals highlighted that PLS-DA was significantly better than LDA for classifying the observations from the validation test set, the PLS-DA model showed the highest overall accuracy (96.1 ± 2.4%), highest recall (0.98 ± 0.06), and highest F Score (0.97 ± 0.05), and to our knowledge, could be the first study to correlate FTIR-ATR spectra and primary data of theobromine/caffeine ratio quantification using HPLC in dry cocoa samples to discriminate Table 4. Classification results and overall accuracy using the LDA and PLS-DA chemometric model for classifying cocoa varieties differentiated using HPLC Different letters indicate statistically significant differences (p ≤ 0.05) obtained using Fisher's test between the chemometric models for the validation dataset.Results are expressed as mean ± standard deviation the Trinitario and Forastero varieties using chemometric models, thus providing a satisfactory classification tool for differentiating the cocoa varieties in the two classes evaluated.
Additionally, the results showed that the LDA model was overfitted for the training dataset owing to decreased evidence (Table 4) of the goodness of classification metrics with respect to those obtained with the test set.
Concerning the PLS-DA model, slight decreases in the classification metrics were observed, indicating the improved ability of PLS-DA to predict the cocoa variety of an unknown observation based on its spectral information.Hence, the ATR-FTIR spectroscopy technique, coupled with chemometric models, reported satisfactory and reliable classification performance for the varieties under study.Using the PLS-DA, the highest overall accuracy percentage of 96.1 ± 2.4% was obtained.Thus, PLS-DA is the most suitable chemometric model for solving the classification problem and can be considered a valuable tool for distinguishing between these cocoa varieties.

Conclusions
1. Attenuated total reflectance-Fourier transform infrared spectroscopy proved to be valuable for characterizing and identifying functional groups associated with chemical compounds such as theobromine and caffeine.
2. The results obtained using discriminant chemometric models, LDA, and PLS-DA indicated that the proposed ATR-FTIR spectroscopy technique can be regarded as a promising alternative, which is appropriate for identifying Trinitario and Forastero cocoa bean varieties.
3. The partial least squares regression-discriminant analysis model provided the most adequate and useful information for predicting the dry bean cocoa varieties.

Table 2 .
Cocoa varieties determined using HPLC Means with different letters indicate statistically significant differences (p ≤ 0.05) according to Fisher's test.Results are expressed as mean ± standard deviation

Table 3 .
Theobromine and caffeine concentration of Trinitario and Forastero dried cocoa samples