Introduction
Flavour and aroma are two important smoking characters in tobacco and they are mainly depended on the classes of tobacco leaves. In China, the main method of grading tobacco leaves is depended on the position and the color. Usually, the positions of a tobacco plant are divided into lower (X), middle (C) and upper (B) portion three categories. The color is divided into red-brown (R), orange (F) and lemon (L) three different hue categories. The group of different classes of tobacco leaves is firstly formed by combination of the position and color categories and then each group will be further divided into 3 or 4 grades according to the quality. However, the most grading work is mainly manually operated by the experience of tobacco experts at present, which is slow, laborious and not objective. The efficiency and the stability are also dissatisfying.^{1} Therefore, it is necessary to develop a new method which is fast, high-efficiency and more objective.
Near infrared (NIR) spectroscopy is a useful analytical chemistry tool^{2} and it has the advantages such as accurate, fast, and non-destructive. It has been widely used in various fields such as agriculture, medical, oil, tobacco and so on.^{3}^{-}^{5} Bin et al.^{6} and Wang et al.^{7} have already used NIR spectroscopy technology in the classification of tobacco leaves and the classification results are satisfying. However, the instrument they used is the traditional desktop NIR spectroscopy device, which is expensive, big and cannot be brought to the purchasing spot. Comparing with the traditional desktop NIR device, a hand-held NIR spectroscopy device is small, cheap, flexible and much more convenient. It is suitable for the tobacco purchasing spot.^{8} However, the spectrum range of the hand-held NIR (900-1700 nm) is usually smaller than the traditional desktop NIR instrument (1000-2500 nm), how to build a model that is fit for the device is also of great importance.
As one of the deep learning methods, SRC algorithm can capture the essential of the signal or data and has been widely used in signal de-noising,^{9}^{,}^{10} decoding,^{11} compressed sensing^{12} and machine learning.^{13}^{,}^{14} Here it is mainly focused on how to classify the spectral data which is acquired by using a hand-held NIR device. The SRC (sparse representation classification) algorithm was proposed by Li and Ngom^{15}^{,}^{16} and it can reflect the relevance between feature and target under the control of other features. It is robust to noise and suitable for the classification of hand-held NIR spectral data.
In this paper, a novel SRC method of tobacco leaves based on hand-held NIR spectroscopy technology is proposed. The method makes the large tobacco leaf NIR spectral data to be sparse and adopts a transformation matrix to reduce dimensionality. With the advantage of de-noising, avoiding overfitting, and being convenient in practice, it is shown that this method could improve the efficiency of the classification of tobacco leaves in the purchasing on the spot.
Experimental
Samples
In the following research, two experimental sets from different locations will be chosen. The first one was harvested in 2018 from Zhanyi City, Yunnan Province of China. It has 210 pieces of tobacco leaves and contains X2F, C3F and B2F three different categories. Each category has the same number. The second one was harvested in 2018 from Luoping City, Yunnan Province of China. It has 320 pieces of tobacco leaves and contains X2F, C2F, C3F, B1F and B2F five different categories. Each category has the same number. In the experimental process, we chose half of the total samples randomly as the training set and the test set for the two data sets respectively by means of using 2-fold cross-validation method.^{17} The details of the two data sets are shown in Table 1.
Data | Year | Location | Class^{a} | Feature | Total of samples | Total of training samples | Total of test samples |
---|---|---|---|---|---|---|---|
Data set 1 | 2018 | Zhanyi, Yunnan | X2F | 320 | 70 | 35 | 35 |
2018 | Zhanyi, Yunnan | C3F | 320 | 70 | 35 | 35 | |
2018 | Zhanyi, Yunnan | B2F | 320 | 70 | 35 | 35 | |
Data set 2 | 2018 | Luoping, Yunnan | X2F | 320 | 64 | 32 | 32 |
2018 | Luoping, Yunnan | C2F | 320 | 64 | 32 | 32 | |
2018 | Luoping, Yunnan | C3F | 320 | 64 | 32 | 32 | |
2018 | Luoping, Yunnan | B1F | 320 | 64 | 32 | 32 | |
2018 | Luoping, Yunnan | B2F | 320 | 64 | 32 | 32 |
^{a}The different classes of tobacco leaves, formed by combination of the position and color categories, followed by division into grades according to the quality. X: lower; C: middle; B: upper; F: orange.
Equipment
The DLP NIRscan Nano is an ultra-portable spectrometer evaluation module utilizing DLP (digital light processing) technology to meet lower cost, smaller size, and higher performance than traditional architectures. The replacement of a linear array detector with DLP digital micromirror device (DMD) in conjunction with a single point detector adds the functionality of programmable spectral filters and sampling techniques that were not previously available on NIR spectrometers. The detail of the technology different from the current one is shown in Figure 1. The device was provided by Texas Instruments. The spectral data was collected with a hand-held DLP-based spectroscopy device with 16 scans co-added. The spectral range of the device is from 900 to 1700 nm and the resolution is 5.85 nm. A total of five absorbance spectral data were collected on each tobacco leaf sample (shown in Figure 2). The final spectral data of each sample was calculated by averaging the five spectra. The NIR spectral data of the two sets collected by hand-held NIR device can be seen in Figures 3a and 3b.
Theory of SRC algorithm
Sparse representation (SR)^{9} is a parsimonious principle that a sample can be approximated by a sparse linear combination of basis vectors.^{15} Non-orthogonal basis vectors can be learned by SR, and the basis vectors may be allowed to be redundant. There are two techniques in SR. First, given a basis matrix, learning the sparse coefficients of a new sample is called sparse coding. Second, given training data, learning the basis vector is called dictionary learning. It can be statistically formulated as
In equation 1, b is the training data with samples, A = [a_{1}, ..., a_{k} ] is defined as a dictionary. a_{i} is one of the dictionary atoms, x and e represent the sparse coefficient vector and the error term respectively, k is the model parameter. SR model has the following constraints: (i) the error term e is Gaussian distributed with mean zero and isotropic covariance. (ii) The dictionary atoms is usually Gaussian distributed, the coefficient vector should follows a sparsity-inducing distribution. (iii) x is independent of ε.
As mentioned above, a SR model usually involves two steps, one is sparse coding and the other is dictionary learning. If all the coefficient vectors of the training dictionary are all non-negative, it can be formulated as the following mode:
The sparse coding model of equation 2 is called non-negative least squares (NNLS).^{16} NNLS has the advantages of easy interpretable and convenient in practice. It is suitable for dealing with the NIR spectral data. Therefore, the classification of hand-held spectral data based on SRC algorithm has the following steps: (i) all the spectral data of the samples are collected by using a hand-held NIR device and they are divided into the training samples and test samples randomly; (ii) all the training samples are used to build a training dictionary by using NNLS algorithm; (iii) the test samples will be expressed via the above training sparse representation dictionary; (iv) the class of a test sample is assigned with the minimum residual by computing the regression residual.
Results and Discussion
Firstly, pre-processing operation is implemented on the original spectral data. Here multiplicative scatter correction and gap-segment (first derivative, 5-point window, two polynomial order) methods^{18} are chosen. The pre-processing results are shown in Figure 4. The pre-processing results show that the resolution of the spectral data has been improved.
After pre-processing, the dimension of pre-processing data is still huge. Therefore, principal component analysis (PCA)^{19} is chosen to reduce the dimension of the data. Table 2 shows the cumulative percentage of variance for the first six principal components of the two data sets and the results indicate that these six PCs can describe the two data sets well. After PCA operation, the dimensionality of the data set 1 has been reduced from 210 × 320 to 210 × 6 and data set 2, from 320 × 320 to 320 × 6.
Data set | PC1^{a} / % | PC2^{a} / % | PC3^{a} / % | PC4^{a} / % | PC5^{a} / % | PC6^{a} / % |
---|---|---|---|---|---|---|
Data set 1 | 88.53 | 98.43 | 98.89 | 99.35 | 99.65 | 99.78 |
Data set 2 | 80.64 | 97.34 | 99.47 | 99.72 | 99.84 | 99.88 |
^{a}Cumulative percentage of variance. PCA: principal component analysis.
As mentioned in Experimental section (“Samples” sub-section), two-fold cross-validation method was used to partition all the samples into training and test samples equally. LDA^{20} SVM^{21}^{,}^{22} and SRC^{15}^{,}^{16} algorithms were used to solve the classification problem. We use radial basis function (RBF) as the kernel for SVM algorithm. Besides, particle swarm optimization (PSO) algorithm^{23} is used to optimize the accuracy of SVM classifier by randomly generating the parameters and estimate the best value for regularization of kernel parameters for SVM model. All the classifiers ran on the same training and test splits for fair comparison. In order to guarantee every sample would be chosen, all the algorithms were calculated 10 times for each data set. The comparing results are shown in Tables 3 and 4.
Class^{a}( Test samples) |
LDA Average classification accuracy / % |
SVM Average classification accuracy / % |
SRC Average classification accuracy / % |
---|---|---|---|
X2F (35 × 10) | 331 / 350, 94.57 | 332 / 350, 94.86 | 341 / 350, 97.43 |
C3F (35 × 10) | 310 / 350, 88.57 | 303 / 350, 86.57 | 307 / 350, 87.71 |
B2F (35 × 10) | 327 / 350, 93.43 | 336 / 350, 96.00 | 350 / 350, 100.00 |
Total average accuracy | 968 / 1050, 92.19 | 971 / 1050, 92.47 | 998 / 1050, 95.05 |
^{a}The different classes of tobacco leaves, formed by combination of the position and color categories, followed by division into grades according to the quality. X: lower; C: middle; B: upper; F: orange. LDA: linear discriminant analysis; SVM: supportvector machine; SRC: sparse representation classification.
Class^{a} (Test samples) |
LDA Average classification accuracy / % |
SVM Average classification accuracy / % |
SRC Average classification accuracy / % |
---|---|---|---|
X2F (32 × 10) | 266 / 320, 83.12 | 290 / 320, 90.62 | 297 / 320, 92.81 |
C2F (32 × 10) | 210 / 320, 65.62 | 282 / 320, 88.12 | 266 / 320, 83.12 |
C3F (32 × 10) | 292 / 320, 91.25 | 301 / 320, 94.06 | 283 / 320, 88.43 |
B1F (32 × 10) | 183 / 320, 57.19 | 229 / 320, 71.56 | 278 / 320, 86.88 |
B2F (32 × 10) | 112 / 320, 35.00 | 164 / 320, 51.25 | 281 / 320, 87.81 |
Total average accuracy | 1063 / 1600, 66.44 | 1266 / 1600, 79.12 | 1405 / 1600, 87.81 |
^{a}The different classes of tobacco leaves, formed by combination of the position and color categories, followed by division into grades according to the quality. X: lower; C: middle; B: upper; F: orange. LDA: linear discriminant analysis; SVM: supportvector machine; SRC: sparse representation classification.
It can be seen from Tables 3 and 4 that both, the single and total correct classification numbers or accuracy of SRC algorithm are comparable with LDA and SVM algorithms, especially for the close classes of the same position and color (C2F and C3F, B1F and B2F). The main reason is that the principles of PCA-LDA and PCA-SVM algorithms use the minimum Euclidean distance of the feature space to classify the samples. The classification results are ineffective if the classes and features are close. However, the principle of SRC classification algorithm is using the redundancy characteristic of the dictionary to classify the training samples and test samples. It can capture the essential features of the signal or data and has the strong robustness to noise. As the result, different from LDA and SVM algorithms, it can also achieve the effective classification results even the classes and features are close.
The aim of the paper is how to grade tobacco leaves in the purchasing process on the spot using a hand-held DLP-based NIR device, we want to achieve the classification results as soon as possible. Therefore, computation time is also a very important factor for the field use as it needs to get the classification results immediately. Then, the averaged computation time of each algorithm is recorded and the result is shown as Figure 5. All experiments of the two data sets are performed on an Intel laptop computer (Core TM i7-6700, 3.70 GHz, CPU with 8 GB RAM, with 64-bit Windows 10 Professional operation system). It can be easily seen from Figure 5 that SRC classification algorithm is much more efficient than the other two methods.
Conclusions
The paper proposes a novel classification tool based on hand-held NIR technology to grade different classes of tobacco leaves. The experimental results show that SRC algorithm works better than LDA and SVM algorithms on both classification accuracy and computation efficiency. The results suggest that a hand-held DLP-based NIR device could be an effective tool for grading tobacco leaves in the purchasing process on the spot.