Acessibilidade / Reportar erro

Sparse Estimation of the Precision Matrix and Plug-In Principle in Linear Discriminant Analysis for Hyperspectral Image Classification

ABSTRACT

In this paper, a new method for supervised classification of hyperspectral images is proposed for the case in which the size of the training sample is small. It consists of replacing in the Mahalanobis distance the maximum likelihood estimator of the precision matrix by a sparse estimator. The method is compared with two other existing versions of LDA sparse, both in real and simulated images.

Keywords:
hight dimensionality; linear discriminant analysis; hyperspectral image

1 INTRODUCTION

Hyperspectral sensors collect information from the earth’s surface in hundreds of bands of the electromagnetic spectrum whit relatively narrow bandwidths. Each pixel of a hyperspectral image can be regarded as a high-dimensional vector, x=x1,,xp', called feature vector, whose entries x j , 1 ≤ j ≤ p corresponds to the spectral reflectance collected in the band j of the electromagnetic spectrum.

Image classification is an essential step wich consists in classification rules that assign every data point x into one of K classes based on a features vector. In supervised image classification, a classifier is trained on a training data set of size n. Since the collection of such training data is either expensive or time-consuming, there is usually only a small amount of labeled data available.

The huge number of features (p) of hyperspectral data and the limited availability of the training sample data (n) greatly worsen the traditional classification methods performance such as linear discriminant analysis (LDA) as 2121 H. Zou. Classification with high dimensional features. WIREs computational Statistics, 11 (2019). doi:https://doi.org/10.1002/wics.1453.
https://doi.org/10.1002/wics.1453...
emphasize. This problem is typical in high-dimensional data analysis.

For example, if p > n the sample covariance matrix S is singular and LDA cannot be applied. If n is not much greater than p, S can be calculated but due to the large number of estimated parameters (p(p + 1)/2), it will have a significant amount of sample error, and its inverse will be a highly biased estimator of Σ-1(11 J. Bai & S. Shi. Estimating High Dimensional Covariance Matrices and Its Applications. Annals of Economics and finance, 12(2) (2011), 199-215.). On the other hand, because the bandwidth decreases as p grows, many bands are highly correlated and hyperspectral data can be contain a significant amount of redundant spectral information. Recently, two algorithms have been proposed called sparse discriminant analysis (SDA) (55 L. Clemmensen, T. Hastie, D. Witten & B. Ersbøll. Sparse Discriminant Analysis. Technometrics, 53(4) (2011), 406-413. doi:10.1198/TECH.2011.08118.
https://doi.org/10.1198/TECH.2011.08118...
) and penalized LDA (2020 D. Witten & R. Tibshirani. Penalized classification using Fisher’s linear discriminant. J R Stat Soc Series B Stat Methodol, 73(54) (2011), 753-772. doi:10.1111/j.1467-9868.2011.00783.x.
https://doi.org/10.1111/j.1467-9868.2011...
) which simultaneously solve the problem of singularity of the covariance matrix (when p > n) and allow variable selection by imposing a penalty of type ℓ1 to the optimal scoring approach from which one can derive the LDA rule and the discriminant vectors in Fisher’s discriminant problem respectively. In the context of high dimensionality, the estimation of the covariance matrix and its inverse, the precision matrix, is already a research area of important results. Most of the proposals rely on sparsity assumptions and requires some kind of regularization strategy. In this framework, it is assumed that many of the matrix entries to be estimate are zeros. (1010 T.H. J. Friedman & R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3) (2007), 432-441.; 88 O.B.L. Ghaoui & A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary Data. Journal of Machine Learning Research, 9 (2008), 485-516.; 1515 W. Liu & X. Luo. Fast and adaptive sparse precision matrix estimation in high dimensions. Journal of Multivariate Analysis, 135 (2015), 153-162. doi:https://doi.org/10.1016/j.jmva.2014.11.005. URL https://www.sciencedirect.com/science/article/pii/S0047259X14002607.
https://doi.org/10.1016/j.jmva.2014.11.0...
; 99 Y.L. J. Fan & H. Liu. An overview on the estimation of large covariance and precision matrices. The Econometrics Journal, 19(1) (2016), 1-34. doi:https://doi.org/10.1111/ectj.12061.
https://doi.org/10.1111/ectj.12061...
). Using estimation of sparse precision matrices and plug-in principle in the Mahalanobis distance allows to extend LDA to high dimensional problems 77 J. Fan, Y. Feng & Y. Wu. Network exploration via the adaptive LASSO and SCAD penalties. The Annals of Applied Statistics, 3(2) (2009), 521-541. doi:10.1214/08-AOAS215.
https://doi.org/10.1214/08-AOAS215...
. This strategy, will be called plug-in LDA.

On the other hand, we are interested in non-paametric tree-based classification methods introduced by Breiman, Friedman, Olshen and Stone in mid 1980’s. Therefore, we can expect this approach outperforms LDA when the decision boundary is highly non-linear. However these methods can lead to a phenomenon known as overfitting the data, which essentially means that on the training data set they follow the errors too closely having a very poor perfomance on new observations. For this reason trees can be very sensitive to small changes in the training sample dramatically affecting the final estimated tree 1111 G. James, D. Witten, T. Hastie & R. Tibshirani. “An Introduction to Statistical Learning with Applications in R”. Springer, 2 ed. (2017)..

The aim of this paper is to compare the performance of SDA and penalized - LDA versus plugin-LDA using Kashlak proposal based on precision matrix estimation 1313 A. Kashlak. Non-asymptotic error controlled sparse high dimensional precision matrix estimation. arXiv:1903.10988 [stat.ME], (2019).. Since random forest (RF) is a very popular tree-based classification methods it will be also included in this paper for comparative purposes 22 L. Breiman. Random forests. Machine Learning, 1 (2001), 5-32.; 66 T.N. Do, P. Lenca, S. Lallich & N.K. Pham. Classifying Very-High-Dimensional Data with Random Forests of Oblique Decision Trees. Advances in knowledge discovery and management, 1 (2009).; 1414 C. Li. The Application of high-dimensional Data Classification by Random Forest based on Hadoop Cloud Computing Platform. Chemical Engineering Transactions, 51 (2016)..

This paper is organized as follows. We review LDA, SDA and penalized LDA in Section 2 and in Section 3 we present our proposal, the plug-in-LDA. A simulation study and applications to real hyperspectral images are presented in Section 5. Section 6 contains a discussion.

2 REVIEW OF LDA AND PENALIZED LDA

There are three distinct arguments that result in the LDA classifier: the multivariate Gaussian model, Fisher’s discriminant problem, and the optimal scoring problem.

Consider K populations (classes) where a p-dimensional random vector x is defined and assume that x~Nμk,Σw where µ k is the mean vector for class k and Σw is a pooled within-class covariance matrix common to all K classes. The Bayesian approach classifies a new observation x* to the class k that maximizes Mahalanobis’s distance

δ k x * = x * - μ ^ k ' Σ ^ w - 1 x * - μ ^ k - 2 log π k (2.1)

where μ^k and Σ^w denote maximum likelihood estimates for µ k and Σw , obtained from a training sample x1',xn'

LDA, considered as arising from Fisher’s discriminant problem, consists in finding K vectors, called discriminant vectors, β^1,β^K such that sequentially solve

maximize β k p β k ' Σ ^ b β k subject to β k ' Σ ^ w β k = 1 and β j ' Σ ^ w β k = 0 , j < k , (2.2)

where Σ^b is the standard estimate for the between-class covariance matrix. Since Σ^b has rank at most K − 1 then there exits at most K − 1 non trivial solutions to (2.2), β^1,,β^K-1 called discriminant vectors. the classification rule is reduced to projecting a new observation on the space defined by the discriminant vectors and assigning it to the class whose projected mean is closest.

A third formulation that yields the LDA classification rule is optimal scoring, introduced by 55 L. Clemmensen, T. Hastie, D. Witten & B. Ersbøll. Sparse Discriminant Analysis. Technometrics, 53(4) (2011), 406-413. doi:10.1198/TECH.2011.08118.
https://doi.org/10.1198/TECH.2011.08118...
, that transform the classification problem into a regression problem by solving

minimize θ k , β k Y θ k - X β k 2 subject to θ k ' Y ' Y θ k = 1 . (2.3)

where Y = (y ik ) denotes a n × K matrix with y ik = 1 if the i-th observation belongs to the k−th class and y ik = 0 otherwise. Moreover, θ K is a scores vector of dimension K and represents the scores vector of the training data. The solution β^k to (2.3) are proportional to solution to (2.2) (see 1919 A.B. T. Hastie & R. Tibshirani. Penalized Discriminant Analysis. The Annals of Statistics, 23(1) (1995), 73-102.).

Among the proposals to extend LDA to the problems of high dimensionality, those that produce more interpretable models are of particular interest. We need that these extensions allow variable selection and classification simultaneously. In linear regression model, lasso is a regularization method that, imposing a penality at the size of the estimated parameters vector, shrunks some coefficients estimates exactly to zero, producing variable selection. Using this approach 55 L. Clemmensen, T. Hastie, D. Witten & B. Ersbøll. Sparse Discriminant Analysis. Technometrics, 53(4) (2011), 406-413. doi:10.1198/TECH.2011.08118.
https://doi.org/10.1198/TECH.2011.08118...
propose SDA by imposing to (2.3) an additional penalization:

minimize θ k , β k Y θ k - X β k 2 + γ β k ' Ω β k + λ β k 1 subject to θ k ' Y ' Y θ k = 1 , θ k ' Y ' Y θ i = 0 , i < k (2.4)

where Ω is a positive definite matrix and γ and λ are nonnegative tuning parameters.SDA produce sparse discriminant vectors since if λ is large, some entries of β k will be zero.

Penalized Fisher’s linear discriminant, proposed by 2020 D. Witten & R. Tibshirani. Penalized classification using Fisher’s linear discriminant. J R Stat Soc Series B Stat Methodol, 73(54) (2011), 753-772. doi:10.1111/j.1467-9868.2011.00783.x.
https://doi.org/10.1111/j.1467-9868.2011...
, is based on imposing an ℓ1 penalty on equation (2.2):

maximize β k β k ' Σ ^ b β k - λ β k 1 subject to β k ' Σ ^ w β k = 1 and β j ' Σ ^ w β k = 0 , j < k , (2.5)

where Σ~w is some full rank estimate for Σw , such as Σ~w=diagσ1^2,,σp^2, where σj^2 is the j-th diagonal element of Σ^w. This method also produce sparse discriminant vectors. Although there is a correspondence between the critical points of problems (2.4) and (2.5), the solutions are not necessarily the same 2020 D. Witten & R. Tibshirani. Penalized classification using Fisher’s linear discriminant. J R Stat Soc Series B Stat Methodol, 73(54) (2011), 753-772. doi:10.1111/j.1467-9868.2011.00783.x.
https://doi.org/10.1111/j.1467-9868.2011...
.

3 PLUG-IN LDA

In high dimensionality, to estimate the precision matrix associated to a multivariate Gaussian distribution it usual to assume that it is sparse. The most commonly estimation methods employ an ℓ1-penalized maximum likelihood (1010 T.H. J. Friedman & R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3) (2007), 432-441.; 88 O.B.L. Ghaoui & A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary Data. Journal of Machine Learning Research, 9 (2008), 485-516.; 44 T. Cai & W. Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association, 106 (2011), 672-684.). As an alternative, 1313 A. Kashlak. Non-asymptotic error controlled sparse high dimensional precision matrix estimation. arXiv:1903.10988 [stat.ME], (2019). propose a novel distribution free estimator that controls the false positive rate in the selection of the non zero entries of the estimated precision matrix. This proposal begins with an initial estimator Ω^0- such as debiased graphical lasso (1212 J. Jankova & S. Van De Geer. Confidence intervals for high-dimensional inverse covariance estimation. Electronic Journal of Statistics, 9 (2015), 1205-1229.) or debiased ridge estimator (33 P. Buhlmann. Statistical significance in high-dimensional linear models. Bernoulli, 19 (2015), 1212-1242.) and iteratively using a binary search procedure locates the densest estimator near I p or alternatively the sparsest estimator close to Ω^0. Considering these sparse estimates and plug-in principle in the Mahalanobis distance we propose a new version of LDA sparse which we will call plug-in LDA.

4 ACCURACY ASSESSMENT MEASURES

In selecting the best classification algorithm a very important task is the choice of an appropriate performance evaluation measures. Accuracy assessment is traditionally conducted using the sample confusion matrix, in which classification results of the validation dataset are compared to “ground truth.” Based on the confusion matrix, a variety of measures can be calculated, including Cohen’s Kappa coefficient 88 O.B.L. Ghaoui & A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary Data. Journal of Machine Learning Research, 9 (2008), 485-516. and the F 1 score1818 Y. Sasaki. The truth of the F-measure. Teach Tutor Mater, (2007), 1-5.

Cohen kappa statistic is a chance-corrected method for assessing agreement. The values of range lie in [1; 1] with 1 presenting complete agreement and 0 meaning no agreement or independence. A negative statistic implies that the agreement is worse than random.

Table 1:
The confusion matrix for a multi-class classification involving k classes.

Using notation similar to Cohen (1960), the kappa coefficient of agreement,(κ), is estimated from confusion matrix:

κ ^ = p 0 - p e 1 - p e (4.1)

where p 0 is the proportion of cases correctly classified (i.e. overall accuracy) and p e is the expected proportion of cases correctly classified by chance (under independency hipotesis):

p 0 = i = 1 k n i i N (4.2)

p e = i = 1 k n i . n . i N 2 (4.3)

Precision and recall (sensitivity) are two important model evaluation measures. For a multiclass problem we can compute the per-class precisio´n and recall as follows:

r e c a l l i = n i i n . i ; 1 i k (4.4)

p r e c i s i o n i = n i i n i . ; 1 i k (4.5)

The F 1 value is used to combine the precision and recall measurements into a single value. This is practical because it makes it easier to compare the combined performance of precision and completeness between various algorithms. F1 score is a harmonic mean of precision and recall:

F 1 i = 2 1 r e c a l l i + 1 p r e c i s i o n i ; 1 i k (4.6)

The weighted F 1 is defined as

w F 1 = i = 1 k n . i F 1 i N (4.7)

To get a more accurate estimate of the measurements, the training data set was randomly divided into three groups, or folds, of equal size (n = p). The first fold is treated as a training set, and remaining folds as validation set. This process results in three estimates of each measure, which are averaged in order to obtain a unique value 1111 G. James, D. Witten, T. Hastie & R. Tibshirani. “An Introduction to Statistical Learning with Applications in R”. Springer, 2 ed. (2017)..

5 APPLICATIONS

Experiments were conducted to test the performance of the SDA; penalized - LDA; plug-in-LDA and RF algorithms with one simulated image and one real image.

For SDA, Penalized-LDA and RF we will use the R-packages sda, sparseMatEst and randomForest, respectively.

5.1 Simulated Data

The first experimental dataset (simulated image) consist of 100 × 100 pixels, 200 bands and four classes. Each pixel x ij ; 1 ≤ i, j ≤ 100 is a p-dimensional random vector, where p represent the number of bands. We assume that xij~Nμk;Σ if pixel (i, j) belongs to class k, 1 ≤ k ≤ 4. Figure 1a and 1b represent an ideal class image with four regions, acting as the reference image, and the RGB-composition of simulated random image assuming a normal distribution. The size of the training set is 160 (40 pixels of each class). Table 2 shows the Frobenius norm and kappa estimated coefficient for the five classifiers. The validation is made with the entire image. The Figure 2 and the Table 2 show that, penalized-LDA has the best performance followed by SDA and plug-in-LDA, while RF has the worst performance.

Figure 1:
(a) Ideal class image with four regions. (b) RGB-composition of simulated random image assuming a normal multivariate distribution.

Figure 2:
Classification results simulated image.

Table 2:
Frobenius norm. (kappa estimated coefficient).

5.2 Real Data

Figure 3 shows an 209 x 167 extract of an EO-1 Hyperion image obtained in http://eo1.usgs.gov. The study area is EO1H2290822012007110P11 T . 44 non-informative bands were removed, as well as water absorption bands 120-132, 165-182, 185-187, 221-224. So, 160 bands were established to be useful for further analysis. By prior knowledge of the area, it was decided to classify the image into four classes: water, urban area, high vegetation index area, and low vegetation index area (blue, white, green and brown, respectively). The size of the training sample is 160.

Figure 3:
Real image: false color composite image (R-G-B = band 120-60-2).

Figure 5 show the thematic maps produced by the five classifiers. When executing the LDA function of the R-package MASS, the following error message appears: ”variables are collinear”. This problem, caused by the insufficient size of the training sample to estimate the covariance matrix, produces an unacceptable thematic map (see Figure 5a).

Figure 4:
Google maps extract (a) city area (b) mountain area.

Figure 5:
Classification results real image.

If the algorithms are ranking based on adequacy measures presented in the Table 3, the first one are SDA and penalized-LDA, followed by plug-in-LDA and finally RF. A visual comparison of the thematic maps (see Figure 5c, 5d, and 5e) with the truth of the terrain (Figure 4a and 4b), seems to show that SDA had a better performance, being very similar to penalized-LDA and plug-in-LDA, while RF had the worst performance.

Table 3:
Kappa estimated coeficient and weighted F 1 score averaged over k = 3 folds.

6 CONCLUSION

This paper proposes a new version of sparse LDA, called plug-in LDA, for the supervised classification of hyperspectral images. It consists of replacing the maximum likelihood estimator of the precision matrix used in Mahalanobis distance with a sparse estimate recently proposed. The results obtained show that plug-in LDA outperforms to RF and has a similar behavior to SDA and Penalized LDA, both in the real image and in the simulated one.

In the context of hiperspectral image classification, extracting totally pure training samples is difficult, since there are often incorrectly labeled pixels, wich can have a strong negative impact on the classification result 1616 M. Popov, S. Alpert, V. Podorvan, M. Topolnytskyi & S. Mieshkov. Method of Hyperspectral Satellite Image Classification under Contaminated Training Samples Based on Dempster-Shafer’s Paradigm. Central European Researchers Journal, 1 (2015), 82-93.; 1717 L. Qingming, Y. Haiou & J. Junjun. Label Noise Cleansing with Sparse Graph for Hyperspectral Image Classification. Remote Sensing, 11 (2019), 1116. doi:10.3390/rs11091116.
https://doi.org/10.3390/rs11091116...
. The main advantage of the plug-in strategy is the possibility of replacing in the Mahalanobis’s distance the estimators of both the position and the covariance matrix with robust alternatives which could turn plug-in into a robust classification algorithm.

REFERENCES

  • 1
    J. Bai & S. Shi. Estimating High Dimensional Covariance Matrices and Its Applications. Annals of Economics and finance, 12(2) (2011), 199-215.
  • 2
    L. Breiman. Random forests. Machine Learning, 1 (2001), 5-32.
  • 3
    P. Buhlmann. Statistical significance in high-dimensional linear models. Bernoulli, 19 (2015), 1212-1242.
  • 4
    T. Cai & W. Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association, 106 (2011), 672-684.
  • 5
    L. Clemmensen, T. Hastie, D. Witten & B. Ersbøll. Sparse Discriminant Analysis. Technometrics, 53(4) (2011), 406-413. doi:10.1198/TECH.2011.08118.
    » https://doi.org/10.1198/TECH.2011.08118
  • 6
    T.N. Do, P. Lenca, S. Lallich & N.K. Pham. Classifying Very-High-Dimensional Data with Random Forests of Oblique Decision Trees. Advances in knowledge discovery and management, 1 (2009).
  • 7
    J. Fan, Y. Feng & Y. Wu. Network exploration via the adaptive LASSO and SCAD penalties. The Annals of Applied Statistics, 3(2) (2009), 521-541. doi:10.1214/08-AOAS215.
    » https://doi.org/10.1214/08-AOAS215
  • 8
    O.B.L. Ghaoui & A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary Data. Journal of Machine Learning Research, 9 (2008), 485-516.
  • 9
    Y.L. J. Fan & H. Liu. An overview on the estimation of large covariance and precision matrices. The Econometrics Journal, 19(1) (2016), 1-34. doi:https://doi.org/10.1111/ectj.12061
    » https://doi.org/10.1111/ectj.12061
  • 10
    T.H. J. Friedman & R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3) (2007), 432-441.
  • 11
    G. James, D. Witten, T. Hastie & R. Tibshirani. “An Introduction to Statistical Learning with Applications in R”. Springer, 2 ed. (2017).
  • 12
    J. Jankova & S. Van De Geer. Confidence intervals for high-dimensional inverse covariance estimation. Electronic Journal of Statistics, 9 (2015), 1205-1229.
  • 13
    A. Kashlak. Non-asymptotic error controlled sparse high dimensional precision matrix estimation. arXiv:1903.10988 [stat.ME], (2019).
  • 14
    C. Li. The Application of high-dimensional Data Classification by Random Forest based on Hadoop Cloud Computing Platform. Chemical Engineering Transactions, 51 (2016).
  • 15
    W. Liu & X. Luo. Fast and adaptive sparse precision matrix estimation in high dimensions. Journal of Multivariate Analysis, 135 (2015), 153-162. doi:https://doi.org/10.1016/j.jmva.2014.11.005 URL https://www.sciencedirect.com/science/article/pii/S0047259X14002607
    » https://doi.org/10.1016/j.jmva.2014.11.005» https://www.sciencedirect.com/science/article/pii/S0047259X14002607
  • 16
    M. Popov, S. Alpert, V. Podorvan, M. Topolnytskyi & S. Mieshkov. Method of Hyperspectral Satellite Image Classification under Contaminated Training Samples Based on Dempster-Shafer’s Paradigm. Central European Researchers Journal, 1 (2015), 82-93.
  • 17
    L. Qingming, Y. Haiou & J. Junjun. Label Noise Cleansing with Sparse Graph for Hyperspectral Image Classification. Remote Sensing, 11 (2019), 1116. doi:10.3390/rs11091116.
    » https://doi.org/10.3390/rs11091116
  • 18
    Y. Sasaki. The truth of the F-measure. Teach Tutor Mater, (2007), 1-5.
  • 19
    A.B. T. Hastie & R. Tibshirani. Penalized Discriminant Analysis. The Annals of Statistics, 23(1) (1995), 73-102.
  • 20
    D. Witten & R. Tibshirani. Penalized classification using Fisher’s linear discriminant. J R Stat Soc Series B Stat Methodol, 73(54) (2011), 753-772. doi:10.1111/j.1467-9868.2011.00783.x.
    » https://doi.org/10.1111/j.1467-9868.2011.00783.x
  • 21
    H. Zou. Classification with high dimensional features. WIREs computational Statistics, 11 (2019). doi:https://doi.org/10.1002/wics.1453
    » https://doi.org/10.1002/wics.1453

Publication Dates

  • Publication in this collection
    05 Sept 2022
  • Date of issue
    Jul-Sep 2022

History

  • Received
    30 Sept 2021
  • Accepted
    24 Mar 2022
Sociedade Brasileira de Matemática Aplicada e Computacional - SBMAC Rua Maestro João Seppe, nº. 900, 16º. andar - Sala 163, Cep: 13561-120 - SP / São Carlos - Brasil, +55 (16) 3412-9752 - São Carlos - SP - Brazil
E-mail: sbmac@sbmac.org.br