Acessibilidade / Reportar erro

Influence of wavelet transform settings on NIR and MIR spectrometric analyses of diesel, gasoline, corn and wheat

Abstracts

This paper investigates the influence of wavelet family, length and number of resolution levels on the performance of multivariate calibration models obtained in the wavelet domain. Twenty-one physical and chemical properties of diesel, gasoline, corn and wheat were determined by near/mid infrared spectrometry employing partial least-squares (PLS) and stepwise regression (SR) in the original and wavelet domains. Through proper selection of the wavelet transform settings, average RMSEP reductions of 8.2% (PLS) and 27.0% (SR) were obtained with respect to the original domain. However, the SR models presented considerable sensitivity with respect to the choice of transform settings. In this case, an analysis of variance indicated that the number of resolution levels is the most important factor to be considered.

multivariate calibration; wavelet transform; analysis of variance; mid and near infrared spectrometry; food and fuel analysis


Este artigo investiga a influência da família e comprimento da wavelet, bem como do número de níveis de resolução, sobre o desempenho de modelos obtidos por calibração multivariada no domínio wavelet. Vinte e uma propriedades físicas e químicas de amostras de diesel, gasolina, milho e trigo foram determinadas por espectrometria no infravermelho médio e próximo empregando mínimos-quadrados parciais (PLS) e regressão por passos (SR) nos domínios original e wavelet. Mediante seleção adequada dos parâmetros da transformada wavelet, reduções médias de 8,2% (PLS) e 27,0% (SR) foram obtidas para o RMSEP em relação ao domínio original. Contudo, os modelos SR apresentaram expressiva sensibilidade à escolha dos parâmetros da transformada. Neste caso, uma análise de variância indicou que o número de níveis de resolução é o fator mais importante a ser considerado.


SHORT REPORT

Influence of wavelet transform settings on NIR and MIR spectrometric analyses of diesel, gasoline, corn and wheat

Luiz Alberto PintoI;Roberto K. H. GalvãoI; Mário César U. AraújoII, * * e-mail: laqa@quimica.ufpb.br

IInstituto Tecnológico de Aeronáutica, Divisão de Engenharia Eletrônica, 12228-900 São José dos Campos-SP, Brazil

IIUniversidade Federal da Paraíba, CCEN, Departamento de Química, CP 5093, 58051-970 João Pessoa-PB, Brazil

ABSTRACT

This paper investigates the influence of wavelet family, length and number of resolution levels on the performance of multivariate calibration models obtained in the wavelet domain. Twenty-one physical and chemical properties of diesel, gasoline, corn and wheat were determined by near/mid infrared spectrometry employing partial least-squares (PLS) and stepwise regression (SR) in the original and wavelet domains. Through proper selection of the wavelet transform settings, average RMSEP reductions of 8.2% (PLS) and 27.0% (SR) were obtained with respect to the original domain. However, the SR models presented considerable sensitivity with respect to the choice of transform settings. In this case, an analysis of variance indicated that the number of resolution levels is the most important factor to be considered.

Keywords: multivariate calibration, wavelet transform, analysis of variance, mid and near infrared spectrometry, food and fuel analysis

RESUMO

Este artigo investiga a influência da família e comprimento da wavelet, bem como do número de níveis de resolução, sobre o desempenho de modelos obtidos por calibração multivariada no domínio wavelet. Vinte e uma propriedades físicas e químicas de amostras de diesel, gasolina, milho e trigo foram determinadas por espectrometria no infravermelho médio e próximo empregando mínimos-quadrados parciais (PLS) e regressão por passos (SR) nos domínios original e wavelet. Mediante seleção adequada dos parâmetros da transformada wavelet, reduções médias de 8,2% (PLS) e 27,0% (SR) foram obtidas para o RMSEP em relação ao domínio original. Contudo, os modelos SR apresentaram expressiva sensibilidade à escolha dos parâmetros da transformada. Neste caso, uma análise de variância indicou que o número de níveis de resolução é o fator mais importante a ser considerado.

Introduction

Over the past two decades, the wavelet transform (WT)1,2 has been employed in a variety of chemometrics applications such as denoising,3,4 signal compression,5 baseline correction,6 classification7 and multivariate calibration.8-10 In the context of multivariate calibration, WT can be used to compress the data set prior to the use of regression techniques such as principal component regression (PCR)11 or partial least squares (PLS).12 Variable selection algorithms may also be employed to choose an appropriate subset of wavelet coefficients for use with multiple linear regression (MLR).8,13,14

Although several papers have been published on the use of WT for multivariate calibration, the choice of a suitable wavelet for a particular application is still an open problem. In fact, unlike the Fourier transform, which is restricted to the use of sine and cosine basis functions, WT can be implemented with a wide variety of wavelets.2,4 In signal compression and denoising applications, the choice of wavelet could be guided by the minimum description length (MDL) criterion, as described elsewhere.4 However, such a criterion does not take into account the x-y relationship between the instrumental responses and the property of interest for multivariate calibration.

The choice of wavelet usually entails the selection of a family, such as Daubechies (db), Symlet (sym) and Coiflet (coif), as well as an order (i.e., wavelet length) within that family.2,4,15 Some authors opted to test several wavelets and choose the most appropriate one on the basis of the performance of the resulting model.10,16,17 Chalus et al.,16 for instance, tested the db2, db6 and sym6 wavelets. Eriksson et al.17 employed db4, sym8 and coif2. Nicolai et al.10 tested 16 wavelets from the Daubechies (db2, db4, db6, db8, db10, db18), Symlet (sym2, sym4, sym6, sym8, sym10) and Coiflet (coif1, coif2, coif3, coif4, coif5) families. Other authors only reported the use of a single wavelet in their work, such as sym8,18,19 or db4.9,12

In addition to the choice of wavelet, another issue that may affect the results of multivariate calibration is the number of resolution levels employed in WT.9,16,18,19 However, this aspect has received comparatively little attention from the researchers and has even been omitted in some papers.12,17,20 Therefore, more detailed investigations concerning this issue would be of value.

The present paper investigates the influence of wavelet family, length and number of resolution levels on the predictive performance of a multivariate calibration model. More specifically, the investigation is aimed at determining whether such WT settings have a significant effect on the result and which setting should deserve more attention from the analyst. For this purpose, four datasets are employed, namely: near-infrared (NIR) absorbance spectra of 170 diesel samples, mid-infrared (MIR) absorbance spectra of 104 gasoline samples, NIR reflectance spectra of 80 corn samples and NIR reflectance spectra of 100 wheat flour samples. A total of 21 physical and chemical properties are considered. Multivariate calibration is carried out by using PLS, as well as MLR with variable selection by stepwise regression.21

Background

The wavelet transform can be implemented in a computationally efficient manner by using a digital filter bank algorithm.22 The basic structure of the filter bank consists of a pair of low-pass and high-pass filters, followed by a dyadic downsampling operation.14,15,22 The downsampled outputs of the low-pass and high-pass filters are termed approximation and detail coefficients, respectively. The filtering/downsampling operations can be reapplied to the approximation coefficients up to the number M of decomposition levels specified by the analyst. The transform result consists of the approximation coefficients at the last level in addition to all detail coefficients. The low-pass and high-pass filters are typically of finite length, and, therefore, each approximation or detail coefficient corresponds to a section of the original signal. This spatial localization feature is one of the main advantages of WT over the Fourier transform.1,2,15

The most commons wavelet filters employed in multivariate calibration belong to the Daubechies, Symlet and Coiflet families.9,10,12,16-19 These families differ by features such as symmetry and smoothness.15,23 Each family comprises filters of different length L. The dbN, symN and coifN filters have length L = 2N, 2N and 6N, respectively. Parameter N is termed the filter order. For illustration, Figure 1 presents the Daubechies, Symlet and Coiflet low-pass filters of length 12, 18, 24 and 30. The high-pass filters are obtained by reversing the corresponding low-pass filters and changing the sign of every other element of the sequence.14,15


Experimental

Data sets

Four data sets were employed in the present investigation. The first data set consists of NIR absorbance spectra of 170 diesel samples, recorded in the range 1000-1600 nm with resolution of 0.5 nm.24 The second data set comprises MIR absorbance spectra of 104 gasoline samples in the range 2500-15400 nm with resolution of 2 nm.25 The third data set is publicly available and consists of NIR reflectance spectra of 80 corn samples in the range 1100-2500 nm with resolution of 2 nm.26 Data from instrument "mp5" were employed. The fourth data set, also publicly available,27 consists of NIR reflectance spectra of 100 wheat flour samples in the range 1000-2500 nm with resolution of 2 nm. The spectra of the four data sets are presented in Figure 2.


The physical and chemical properties under consideration in each data set are presented in Table 1. Henceforth these properties will be denoted by codes P1-P9.

As can be seen in Figure 2, the spectra display undesirable baseline features. For this reason, first derivative spectra were calculated by using a Savitzky-Golay filter,28 with a 2nd order polynomial and an 11 point window. The resulting spectra, which were used throughout the work, are presented in Figure 3.


Within each data set, 70% of the samples were used for construction of the models. These samples were selected by applying the Kennard-Stone algorithm29 to the derivative spectra. For PLS calibration, the modelling samples were further divided into a calibration set (50% of the overall dataset) and a validation set (20% of the overall dataset). The remaining 30% of the samples formed a prediction set, which was used to evaluate the performance of the resulting models.

Wavelet transform

The wavelet transform of the derivative spectra was calculated with the Daubechies, Symlet and Coiflet filters, which are the most commonly used in multivariate calibration.9,10,12,16-19 Four filter lengths were employed, namely 12, 18, 24 and 30, as shown in Figure 1. Constant extension ("smooth padding of order zero")23,30 was used to reduce border effects at the endpoints of the spectra. The number of decomposition levels was varied from one up to the maximum number L for which the spatial localization features of WT are not lost. This limit situation occurs when the wavelet filters span the entire length of the downsampled approximation coefficients.31 In the Matlab software, such a maximum number of decomposition levels can be obtained by using function "wmaxlev" from the Wavelet Toolbox. It is worth noting that the maximum number of decomposition levels depends on the filter length and the number of spectral variables. Table 2 summarizes the WT settings employed in the investigation.

In order to reduce computational workload in the model-building process, a preliminary compression procedure was applied to the wavelet coefficients. Compression was carried out by discarding the smallest wavelet coefficients (in absolute value) while retaining 99% of explained variance.32

Multivariate calibration

PLS and stepwise regression (SR) were employed to build regression models in the wavelet, as well as in the original domains. For each property, the number of latent variables in PLS was chosen to minimize the root-mean-square error in the validation set. In stepwise regression, the α-entry and α-exit values21 were set to 0.01. The results for each parameter under consideration were evaluated in terms of the root-mean-square error of prediction (RMSEP) defined as

where and are the reference and predicted parameter values for the ith sample of the prediction set, which comprises Np samples.

Software

All calculations were performed in Matlab® 6.5 R13 by using functions from the Wavelet and Statistics Toolboxes, as well as lab-made routines.

Results

Tables 3 and 4 show the PLS and SR results for the original spectral domain, as well as the best and worst results obtained in the wavelet domain. As can be seen, it is not possible to point out a single wavelet family, level or filter length that systematically leads to the best or worst outcomes. The Diff columns indicate the percentual difference between the RMSEP values obtained in the original and wavelet domains. In the PLS case, the average difference with respect to the original domain was -8.2% and +9.9% for the best and worst wavelet settings, respectively. For SR, the average differences were -27.0% and +34.2%.

These results indicate that the wavelet transform may indeed be useful to improve the predictive ability of PLS and SR models. However, the SR outcome is more sensitive to the choice of wavelet settings as compared to PLS. Such a finding can be interpreted in two ways. On the one hand, it may be argued that the use of SR in the wavelet domain is risky in that poor results may be obtained given an inadequate choice of WT settings. On the other hand, the potential gains for SR may be significant. In fact, a comparison between Tables 3 and 4 reveals that the best wavelet settings for stepwise regression provide results that are superior, in most cases, to those obtained by PLS (either in the original or wavelet domains).

In light of these findings, it can be concluded that the choice of WT settings plays a more important role for SR than it does for PLS. In order to further investigate the influence of WT settings in the SR outcome, an analysis of variance (ANOVA)33 was carried out for each parameter under consideration. For this purpose, the RMSEP value was adopted as response variable. The wavelet family (Daubechies, Symlet, Coiflet), filter length (12, 18, 24, 30) and number of resolution levels (one up to the maximum number L) were the factors under analysis.

Figure 4 presents the ANOVA results obtained for each property and factor (WT setting) under consideration. The effect of a given factor on RMSEP is significant if the resulting p-value is small.33 It is worth noting that the vertical axis in Figure 4 corresponds to (1 - p). Therefore, significant effects are indicated by large bars.


As can be seen, in 11 out of the 21 properties, at least one factor displayed a significant effect at a confidence level of 95% (horizontal dashed line in Figure 4). This result again indicates that the choice of appropriate WT settings is indeed important in the SR framework. It is interesting to notice that most significant effects are associated to the number of resolution levels, rather than wavelet family or filter length. In fact, the number of decomposition levels had a significant effect in seven properties, as compared to five properties for wavelet family and only two properties for filter length. In addition, it is worth noting that the number of levels was the most influential factor in 12 out of the 21 properties. Therefore, one may recommend that the analyst should pay special attention to the choice of resolution levels when building the SR model in the wavelet domain.

Conclusions

This paper investigated the influence of three WT settings (wavelet family, filter length and resolution levels) on the predictive performance of PLS and SR models for NIR/MIR spectrometric analyses of diesel, gasoline, corn and wheat. A total of 21 physical and chemical properties were considered in this study.

The results show that the choice of WT settings does affect the results of both PLS and SR, providing the potential for gains with respect to modelling in the original spectral domain. In fact, through proper selection of those settings, average RMSEP reductions of 8.2% (PLS) and 27.0% (SR) were obtained with respect to the original domain. However, the SR outcome exhibited considerable sensitivity to the choice of WT settings. In fact, an inadequate choice could lead to an average RMSEP increase of 34.2%. In particular, an analysis of variance revealed that the number of resolution levels is the most important factor to be considered in this framework.

Acknowledgments

This work was partially supported by CAPES (PROCAD Grant 0081/05-1) and CNPq (research fellowships). The authors are also indebted to Dr. Fernanda Araújo Honorato and Mr. Gledson Emídio José for providing the gasoline and diesel data sets, respectively.

Submitted: April 30, 2009

Published online: August 17, 2010

FAPESP has sponsored the publication of this article.

  • 1. Alsberg, B. K.; Woodward, A. M.; Kell, D. B.; Chemom. Intell. Lab. Syst. 1997, 37, 215.
  • 2. Walczak, B.; Wavelets in Chemistry, Elsevier Science: New York, 2000.
  • 3. Galvão, R. K. H.; Filho, H. A. D.; Martins, M. N.; Araújo, M. C. U.; Pasquini, C.; Anal. Chim. Acta 2007, 581, 159.
  • 4. Cai, C. S.; Harrington, P. D.; J. Chem. Inf. Comput. Sci. 1998, 38, 1161.
  • 5. Walczak, B.; Massart, D. L.; Chemom. Intell. Lab. Syst. 1997, 36, 81.
  • 6. Shao, X.; Cai, W.; Pan, Z.; Chemom. Intell. Lab. Syst. 1999, 45, 249.
  • 7. Donald, D.; Coomans, D.; Yvetty, E.; Cozzolino, D.; Gishen, M.; Hancock, T.; Chemom. Intell. Lab. Syst. 2006, 82, 122.
  • 8. Galvão, R. K. H.; José, G. E.; Dantas Filho, H. A.; Araújo, M. C. U.; Silva, E. C.; Paiva, H. M.; Saldanha, T. C. B.; Souza, E. S. O. N.; Chemom. Intell. Lab. Syst. 2004, 70, 1.
  • 9. Díez, I. E.; Sáiz, J. M. G.; Pizarro, C.; Anal. Chim. Acta 2004, 515, 31.
  • 10. Nicolai, B. M.; Theron, K. I.; Lammertyn, J.; Chemom. Intell. Lab. Syst. 2007, 85, 243.
  • 11. Vogt, F.; Tacke, M.; Chemom. Intell. Lab. Syst. 2001, 59, 1.
  • 12. Trygg, J.; Wold, S.; Chemom. Intell. Lab. Syst. 1998, 42, 209.
  • 13. Brown, P. J.; Fearn, T.; Vannucci, M.; J. Am. Stat. Assoc. 2001, 96, 398.
  • 14. Coelho, C. J.; Galvão, R. K. H.; Araújo, M. C. U.; Pimentel, M. F.; Silva, E. C.; Chemom. Intell. Lab. Syst. 2003, 66, 205.
  • 15. Strang, G.; Nguyen, T.; Wavelet and Filter Banks, Cambridge Press: Wellesley, 1996.
  • 16. Chalus, P.; Walter, S.; Ulmschneider, M.; Anal. Chim. Acta 2007, 591, 219.
  • 17. Eriksson, L.; Trygg, J.; Johansson, E.; Bro, R.; Wold, S.; Anal. Chim. Acta 2000, 420, 181.
  • 18. Alsberg, B. K.; Woodward, A. M.; Winson, M. K.; Rowland, J. J.; Kell, D. B.; Anal. Chim. Acta 1998, 368, 29.
  • 19. Tan, H.; Brown, S. D.; Anal. Chim. Acta 2003, 490, 291.
  • 20. Jetter, K.; Depczynski, U.; Molt, K.; Niemöller, A.; Anal. Chim. Acta 2000, 420, 169.
  • 21. Draper, N. R.; Smith, H.; Applied Regression Analysis, 3rd ed., Wiley: New York, 1998.
  • 22. Mallat, S. G.; IEEE Transaction on Pattern Analysis and Machine Intelligence 1989, 11, 674.
  • 23
    The Mathworks; Matlab 6.5 User's Guide, Natick, MA, USA.
  • 24. Galvão, R. K. H.; Araújo, M. C. U. In Comprehensive Chemometrics; Brown, S.; Tauler, R.; Walzack, R., eds., Elsevier: Oxford, 2009, vol. 3, pp. 233-283.
  • 25. Honorato, F. A.; Galvão, R. K. H.; Pimentel, M. F.; Barros Neto, B.; Araújo, M. C. U.; Carvalho, F. R.; Chemom. Intell. Lab. Syst. 2005, 76, 65.
  • 26
    http://software.eigenvector.com/Data/Corn/index.html, accessed in December 2007.
    » link
  • 27
    ftp://ftp.clarkson.edu/pub/hopkepk/Chemdata/Kalivas, accessed in December 2007.
    » link
  • 28. Beebe, K. R.; Pell, R. J.; Seasholtz, B.; Chemometrics - A Practical Guide, Wiley: New York, 1998.
  • 29. Kennard, R. W.; Stone, L. A.; Technometrics 1969, 11, 137.
  • 30. Santos, R. N. F.; MSc Dissertation, Instituto Tecnológico de Aeronáutica, São José dos Campos, Brasil, 2006.
  • 31. Santos, R. N. F.; Galvão, R. K. H.; Araújo, M. C. U.; Silva, E. C.; Talanta 2007, 71, 1136.
  • 32. Pontes, M. J. C.; Cortez, J.; Galvão, R. K. H.; Pasquini, C.; Araújo, M. C. U.; Coelho, R. M.; Chiba, M. K.; Abreu, M. F.; Madari, B. E.; Anal. Chim. Acta 2009, 642, 12.
  • 33. Massart, D. L.; Vandeginste, B. G. M.; Buydens, L. C. C., Lewi, D. J.; Verbeke, J. E.; Handbook of Chemometrics and Qualimetrics - Part A, Elsevier: Amsterdam, 1997.
  • *
    e-mail:
  • Publication Dates

    • Publication in this collection
      10 Feb 2011
    • Date of issue
      Jan 2011

    History

    • Accepted
      17 Aug 2010
    • Received
      30 Apr 2009
    Sociedade Brasileira de Química Instituto de Química - UNICAMP, Caixa Postal 6154, 13083-970 Campinas SP - Brazil, Tel./FAX.: +55 19 3521-3151 - São Paulo - SP - Brazil
    E-mail: office@jbcs.sbq.org.br