Influence of Wavelet Transform Settings on NIR and MIR Spectrometric Analyses of Diesel, Gasoline, Corn and Wheat

Este artigo investiga a influência da família e comprimento da wavelet, bem como do número de níveis de resolução, sobre o desempenho de modelos obtidos por calibração multivariada no domínio wavelet. Vinte e uma propriedades físicas e químicas de amostras de diesel, gasolina, milho e trigo foram determinadas por espectrometria no infravermelho médio e próximo empregando mínimos-quadrados parciais (PLS) e regressão por passos (SR) nos domínios original e wavelet. Mediante seleção adequada dos parâmetros da transformada wavelet, reduções médias de 8,2% (PLS) e 27,0% (SR) foram obtidas para o RMSEP em relação ao domínio original. Contudo, os modelos SR apresentaram expressiva sensibilidade à escolha dos parâmetros da transformada. Neste caso, uma análise de variância indicou que o número de níveis de resolução é o fator mais importante a ser considerado.


Introduction
][10] In the context of multivariate calibration, WT can be used to compress the data set prior to the use of regression techniques such as principal component regression (PCR) 11 or partial least squares (PLS). 12Variable selection algorithms may also be employed to choose an appropriate subset of wavelet coefficients for use with multiple linear regression (MLR). 8,13,14though several papers have been published on the use of WT for multivariate calibration, the choice of a suitable wavelet for a particular application is still an open problem.In fact, unlike the Fourier transform, which is restricted to the use of sine and cosine basis functions, WT can be implemented with a wide variety of wavelets. 2,4In signal compression and denoising applications, the choice of wavelet could be guided by the minimum description length (MDL) criterion, as described elsewhere. 4However, such a criterion does not take into account the x-y relationship between the instrumental responses and the property of interest for multivariate calibration.
The choice of wavelet usually entails the selection of a family, such as Daubechies (db), Symlet (sym) and Coiflet (coif), as well as an order (i.e., wavelet length) within that family. 2,4,15Some authors opted to test several wavelets and choose the most appropriate one on the basis of the performance of the resulting model. 10,16,17Chalus et al., 16 for instance, tested the db2, db6 and sym6 wavelets.Eriksson et al. 17 employed db4, sym8 and coif2.Nicolai et al. 10 tested 16 wavelets from the Daubechies (db2, db4, db6, db8, db10, db18), Symlet (sym2, sym4, sym6, sym8, sym10) and Coiflet (coif1, coif2, coif3, coif4, coif5) families.Other authors only reported the use of a single wavelet in their work, such as sym8, 18,19 or db4. 9,12n addition to the choice of wavelet, another issue that may affect the results of multivariate calibration is the number of resolution levels employed in WT. 9,16,18,19 However, this aspect has received comparatively little attention from the researchers and has even been omitted in some papers. 12,17,20Therefore, more detailed investigations concerning this issue would be of value.
The present paper investigates the influence of wavelet family, length and number of resolution levels on the predictive performance of a multivariate calibration model.More specifically, the investigation is aimed at determining whether such WT settings have a significant effect on the result and which setting should deserve more attention from the analyst.For this purpose, four datasets are employed, namely: near-infrared (NIR) absorbance spectra of 170 diesel samples, mid-infrared (MIR) absorbance spectra of 104 gasoline samples, NIR reflectance spectra of 80 corn samples and NIR reflectance spectra of 100 wheat flour samples.A total of 21 physical and chemical properties are considered.Multivariate calibration is carried out by using PLS, as well as MLR with variable selection by stepwise regression. 21

Background
The wavelet transform can be implemented in a computationally efficient manner by using a digital filter bank algorithm. 22The basic structure of the filter bank consists of a pair of low-pass and high-pass filters, followed by a dyadic downsampling operation. 14,15,22The downsampled outputs of the low-pass and high-pass filters are termed approximation and detail coefficients, respectively.The filtering/downsampling operations can be reapplied to the approximation coefficients up to the number M of decomposition levels specified by the analyst.The transform result consists of the approximation coefficients at the last level in addition to all detail coefficients.The low-pass and high-pass filters are typically of finite length, and, therefore, each approximation or detail coefficient corresponds to a section of the original signal.This spatial localization feature is one of the main advantages of WT over the Fourier transform. 1,2,157][18][19] These families differ by features such as symmetry and smoothness. 15,23Each family comprises filters of different length L. The dbN, symN and coifN filters have length L = 2N, 2N and 6N, respectively.Parameter N is termed the filter order.For illustration, Figure 1 presents the Daubechies, Symlet and Coiflet low-pass filters of length 12, 18, 24 and 30.The high-pass filters are obtained by reversing the corresponding low-pass filters and changing the sign of every other element of the sequence. 14,15

Data sets
Four data sets were employed in the present investigation.The first data set consists of NIR absorbance spectra of 170 diesel samples, recorded in the range 1000-1600 nm with resolution of 0.5 nm. 24The second data set comprises MIR absorbance spectra of 104 gasoline samples in the range 2500-15400 nm with resolution of 2 nm. 25 The third data set is publicly available and consists of NIR reflectance spectra of 80 corn samples in the range 1100-2500 nm with resolution of 2 nm. 26Data from instrument "mp5" were employed.The fourth data set, also publicly available, 27 consists of NIR reflectance spectra of 100 wheat flour samples in the range 1000-2500 nm with resolution of 2 nm.The spectra of the four data sets are presented in Figure 2.
The physical and chemical properties under consideration in each data set are presented in Table 1.Henceforth these properties will be denoted by codes P1-P9.
As can be seen in Figure 2, the spectra display undesirable baseline features.For this reason, first derivative spectra were calculated by using a Savitzky-Golay filter, 28 with a 2 nd order polynomial and an 11 point window.The resulting spectra, which were used throughout the work, are presented in Figure 3.
Within each data set, 70% of the samples were used for construction of the models.These samples were selected by applying the Kennard-Stone algorithm 29 to the derivative spectra.For PLS calibration, the modelling samples were further divided into a calibration set (50% of the overall dataset) and a validation set (20% of the overall dataset).The remaining 30% of the samples formed a prediction set, which was used to evaluate the performance of the resulting models.

Wavelet transform
7][18][19] Four filter lengths were employed, namely 12, 18, 24 and 30, as shown in Figure 1.Constant extension ("smooth padding of order zero") 23,30 was used to reduce border effects at the endpoints of the spectra.The number of decomposition levels was varied from one up to the maximum number L for which the spatial localization features of WT are not lost.This limit situation occurs when the wavelet filters span the entire length of the downsampled approximation coefficients. 31In the Matlab software, such a maximum number of decomposition levels can be obtained by using function "wmaxlev" from the Wavelet Toolbox.It is worth noting that the maximum number of decomposition levels depends on the filter length and the number of spectral variables.Table 2 summarizes the WT settings employed in the investigation.
In order to reduce computational workload in the modelbuilding process, a preliminary compression procedure was applied to the wavelet coefficients.Compression was carried out by discarding the smallest wavelet coefficients (in absolute value) while retaining 99% of explained variance. 32

Multivariate calibration
PLS and stepwise regression (SR) were employed to build regression models in the wavelet, as well as in the original domains.For each property, the number of latent variables in PLS was chosen to minimize the root-mean-square error in the validation set.In stepwise regression, the a-entry and a-exit values 21 were set to 0.01.The results for each parameter under consideration were evaluated in terms of the root-mean-square error of prediction (RMSEP) defined as (1)   where and are the reference and predicted parameter values for the i th sample of the prediction set, which comprises Np samples.

Software
All calculations were performed in Matlab ® 6.5 R13 by using functions from the Wavelet and Statistics Toolboxes, as well as lab-made routines.

Results
Tables 3 and 4 show the PLS and SR results for the original spectral domain, as well as the best and worst results obtained in the wavelet domain.As can be seen,  it is not possible to point out a single wavelet family, level or filter length that systematically leads to the best or worst outcomes.The Diff columns indicate the percentual difference between the RMSEP values obtained in the original and wavelet domains.In the PLS case, the average difference with respect to the original domain was -8.2% and +9.9% for the best and worst wavelet settings, respectively.For SR, the average differences were -27.0% and +34.2%.
These results indicate that the wavelet transform may indeed be useful to improve the predictive ability of PLS and SR models.However, the SR outcome is more sensitive to the choice of wavelet settings as compared to PLS.Such a finding can be interpreted in two ways.On the one hand, it may be argued that the use of SR in the wavelet domain is risky in that poor results may be obtained given an inadequate choice of WT settings.On the other hand, the potential gains for SR may be significant.In fact, a comparison between Tables 3 and 4 reveals that the best wavelet settings for stepwise regression provide results that are superior, in most cases, to those obtained by PLS (either in the original or wavelet domains).
In light of these findings, it can be concluded that the choice of WT settings plays a more important role for SR than it does for PLS.In order to further investigate the influence of WT settings in the SR outcome, an analysis of variance (ANOVA) 33 was carried out for each parameter under consideration.For this purpose, the RMSEP value was adopted as response variable.The wavelet family (Daubechies, Symlet, Coiflet), filter length (12, 18, 24, 30)  and number of resolution levels (one up to the maximum number L) were the factors under analysis.
Figure 4 presents the ANOVA results obtained for each property and factor (WT setting) under consideration.The percentual difference between the RMSEP values obtained in the original and wavelet domains is indicated in the Diff % columns.The percentual difference between the RMSEP values obtained in the original and wavelet domains is indicated in the Diff % columns.
The effect of a given factor on RMSEP is significant if the resulting p-value is small. 33It is worth noting that the vertical axis in Figure 4 corresponds to (1p).Therefore, significant effects are indicated by large bars.
As can be seen, in 11 out of the 21 properties, at least one factor displayed a significant effect at a confidence level of 95% (horizontal dashed line in Figure 4).This result again indicates that the choice of appropriate WT settings is indeed important in the SR framework.It is interesting to notice that most significant effects are associated to the number of resolution levels, rather than wavelet family or filter length.In fact, the number of decomposition levels had a significant effect in seven properties, as compared to five properties for wavelet family and only two properties for filter length.In addition, it is worth noting that the number of levels was the most influential factor in 12 out of the 21 properties.Therefore, one may recommend that the analyst should pay special attention to the choice of resolution levels when building the SR model in the wavelet domain.

Conclusions
This paper investigated the influence of three WT settings (wavelet family, filter length and resolution levels) on the predictive performance of PLS and SR models for NIR/MIR spectrometric analyses of diesel, gasoline, corn and wheat.A total of 21 physical and chemical properties were considered in this study.
The results show that the choice of WT settings does affect the results of both PLS and SR, providing the potential for gains with respect to modelling in the original spectral domain.In fact, through proper selection of those settings, average RMSEP reductions of 8.2% (PLS) and 27.0% (SR) were obtained with respect to the original domain.However, the SR outcome exhibited considerable sensitivity to the choice of WT settings.In fact, an inadequate choice could lead to an average RMSEP increase of 34.2%.In particular, an analysis of variance revealed that the number of resolution levels is the most important factor to be considered in this framework.

Table 1 .
Physical and chemical properties under consideration and their respective range in each data boiling point; T10, T50, T85, T90: temperatures at which 10, 50, 85 and 90% of the sample has evaporated, respectively; fbp: final boiling point.

Table 3 .
PLS results in the original and wavelet domains

Table 4 .
Stepwise regression results in the original and wavelet domains