Effect of the Subsampling Ratio in the Application of Subagging for Multivariate Calibration with the Successive Projections Algorithm

Este artigo estuda o efeito da razão de subamostragem na abordagem subagging para regressão linear múltipla com seleção de variáveis pelo algoritmo das projeções sucessivas. Para isso, apresentam-se investigações envolvendo dados simulados e também determinação de umidade e proteína em trigo e temperaturas de destilação (T10 e T90), massa específica e enxofre em diesel por espectrometria no infravermelho próximo. Em termos de capacidade de predição e sensibilidade a ruído, os melhores resultados foram obtidos para razões de subamostragem em torno de 40%.


Introduction
The successive projections algorithm (SPA) 1 was developed to select subsets of variables with small multicollinearity for use in multiple linear regression (MLR) models.MLR-SPA has been employed, for example, in spectrometric determination of solubility of solids in beers, 2 glucose in human blood, 3 quality parameters of vegetable oils, 4 phenolic compounds in sea water, 5 sulphur in diesel, 6 and various other applications.A graphic user interface for MLR-SPA is publicly available at http://www.ele.ita.br/~kawakami/spa.
In addition to new analytical applications of MLR-SPA, several works have also been conducted on the implementation aspects of the algorithm itself.Gains in parsimony were achieved, for example, by identifying variables that can be removed from the model without compromising its prediction ability. 7Improvements were also obtained by exploiting the correlation with the dependent variable in the projections phase of MLR-SPA. 8ore recently, a modification was proposed to deal with the presence of unknown interferents in the samples to be analyzed. 9Improvements concerning computational issues have also been reported. 10,11n this context, it has been shown 12 that MLR-SPA results can be improved by using a statistical technique called subagging (subsample aggregating).Such a technique consists of combining different models obtained as the result of a process of subsampling. 13In the MLR-SPA-subagging case, the subsampling procedure can be regarded as a random splitting of the modelling data into calibration and validation sets.In a subsequent work, this approach was adapted for use in a calibration transfer framework. 14For this purpose, transfer samples were inserted in the validation set formed at each subsampling iteration.
An important factor, which was not addressed in the previous MLR-SPA-subagging works, 12,14 concerns the choice of the subsampling ratio employed in the calibration/validation splitting of the modelling samples.
In both investigations, 12,14 this ratio was arbitrarily set to approximately 60% (number of calibration samples/number of calibration plus validation samples).The present paper investigates the effect of the subsampling ratio on the resulting MLR-SPA-subagging model.For this purpose, case studies involving simulated data, as well as nearinfrared (NIR) spectrometric determination of moisture and protein in wheat and T10, T90, specific mass and sulphur in diesel are presented.The results are evaluated in terms of the prediction ability and sensitivity to spectral measurement noise.

MLR-SPA
In the original implementation of MLR-SPA, 1 it is assumed that the N mod samples available for modelling purposes have been divided into a calibration and a validation sets, with N cal and N val samples, respectively (where N mod = N cal + N val ).The instrumental response data of the calibration set are then arranged in a matrix X cal (N cal × K) where N cal and K denote the number of samples and variables, respectively.A series of projection operations involving the columns of matrix X cal is then employed to form K chains with M variables each, where M = min(N cal -1, K).The first element of the k th chain corresponds to variable x k .Each subsequent element in the chain is selected in order to display the least collinearity with the previous ones.Subsets of variables extracted from the chains are then evaluated on the basis of the prediction ability of the resulting MLR models in the validation set.The best subset of variables is then chosen according to a suitable performance criterion, such as the root-mean-square error of validation.Finally, a statistical hypothesis test is employed to remove variables from this subset without compromising the prediction ability of the MLR model. 7

MLR-SPA-subagging
The MLR-SPA outcome depends on the choice of calibration and validation sets from the available samples.In the MLR-SPA-subagging approach, 12 this aspect is exploited to generate a pool of different MLR models which are then aggregated into an ensemble model.Each individual model is obtained by randomly splitting the modelling samples into calibration and validation sets and then applying MLR-SPA.At the end, model aggregation is carried out by co-averaging the individual model predictions as (1)   where ŷ av and ŷ (n) denotes the predictions of the ensemble model and the nth individual model, respectively.In what follows, the number of subsampling iterations (i.e., the number of aggregated models) will be denoted by P, as in equation 1.Typically, it has been found 12 that the MLR-SPA-subagging procedure tends to converge after the aggregation of approximately P = 30 individual models.
It is worth noting that the co-averaging procedure expressed in equation 1 can also be reformulated in terms of the regression coefficients of the MLR models, i.e., (2)   (3) where (4) If a certain variable x k was not selected by MLR-SPA for inclusion in a particular individual model, the corresponding regression coefficient b k is set to zero in that model.For example, assume that K = 3 variables are available for selection and that P = 2 individual models are obtained in the MLR-SPA-subagging procedure.Furthermore, suppose that variables x 1 and x 3 are selected in the first MLR-SPA model and variables x 2 and x 3 are selected in the second MLR-SPA model.In this case, these models could be expressed as (5) ( 6) Equations 5 and 6 can be rewritten as ( 7) (8)   where a null regression coefficient was assigned to variables x 2 and x 1 in the first and second models, respectively.The coaveraging procedure can then be employed as in equation 4 with b 2 (1) = 0 and b 1 (2) = 0. Within this context, an important design parameter is the subsampling ratio q defined as (9) Vol.22, No. 11, 2011   For illustration, Figure 1 depicts the MLR-SPAsubagging procedure for two different subsampling ratios, namely q = 50% (Figure 1a) and q = 70% (Figure 1b).
As mentioned above, in previous works concerning the use of MLR-SPA-subagging, 12,14 the subsampling ratio q was arbitrarily set to approximately 60%.However, it may be argued that better ensemble models could be obtained by using a different choice for q, which motivates the present investigation.

Simulated data set
Simulated spectra were generated as in a previous work involving MLR-SPA. 7For this purpose, a linear relation was assumed between the matrix X of instrumental responses and the matrix Y of analyte concentrations: where N is a noise term.Three analytes (termed A, B, and C) and K = 300 spectral variables were considered.Matrix W (3 × 300) contains the proportionality coefficients between the analyte concentrations and the instrumental responses.The W-values adopted in this simulated study are presented in Figure 2a.A total of 200 spectra were generated by using a matrix Y (200 × 3) with concentration values randomly distributed in the range 1-10 (arbitrary units).Gaussian noise with zero mean and standard deviation of 0.1 was added to all spectra as in equation 10.The resulting spectra are presented in Figure 2b.
The overall set of N tot = 200 spectra was divided into a modelling set with N mod = 100 samples and a prediction set with N pred = 100 samples (N tot = N mod + N pred ) by applying the Kennard-Stone (KS) algorithm 19 to the matrix X (200 × 300) of instrumental responses.The modelling set was employed in the MLR-SPA-subagging procedure, as described in the previous section.The prediction set was only employed to evaluate and compare the performance of the resulting models.

Wheat data set
This public data set consists of NIR diffuse reflectance spectra of N tot = 100 wheat samples, along with reference values of moisture and protein content. 15,16The spectra were acquired in the range 1100-2500 nm with a 2 nm resolution, resulting in 701 spectral variables.
Figure 3 shows the NIR spectra of the 100 wheat samples.As can be seen in Figure 3a, the spectra display baseline shifts, which were eliminated by using a first-derivative Savitzky-Golay filter with a 2 nd -degree polynomial and a 21-point window. 17The resulting derivative spectra are shown in Figure 3b.Finally, the number of variables was reduced by discarding those for which the maximum signal intensity over all derivative spectra did not exceed 2% of the maximum signal intensity in the overall data set. 18The resulting spectra comprised K = 652 variables.
The overall set of N tot = 100 wheat samples was divided into a modelling set with N mod = 70 samples and a prediction set with N pred = 30 samples (N tot = N mod + N pred ) by applying the KS algorithm to the matrix X (100 × 652) of derivative spectra.

Diesel data set
This data set, which comprises 170 diesel samples collected from gas stations in the city of Recife (Pernambuco State, Brazil), was employed in a previous MLR-SPAsubagging study. 12The reference values for sulphur content, specific mass, and distillation temperatures (T10 and T90) were obtained according to the ASTM (American Society for Testing and Materials) D4294-90, 4615, and D86 methods, respectively.NIR spectra in the range 885-1600 nm were acquired using a FT-NIR/MIR spectrometer Perkin Elmer GX with an optical path length of 1.0 cm and a spectral resolution of 2 cm -1 .Systematic variations in the baseline were circumvented by using derivative spectra calculated with a Savitzky-Golay filter (2 nd -order polynomial, 11-point window).As a result, the number of spectral variables was K = 1431.The original and derivative NIR spectra of the 170 diesel samples are presented in Figures 4a and 4b, respectively.
The overall set of N tot = 170 diesel samples was divided into a modelling set with N mod = 85 samples and a prediction set with N pred = 85 samples (N tot = N mod + N pred ) by applying the KS algorithm to the matrix X (170 × 1431) of derivative spectra.

Evaluation of the MLR-SPA-subagging models
The MLR-SPA-subagging models were obtained for nine different subsampling ratios, namely q = 10%, 20%, ..., 90%.It is worth noting that such percentages are expressed in terms of the N mod modelling samples, as indicated in equation 9.In each case, the ensemble models were evaluated in terms of predictive ability and sensitivity to instrumental noise.The predictive ability was assessed by calculating the root-meansquare error in the prediction set (RMSEP) as where y i and ŷ i are the reference and the predicted values of the property under consideration for the ith prediction sample.
Sensitivity to instrumental noise was taken into account as suggested elsewhere [20][21][22] by calculating the 2-norm of the regression vector (||b av ||), which is defined as:   (12)   where b k av denotes the regression coefficient associated to variable x k in the ensemble model.In fact, by following the demonstration provided by Pinto et al., 20 it can be shown that s ŷ av = s noise ||b av ||, where s noise is the standard deviation of the instrumental noise (assumed to be homoscedastic and uncorrelated across the model variables) and s ŷ av is the standard deviation of the error in the ensemble model predictions resulting from the propagation of the instrumental noise.Ideally, improvements in the ensemble model should provide reductions in both RMSEP and ||b av ||.
It is worth noting that MLR-SPA-subagging has a stochastic nature due to the random subsampling operations.Therefore, given a certain subsampling ratio q and number of iterations P, the RMSEP and ||b av || values may vary for different realizations of the MLR-SPA-subagging procedure.For this reason, in order to assess the dispersion of the results, a Monte Carlo simulation 23 was carried out by calculating the average and standard deviation of the results over several realizations.In the present work, n MC = 25 realizations were employed in the Monte Carlo simulation.

Software
All calculations were carried out using the Matlab 2009b software.The subsampling operations in the MLR-SPA-subagging procedure were performed by using random permutations with the "rand" Matlab routine.

Simulated data set
Figure 5 presents the results obtained for analytes A and B with a fixed subsampling ratio (q = 70%) and a number of iterations P ranging from 1 to 50.The results for analyte C were similar to those obtained for analyte A and are thus omitted for brevity.As can be seen, there is a marked improvement in RMSEP and ||b av || for both analytes as the number of iterations P is increased.However, the gains become marginal after P = 30, which is in agreement with the findings reported elsewhere. 12he average results obtained for different subsampling ratios are shown in Figure 6.In this case, the number of iterations was set to P = 30, because the improvements for more iterations were marginal as discussed above.The error bars for RMSEP and ||b av || correspond to standard errors, which were calculated as the standard deviation s divided by the square root of the number of Monte Carlo realizations n MC (i.e., ).In the ||b av || × RMSEP plots presented in Figure 6, the best results are those situated closest to the origin (small values for both ||b av || and RMSEP).In this sense, Figure 6a indicates that appropriate subsampling ratios for determination of analyte A range from 20% to 40%.Within this range, the changes in ||b av || are minor and the difference between the smallest and largest RMSEP values are not significant according to an F-test at 95% confidence level.Other values for q (10% and 50-90%) lead to an increase in both ||b av || and RMSEP.The same comments regarding the choice of q can be applied to analyte B (Figure 6b).It is worth noting that the RMSEP and ||b av || results are generally worse for analyte B as compared to analyte A. Such a finding can be ascribed to the fact that the spectrum of analyte B is strongly overlapped by the two other analytes (A and C, as seen in Figure 2a).

Wheat data set
Figure 7 presents the protein and moisture results obtained for a fixed subsampling ratio (q = 70%) and a number of iterations P ranging from 1 to 50, as in Figure 5.As in the simulated case study, a marked improvement in RMSEP and ||b av || for both protein and moisture can be observed as the number of iterations is increased up to P = 30.In this case, it is worth noting that the results are statistically unstable for P < 10, as indicated by the large standard deviation values at the beginning of the curves.
The average results obtained for different subsampling ratios with P = 30 are shown in Figure 8.As can be seen in Figure 8a, appropriate subsampling ratios for moisture determination range from 30 to 70%.Within this range, the changes in ||b av || are minor and the difference between the smallest and largest RMSEP values are not significant according to an F-test at 95% confidence level.Smaller values for q (10 and 20%) lead to a noticeable increase in RMSEP, whereas larger values for q (80 and 90%) result in substantially larger ||b av || values.In the protein case (Figure 8b), the best RMSEP results were obtained for q ranging from 40 to 90%.By taking the ||b av || criterion into account, the best choice becomes q = 40%.
It is worth noting that the RMSEP values obtained for q = 10 and 20% were significantly larger than those obtained with the other subsampling ratios.In the protein case, for instance, the RMSEP for q = 10% was twice the value obtained for q ranging from 40 to 90%.Such a result for q = 10 and 20% may be ascribed to the small number of samples N cal employed in the calibration of each individual MLR-SPA model, which limits the maximum number M of spectral variables that can be selected, as M = min(N cal -1, K).This handicap was particularly adverse in the case of protein, because the bulk protein content in wheat involves a complex mixture of several components.Therefore, MLR-SPA models with few variables may not    be able to capture the various vibrational phenomenae involved in the NIR analysis of protein content.
On the other hand, the largest subsampling ratios (80 and 90%) yielded models with considerably high ||b av || values.In this case, since more calibration samples were employed, MLR-SPA was able to include a larger number of spectral variables in each individual model, which resulted in an increase in ||b av ||.Therefore, although suitable RMSEP values were obtained, the resulting MLR-SPA-subagging models are more sensitive to noise in the spectral measurements.This feature would compromise prediction accuracy if the models were applied to new measurements with lower signal-to-noise ratio, as illustrated elsewhere. 20n view of the above discussions, by taking into account the results of ||b av || and RMSEP for both properties, the most suitable subsampling ratios would be in the range 40-60%.

Diesel data set
As in the two case studies above, a marked improvement in RMSEP and ||b av || was observed for all diesel properties as the number of iterations was increased up to P = 30.Therefore, the corresponding graphs are omitted for brevity.The average results obtained for different subsampling ratios with P = 30 are shown in Figure 9. Again, the worst results in terms of either RMSEP or ||b av || were always obtained for the extreme values of q (10 and 90%).On the overall, the best results for these two metrics were obtained for q ranging from 30 to 50%.

Conclusions
This paper was concerned with the effect of the subsampling ratio on MLR-SPA-subagging models.For this purpose, investigations involving simulated data, Vol.22, No. 11, 2011   as well as near-infrared spectrometric determination of moisture and protein in wheat and distillation temperatures (T10 and T90), specific mass and sulphur in diesel were carried out.The results were evaluated in a multi-criterion framework by considering prediction ability (RMSEP) and sensitivity to spectral measurement noise (||b av ||).In view of these metrics, it was found that 30 subsampling iterations were sufficient to obtain convergence of the MLR-SPA-subagging procedure, which is in agreement with the findings of a previous study. 12The best results were obtained for subsampling ratios in the range 20-40% (simulated data), 40-60% (wheat) and 30-50% (diesel).Therefore q = 40% is found to be an appropriate compromise choice.In terms of the number N cal of calibration samples, these q percentages correspond to 20-40 samples (simulated data), 28-42 samples (wheat) and 26-43 samples (diesel).The smaller number of calibration samples required for the simulated dataset can be ascribed to the fewer sources of variability as compared to the wheat and diesel datasets, which involve actual physical/ chemical phenomenae.It is worth noting that the range of N cal values indicated above for these real-life datasets is in agreement with the guidelines of the ASTM E 1655 05 standard, 24 which recommends the use of at least 24 calibration samples.

Figure 2 .
Figure 2. (a) Pure spectra for analytes A, B, C and (b) mixture spectra.

Figure 3 .
Figure 3. (a) Original and (b) derivative spectra of the wheat samples.

Figure 4 .
Figure 4. (a) Original and (b) derivative spectra of the diesel samples.

Figure 5 .
Figure 5. MLR-SPA-subagging results as a function of the number of iterations: RMSEP for (a) analyte A and (b) analyte B, ||b av || for (c) analyte A and (d) analyte B. The solid and dashed lines represent the average result and the ±1s boundaries obtained from n MC = 25 Monte Carlo realizations.

Figure 6 .
Figure 6.||b av || versus RMSEP for (a) analyte A and (b) analyte B using different subsampling ratios.Standard errors are indicated by horizontal and vertical bars.

Figure 7 .
Figure 7. MLR-SPA-subagging results as a function of the number of iterations: RMSEP for (a) moisture and (b) protein, ||b av || for (c) moisture and (d) protein.The solid and dashed lines represent the average result and the ±1s boundaries obtained from n MC = 25 Monte Carlo realizations.

Figure 8 .
Figure 8. ||b av || versus RMSEP for (a) moisture and (b) protein using different subsampling ratios.Standard errors are indicated by horizontal and vertical bars.