Abstract
Genetic algorithm and multiple linear regression (GAMLR), partial least square (GAPLS), kernel PLS (GAKPLS) and LevenbergMarquardt artificial neural network (LM ANN) techniques were used to investigate the correlation between retention index (RI) and descriptors for 116 diverse compounds in essential oils of six Stachys species. The correlation coefficient LGOCV (Q²) between experimental and predicted RI for test set by GAMLR, GAPLS, GAKPLS and LM ANN was 0.886, 0.912, 0.937 and 0.964, respectively. This is the first research on the QSRR of the essential oil compounds against the RI using the GAKPLS and LM ANN.
essential oils; genetic algorithmkernel partial least squares; LevenbergMarquardt artificial neural network
ARTIGO
Quantitative structureretention relationships analysis of retention index of essential oils
Hadi Noorizadeh^{*} * email: hadinoorizadeh@yahoo.com ; Abbas Farmany; Mehrab Noorizadeh
Department of Chemistry, Faculty of Science, Islamic Azad University, Ilam Branch, Ilam, Iran
ABSTRACT
Genetic algorithm and multiple linear regression (GAMLR), partial least square (GAPLS), kernel PLS (GAKPLS) and LevenbergMarquardt artificial neural network (LM ANN) techniques were used to investigate the correlation between retention index (RI) and descriptors for 116 diverse compounds in essential oils of six Stachys species. The correlation coefficient LGOCV (Q^{2}) between experimental and predicted RI for test set by GAMLR, GAPLS, GAKPLS and LM ANN was 0.886, 0.912, 0.937 and 0.964, respectively. This is the first research on the QSRR of the essential oil compounds against the RI using the GAKPLS and LM ANN.
Keywords: essential oils; genetic algorithmkernel partial least squares; LevenbergMarquardt artificial neural network.
INTRODUCTION
An essential oil is a volatile mixture of organic compounds derived from odorous plant material by physical means.^{1} The composition of essential oil has been extensively investigated because of its commercial interest in the fragrance industry (soaps, colognes, perfumes, skin lotion and other cosmetics), in aromatherapy (relaxant), in pharmaceutical preparations for its therapeutic effects as a sedative, spasmolytic, antioxidant, antiviral and antibacterial agent.^{2,3} Recently it has also been employed in food manufacturing as natural flavouring for beverages, ice cream, candy, baked goods and chewing gum. Stachys L. (Lamiaceae, Lamioideae) is among the largest genera of Lamiaceae. Stachys consists of annual and perennial herbs and subshrubs showing extensive variation in morphological and cytological characters.^{4}Stachys L. is a large genus comprising over 300 worldwide species and is widely spread throughout Northern Europe and the Mediterranean.^{5} The constituents of essential oil of these spices includes: oxygenated monoterpenes, monoterpene hydrocarbons, oxygenated sesquiterpenes, sesquiterpene hydrocarbons, carbonylic compounds, phenols, fatty acids and esters. These entire compounds have been identified by gas chromatography (GC) and gas chromatographymass spectrometry (GCMS). GC and GCMS are the main methods for identification of these volatile plant oils. To increase the reliability of the MS identification, comprehensive twodimensional GCMS can be used. This technique is based on two consecutive GC separations, typically according to boiling point and polarity.^{6} The compounds are identified by comparison of retention index with those reported in the literature and by comparison of their mass spectra with libraries or with the published mass spectra data.^{7}
Chromatographic retention for capillary column gas chromatography is the calculated quantity, which represents the interaction between stationary liquid phase and gasphase solute molecule. This interaction can be related to the functional group, electronic and geometrical properties of the molecule.^{8,9}
Mathematical modeling of these interactions helps chemists to find a model that can be used to obtain a deep understanding about the mechanism of interaction and to predict the retention index (RI) of new or even unsynthesized compounds.^{10} Building retention prediction models may initiate such theoretical approach, and several possibilities for retention prediction in GC. Among all methods, quantitative structureretention relationships (QSRR) are most popular. In QSRR, the retention of given chromatographic system was modeled as a function of solute (molecular) descriptors. A number of reports, deals with QSRR retention index calculation of several compounds have been published in the literature.^{1113} The QSRR/QSAR models apply to multiple linear regression (MLR) and partial least squares (PLS) methods often combined with genetic algorithms (GA) for feature selection.^{14,15}
Because of the complexity of relationships between the property of molecules and structures, nonlinear models are also used to model the structureproperty relationships. Levenberg Marquardt artificial neural network (LM ANN) is nonparametric nonlinear modeling technique that has attracted increasing interest. In the recent years, nonlinear kernelbased algorithms as kernel partial least squares (KPLS) have been proposed.^{16,17} The basic idea of KPLS is first to map each point in an original data space into a feature space via nonlinear mapping and then to develop a linear PLS model in the mapped space. According to Cover's theorem, nonlinear data structure in the original space is most likely to be linear after highdimensional nonlinear mapping.^{18} Therefore, KPLS can efficiently compute latent variables in the feature space by means of integral operators and nonlinear kernel functions. Compared to other nonlinear methods, the main advantage of the kernel based algorithm is that it does not involve nonlinear optimization. It essentially requires only linear algebra, making it as simple as the conventional linear PLS. In addition, because of its ability to use different kernel functions, KPLS can handle a wide range of nonlinearities. In the present study, GAMLR, GAPLS, GAKPLS and LM ANN were employed to generate QSRR models that correlate the structure of some compound; with observed RI. The present study is a first research on QSRR of the essential oil compounds against the RI, using GAKPLS and LM ANN. The performance of these models was compared with those obtained by the GAMLR and GA PLS methods.
EXPERIMENTAL
Data set
Retention index of essential oils of six Stachys species, S. cretica L. ssp. vacillans Rech. fil., S. germanica L., S. hydrophila Boiss., S. nivea Labill., S. palustris L. and S. spinosa L., obtained by hydrodistillation, was studied by GC and GCMS, which contains 116 compounds^{19} (Table 1). This set was measured at the same condition with the Innowax column (60 m x 0.25 mm i.d.; 0.33 μ m film thickness) for GC measurement. GCMS analysis was also performed on an Agilent 6850 series II apparatus, fitted with a fused silica HP1 capillary column (30 m x 0.25 mm i.d.; 0.33 μ m film thickness). The retention index of these compounds was decreased in the range of 3710 and 1075 for both Octadecanoic acid and αPinene, respectively.
In order to evaluate the generated models, we used leavegroupout cross validation (LGOCV). Cross validation consists of the following: removing one (leaveoneout) or groups (leavegroupout) of compounds in a systematic or random way; generating a model from the remaining compounds, and predicting the removed compounds.
Descriptor calculation
All structures were drawn with the HyperChem software (version 6). Optimization of molecular structures was carried out by semiempirical AM1 method using the Fletcher Reeves algorithm until the root mean square gradient of 0.01 was obtained. Since the calculated values of the electronic features of molecules will be influenced by the related conformation. Some electronic descriptors such as dipole moment, polarizability and orbital energies of LUMO and HOMO were calculated by using the HyperChem software. Also optimized structures were used to calculate 1497 descriptors by DRAGON software version 3.^{20}
Software and programs
A Pentium IV personal computer (CPU at 3.06 GHz) with windows XP operational system was used. Geometry Optimization was performed by HyperChem (version 7.0 Hypercube, Inc.), Dragon software was used to calculate of descriptors. MLR analysis was performed by the SPSS Software (version 13, SPSS, Inc.) by using enter method for model building. Minitab software (version 14, Minitab) was used for the simple PLS analysis. Cross validation, GAMLR, GAPLS, GAKPLS, LM ANN and other calculation were performed in the MATLAB (Version 7, Mathworks, Inc.) environment.
THEORY
Genetic algorithm
Genetic algorithm has been proposed by J. Holland in the early 1970s but it was possible to apply them with reasonable computing times only in the 1990s, when computers became much faster. GA is a stochastic method to solve the optimization problems, defined by fitness criteria applying to the evolution hypothesis of Darwin and different genetic functions, i.e., crossover and mutation.^{21} In GA, each individual of the population, defined by a chromosome of binary values as the coding technique, represented a subset of descriptors. The number of genes at each chromosome was equal to the number of descriptors. The population of the first generation was selected randomly. A gene was given the value of one, if its corresponding descriptor was included in the subset; otherwise, it was given the value of zero.^{22} The GA performs its optimization by variation and selection via the evaluation of the fitness function η. Fitness function was used to evaluate alternative descriptor subsets that were finally ordered according to the predictive performance of related model by cross validation. The fitness function was proposed by Depczynski et al..^{23} The rootmeansquare errors of calibration (RMSEC) and prediction (RMSEP) were calculated and the fitness function was calculated by Equation 1.
where m_{c} and m_{p} are the number of compounds in the calibration and prediction set and n represent the number of selected variables, respectively. The parameter algorithm reported in Table 2.
Linear models
Multiple linear regression
A major step in constructing the QSRR model is finding a set of molecular descriptors that represent variation in the structural property of the molecules. The modeling and prediction of the physicochemical properties of organic compounds is an important objective in many scientific fields.^{24,25} MLR is one of the most modeling methods in QSRR. MLR method provides an equation that links the structural features to the RI of the compounds:
where a_{0} and a_{i} are intercept and regression coefficients of the descriptors, respectively. d_{i} has the common definition, variable or descriptor in this case, the elements of this vector are equivalent numerical values of descriptors of the molecules.
Partial least squares
PLS is a linear multivariate method for relating the process variables X with responses Y. PLS can analyze data with strongly collinear, noisy, and numerous variables in both X and Y.^{26} PLS reduces the dimension of the predictor variables by extracting factors or latent variables that are correlated with Y while capturing a large amount of the variations in X. This means that PLS maximizes the covariance between matrices X and Y. In PLS, the scaled matrices X and Y are decomposed into score vectors (t and u), loading vectors (p and q), and residual error matrices (E and F):
where a is the number of latent variables. In an inner relation, the score vector t is linearly regressed against the score vector u.
where b is regression coefficient that is determined by minimizing the residual h. It is crucial to determine the optimal number of latent variables and cross validation is a practical and reliable way to test the predictive significance of each PLS component. There are several algorithms to calculate the PLS model parameters. In this work, the NIPALS algorithm was used with the exchange of scores.^{27}
Nonlinear
Kernel partial least squares
The KPLS method is based on the mapping of the original input data into a high dimensional feature space ℑ where a linear PLS model is created. By nonlinear mapping Φ: x∈ℜ^{n}  Φ(x)∈ℑ, a KPLS algorithm can be derived from a sequence of NIPALS steps and has the following formulation:^{28}
1. Initialize score vector w as equal to any column of Y.
2. Calculate scores u = ΦΦ^{T}w and normalize u to u = 1, where Φ is a matrix of regressors.
3. Regress columns of Y on u: c = Y^{T}u, where c is a weight vector.
4. Calculate a new score vector w for Y: w = Yc and then normalize w to w=1.
5. Repeat steps 24 until convergence of w.
6. Deflate ΦΦ^{T} and Y matrices:
7. Go to step 1 to calculate the next latent variable.
Without explicitly mapping into the highdimensional feature space, a kernel function can be used to compute the dot products as follows:
ΦΦ^{T} represents the (n×n) kernel Gram matrix K of the cross dot products between all mapped input data points Φ(x_{i}),i = 1, ..., n. The deflation of the ΦΦ^{T} = Kmatrix after extraction of the u components is given by:
where I is an mdimensional identity matrix. Taking into account the normalized scores u of the prediction of KPLS model on training data Y is defined as:
For predictions on new observation data Y^_{t} , the regression can be written as:
where K_{t} is the test matrix whose elements are K_{ij} =K(x_{i}, x_{j}) where x_{i} and x_{j} present the test and training data points, respectively.
Artificial neural network
An artificial neural network (ANN) with a layered structure is a mathematical system that stimulates the biological neural network; consist of computing units named neurons and connections between neurons named synapses.^{2931} Input or independent variables are considered as neurons of input layer, while dependent or output variables are considered as output neurons. Synapses connect input neurons to hidden neurons and hidden neurons to output neurons. The strength of the synapse from neuron i to neuron j is determined by mean of a weight, W_{ij}. In addition, each neuron j from the hidden layer, and eventually the output neuron, are associated with a real value b_{j}, named the neuron's bias and with a nonlinear function, named the transfer or activation function. Because the artificial neural networks (ANNs) are not restricted to linear correlations, they can be used for nonlinear phenomena or curved manifolds.^{29} Back propagation neural networks (BNNs) are most often used in analytical applications.^{30} The back propagation network receives a set of inputs, which is multiplied by each node and then a nonlinear transfer function is applied. The goal of training the network is to change the weight between the layers in a direction to minimize the output errors. The changes in values of weights can be obtained using Equation 11:
where ΔW_{ij,n} is the change in the weight factor for each network node, α is the momentum factor, and F is a weight update function, which indicates how weights are changed during the learning process. There is no single best weight update function which can be applied to all nonlinear optimizations. One need to choose a weight update function based on the characteristics of the problem and the data set of interest. Various types of algorithms have been found to be effective for most practical purposes such as LevenbergMarquardt (LM) algorithm.
Levenberg Marquardt algorithm
While basic back propagation is the steepest descent algorithm, the LevenbergMarquardt algorithm^{32} is an alternative to the conjugate methods for second derivative optimization. In this algorithm, the update function, F_{n}, can be calculated using Equations 12 and 13:
where J is the Jacobian matrix, μ is a constant, I is a identity matrix, and e is an error function.^{33}
RESULTS AND DISCUSSION
Linear models
GAMLR analysis
To reduce the original pool of descriptors to an appropriate size, the objective descriptor reduction was performed using various criteria. Reducing the pool of descriptors eliminates those descriptors which contribute either no information or whose information content is redundant with other descriptors present in the pool. From the variable pairs with R > 0.90, only one of them was used in the modeling, while the variables over 90% and equal to zero or identical were eliminated. With the use of these criteria, 1014 out of 1497 original descriptors were eliminated and remaining descriptors were employed to generate the models with the GAMLR program. In order to minimize the information overlap in descriptors and to reduce the number of descriptors required in regression equation, the concept of nonredundant descriptors was used in our study. The best equation is selected on the basis of the highest multiple correlation coefficient leavegroupout cross validation (LGOCV) (Q^{2}), the least RMSECV and relative error of prediction and simplicity of the model. These parameters are probably the most popular measure of how well a regression model fits the data. Among the models proposed by GAMLR, one model had the highest statistical quality and was repeated more than the others.This model had five molecular descriptors including constitutional descriptors (sum of atomic van der Waals volumes (scaled on Carbon atom)) (Sv), topological descriptors (mean topological charge index of order1) (JGI1), atomcentred fragments (H attached to C0 (sp^{3}) no X attached to next C) (H046) and electronic descriptors (dipole moment (μ) and highest occupied molecular orbital (HOMO)). The best QSRR model obtained is given below together with the statistical parameters of the regression in Equation 14.
Since mean topological charge index of order1 coefficient is bigger in the equation, it is very important descriptor compared to the other descriptors in the model. The JGI1, H046 and HOMO displays a negative sign which indicates that when these descriptors increase the RI decreases. The Sv and μ displays a positive sign which indicates that the RI is directly related to these descriptors. The predicted values of RI are plotted against the experimental values for training and test sets in Figure 1a. The statistical parameters of this model, constructed by the selected descriptors, are depicted in Table 3.
GAPLS analysis
The colinearity problem of the MLR method has been overcome through the development of the partial leastsquares projections to latent structures (PLS) method. For this reason, after eliminating descriptors that had identical or zero values for greater than 90% of the compounds, 1097 descriptor were remained. These descriptors were employed to generate the models with the GAPLS and GAKPLS program. The best PLS model contains 7 selected descriptors in 3 latent variables space. These descriptors were obtained constitutional descriptors (number of rings) (nCIC), topological descriptors (Balaban centric index) (BAC), 2D autocorrelations (BrotoMoreau autocorrelation of a topological structure  lag 5/weighted by atomic masses) (ATS5m), geometrical descriptors ((3DBalaban index) (J3D), RDF descriptors (Radial Distribution Function  5.5/weighted by atomic van der Waals volumes) (RDF055v), functional group (number of terminal C(sp)) (nR#CH/X) and electronic descriptors (polarizability). For this in general, the number of components (latent variables) is less than the number of independent variables in PLS analysis. The Figure 1b shows the plot of predicted versus experimental values for training and test sets. The obtained statistic parameters of the GAPLS model were shown in Table 3. The data confirm that higher correlation coefficient and lower prediction error have been obtained by PLS in relative to MLR and these reveal that PLS method produces more accurate results than that of MLR. The PLS model uses higher number of descriptors that allow the model to extract better structural information from descriptors to result in a lower prediction error.
Nonlinear models
GAKPLS analysis
The leavegroupout cross validation (LGOCV) has been performed. The n selected descriptors in each chromosome were evaluated by fitness function of PLS and KPLS based on the Equation 1. In this paper a radial basis kernel function, k(x,y)= exp(xy^{2} /c), was selected as the kernel function with c = mσ^{2}where r is a constant that can be determined by considering the process to be predicted (here r set to be 1), m is the dimension of the input space and σ^{2} is the variance of the data.^{34} It means that the value of c depends on the system under the study. Figure 2a shows the plot of Q^{2} versus latent variable for this model. The 5 descriptors in 2 latent variables space chosen by GAKPLS feature selection methods were contained. These descriptors were obtained constitutional descriptors (number of bonds) (nBT), geometrical descriptors (span R) (SPAN), atomcentred fragments ((phenol/enol/carboxyl OH) and electronic descriptors (lowest unoccupied molecular orbital (LUMO) and polarizability. The Figure 2b shows the plot of predicted versus experimental values for training and test sets. High correlation coefficient and closeness of slope to 1 in the GAKPLS model reveal a satisfactory agreement between the predicted and the experimental values. For the constructed model, four general statistical parameters were selected to evaluate the prediction ability of the model for the RI. Table 3 shows the statistical parameters for the compounds obtained by applying models to training and test sets. The statistical parameters correlation coefficient (R^{2}), correlation coefficient LGOCV (Q^{2}), relative error (REP)% and root mean squares error (RMSE) was obtained for proposed models. Each of the statistical parameters mentioned above were used for assessing the statistical significance of the QSRR model. The data presented in Table 3 indicate that the GAPLS and GAMLR linear model have good statistical quality with low prediction error, while the corresponding errors obtained by the GAKPLS model are lower. In comparison with the results obtained by these models suggest that GAKPLS hold promise for applications in choosing of variable for LM ANN systems. This result indicates that the RI of essential oils possesses some nonlinear characteristics.
Description of some models descriptors
In the chromatographic retention of compounds in the nonpolar or low polarity stationary phases two important types of interactions contribute to the chromatographic retention of the compounds: the induction and dispersion forces. The dispersion forces are related to steric factors, molecular size and branching, while the induced forces are related to the dipolar moment, which should stimulate dipoleinduced dipole interactions. For this reason, constitutional descriptors, atomcentred fragments, functional groups and electronic descriptors are very important.
Constitutional descriptors are most simple and commonly used descriptors, reflecting the molecular composition of a compound without any information about its molecular geometry. The most common Constitutional descriptors are number of atoms, number of bound, absolute and relative numbers of specific atom type, absolute and relative numbers of single, double, triple, and aromatic bound, number of ring, number of ring divided by the number of atoms or bonds, number of benzene ring, number of benzene ring divided by the number of atom, molecular weight and average molecular weight.
Electronic descriptors were defined in terms of atomic charges and used to describe electronic aspects both of the whole molecule and of particular regions, such atoms, bonds, and molecular fragments. This descriptor calculated by computational chemistry and therefore can be consider among quantum chemical descriptor. The eigenvalues of LUMO and HOMO and their energy gap reflect the chemical activity of the molecule. LUMO as an electron acceptor represents the ability to obtain an electron, while HOMO as an electron donor represents the ability to donate an electron. The HOMO energy plays a very important role in the nucleophylic behavior and it represents molecular reactivity as a nucleophyle. Good nucleophyles are those where the electron residue is high lying orbital. The energy of the LUMO is directly related to the electron affinity and characterizes the susceptibility of the molecule toward attack by nucleophiles. Electron affinity was also shown to greatly influence the chemical behaviour of compounds, as demonstrated by its inclusion in the QSPR/QSRR.^{35,36}
The geometrical descriptors which use the modeled threedimensional coordinates. These descriptors attempt to describe the geometrical environments of carbon atoms. They are usually employed only in situations in which locked conformations are being studied.^{37}
Topological descriptors are based on a graph representation of the molecule. They are numerical quantifiers of molecular topology obtained by the application of algebraic operators to matrices representing molecular graphs and whose values are independent of vertex numbering or labeling. They can be sensitive to one or more structural features of the molecule such as size, shape, symmetry, branching and cyclicity and can also encode chemical information concerning atom type and bond multiplicity. Balaban index is a variant of connectivity index, represents extended connectivity and is a good descriptor for the shape of the molecules and modifying biological process. Nevertheless, some of chemists have used this index successfully in developing QSPR/QSRR models.
The radial distribution function descriptors are based on the distances distribution in the geometrical representation of a molecule and constitute a radial distribution function code. Formally, the radial distribution function of an ensemble of N atoms can be interpreted as the probability distribution of finding an atom in a spherical volume of radius r.^{38}
LM ANN analysis
With the aim of improving the predictive performance of nonlinear QSRR model, LM ANN modeling was performed. Descriptors of GAKPLS model were selected as inputs in LM ANN model. The network architecture consisted of five neurons in the input layer corresponding to the five mentioned descriptors. The output layer had one neuron that predicts the RI. The number of neurons in the hidden layer is unknown and needs to be optimized. In addition to the number of neurons in the hidden layer, the learning rate, the momentum and the number of iterations also should be optimized. In this work, the number of neurons in the hidden layer and other parameters except the number of iterations were simultaneously optimized. A Matlab program was written to change the number of neurons in the hidden layer from 2 to 7, the learning rate from 0.001 to 0.1 with a step of 0.001 and the momentum from 0.1 to 0.99 with a step of 0.01. The root mean square errors for training set were calculated for all of the possible combination of values for the mentioned variables in cross validation. It was realized that the RMSE for the training set are minimum when three neurons were selected in the hidden layer and the learning rate and the momentum values were 0.5 and 0.2, respectively. Finally, the number of iterations was optimized with the optimum values for the variables. It was realized that after 13 iterations, the RMSE for prediction set were minimum. The statistical parameters for LM ANN model in Table 3. Plots of predicted RI versus experimental RI values by LM ANN are shown in Figure 3. Obviously, there is a close agreement between the experimental and predicted RI and the data represent a very low scattering around a straight line with respective slope and intercept close to one and zero. The closeness of the data to the straight line with a slope equal to 1 shows the perfect fit of the data to a nonlinear model. It should be noted that the data shown in Figure 3 are the predicted values according to leavegroupout crossvalidation and a deviation from the regression line is expected for some points. The Q^{2}, which is a measure of the model fit to the cross validation set, can be calculated as:
where y_{i}, y_{î} and y^{} were respectively the experimental, predicted, and mean RI values of the samples. The accuracy of cross validation results is extensively accepted in the literature considering the Q^{2} value. In this sense, a high value of the statistical characteristic (Q^{2} > 0.5) is considered as proof of the high predictive ability of the model.^{39} However, several authors suggest that a high value of Q^{2} appears to be a necessary but not sufficient condition for a model to have a high predictive power and consider that the predictive ability of a model can only be estimated using a sufficiently large collection of compounds that was not used for building the model.^{40}
We believe that applying only LGOCV is not sufficient to evaluate the predictive ability of a model. Thus we employed a twostep validation protocol which contains internal (LGOCV) and external (test set) validation methods. The data set was randomly divided into training (calibration and prediction sets) and test sets after sorting based on the RI values. The training set consisted of 92 molecules and the test set, consisted of 24 molecules. The training set was used for model development, while the test set in which its molecules have no role in model building was used for evaluating the predictive ability of the models for external set. We reported that the retention index of this essential oil was mainly controlled by constitutional descriptors, functional groups and electronic descriptors.
The statistical parameters obtained by LGOCV for LM ANN, GAKPLS and the linear QSRR models are compared in Table 3. Inspections of the results of this table reveals a higher R^{2} and Q^{2} values and lower the RE for LM ANN model for the training and test sets compared with their counterparts for GAKPLS and other models. Moreover, the low values of rootmeansquare error of prediction for the samples in the test set confirm the prediction ability of the resulted models for the compounds that were not used in the modelbuilding step. In comparison with the plot of other models, the LM ANN predicted RI of each one of the training and test sets compound represent a uniform and linear distribution. This clearly shows the strength of LM ANN as a nonlinear feature selection method. The key strength of LM ANN is their ability to allow for flexible mapping of the selected features by manipulating their functional dependence implicitly. Neural network handles both linear and nonlinear relationship without adding complexity to the model. This capacity offset the large computing time required and complexity of LM ANN model with respect other models.
CONCLUSION
The essential oils are widely used in pharmaceutical, cosmetic and perfume industry, and for flavouring and preservation of several food products. GC and GCMS is one of the most powerful tools in analytical volatile compound (such as essential oils). In this study, an accurate QSRR model for estimating the retention index (RI) of essential oils of six Stachys species which obtained by GC and GCMS was developed by employing the two linear models (GAMLR and GAPLS) and two nonlinear models (GAKPLS and LM ANN). The most important molecular descriptors selected represent the constitutional descriptors, functional group and electronic descriptors that are known to be important in the retention mechanism of essential oils. Four models have good predictive capacity and excellent statistical parameters. A comparison between these models revealed the superiority of the GAKPLS and LM ANN to other models. It is easy to notice that there was a good prospect for the GAKPLS and LM ANN application in the QSRR modeling. This indicates that the RI of essential oils possesses some nonlinear characteristics. In comparison with two nonlinear models, the results showed that the LM ANN model can be effectively used to describe the molecular structure characteristic of these compounds. It can also be used successfully to estimate the RI for new compounds or for other compounds whose experimental values are unknown.
Recebido em 7/3/10; aceito em 12/8/10; publicado na web em 30/11/10

^{1}Vagionas, K.; Ngassapa, O.; Runyoro, D.; Graikou, K.; Gortzi, O.; Chinou, I.; Food Chem. 2007, 105, 1711.

^{2}Kim, N. S.; Lee, D. D.; J. Chromatogr., A 2002, 982, 31.

^{3}Eminagaoglu, O.; Tepe, B.; Yumrutas, O.; Akpulat, H. A.; Daferera, D.; Polissiou, M.; Sokmen, A.; Food Chem. 2007, 100, 339.

^{4}Radulovi, N.; Lazarevi, J.; Risti N.; Pali R.; Biochem. Syst. Ecol. 2007, 35, 196.

^{5}Kotsos, M. P.; Aligiannis, N.; Mitakou, S.; Biochem. Syst. Ecol. 2007, 35, 381.

^{6}Peters, R.; Tonoli, D.; van Duin, M.; Mommers, J.; Mengerink, Y.; Wilbers, A. T. M.; van Benthem, R.; de Koster, Ch.; Schoenmakers, P. J.; van derWal, Sj.; J. Chromatogr., A 2008, 1201, 141.

^{7}Li, Z. G.; Lee, M. R.; Shen, D. L.; Anal. Chim. Acta 2006, 576, 43.

^{8}AcevedoMart1nez, J.; EscalonaArranz, J. C.,; VillarRojas, A.; TellezPalmero, F.; PerezRoses, R.; Gonzalez, L.; CarrascoVelar, R.; J. Chromatogr., A 2006, 1102, 238.

^{9}Teodora, I.; Ovidiu, I.; J. Mol. Design 2002, 1, 94.

^{10}Ghasemi, J.; Saaidpour, S.; Brown, S. D.; J. Mol. Struct. 2007, 805, 27.

^{11}Qin, L. T.; Liu, Sh. Sh.; Liu, H. L.; Tong, J.; J. Chromatogr., A 2009, 1216, 5302.

^{12}Riahi, S.; Pourbasheer, E.; Ganjali, M. R; Norouzi, P.; J. Hazard. Mater. 2009, 166, 853.

^{13}Bombarda, I.; Dupuy, N.; Le Van Da, J. P.; Gaydou , E. M.; Anal. Chim. Acta 2008, 613, 31.

^{14}Deeb, O.; Hemmateenejad, B.; Jaber, A.; GardunoJuarez, R.; Miri, R.; Chemosphere 2007, 67, 2122.

^{15}Hemmateenejad, B.; Miri, R.; Akhond, M.; Shamsipur, M.; Chemom. Intell. Lab. Syst. 2002, 64, 91.

^{16}Kim, K.; Lee, J. M.; Lee, I. B.; Chemom. Intell. Lab. Syst 2005, 79, 22.

^{17}Rosipal, R.; Trejo, L. J.; J. Mach. Learning Res 2001, 2, 97.

^{18}Haykin, S.; Neural Networks, PrenticeHall: New Jersey, 1999.

^{19}Conforti, F.; Menichini, F.; Formisano, C.; Rigano, D.; Senatore, F.; Arnold, N. A.; Piozzi, F.; Food Chem 2009, 116, 898.

^{20}Todeschini, R.; Consonni, V.; Mauri, A.; Pavan, M.; DRAGONSoftware for the calculation of molecular descriptors; Version 3.0 for Windows, 2003.

^{21}Cai, W.; Xia, B.; Shao, X.; Guo, Q.; Maigret, B.; Pan, Z.; J. Mol. Struct 2001, 535, 115.

^{22}Goldberg, D. E.; Genetic Algorithms in Search, Optimization and Machine Learning, AddisonWesleyLongman: Reading, 2000.

^{23}Depczynski, U.; Frost, V. J.; Molt, K.; Anal. Chim. Acta 2000, 420, 217.

^{24}Citra, M.; Chemosphere 1999, 38, 191.

^{25}Hemmateenejad, B.; Shamsipur, M.; Miri, R.; Elyasi, M.; Foroghinia, F.; Sharghi, H.; Anal. Chim. Acta 2008, 610, 25.

^{26}Wold, S.; Sjostrom, M.; Eriksson, L.; Chemom. Intell. Lab. Syst. 2001, 58, 109.

^{27}Tang, K.; Li, T.; Anal. Chim. Acta 2003, 476, 85.

^{28}Rosipal, R.; Trejo, L. J.; J. Mach. Learning Res 2001, 2, 97.

^{29}Zupan, J.; Gasteiger, J.; Neural Network in Chemistry and Drug Design, WileyVCH: Weinheim, 1999.

^{30}Beal, T. M.; Hagan, H. B.; Demuth, M.; Neural Network Design, PWS: Boston, 1996.

^{31}Zupan, J.; Gasteiger, J.; Neural Networks for Chemists: An Introduction, VCH: Weinheim, 1993.

^{32}Guven, A.; Kara, S.; Exp. Sys. Appli 2006, 31, 199.

^{33}Salvi, M.; Dazzi, D.; Pelistri, I.; Neri, F.; Wall, J. R.; Ophthalmology 2002, 109, 1703.

^{34}Kim, K.; Lee, J. M.; Lee, I. B.; Chemom. Intell. Lab. Syst 2005, 79, 22.

^{35}Booth, T. D.; Azzaoui, K.; Wainer, I. W.; Anal. Chem 1997, 69, 3879.

^{36}Yu, X.; Yi, B.; Liu, F.; Wang, X.; React. Funct. Polym 2008, 68, 1557.

^{37}Karelson, M.; Molecular descriptors in QSAR/QSPR, WileyInterscience: New York, 2000.

^{38}Todeschini, R.; Consonni, V.; Handbook of molecular descriptors, WileyVCH: Weinheim, 2000.

^{39}Hemmateenejad, B.; Javadnia, K.; Elyasi, M.; Anal. Chim. Acta 2007, 592, 72.

^{40}Golbraikh, A.; Tropsha, A.; J. Mol. Graph. Model 2002, 20, 269.
Publication Dates

Publication in this collection
02 Mar 2011 
Date of issue
2011
History

Accepted
12 Aug 2010 
Received
07 Mar 2010