Qspr Study of Partition Coefficient (k O/w ) of Some Organic Compounds Using Radial Basic Function-partial Least Square (rbf-pls)

Neste trabalho, nós introduzimos um novo método da função de base radial por regressão de mínimos quadrados (RBF-PLS) com elevada exatidão e precisão nos estudos quantitativos da relação entre a estrutura-propriedade de compostos orgânicos (QSPR). Três métodos QSPR foram comparados para a predição dos coeficientes de partição no sistema n-octanol-água (K o/w) (de alguns compostos orgânicos). A regressão linear múltipla (MLR), a regressão parcial dos mínimos quadrados (PLS) e a regressão base radial com funções pelo método de mínimos quadrados (RBF-PLS) foram empregadas para construir os modelos lineares e não-lineares e predizer o valor de K o/w. Os descritores teóricos foram calculados por Dragon e por Gaussian 98 e foram explorados pelas regressões parciais, codificando diferentes aspectos topológicos, geométricos e eletrônicos das estruturas moleculares. A raiz quadrada dos erros médios previstos (RMSEP) para as etapas de testes e da previsão teórica por modelos de MLR, de PLS e de RBF-In this work, we introduce a new method ability radial basic function-partial least square (RBF-PLS) with high accuracy and precision in QSPR studies. Three quantitative structure-propertty relationship (QSPR) methods have been compared for the prediction of n-octanol-water partition coefficients (K o/w) of some organic compounds. The multiple linear regressions (MLR), partial least square (PLS) and radial basis function-partial least squares (RBF-PLS) models were employed to construct linear and nonlinear models to predict of K o/w. The theoretical descriptors that calculated by Dragon and Gaussian 98 were explored by stepwise regressions, encoding different aspects of the topological, geometrical and electronic molecular structures. The root means square error of prediction (RMSEP) for training and prediction sets by MLR, PLS and RBF-PLS models were The resultant data explained that RBF-PLS produced better results than PLS and MLR.


Introduction
If a third substance is added to a system of two immiscible liquids in equilibrium, the added component will distribute itself between the two liquid phases until the ratio of its concentrations in each phase attain a certain value so called the distribution constant or partition coefficient.The measurement of liquid-liquid partition coefficients is extremely important in: (i) fundamental chemistry for studying inorganic and/or organic complex equilibria; (ii) industrial chemistry for optimization of production and waste treatment; and (iii) food chemistry for purification and extraction of sugars, fat or caffeine. 1The n-octanol/water partition coefficient is the ratio of the concentration of a chemical species in n-octanol to that in water for a two-phase system at equilibrium.The logarithm of this coefficient, log K o/w has been shown to be one of the key parameters in quantitative structure-property relationship (QSPR) studies.Also, the n-octanol-water partition coefficient is a measure of the hydrophobicity and hydrophilicity of a substance.Hydrophobic interactions are very importance in many areas of chemistry, including enzyme-ligand interactions, drug-receptor interactions, transport of drug to the active site, the assembly of lipids in biomembranes, aggregation of surfactants, coagulation and detergency, etc. 2,3 Hydrophobic "bonding" is actually not bond formation at all, but rather the tendency of hydrophobic molecules or hydrophobic parts of molecules to avoid water because they are not readily accommodated in the highly ordered hydrogen bonded structure of water. 4Hydrophobic interactions are favored thermodynamically because of increased entropy of the water molecules that accompanies the association of nonpolar molecules which squeeze out water.The hydrophobic "bonding" resulting from an unwelcome reception of nonpolar molecules in water involves van der Waals forces, hydrogen bonding of water molecules in 3D structure and other interactions. 5Distribution coefficient could be measured for basic, acidic and neutral compounds.1][12][13][14][15][16] There are some reports about the applications of MLR [17][18][19][20] and artificial neural network, [21][22][23][24] modeling to predict the n-octanol/water partition coefficient of organic compounds.6][27][28][29] Experimental determination of K o/w is often complex and time-consuming and can be done only for already synthesized compounds.For this reason, a number of computational methods for the prediction of this parameter have been proposed.In this work, a QSPR study is performed to develop models that relate the structures of some acidic, basic and neutral organic compounds to their n-octanol-water partition coefficients.The radial basis function-partial least squares (RBF-PLS) method was used for predicting the log K o/w of mentioned organic compounds.

Theoretical background
About RBF-PLS, let us denote the matrix containing independent variables by X.For m objects and n variables its dimensionality is m×n.Matrix Y, describing the belongingness of m objects to g classes has the dimensionality m×g and contains only ones and zeros.The principle of RBF-PLS can be summarized as follows: Instead of applying PLS2 to the X and Y data matrices containing the initial data, it can be applied to the matrices A and Y, where A is the so called activation matrix defined as: (1)   where Q is the radial basis function, characterized by the center and width parameters.
In the linear PLS2 model 30,31 the centered data matrices A and Y are projected onto the low dimensional score matrices of T and U respectively: where the matrices P and C represent the regression coefficients (loadings).
When the C weights are not normalized, the linear inner relation between the scores matrices T and U can be presented as: and then where E, F, F* and H matrices contain residuals.An optimal number of factors are established using the cross-validation procedure. 31,32All independent variables are scaled to the range [0, l].
As known, PLS is a method for building regression models on the latent variable decomposition relating two blocks, matrices X and Y which contain the independent x and dependent y variables respectively.These matrices can be simultaneously decomposed into a sum of f latent variables as follows: which T and U are the score matrices for X and Y respectively.Also, P and Q are the loadings matrices for X and Y respectively and finally E and F are the residual matrices.Two matrices are correlated by the scores T and U for each latent variable as follows: where b f is the regression coefficient for the f latent variable.The acidity constant of the new samples can be estimated from the new scores T ' which are substituted in equation ( 9), leading to equation (10).Applications of PLS have been discussed by some researchers. 31,34= TBQ T + F (9) The general purpose of multiple linear regressions (MLR) is to quantify the relationship between several independent or predictor variables and a dependent variable: where m is the number of independent variables, b 1 , …, b m the regression coefficients and y is the dependent variable.Multiple linear regression (MLR) techniques based on least-squares procedures are very often used for estimating the coefficients involved in the model equation. 35

Data set
Experimental n-octanol-water partition coefficients (K o/w ) data of some neutral, basic and acidic organic compounds were taken from reference 36.8][39][40] Names of these compounds and their experimental n-octanol-water partition coefficients are shown in Table 1.As can be seen, this set contains in total namely 61 n-octanol-water partition coefficients data.Also the calculated n-octanol-water partition coefficients for these compounds by MLR, PLS and RBF-PLS methods are tabulated in Table2.The n-octanol-water partition  coefficients values for these compounds were obtained in the same instrumental conditions.

Descriptor generation and screening
The n-octanol-water partition coefficients (K o/w ) of solutes in separation method are related to some structural, topological, electronic and geometric properties of solutes.The value of these molecular features can be encoded quantitatively by numerical values named molecular descriptors.These molecular parameters are to be used to search for the best QSPR model of n-octanol-water partition coefficients (K o/w ). Figure 1 shows the normal distribution of logarithm K ow , which indicates that, the experimental values distributed normally and their frequency is completely reasonable.The 2D structures of the molecules were drawn using Hyperchem 7 software. 41Pre-optimized data with the molecular mechanics force field (MM+) and final geometries were obtained with the semi-empirical AM1 method in Hyperchem program.All calculations were carried out at the restricted Hartree-Fock level with no configuration interaction.The molecular structures were optimized using the Polak-Ribiere algorithm until the root mean square gradient was 0.001 kcal mol -1 .The resulted geometry was transferred into the Dragon program package which was developed by Milano chemometrics and QSPR group, 42 to calculate about 1497 descriptors in constitutional, topological, geometrical, charge, GETAWAY (geometry, topology and atoms-weighed assembly), WHIM (weighed holistic invariant molecular descriptors), 3D-MoRSE (3D-molecular representation of structure based on electron diffraction), molecular walk count, BCUT, 2D-autocorrelation, aromaticity index, randic molecular profile, radial distribution function, functional group and atom-centered fragment classes.Meanwhile the Hyperchem output files again were used by the Gaussian 98, 43 program to calculate 2 classes of descriptor including: electrostatic (minimum and maximum of partial charge, polarity parameters, charge surface area descriptors, etc) and quantum chemical (dipole moment, HOMO and LUMO energies, etc) was operated to optimized with 6-31 + G ** basis set for all atoms at the B3LYP level. 44,45o molecular symmetry constrain was applied rather full optimization of all bond lengths and angles was carried out at the level B3LYP/6-31 ++ G ** .We used from 7 descriptors for building of different models.It should be noted that we did Y-shuffled and the result was 0.0921, which indicated that there is no chance correlation.These descriptors are complementary information content (neighborhood symmetry of 1-order) (CIC1), average eigenvector coefficient sum from distance matrix (VED2), partial charge weighed topological electronic charge (PCWTe), shape profile no.01 (SP01), 3D-MoRSE-signal03/ weighed by atomic masses (Mor03m), T total size index/ weighed by atomic van der Waals volumes (Tv) and mean molecular polarizability (α).These descriptors and their characterizations are shown in Table 3.

Selection of the optimum number of factors
The optimum number of factors (latent variables) to be included in the calibration model was determined by computing the prediction error sum of squares (PRESS) for cross-validated models using a high number of factors (half the number of total standard +1), which is defined as follows: (12)   where y i is the reference K o/w for the ith compound and ŷi represents the estimated K o/w .The cross-validation method employed was to eliminate only one compound at a time and then PLS calibrate the remaining standard descriptor.By using this calibration, the K o/w of the compound that left out was predicted.This process was repeated until each standard had been left out once.One reasonable choice for the optimum number of factors would be that number which yielded the minimum PRESS value.Since there are a finite number of compounds in the training set, in many cases the minimum PRESS value causes over fitting for unknown compounds that were not included in the model.For solution of this problem, Haaland et al. [46][47][48] has been suggested, which the PRESS values for all previous factors are compared with the PRESS value at the minimum.The F-Statistical test can be used to determine the significance of PRESS values greater than the minimum.In all instances, the number of factors for the first PRESS values whose F-ratio probability drops below 0.75 was selected as the optimum.Number of factors were used in PLS model is 5. Figure 2 shows the variation of the R 2 parameter with number of factors.For presentation of the effect of number of factor on the consecutive RBF-PLS models, variation of the root means square error (RMSE) of cross-validation versus σ is presented in Figure 3.It is clear from this figure that in σ value of 0.1 and two number of factors we have maximum RMSE value.

Selection of the number of descriptors factors
The basic parameter that is important in different models is number of the descriptors.The liquid-liquid partition coefficient of solutes is related to some structural, electronic and geometric properties of solutes and solvent molecules.These molecular parameters are to be used to search for the best QSPR model of liquid-liquid partition coefficients and they are geometric, electronic and topological descriptors.Geometric descriptors were calculated using optimized cartesian coordinates and the van der Waals radius of each atom in the molecule.The method of stepwise multiple linear regression (MLR) was used for the selection of importance descriptors and model construction.Descriptors that appear in the best MLR equation are shown in Table 3.As it can be seen from the correlation matrix in Table 4 except a correlation there is no significant correlation between the selected descriptors.We used from 7 descriptors for RBF-PLS model.The statistical parameters and specification of the MLR, PLS and RBF-PLS models are shown in Table 5.As can be seen from this table, the values of root mean square error of prediction (RMSEP) for training and prediction set for the RBF-PLS model are 0.0364 and 0.0533, which should be compared with the values of 0.3050, 0.3564, 0.4022 and 0.4128 respectively for the PLS and MLR models.
Comparison of these values and also other statistical results of these two models in Table 5 indicate that the obtained results by RBF-PLS are better than those obtained using the MLR and PLS methods.This is believed to be due to the nonlinear capabilities of the RBF-PLS.It should be noted that we performed a Y-randomization test, in which the Y-block was shuffled, whilst the descriptors block was kept unaltered.After analyzing 10 cases of Y-randomization for the model, the average square correlation coefficients achieved which was 0.1011, is compared to the one found when considering the true Y.The results show that, there is not a chance correlation.Figure 4 shows the plot of the RBF-PLS predicted against the experimental values of n-octanol-water partition coefficients (K o/w ) for the molecules included in the data set.This plot illustrate that the RBF-PLS is a power technique for prediction of n-octanol-water partition coefficients.The residuals of the RBF-PLS calculated values of n-octanol-water partition coefficients (K o/w ) are plotted against the experimental values in Figure 5.The propagation of residuals at both sides of the zero line indicates that no systematic error exists in the development of the RBF-PLS model.

Conclusion
MLR, PLS and RBF-PLS are used as feature mapping techniques for prediction of the n-octanol-water partition coefficients of some neutral, basic and acidic organic compounds.The result obtained reveals the superiority of RBF-PLS over the MLR and PLS models.This is due to the ability of the RBF-PLS to allow for mapping of the selected features by manipulating their functional dependence implicitly unlike regression analysis.Descriptors appearing in these QSPR models provide information related to different molecular properties which can participate in the physicochemical process that affected the n-octanol-water partition coefficients of the solutes.

Figure 1 .
Figure 1.The Gaussian distribution of log K o/w .

Figure 4 .
Figure 4. Plot of the calculated n-octanol-water partition coefficients against the experimental values.

Figure 5 .
Figure 5. Plot of the residuals vs. experimental values of n-octanol-water partition coefficients.

Table 1 .
Data set and their n-octanol-water partition coefficients (K o/w )

Table 2 .
Calculated n-octanol-water partition coefficients (K o/w ) by MLR, PLS and RBF-PLS methods

Table 3 .
The descriptors were used in model construction Plot of PRESS vs. number of factors by PLS model.
3. Variation of root mean square error of cross-validation vs. σ.

Table 5 .
Statistical parameters and RBF-PLS models Statistical parameter obtained using RBF-PLS, PLS, MLR models.These parameters are root mean square error of prediction (RMSEP), relative standard error of prediction (RSEP), mean absolute error (MAE) and square of correlation coefficient (R 2 ).
a Number of factor:

Table 4 .
Correlation matrix for the seven selected descriptors