Structure-flammability relationship study of phosphoester dimers by MLR and PLSa

Polyphosphonates and polyphosphates having good flame retardancy represent an important class of organophosphorus based polymer additives. In this analysis the flammability of 28 previously synthesized polyphosphoesters, modelled as dimmers, was explored using the multiple linear regression (MLR) and Partial Least Square (PLS) methodology. The statistical quality of the final MLR and PLS models was estimated using the following parameters: the squared correlation coefficient (r2 training = 0.917 and 0.976), the training root-mean-square errors (RMSEtr = 0.029 and 0.016) and the leave-seven-out cross-validation correlation coefficient (qL70 = 0.748 and 0.881), respectively. External validation was checked for a test set of seven compounds using several criteria. The MLR models had somewhat inferior fitting results. The final MLR and PLS models can be used for the estimation of limiting oxygen index (LOI) values of new polyphosphoester structures. The presence of phosphonate groups and increasing molecular branching in an isomeric series favour the dimer flammability.


Introduction
An important feature of most commercial polymers is to be non-flammable or flame retardant [1] .Other polymer properties, like as: glass transition temperature, thermal decomposition temperature, etc., have been previously studied by quantitative structure-property relationships [2,3] .
Different polyphosphoesters with fire retardant properties were reported in the literature, being included in materials like: polycarbonates, polyamides, thermosets, etc [5] .The flammability of phosphorous polymers was investigated in order to determine structural-property relationships, too [6,7] .Two types (R and S) of chirality were found for the monomer polyphosphoesters, which were geometry optimized using the MMFF94s force field [6] .Multiple linear regression (MLR), artificial neural networks (ANNs) and support vector machines (SVMs) were applied to correlate the limiting oxygen index (LOI) values to the structural calculated descriptors.Good fitting results and predictable models were obtained using the MLR and ANN approaches, the SVM modelling providing the poorest results.It was concluded that the monomer geometry is important for flame retardancy.
Our goal was to develop robust multiple linear regression (MLR) and the partial least squares (PLS) models that select a set of variables that efficiently predict the limiting oxygen index (LOI) values and guide new information on the flammability mechanism of polyphosphoesters [6] dimers.This parallel approach gives the opportunity to compare the quality of results supplied by the two methodologies.

Data set
We used a series of 28 previously synthesized polyphosphoesters [6] , which were modelled in the present study as dimers.The dataset in this investigation consisted of 28 RR, RS, SR and SS phosphoester dimers for compounds 1 to 14; compounds 15 to 28 had only one chiral centre, at the P2 phosphorous atom (see Figure 1).
Experimental data for the limiting oxygen index (LOI), expressed in % (Table 1), and used as dependent variable in this study, was previous reported in references [6] and [7] .Dimer molecular structures were built using the Marvin program [8] , which was used for drawing, displaying and characterizing chemical structures.Dimer conformers were pre-optimized using the 94s variant of the MMFF (Merck Molecular force field) [9] with coulomb interactions and the attractive part of the van der Waals interactions, included in the OMEGA software [10][11][12] .The following parameters were used for the conformer generation: a maximum of 400 conformers per compound, an energy cut-off of 10 kcal/mol relative to a global minimum identified from the search.SMILES notation was used as program input.The stereoisomers were generated using the 'Flipper' utility inside the Omega program.To avoid redundant conformers, any conformer having a RMSD fit outside 0.5 Å to another conformer was removed.
a Dedicated to the 150 th anniversary of the Romanian Academy.

Training and test set selection
The series of phosphoester dimers were divided into training and test set using several approaches: the partition against medoids (PAM) algorithm [16] ("cluster" package available in R [17] based on the Euclidian distance), the decreasing response order and the random splitting.In order to use same test set in both MLR and PLS approaches, seven out of twenty eight (25%) phosphoester dimers (compounds 2, 10, 11, 15, 17, 19 and 22, see Figure 1) were chosen as test set to validate the final models.The data structures and the LOI range values (in %), comprised in the test set (0.22-0.50) and the training set (0.18-0.55), are commensurate.

Multiple Linear Regression (MLR) and Partial Least Square (PLS)
Multiple linear regression (MLR) [18] has been applied after variable selection carried out by means of a genetic algorithm included in the QSARINS v. 2.2 program [19,20] using the RQK fitness function, with leave-one-out cross-validation correlation coefficient, which constrained the function to be optimized.In MLR, the number of 1549 calculated descriptors is too high compared to the number of compounds (N = 28) and an appropriate variable selection method was required.MLR calculations were carried out separately for each dataset: RR, RS, SR, SS.
In MLR calculations the structural data was normalized based on the autoscaling method, which can be described as: where for each variable m, XT MJ and X MJ are the j values for the m variable after and before scaling, respectively, m X is the mean and S M the standard deviation of the variable.
The PLS methodology is a generalization of the MLR one, having as main advantage the possibility to analyze the data with correlated, noise, and large number of independent variables [21] .In the PLS equation the latent variables were transformed as function of the original X IJ (i =1, 2,..., N; j=1, 2,..., K) variables, resulting following equation: where Ŷ I represents the calculated dependent variable, and b J the PLS coefficients.The obtained models were optimized by a procedure of outlier detection and based on variables with significant coefficients different from zero.When the variable selection was achieved, only the significant descriptors with coefficients different from zero were preserved in the final models (for noise elimination).Both methologies have as main goal to find out a mathematical model with minimum number of parameters and with good estimation capability.

Model validation
For the external validation of the MLR and PLS models several parameters were calculated: Q F3 [24] (models with values higher than 0.7 were considered acceptable), the CCC ext (the concordance correlation coefficient, with satisfactory values higher than 0.85) [25] , RMSE ext (root-mean-square errors) and MAE ext (mean absolute error) [26] and R 2 pred (a higher limit than 0.5 was considered as acceptable) [27] .The comparable thresholds used in this study for different validation criteria have been rigorously previously determined [25,28] .Other statistical parameters [29] were used for the external test set: (i) squared correlation coefficient (r 2 test ) between the predicted and observed activities as well as squared correlation coefficient by cross-validation (q 2 ); (ii) coefficient of determination for linear regressions with intercepts set to zero, i.e. r 2 0 (predicted versus observed activities), and r ′2 0 (observed versus predicted activities); (iii) slopes k and k' of the above mentioned two regression lines.The following conditions should be satisfied for a model with acceptable predictive ability: For the internal validation of the final models other parameters were employed: r 2 training (determination coefficient), q 2 L70 (leave-seven-out cross-validation coefficient; values higher than 0.7 were considered as acceptable), q 2 LOO (leave-one-out cross-validation coefficient), RMSE tr , MAE tr and CCC tr , calculated for the training set.
In the mean time higher r 2 training values must be accompanied by q 2 values as close to the r 2 training ones as possible [30] (to avoid over fitting, which was, also, checked by the RMSE and MAE values).
The risk of chance correlation was, also, verified by the Y-scrambling procedure (r 2 Scr and q 2 Scr ) and must have lower values than the original model.For calculation of r 2

Scr and q 2
Scr this process was repeated 999 times in case of PLS calculations and 2000 times in the MLR ones.
After the check of all validation parameters, the applicability domain for the models is required, because robust and validated models cannot be expected to reliably predict the modelled property for any type of compounds.The applicability domain is a theoretical region in physicochemical of response and chemical structure space for which a QSAR model should make predictions with a given reliability [30] .In the applicability domain only the predictions for those compounds that fall within this domain can be considered as reliable, not extrapolations of the model.In the Williams plot the standardized residuals versus the leverages (hi) was exploited to visualize the applicability domain for our final MLR models.

Results and Discussions
The major objective of this paper was the estimation of limiting oxygen index (LOI) of phosphoester dimers using molecular descriptors that can be computed directly from molecular structure and guide new information on the flammability mechanism.

MLR results
The relationship between the molecular descriptors and LOI values of the dimer derivatives is illustrated by the following Equations 8-11: RR model RS model Crisan, L., Iliescu where SEE represents the standard error of estimates, r 2 adj -the adjusted r 2 , F-the Fischer test, q 2 LOO -leave-one-out cross-validation coefficient.Other statistical results of models 8-11 are included in Tables 2, 3, 4.
The Williams (of the standardized residuals versus the leverage) plot was used to visualize the applicability domain of the final best MLR_RR model (Figure 2).This plot confirms the absence of outliers and influential points.All compounds were located within the applicability domain and were predicted accurately.
The MLR_RR model is completely satisfactory in the fitting and has high predictive power.The LOO (leave-one-out) cross-validation highlights that the model is stable, not obtained by chance, in fact the difference between r 2 training and q 2 LOO is small: 5.3%.This model is internally predictive with differences between q 2 LMO and q 2 LOO of -4.5%, and between r 2 training and r 2 LMO of 9.8%.The risk of chance correlation was, also, verified by the Y-scrambling procedure.The extremely low calculated r 2 Scr and q 2 Scr scrambling values (Table 2) indicate no chance correlation for the chosen models.3 and all calculated terms of Golbraikh and Tropsha (Table 4) confirm the predictive power of all MLR models.
Better statistical results and a more stable model to simulate polymer flammability were noticed in case of the RR dataset model compared to the others.
The edge adjacency matrix encodes information about the connectivity between graph edges [15] .EEig09d (eigenvalue 09 from edge adj.matrix weighted by dipole moments) takes into account the molecular polarity, being unfavourable for dimer flame retardancy.
The mean square distance index, denoted as MSD [15] , is calculated from the second-order distance distribution moment [31] .The MSD index decreases with increasing molecular branching in an isomeric series, which is favourable for dimer flammability.
GETAWAYs (Geometry, Topology, and Atom-Weights Assembly) are geometrical descriptors encoding information on the effective position of substituents and fragments in the molecular space [15] .Moreover, they are independent of molecule alignment and they, also, account to some extent for information on molecular size and shape as well as for specific atomic properties.Increased R2m+ (R maximal autocorrelation of lag 2/weighted by atomic masses) values favour the dimer flammability.Compounds containing phosphonate groups are favourable for the dimer flame retardancy.

PLS results
PLS calculations were performed with SIMCA-P+12 [32] program using 21 stereoisomers as a training set and 7 stereoisomers as a test set with the taken ratio of 75% for training set and 25% for test set in whole series of compounds.The large difference between the r 2 training and q 2 L70 values of the first calculated PLS model (lower than 0.3 is accepted) demonstrated the model over fit, and suggested the need for enhancement of the model quality.Therefore, the noise variables (with insignificant coefficient values) have been removed.Several PLS models were developed for the RR, RS, SR and SS datasets to increase their predictive power.In the final PLS_SS model compound 5 was omitted, being found as outlier, in accordance to the Hotelling's T 2 range plot [32] .
The final (four-components for the RR, RS, SR datasets and two-components for the SS dataset) PLS models are satisfactory in the fitting (Table 2).The over fitting of the models was exceeded by the remarkable high and close values of r 2 training and q 2 L70 , and was, also, checked by the RMSE and MAE (mean absolute error) parameters.In the same time similar RMSE values for the training and validation sets are observed (Tables 2 and 3).
PLS models with predictive power were obtained (see Tables 3 and 4), except the PLS_SS one, as seen from the values of Q and CCC ext parameters.The predicted LOI values for the RR dataset are given in Table 1.
The PLS models were internally validated using, also, 999 permutations in Y-scrambling.The calculated r 2 Scr and q 2 Scr scrambling values (Table 2) indicate no chance correlation for the chosen models.a from reference [6] and b from reference [7] .In the PLS modelling the terms having VIP values greater than 1 are the most relevant for explaining the dependent variable, and usually only these descriptors were interpreted.The descriptors showing the largest VIP values can simulate polymer flammability and are discussed below.
For all models higher values of the Randic shape index (-path/walk 4 and path/walk 5 -PW4 and PW5) are favourable for the flammability, while the MSD (Balaban mean square distance index) descriptor is unfavourable for flammability.They are topological descriptors obtained from molecular graph [15] .
Another group of significant descriptors is the class of 2D autocorrelation descriptors, which are computed from molecular graph as the sum of products of atom weights of the terminal atoms of all the paths for the considered path length (the so called lag) [15] .The most important 2D autocorrelation descriptors involved in our model are the Geary parameters.The positive coefficients of GATS6m -Geary autocorrelation of lag 6 weighted by mass, increase the flame retardancy of RR, RS and SR series, while for SS dimers, the same effect was observed for descriptor GATS5v -Geary autocorrelation of lag 5 weighted by van der Waals volume.The 3D-MoRSE descriptors provide 3D information from atomic coordinates using the same transform as in electron diffraction (which uses them to prepare theoretical scattering curves) [15] .For the RR and SR datasets, Mor13m-signal 13/weighted by mass, decrease the flame retardancy, for the RS dataset Mor15m-signal 15/weighted by mass is favourable for flammability, while for SS dataset these descriptors are insignificant.
Class of topological and frequency fingerprints descriptors are expressed as sum of topological distances between two elements or frequency of two atoms at a topological distance.Descriptors T(O..P) -the sum of topological distances between O..P, F07[O-S] -the frequency of O -S at topological distance 7, and F10[C-S] -the frequency of C -S at topological distance 10, with negative coefficients are unfavourable for flammability for RR, RS and SS datasets.
Three GETAWAY descriptors: one in the RR set: R5m-the R autocorrelation of lag 5/weighted by mass, and two in the SR set: HATS6v -the leverage-weighted autocorrelation of lag 6/weighted by van der Waals volume and HATS6p-leverage-weighted autocorrelation of lag 6/weighted by polarizability increase the dimer flame retardancy.
Compared to the MLR previously published monomer models [6] , the statistical results for fitting are improved in case of MLR and PLS dimer models.Additional structural information which influences the flame retardancy was included in the final dimer MLR (e.g. the number of phosphonates) and PLS (e.g.2D frequency fingerprints) models.

Conclusions
The MLR and PLS models developed for this series of dimer phosphoesters will be helpful to predict the LOI values of new untested compounds.Better statistical results and a more stable model to simulate polymer flammability were noticed in case of the RR dataset compared to the others, the presence of R chiral centre at the phosphorous atom being important for the dimer flammability.The mean square distance index and GETAWAY descriptors favour the dimer flammability, as well as increased number of phosphonates included in the dimer structure, as derived from both MLR and PLS methodologies.Better PLS fitting and predictivity results were obtained compared to the MLR ones for all datasets, except for the SS one.
Dimers including structures with R chiral centres gave more stable and predictive models compared to the previously published MLR monomer ones.
New structural information which influences the flame retardancy was included in the final MLR and PLS dimer models.Table 4. Golbraikh and Tropsha criteria [29] calculated for external validation of the MLR and PLS models (test set).

Figure 1 .
Figure 1.Dimer phosphoester structure.RR series: R chiral centre at P1, R chiral centre at P2; RS series: R chiral centre at P1, S chiral centre at P2; SR series: S chiral centre at P1, R chiral centre at P2; SS series: S chiral centre at P1, S chiral centre at P2; compounds 15 to 28 had only one chiral centre, at the P2 phosphorous atom.

2 F1 , Q 2 F2 , Q 2 F3
The RMSE (root-mean-square error) values for the training and validation sets are similar.The chosen MLR_RR model demonstrate a satisfactory stability in internal validation, has high fitting, internal and external predictive power.The high values of Q and CCC ext external validation parameters (see section 2.5) included in Table

Figure 2 .
Figure 2. Williams plot: standardized residuals of the MLR_RR model versus leverages, predicted by fitting.Training compounds are marked by white circles and test compounds by black circles.

Table 1 .
Experimental and predicted LOI values, structural descriptors included in the final MLR_RR model.

Table 2 .
Internal validation parameters of the MLR and PLS models (training set).

Table 3 .
External validation parameters of the MLR and PLS models (test set).