A quantitative relationship between T g s and chain segment structures of polystyrenes

The glass transition temperature (Tg) is a fundamental characteristic of an amorphous polymer. A quantitative structure-property relationship (QSPR) based on error back-propagation artificial neural network (ANN) was constructed to predict Tgs of 107 polystyrenes. Stepwise multiple linear regression (MLR) analysis was adopted to select an optimal subset of molecular descriptors. The chain segments (or motion units) of polymer backbones with 20 carbons in length (10 repeating units) were used to calculate these molecular descriptors reflecting polymer structures. The relative optimal conditions of ANN were obtained by adjusting various network paramters by trial-and-error. Compared to the model already published in the literature, the optimal ANN model with [4-7-1] network structure in this paper is accurate and acceptable, although our model has more samples in the test set. The results demonstrate the feasibility and powerful ability of the chain segment structures as representative of polymers for developing Tg models of polystyrenes.


Introduction
The glass transition temperature (T g ) is known as the glass temperature or the transition temperature between glass and rubber states of amorphous materials.T g is a fundamental characteristic and is taken as the most crucial property of amorphous polymeric materials [1] .The nature of the theory in the glass and glass transition is unsolved, however, is taken as the deepest and most interesting problem in solid stated theory.Though T g can be determined experimentally, the discrepancies in reported T g values in the literature may be quite large, because (1) the transition happens over a comparatively wide temperature range, and (2) many factors affect T g values, which include the structural, constitutional and conformational features of polymers, molecular weight, and experimental conditions such as the measuring method, duration of the experiment, and pressure during the measurement [2] .In addition, experimental determination of T g s cannot apply to those polymers that are not yet synthesized.Hence, it is necessary to develop theoretical methods for the prediction of T g s.
Quantitative structure-property relationship (QSPR) models can be used to predict T g values of polymers.This approach is based on the assumption that the variation of physicochemical properties of the compounds is dependent on changes of molecular structure, which can be characterized with descriptors.A major goal of QSPR approach is to develop a mathematical relationship between the property of interest and structural features [3] .Some researchers have predicted T g s of polymers with QSPR models.Van Krevelen [4] predicted T g s by using the group additive property theory.This method is only applicable to polymers whose contribution values are known.Bicerano [2] developed a more universally QSPR model with R 2 (the square of the correlation coefficient R) being 0.95 and standard error (s) being 24.65 K for a data set of 320 polymers.The T g model was based on the solubility parameter and the weighted sum of 13 topological bond connectivity parameters of the monomer structures.But the model is not validated with the test set.Joyce et al. [5] built models for T g prediction based on the monomer structures of 360 polymers.The model predicted the T g values for a test set of polymers with a root mean square (rms) error of 35 K. Katritzky et al. [6] introduced a four-parameter model with R 2 0.928 for 21 medium molecular weight polymers and copolymers based on their repeat units.On a larger data set, Katritzky et al. [7] developed a QSPR for the molar glass transition temperature (T g /M) of 88 uncross-linked linear homopolymers.The model has five molecular descriptors and the s for T g is 32.9 K. On the same data, Cao and Lin [8] developed a QSPR model (R 2 = 0.9056) by using five molecular descriptors that focus on the influence of chain stiffness and intermolecular forces.Yu et al. [9] developed stepwise multiple linear regression (MLR) for 107 polystyrenes and generated a QSPR model (R = 0.959 and s = 15.20 K) from the training set of 96 polystyrenes.The MLR model produced a rms error of 20.5 K for the test set comprising 11 polystyrenes.Recently, some quantum chemical descriptors calculated from repeating units or monomers were used to develop QSPR models for T g s of polymers [10][11][12] .
Due to the large and variable size of polymer molecules, the QSPR models stated above, together with QSPR models of other polymer properties, are modeled by extrapolation from monomer structures or repeating units [1] .These methods fail to account for the influences from neighboring repeating units.Especially for the T g , the glass transition is resulted from Brownian motion of chain segments subjected to freezing or thawing.In this work, the chain segments (localized units or motion units) with 20 carbons (10 repeated units) in length were used to calculate descriptors for their corresponding polystyrenes and to develop QSPR models for their T g s.

Data set
Table 1 shows the experimental T g data for 107 polystyrenes, which are taken from Brandrup et al. [13] .The entire set contains a T g value range of 208-490 K.
The pendant groups presented in the benzene ring include halides, carbonyls, ethers, hydrocarbon chains, hydroxyl, hydroxyimino, aromatic rings, and other functional groups.These polystyrenes were randomly divided into a training set (70 polystyrenes) and a test set (37 polystyrenes).
; b T g data were calculated with the ANN model.A quantitative relationship between T g s and chain segment structures of polystyrenes

Descriptor computation
A polymeric material consists of a mixture of giant molecules.Therefore, it is impossible to calculate descriptors directly from molecular structures of the polymeric material.Two approaches have been adopted to resolve this problem.One is using the repeating unit to calculate descriptors for the corresponding polymer.The other is using the monomer as representative of the corresponding polymer [1] .
T g is a temperature point used to express transition region, where polymer chain segments can move from frozen to movement (or vice versa).Below the glass transition region, 1-4 chain atoms are involved in motion.Further, these motions are largely restricted to vibrations and short-range rotational motions.During the glass transition region, 10-50 chain atoms attain sufficient thermal energy to move in a coordinated manner.In T g region, these chain atoms (motion units) are first mobilized before the whole molecule starts moving.On further heating, the increased energy allotted to the chains permits them reptate out through entanglements rapidly and flow as individual molecules [14,15] .The structures of polymer chain segment have an effect on its glass transition and are correlated to T g s.According to above theory of glass transition, descriptors calculated from the chain segments are more accurate in describing structures affecting polymer T g s than that from repeating units and monomers.From a theoretical point of view, the chain segment used to calculate descriptors is longer, the descriptors are more accurate in characterizing polymers.The motion units related to glass transition of polymers usually contain 10-50 carbons in length.In addition, a too long segment taken into account may cause difficulty in calculating descriptors, and a too short segment cannot sufficiently represent the structure of motion units.Thus, chain segments with 20 carbons in chain length were used to calculate molecular descriptors for the corresponding polymers.
Polymeric chain segments containing 20 main chain carbons of polystyrenes were first sketched using ChemBioDraw Ultra 11.0 in ChemBioOffice 2008 program.For example, the structure model consisting of 10 repeating units end-capped by two hydrogens (see Figure 1) was adopted as the representative structure of poly(styrene) (No. 65 in Table 1) to calculate the descriptors.
Subsequently, the sketched 2D molecular structures were converted to 3D structures and optimized using a molecular mechanics (MM2 force field) in ChemBio3D Ultra 11.0 with the convergence criterion of minimum rms of gradient value being 0.01 kcal/molÅ.The optimized molecules were saved in Sybyl mol2 (.mol2) format as the input files for Dragon software [16] .Lastly, 4885 descriptors were calculated for each energy-minimized motion unit with Dragon software.Descriptors with constant or near constant values and with pair correlation greater than or equal to 0.90 were removed in order to reduce redundant and non-useful information.After excluding redundant and non-useful variables, 551 descriptors were remained to undergo descriptor selection.A relative optimal subset of descriptors was obtained by applying MLR analysis in IBM SPSS Statistics 19.

Artificial neural network
The optimal descriptors subset was fed to artificial neural network (ANN) as input vectors.ANNs are computational models, which simulate the human brain behavior.The common networks consist of an input layer, some number of hidden layers (intermediate layers) and an output layer.Each layer includes a number of processing nodes, called neurons or units.Each node in the network is influenced by those nodes to which it is connected in a highly complex and parallel way.The degree of influence is dictated by the values of the links or connections.Through a training algorithm, the overall behavior of ANNs can be modified by adjusting the weights (or the values of the links or connections).After learning from the input dataset, ANNs acquire knowledge and can be applied on test set data not present in the training set.The output layer produces the prediction values of properties interested.One of the most popular algorithms applied in the training phase is the error back-propagation (BP) algorithm.The number of neurons in the hidden layer shouled be optimized by trial and validation until no obvious improvement was seen for that model [17] .T g data were taken from Brandrup et al. [13] ; b T g data were calculated with the ANN model.where n is the number of samples from the training set; s is the standard error of estimate; R is the correlation coefficient; F is the Fischer ratio.

Results and Discussions
The four molecular descriptors, ChiA_B(e), SpMax_EA(bo), H7s and DLS_01 appearing in above MLR model and the corresponding descriptor values are shown in Table 1.Their descriptor characteristics are listed in Table 2; and their definitions [18] are shown in Table 3. Calculated results with Equation 1 are depicted in Figure 2A.The rms errors of T g s of the training and test sets are 16.1 and 22.4 K, respectively.
The four descriptors are then fed to ANN as input vectors.The optimal condition of the neural network was obtained by adjusting various parameters by trial-and-error.The architecture of the final optimum BP neural network is [4-7-1], with the number of hidden layer being 1, the nodes in hidden layer being 7, the permission error being 0.00001, the momentum being 0.6, and the sigmoid parameter being 0.9.The results from ANN method are listed in Table 1 and depicted in Figure 2B, which indicate that the predicted T g values are close to the experimental ones.The rms error of training set is 13.6 K (R = 0.939).The test set rms error is 17.1 K (R = 0.902) which is less than the errors from the test set in previous model (20.5 K) [9] .The mean relative error for the 107 polystyrenes in Table 1 is 3.4%, less than that from the model of Yu et al. [9] (3.7%).Furthermore, it should be noted that the test set in this paper possesses 37 polystyrenes, more than the number of samples (11 polystyrenes) [9] .And it is much easier to obtain better results on small test set of polymers.In comparison to previous model on T g s of polystyrenes [9] , the statistic qualities of our model is accurate and acceptable.Therefore, it is feasible to calculate molecular descriptors from the chain segments of polymer backbones comprising 10 repeating units for developing T g model of polystyrenes.
Table 2 shows that each descriptor in Equation 1 has a Sig.-value near to 0, and less than the default level of 0.05, which suggest that these descriptors are significant for T g s.Moreover, all variance inflation factor (VIF) values are less than 2, far less than the default value of 10.Thus these descriptors are "pure" without "mixing" or contamination from other descriptors, and each descriptor reflects some particular molecular structures affecting T g s.
According to the t-test, the most significant descriptor in the MLR model is ChiA_B(e) (2D matrix-based descriptors) [16] .ChiA_B(e) denotes the average randic-like index from burden matrix weighted by Sanderson electronegativity and is defined as follow: Chi_M(e) ChiA_B(e) nBO = (2) Where nBO is the number of graph edges.Chi_M(e) is the Randic-like index calculated by applying Sanderson electronegativity as the vertex weighting scheme and a H-depleted molecular graph as a square matrix: Chi_M(e) ( ; ) ( ; ) VS M e VS M e Here nSK means the number of graph vertices; VS i (M) is the ith matrix row sum; α ij are the elements of the adjacency matrix, which are equal to one for pairs of adjacent vertices, and zero otherwise.ChiA_B(e) reflects information about interatomic distances, bond distances, ring types, planar and non-planar systems and atom types [16] .A small ChiA_B(e) indicates a small interatomic distances, which results in a low degree of freedom for rotation and leads to high T g .
The second significant descriptor is the GETAWAY (GEometry, Topology, and Atom-Weights AssemblY) descriptor, H7s (H autocorrelation of lag 7 / weighted by I-state).The descriptor H7s encodes information on structural fragments, such as the effective position of substituents and fragments in the molecular space, and accounts information on molecular size and shape as well as for specific atomic properties [16] .A large H7s suggests that a polymer has a large side group, which decreases the volume ratio of phenyl ring to other substituent groups.While the aromatic or cyclic structure in bulky side groups increases rotational barrier for backbone chain and leads to high T g .Therefore, a polymer with large H7s may have a low T g .
The next significant descriptor is the Drug-like indice DLS_01.The descriptor DLS_01, being modified drug-like score from Lipinski (4 rules), is calculated as 1 minus Lipinski Alert Index (LAI), while LAI is defined as the ratio between the number of satisfied conditions over the total number of conditions, i.e., (1) there are more than 5 H-bond donors; (2) there are more than 10 H-bond acceptors (N and O atoms); (3) molecular weight (MW) is over 500; and (4) Moriguchi's logP (MLogP) is over 4.15 [16] .DLS_01 is related to the number of intermolecular hydrogen bonds, which increase intermolecular force and determine the magnitude of molecular aggregates.Polymer molecules with small DLS_01 hold together more strongly due to intermolecular hydrogen bonds and are unable to mover that easily, and possess high T g s.
According to the t-test, the last significant descriptor in the MLR model is SpMax_EA(bo).Edge adjacency index, SpMax_EA(bo), is derived from the H-depleted molecular graph and encodes the connectivity between graph edges.It is leading eigenvalue from edge adjacency matrix weighted by bond order.SpMax_EA(bo) reflects molecular shape and implies the substituent position in the phenyl ring for styrenes [16] .Compared to styrenes with substituents lying in p-or m-positions of the phenyl ring, a styrene with a substituent lying in o-positions usually has a larger SpMax_EA(bo), which can be seen from Table 1.The substituents in o-positions will enhance rotational barrier for backbone chain, increase rigidity of polymer chains and result in higher T g s [9] .Despite a variety of factors affecting the T g values of polymeric materials, intermolecular forces and molecular flexibility (or rigidity) are two important factors related to T g s.The descriptor DLS_01 reflects the intermolecular forces, while descriptors ChiA_B(e), SpMax_EA(bo) and H7s indicate the stiffness of polymer.Therefore, the four descriptors can predict T g s sufficiently.
Figure 3 (Williams plot) was obtained to visualize the applicability domain of the ANN model in this paper.According to Williams plot based on standardized residuals vs. leverages, predictions for only those samples that fall into this domain may be considered reliable [19,20] .Figure 3 shows that only the two samples No. 44, poly(4-propoxysulfonylstyrene) and No. 67, poly(2,3,4,5,6,-pentafluorostyrene) in the training set have larger leverage h values (0.448 and 0.456, respectively), greater than the warning leverage h* (= 0.214).But their standardized residual values (0.430 and 0.075, respectively) are less than 3. Thus the two samples, poly(4-propoxysulfonylstyrene) and poly(2,3,4,5,6,-pentafluorostyrene), can stabilize the ANN model of polystyrenes and make it more accurate.

Conclusions
Four molecular descriptors calculated from the chain segments of main chains comprising 10 repeating units were adopted for developing QSPR model of T g s for polystyrenes.MLR analysis was used to select the optimal subset of descriptors after molecular descriptor generation for each chain segment.The developed ANN model was proved to be accurate and acceptable, with the absolute mean errors for the whole data set is 3.4%, which is less than that of the model published in the literature, although our model possesses more samples for the test set.Therefore, it is feasible calculating molecular descriptors from the chain segments comprising 10 repeating units in length to develop ANN model of T g s for polystyrenes.
By analyzing the correlation between the 551 descriptors and T g s of 70 polystyrenes in the training set with stepwise MLR analysis in IBM SPSS Statistics 19, Equation 1 and the corresponding statistical results were obtained.

Figure 2 .
Figure 2. Plots of calculated vs. experimental T g values of polystyrenes: (A) for MLR model; (B) for ANN model.

Figure 3 .
Figure 3. Williams plot for polystyrenes with a warning leverage of 0.214.

Table 1 .
Molecular descriptors and T g data of 107 polystyrenes.
[13]g data were taken from Brandrup et al.[13]; b T g data were calculated with the ANN model.a T g data were taken from Brandrup et al.

Table 2 .
Characteristics of descriptors appearing in MLR model.

Table 3 .
The symbol, class and definition for descriptors appearing in MLR models.