A New Application of PC-ANN in Spectrophotometric Determination of Acidity Constants of PAR

The acidity constants of the PAR were determined by Principal Component Analysis Artificial Neural Networks, using simulated and experimental spectral data. Triprotic acid mass balance equations and corresponding spectral profiles generated by a Gaussian model were used to simulate all required absorbance-pH data. A constant noise with zero mean and different standard deviations (1-3% of the maximum absorbance values) was superimposed on the generated simulated spectra. A triangular experimental design was used to select and produce the different simulated acidity constants. The effects of white noise at different levels were also studied to check the prediction ability of the model. A fully experimental data set, photometric titration data of PAR at pH=1.50-13.00 range was used as a test set. The obtained acidity constants are in a good agreement with previously reported values using DATAN software.


Introduction
The accurate determination of acidity constant values is often required in various chemical and biochemical areas.These are of vital importance in understanding the distribution, transport behavior, binding to receptors and mechanism of action of certain pharmaceutical preparation. 1,2The acidity constants of organic reagents play a very fundamental role in many analytical procedures such as acid-base titration, solvent extraction and complex formation reactions.But in determining of acidity constants of these molecules we are faced with several drawbacks such as low solubility in aqueous solutions and the low values of acidity constants.Therefore, in order to enhance the acidity constants on one hand and to increase the solubility on the other, we forced to choose mixed solvents.
Spectroscopic methods are, in general, highly sensitive and are as such suitable for studying chemical equilibrium of solutions.If the components involved can be obtained in pure form, or if their spectral responses do not overlap, such analysis is, in general, trivial.For many systems, particularly those with similar components, this is not the case, and these have been difficult to analyze.Therefore to overcome this problem we have to employ the graphical and computational methods.Up to the middle of the 1960s, the evaluation of equilibrium measurements was based on the different graphical methods.These methods were reviewed in considerable details, by Rossotti and Rossoti. 3 Starting from middle of the 1960s; computers acquired ever-greater importance in the evaluation of equilibrium measurement data using multiple wavelengths or full spectral domain for determining the stability and acidity constants.The most relevant reports are on LETAGROP-SPEFO, 4 SPECFIT, 5 SQUAD, 6 and HYPERQUAD. 7All these computational approaches are based on an initial proposal of a chemical equilibrium model defining species stoichiometries and based on mass-action law and mass balance equations (hard modeling methods) and also involve least squares curve fitting procedures.In contrast the soft modeling approaches are free from the restriction of the mass-action law and do not require an initial model species to be set up.Artificial Neural Network's (ANNs) is one of the most powerful techniques of soft or model free computation.The use of soft modeling goes back to 1971, in which Lawton and Silvestre 8 introduced chemometricsbased methods for spectral analysis.A basic review of the applications of ANN was published by Gasteiger and Zupan. 9Recently ANN has successfully been applied in capillary zone electrophoresis, [10][11][12] for modeling in ion chromatography 13 or in electrokinetic micellar chromatography 14 without necessity to know or determine physiochemical parameters.ANN actually represents a well known soft modeling without to know or establish any mathematical model between input and output relationship. 15Here we used an ANN system to establish a nonlinear relationship between score vectors of the absorbance-pH data matrix as input and acidity constants of a tri-protic acid as output.To our knowledge this is the first report on the application of PC-ANN in determination of acidity constants using pH-absorbance titration data.

Theory
Multivariate spectrophotometric data were generated based o Beer-Lambert's law.A matrix C was calculated based on the chemical reactions model and the appropriate formation constants (equilibrium constants).The columns of C are formed by the concentration profiles of the absorbing species of the chemical equilibrium system.The rows of the matrix A contain the respective absorption spectra of components involved in chemical equilibrium model.According to Beer-Lambert's law, the matrix product C × A results in a matrix Y of the individual absorbance readings at all wavelengths and each pH step.An error term with zero mean and different standard deviation in the range 1-3% of the maximum absorbance values, were generated by Gaussian random generator of MATLAB software represented by matrix R is added: [16][17][18] Y= CA +R A Principal component analysis (PCA) was used to convert each spectral data to a simple and more usable and compact one.It is note worthy to mention that principal component analysis has been widely applied in data mining to investigate data structure.In PCA, new orthogonal variables (latent variables or principal components) are obtained by maximizing variance of the data.The number of the latent variables (factors) is much lower than the number of original variables, so that the data can be visualized in a low-dimensional PC's (a space span by principal components) space.While PCA greatly reduces the dimensionality of the space, it does remain the initial information in new space as mush as possible.PCA technique has a special application for all spectrophotometric studies which divided the spectral information into two different orthogonal matrices, score and loading, while score values have the most relevant information to the concentration matrix, instead loading matrices have the most related data to the pure spectral profiles matrix.In this study PCA used for data compression and also for extract the score matrix as the input for the ANN model.
The theoretical aspects of the artificial neural networks are described in several papers. 19,20The development of the artificial neural network (ANN) has provided a powerful tool for non-linear approximations.A multi-layered ANN with enough neurons can approximate almost any nonlinear input-output mapping at any required accuracy.An artificial neural network (ANN) is a mathematical structure designed to mimic the information processing functions of a network of neurons in the brain. 21,22 NNs are highly parallel systems that process information through many interconnected units that respond to inputs through modifiable weights, thresholds, and mathematical transfer functions.Each unit processes the pattern of activity it receives from other units, and then broadcasts its response to still other units.ANNs are particularly well suited for problems in which large datasets contain complicated nonlinear relations among many different inputs.4][25][26] The neural network used in this work is back propagation (BNN) type.A typical back propagation neural network has three layers: the input, the hidden, and the output layers.The activation of a neuron is defined as the sum of the weighted input signals to that neuron: (2)   where w ij is the weight-connection to neuron j in the actual layer from neuron i in the previous layer and bias j is the bias of neuron j.The u j of the weighted inputs is transformed with a transfer function, which is used to get to the output level.Several functions can be used for this purpose, but the "sigmoid function" is mostly applied. 27his function is as follows: (3) where y j is output of the neuron j.The BP network learns by adjusting its weights according to the error (E), equation 4. The goal of training of a network is to change the weights between the layers in a direction that minimizes the error, E: The error E of a network is defined as the squared differences between the target values t and the outputs y of the output neurons summed over p training patterns and k output nodes.
After obtain the error all weights have to correct in turn with back propagation derivative equations, and repeat these two steps to access to an acceptable value for error.In back propagation step, learning rate and momentum must be apply, learning rate for adjusting the variation rate in model and momentum for using the effect of last iteration value of the weights in the next correction.To construct an ANN model three set of data must be apply, calibration set for training the random weights of the network, prediction set to distinguish the optimum values of parameters and the architecture of the network and the test set to evaluate the model efficiency.

Data set
In order to establish the model, three different data sets have been used: calibration set, prediction set and the test set.[30]

Simulated data
In order to have some simulated data which are similar to data of real experiments, we tried to design some triprotic acids with three different sets of pKa's (pKa 1 , pKa 2 , pKa 3 ) at a reasonable intervals.A triangular experimental design was applied to construct data set.For these virtual triprotic acids some spectral information must be approximated, each spectrum structured with a combined Gaussian distribution using the chemical behavior of the desired species in the solutions.In each spectrum we added a random noise to the absorbance values to have as much the relatively real spectra as possible.Figure 1 shows a sample of simulated spectra for prediction set.For all numerical experiments, the spectra were simulated using Gaussian functions in the wavelength range 380-600 nm in 5 nm intervals; the means were 450, 480, 500 and 520 nm and widths 25, 28, 31 and 41 nm and the maximal molar absorbencies were set on 1.2, 1.5, 1.4 and 2.0 mol -1 cm -1 for four chemical species from a triprotic acid respectively.To the simulated data a 1-3% constant noise were added with zero mean and 0.01-0.03standard deviations.

Experimental data
9][30] This real data sets are visible spectra of the 4-(2-pyridylazo) resorcinol (PAR) in different composition of methanol-water at designed pH intervals.

Model constructing
In this part an ANN model was proposed to train a network for predicting the pKa values using spectral information's as input data for each sample.Each sample is related to a matrix with size equal to the number of pH values as the rows and the wavelength values as the columns.A principal component analysis used to compression the resulting matrix to more compact score and loading matrices.The summation of first four PC's was used as input vector for each sample.Of course several combination of scores were used and it was found that the summation of first four principal components (scores) gave the best calibration model.The values of pKa 1, pKa 2 and pKa 3 are structured as the output nodes.The PCA-ANN program was written in MATLAB 6.1 (Mathworks) in our laboratory, and back propagation strategy was used for training of the network.The parameters of the model were optimized using prediction set, finally a 12-8-3 model obtained.The real test set was used to predict the real data and evaluate the model for experimental conditions.

Results and Discussion
The main aim of this work was to define and establish a theoretical model to predict some physicochemical properties of interested molecules form experimental data without any knowledge of the under study system.
All simulations used in this study were based on theoretical knowledge and analytical relationships.A triangular design was applied to create different artificial acidity constants.In this design a triangle shape used, while one of each pKa 1 , pKa 2 and pKa 3 , had been scaled on the one of its gone.That is, pKa 1 varies from 1 to 3 while pKa 2 and pKa 3 were changed between 5-7 and 10-13 respectively on each gone of the triangular shape.Figure 2 shows all of the designed acids with their pKa's on the related triangle.The pKa values of all designed acids are listed in Table 1.Some of these acids randomly selected for prediction set.The concentrations of the components of each acid were obtained from mass-balance relations, corresponding simulated pH values and spectral profiles.Absorbance data matrix obtained by multiplication of concentration matrix by pure spectral profiles for all components.The pH values were varied in the range 1.5 to 13.0 with a step 0.5.The resulting data matrix is considered as inputs for PC-ANN model.To reduce the dimension of the input matrix, PCA technique was applied.PCA decompose data set to corresponding score and loading matrices, then scores were used as input data, because of their ability to cover all information about the concentration profiles, also the uncertainty in the simulation of spectral profile was neglected as a consequence.By using triprotic acids in different pH values as samples, there are four components present in each sample, therefore four principal components (PC) were selected for different pH values.Summation of these PC's at each pH value was used for each acid as input nodes for the ANN model.The ANN model has a good ability to relate the scores to the concentrations or to the pKa's because of high performance of ANN to predict the nonlinear and complex effects.The model was optimized by prediction set, the momentum and learning rate values are 0.1 and 0.9 respectively and 8 nodes were applied in hidden layer.The mean square errors (MSEs) for the training and prediction sets were plotted against the number of iterations (Figure 3), finally at the iteration number of 9500 the ANN

Figure 1 .
Figure 1.A sample of simulated spectra of prediction set.

Figure 3 .
Figure 3. Variations of MSE vs the number of iterations for the training and the prediction sets.

Table 1 .
The simulated acidity constants for assumed triprotic acids for training and predictions sets

Table 2 .
Real and predicted pKa values of the PAR using the constructed optimized model