SciELO - Scientific Electronic Library Online

 
vol.43 número1Biotransformation of the monoterpene, limonene, by Fusarium verticilloides índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Brazilian Archives of Biology and Technology

versión impresa ISSN 1516-8913

Braz. arch. biol. technol. vol.43 no.1 Curitiba  2000

http://dx.doi.org/10.1590/S1516-89132000000100001 

Non-parametric fitting of nonlinear equations to experimental data without use of initial guessing: a basic computer program

 

 

Emmanuel M. PapamichaelI*, Nickolaos P. EvmiridisII and Constaninos PotosisI

ISector of Organic Chemistry and Biochemistry,
IISector of Analytical and Inorganic Chemistry, Department of Chemistry, University of Ioannina, 45110 Ioannina, Greece

 

 


ABSTRACT

In this work, we present a non-parametric method, and the appropriate computer program, for fitting nonlinear multiparametric equations to experimental data. Our method is followed by computation of confidence limits of the parameter estimates. Its performance has been tested on several multiparametric equations, common in the fields of Biochemistry and Biotechnology, and it is a multiparametric expansion of the concept proposed by others for equations having more than two parameters. Good parameter estimates were obtained without a previous knowledge of initial parameter guessing values, and the proposed computer program converges rapidly, in all cases examined within this work.

Key words: Non-parametric methods, computer program, fitting of nonlinear equations.


RESUMO

Neste trabalho, apresentamos um método não-paramétrico, e o programa de computação apropriado, para ajustar equações multiparamétricas não lineares para dados experimentais. Nosso método é seguido pelo cálculo dos limites de confiança dos parâmetros estimados. Seu desempenho tem sido testado em várias equações multiparamétricas, comuns nos campos da Bioquímica e Biotecnologia, e é uma expansão multiparamétrica do conceito proposto por outros, para equações que tenham mais de dois parâmetros. Obtivemos estimativas de parâmetros confiáveis sem um conhecimento prévio de parâmetros iniciais de valores esperados, e o programa de computação proposto converge rapidamente, em todos os casos examinados dentro deste trabalho.


 

 

INTRODUCTION

A relatively high incidence of outliers is observed during measurements with sensors; some of them are due to the sensing device, others due to the chemistry of the process and others due to sampling procedures. The outliers are a real problem when few replicates are provided; this is a common practice in most kinetic determinations.

This outliers problem becomes a real nuisance when one needs to fit such a series of experimental data to a nonlinear multiparametric equation, that is always the case in Biochemistry, where best parameter estimates are required. The problem becomes more acute when good initial guessing values of the parameter estimates are required.

To overcome this situation different criteria of closeness of fit and/or different kind of fitting algorithms can be used; rules for rejecting outliers can, also, help but they are ineffective in several cases (Anscombe 1960). Alternatively, non-parametric methods have been developed (Eisenthal & Cornish-Bowden, 1974a) where the normality, the uniform variance and other requirements are replaced by an assumption of equally probable positive and negative errors. The latter is not always correct, but it is better than the assumptions made for the parametric methods.

In this report, we present a non-parametric method of fitting nonlinear multiparametric equations to experimental data, the corresponding computer program, and its statistical treatment. Initial guessing values for the parameters of the multiparametric equation under consideration are not required. Our method is an extension of that developed earlier (Eisenthal & Cornish-Bowden, 1974a), and it can be applied to model equations having more than two parameters.

 

PRINCIPLES

Function Transformation: In many cases the response of a monitored process as it is the rate of a biochemical reaction, is described by a nonlinear multiparametric model equation of the form:

y = f(x; a,b,c, . . ,p) (1),

where x and y are the independent variable and dependent response, respectively and a, b, c, . . , p the parameters. In practice the data xi,yi are obtained experimentally. Therefore, each experimental data point (xi,yi) should be considered as the known, while the equation parameters as the unknowns to be determined.

In principle, equation (1) can be transformed to the form of a: (i) line equation, (ii) plane equation, (iii) hyperplane equation, for a two or three or more parameters equation, respectively. In the general case the hyperplane equation has the form:

 

 

where P = {Pi} (i = 1, 2, . . .,p) is the vector of the axial components (unknown terms) of any hyperplane point in a hyperspace, and (i = 1, 2, . . .,p) is the intersection of a hyperplane with the ith axis (known terms). Consequently, a hyperplane is defined for each experimental data point (xi,yi) within the given hyperspace.

If errorless data points are applied to a known multiparametric equation the estimated parameter values a, b, c, . , p are the same for all data points and therefore the intersection of all defined hyperplanes is a point whose coordinates correspond to the true parameter values.

However, due to random experimental errors in measurements the hyperplanes are intersected in several ways, by two, by three, . , by p, while only the intersection by p will give points in a space of p dimensions.

The number of intersections: For equations with p parameters, and n experimental data points, the maximum total number of hyperplane intersections is given by the formula:

 

 

where equals to the number of intersections of n hyperplanes by p (Binomial Coefficients, n>p).

The coordinates of all hyperplane intersection points that correspond to parameter estimates are collected in p columns, and sorted in respect to their arithmetical values. The median value of each column is chosen as the best estimate of the corresponding parameter a, b, c, . . .,p.

The error structure: For an infinite number of independent observations (xi,yi), errors ei are supposed to be equally probable either positive or negative. Correspondingly, the true parameter values a, b, , p, are the coordinates of a point lying below or above of each hyperplane defined by each (xi,yi) data point. This supports our choice to define the median values as the estimates of a, b, , p.

The sample median is regarded as a more reliable estimate of the population average value than the sample mean, and alternative approaches to this mater could be certainly found (Cornish-Bowden & Eisenthal, 1974b). This is outside of the purpose of this manuscript. Herein, we would like to put the idea of non-parametric curve fitting of the multiparametric nonlinear equations; at least, good initial guessing values of the parameters were estimated. The latter being sometimes as the Ancient Greek proverb "The begining (is) the Half of Everything" (Platonis opera, Leges).

An example:

As an example it will be used a three parameter equation, which represents the calibration curve of chemiluminescence generated from oxidized pyrogallol by periodate (Papamichael & Evmiridis, 1988), the following:

 

 

Having a sufficient number n of experimental points we will compute estimates of the true values of parameters a, b and c without the knowledge of initial guessing values. For each experimental observation (xi,yi), equation (1) could be written as:

 

 

which is transformed to:

 

 

for yi 0. Equation (6) falls into a general form of plane Equation (7),

 

 

where P1 = a, P2 = b, P3 = c, and

 

 

Due to experimental errors in measurements of the response, equation (5) can be written as:

 

 

where a, b, and c are the true but unknown values of the a, b, and c parameters, xi are the errorless values of the independent variable, and ei is the difference between observed and true response values independently where errors are confined. The total number of intersections should be but only are considered as intersection points.

All mentioned herein, are referred to transformable equations to a suitable form of a hyperplane equation. On the other hand any equation can generally be transformed by arbitrary replacements and proper restrictions.

Equation (9a) can be transformed if an integer value is given, temporarily, to the parameter d (i.e. d=1, or d=2), and equation (9b) by making more than one transformations

 

CONFIDENCE REGIONS OF THE PARAMETER ESTIMATES

Number of regions: In a finite experiment with n data points, n hyperplanes can be drawn according to equation (2) which divide the hyperspace into 2n regions at the most. No one hyperplane could be parallel to another and/or do not pass through the origin. These restrictions are in line with the concept of generating hyperplanes from experimental data points that include random errors ei. This is illustrated in Fig.1 for n=10 and p=3.

 

 

Relative positions of regions and hyperplanes: According to theory there should not one single point of intersection of all n hyperplanes defined by the (xi,yi) data points, due to random experimental errors ei (Eq, 8). On the other hand, hyperplanes as they intersected each other divide hyperspace in regions.

Thus, each hyperplane associated with a specific experimental point is either above (ei positive) or below (ei negative) of the point the coordinates of which are the true parameter values a, b, c, . , p. That point is, obviously, within these regions and it can be used to define the confidence limits of the parameter estimates (Cornish-Bowden & Eisenthal, 1974b).

Designation of regions: Each ei error will be either positive(+) or negative(-) and it is associated to a specific region. Therefore, the total number of ei will be n, i.e. e1, e2, , en. Furthermore, if plus(+) and minus(-) signs of the errors ei are replaced by 1 and 0 respectively, the designation of the hyperspace regions becomes numerical. Thereafter, regions can be labeled either in a binary or in a decimal form. For example, a region in Fig. 1, labeled 24 in decimal form, corresponds to binary 0000011000 for n=10 data points or to binary 11000 for n=5 data points. In Fig. 2 a region labeled 696, in decimal form, corresponds to binary 1010111000.

 

 

Permutations: If each ei value has a median expectation of zero, and all ei-values are independent, then all possible permutations of signs among ei values are equally likely. Since the possible permutations of n signs is 2n the probability for each permutation to occur is 2-n. Confidence regions, as just defined, are rigorous but they extend to infinity and include estimates of the true parameter values that must be rejected as absurd.

Table 1 illustrates the way of calculation of Binomial Coefficients and Probabilities with different number of positive and/or negative signs of n errors ei from the 2n possible permutations. Calculations are given for an equation with three parameters having n=10, and for zero positive and ten negative signs, for one positive and nine negative signs, and so on up to ten positive and zero negative signs.

 

 

From Table 1 we conclude that there is a probability of about 25% for five positive and five negative signs, moreover there is a greater probability for four to six positive and six to four negative signs. This result provides a confidence region of about 66% for a, b, and c within experiment comprising those regions that predict about equal number of positive and/or negative signs of the ei values.

Runs: Another way to define confidence regions is that of considering the number of runs, which is the number of separate strings of 0s and 1s, in the binary notation, of a confidence region.

Any objection raised on defining confidence regions with the concept of permutations, can be removed by considering the number of runs of positive and negative signs in the series of errors, instead of the total number of positive and negative signs (Cornish-Bowden & Eisenthal, 1974b).

For a region 24 corresponds a sequence of signs  - - - - - + + - - -  containing three runs (for n=10) whereas for a region 696 corresponds a sequence of signs  + - + - + + + - - -  containing six runs. The number of runs in random sequence of binary digits obey to Binomial Distribution and for n digits they are permutations with m runs. The probability that there are m runs in n digits is then .

In Table2 is illustrated the calculation of the probabilities for positive and/or negative signs using the same example as in Table 1. A probability of about 50% is calculated for five or six positive (or negative) signs, whereas the probability for four to seven positive (or negative) signs is over than 80%. Conclusions about confidence regions are about the same, using both methods, the latter being more convenient and useful since may give enclosed confidence regions, and relatively small in extent.

 

 

CALCULATIONS

The median estimates of the parameters a, b, c, . , p are well calculated by solving successive systems of p equations, in p unknowns. For the example of equation (4), and for y ¹ 0 we may write:

 

 

where i, j, k =1, 2, 3, . ., n.

When it happens to be k=j or k=i or j=i then the respective counter is increased by one and we proceed so that to have always i ¹ j ¹ k. Another constraint for such a system should be (for all counters). These non-equalities are important for the presented method. The only constraint for this technique is the equation under consideration to be transformable in a form like that of equation (2).

A variety of methods could be chosen to solve a system of p equations in p unknown; in this work we preferred to use the Cholesky's or Croutãs method (Shoup, 1983). In the example, equation (10) could be solved to one unknown term, successively, and written as:

 

 

The above equations represent, one plane in a space of three dimensions which intersects (i) Z-axis at the point , (ii) Y-axis at the point - and (iii) X-axis at the point - . Consequently, intersections by three-planes will be defined by the planes of equations (11) and (12), of which the coordinates of their intersection points give suitable estimates of the a, b, and c parameters.

 

THE PROGRAM

The program is written in BASIC and has been developed under ZBASIC compiler for Macintosh computers. The user have to edit several program lines before "RUNning" it, as follows:

  1. Input the number of (a) data pairs and (b) parameters, of the used equation (constants ndp and npr).

  2. Select the binomial coefficient (binco), from a suitable TABLE, according to the data inputted just above, and assign that value to the appropriate constant. Edit DIM, and CLEAR statements.

  3. Edit program lines that describe the system of the linear simultaneous equations, which should correspond to the transformations of the used nonlinear equation; they found within multiple FOR - NEXT nested loops.

  4. Add as many as FOR - NEXT nested loops as it is the number of the parameters of the used nonlinear equation; add, also, the appropriate IF - THEN statements within nested loops, and edit the end of the last IF - THEN statement.

  5. Finally, add as many as [dx( )=...] assignments, of the loop counters, as it is the number of the parameters of the used equation, run the program and collect results.

 

RESULTS AND DISCUSSION

The presented method gave excellent results when applied to several multiparametric equations, common in Chemistry and Biochemistry; such as:

 

 

it is described thoroughly in this manuscript (Papamichael & Evmiridis, 1988; Papamichael & Evmiridis, 1991) as equation (4).

 

 

it describes the rate of chlorinating of 1,2-dichloroethane to 1,1,2-trichloroethane, and 1,1,1,2- and 1,1,2,2-tetrachloroethanes (Kafarof 1976).

 

 

it describes the rate of substrate depletion by an enzyme that displays non-Michaelis-Menten kinetics.

 

 

it is similar that the above one, and describes a different situation of non-Michaelis-Menten kinetic behavior.

 

 

it describes the same procedure as in (ii), by using two parameters, for a more convenient curve fitting.

Experimental (xi,yi) data pairs were used for both the ordinary and the non-parametric fitting of the above equations (i) to (iii); for equations (iv) and (v) were used simulated data produced accordingly (Papamichael & Evmiridis, 1991). For the ordinary fitting were used either commercial curve fitting packages (UltraFit, 1991), or self-written programs (Papamichael & Evmiridis, 1988).

The results are summarized in Table 3. Estimates of the parameters of all equations were found very close to their true values. The program was converged after few seconds in most cases; in case of equation (iv) of Table 3 the program converged after 1.6 h, most probably due to the large number of parameters and data points. In difficult situations, when the number of data points are relatively few compared to the number of parameters of the used equation, the program was given parameter estimates which can be used as good initial guessings for other methods.

 

 

Objections could be raised on the efficiency and/or the usefulness of the proposed method and program. At a first glance, (a) the method is not applicable for certain nonlinear equations e.g. the Gompertz equation y = ae-be-cx (Ratkowsky, 1983), or (b) all linear forms of nonlinear multiparametric equations can be fitted to any set of data points by multivariate linear regression methods without need of initial guessed values for their parameters.

Any equation can be generally transformed to a suitable form like equation (2) by arbitrary replacements and proper restrictions including the Gompertz one (Ratkowsky, 1983). On the other hand, in linear regression we make assumptions on which we based in order to accept that the minimum variance estimators are unbiased and normally distributed. However, to do so in nonlinear situations we need a very large number of data which are unavailable as far as it is concern the fields of Biochemistry and/or Biotechnology, in most cases (Ratkowsky, 1983).

Based on the above we can conclude that the efficiency, the statistical robustness, and the usefulness of the proposed method and program have been verified.

 

REMARKS

The Program Listing is available on request. It is written under Z-BASIC for Macintosh; however, there is a version of the Z-BASIC compiler for IBM compatibles. Alternatively, authors could help on transformation of the Program Listing under any available compiler.

 

REFERENCES

Anscombe .F.J. (1960), Rejection of outliers Technometrics, 2, 123-147.         [ Links ]

Eisenthal R., Cornish-Bowden A. (1974a), The direct linear plot: A new graphical procedure, Biochem J., 139, 715-720.         [ Links ]

Eisenthal R., Cornish-Bowden A. (1974b), Statistical considerations in the estimation of enzyme kinetic parameters by the direct linear plot and other methods, Biochem J., 139, 721-730.         [ Links ]

Kafarov V., (1976), Cybernetic Methods in Chemistry & Chemical Engineering, MIR Publishers, Moscow, pp.376-379.         [ Links ]

Papamichael E. M., Evmiridis N. P., (1988), Program based on the pattern search method: application to periodate determination using FIA analysis and chemiluminescence detection, TrA.C., 7, 366-370.         [ Links ]

Papamichael E. M., Evmiridis N. P., (1991), Investigation of the error stucture of the calibration curve for, periodate determination by FIA analysis and chemiluminescence detection, C.I.L.S., 7, 39-47.         [ Links ]

Plotonis opera, Leges (Plato's works laws), 6, 753.         [ Links ]

Shoup T. E, (1983), Numerical methods for the Personal Computer, PRENTICE HALL, INC. Englewood Clifts, New Jersey, pp.45-57.         [ Links ]

UltraFit, (1991), The non-Linear curve fitting package, BIOSOFT, Cambridge U.K., pp. 5-48.         [ Links ]

 

 

Received: February 15, 1999;
Revised: April 18, 1999;
Accepted June 06, 1999.

 

 

*Author for correspondence