Abstract
Analytical curves are normally obtained from discrete data by least squares regression. The least squares regression of data involving significant error in both x and y values should not be implemented by ordinary least squares (OLS). In this work, the use of orthogonal distance regression (ODR) is discussed as an alternative approach in order to take into account the error in the x variable. Four examples are presented to illustrate deviation between the results from both regression methods. The examples studied show that, in some situations, ODR coefficients must substitute for those of OLS, and, in other situations, the difference is not significant.
orthogonal distance regression; least squares regression; error in x and y variables
NOTA TÉCNICA
Least squares regression with errors in both variables: case studies
Elcio Cruz de OliveiraI, ^{*} * email: elciooliveira@petrobras.com.br , ^{#} # Programa de PósGraduação em Metrologia para Qualidade e Inovação ; Paula Fernandes de Aguiar^{II}
^{I}Petrobras Transporte S.A., Avenida Presidente Vargas, 328, Centro, 20091060 Rio de Janeiro  RJ, Brasil / Pontifícia Universidade Católica do Rio de Janeiro, 22453900 Rio de Janeiro  RJ, Brasil
^{II}Instituto de Química, Universidade Federal do Rio de Janeiro, 21945970 Rio de Janeiro  RJ, Brasil
ABSTRACT
Analytical curves are normally obtained from discrete data by least squares regression. The least squares regression of data involving significant error in both x and y values should not be implemented by ordinary least squares (OLS). In this work, the use of orthogonal distance regression (ODR) is discussed as an alternative approach in order to take into account the error in the x variable. Four examples are presented to illustrate deviation between the results from both regression methods. The examples studied show that, in some situations, ODR coefficients must substitute for those of OLS, and, in other situations, the difference is not significant.
Keywords: orthogonal distance regression; least squares regression; error in x and y variables.
INTRODUCTION
Classical univariate regression is the most used regression method in Analytical Chemistry. It is generally implemented by ordinary least squares (OLS) fitting using n points (x_{i},y_{i}) to a response function, which is usually linear, and handling homoscedastic data.^{1} In this way, the amount of the unknown (x_{0}) is estimated from one or more measurements of its response (y_{0}). The algorithms for carrying out such analytical curve fitting have been well established in the literature. When working with heteroscedastic data, Analytical Chemistry uses a weighted linear regression.
However, a problem remains in the Analytical Chemistry community: error in the xaxis data. Classical linear regression, available in commercial software, assumes that xvariable errors are negligible, i.e., errorfree.^{1,2}
Analytical methods must typically be applicable over a wide range of concentrations. Therefore, a new analytical method is often compared with a standard method by the analysis of samples in which the analyte concentration may vary over several orders of magnitude. In this case, it is inappropriate to use the paired ttest because its validity rests on the assumption that any errors, either random or systematic, are independent of concentration.^{3}
Over wide ranges of concentration, this assumption may no longer be true. A second problem for fitting analytical curves appears when certified reference materials having negligible error are not available.^{4} Therefore, besides the error derived from the signal, the error from the xaxis data must also be considered. In these cases, OLS should not be used and the literature suggests carrying out orthogonal distance regression (ODR).^{5,6} The aim of this work is to suggest how to handle such cases in which errors in both variables must be considered, and the question of whether ODR and OLS yield metrological differences is evaluated.
METHODOLOGY
General
In general, it is assumed that only the response variable, y, is subject to error and that the predictor variable, x, is known with negligible error. However, there are situations for which the assumption that x is errorfree is not justified. In these situations, regression methods are required that take the error of both variables into account. These methods are called errorsinvariables regression methods.^{6}
In OLS analysis, the best fit is chosen to minimize the residual errors in the y direction, i.e., , for all points. However, for the ODR model, the sum of the squares of the x residual, A^{2} = (X_{i}  ^{^}X_{i})^{2} and the y residual, B^{2} = (Y_{i}  Ŷ_{i})^{2}, are both minimized. This model results in choosing the line regression that minimizes the sum of the squares of the perpendicular (orthogonal) distances from the data points to the line because, geometrically, C^{2} = A^{2} + B^{2}, as shown in Figure 1.^{7}
If n_{i} represents the true value of y_{i} and x_{i} the true value of x_{i} with e_{i} and δ_{i} representing the experimental errors in each, respectively, then
The model which describes the straightline relationship between and is
Consequently, the combination of Equation (3) with Equations (1) and (2) yields
where the last term includes the experimental errors. The fitted line is then the one for which the least sum of squares, d_{i}s, is obtained, and the method has thus been called orthogonal distance regression (ODR). This is equivalent to finding the first principal component of a data set consisting of two variables and n samples.^{6}
ODR statistics
The maximum likelihood method is the most widely used method to solve regressions with errors in both axes; however, the literature quotes other methods.^{812}
The expression of the function of the likelihood method for n pairs of values (x_{i}, y_{i}) that includes a multidimensional model suitable for describing experimental data fluctuations is the multivariate normal:^{13}
Both variables are affected by random measurement errors: x variable,, and y variable, The parameters and are estimated by b_{0} and b_{1}, respectively.
If it is considered that both variances of the variables are constant, and its known rate is λ. This ratio can be defined as follows:
Applying Equation (6) into Equation (5) yields the following:
and its logarithm is given by
Maximizing (8), the log likelihood function in relation to the disturbing parameters,^{14^}µ_{xi} yields
Substituting Equation (9) into Equation (8) results in the profiled log likelihood function that is only a function of and θ.
Deriving this new equation in relation to these three estimators and equalizing the derivatives to zero, which is the approach of Deming,^{8} estimates b_{1} (Equation 10) as
where s_{y}^{2} and s_{x}^{2} are the variances of the y variable and the x variable, respectively, and is the covariance of y and x.
Because both variables are affected by random measurement errors with the simplest case being , an unbiased estimation of the regression coefficients can be obtained by minimizing , i.e., the sum of the squares of the perpendicular distances from the data points to the regression line, where the values of d_{i} are determined perpendicular to the estimated line.^{6}
The expressions for b_{1} and b_{0} are
with
Confidence intervals
To test for bias, i.e., the equivalence of the compared methods, the 95%confidence intervals of the estimators from the linear equations, y = b_{0} + b_{1}x, obtained after the orthogonal regression, were used to test whether the optimal estimators of b_{0} = 0 and b_{1} = 1 are included in the spanned confidence intervals (CI):^{15}
The standard errors of estimators b_{0} and b_{1} are^{16}
where
where t is the Student tfactor with: p = 95%, f = n  2, and s_{b0} and s_{b1} are the standard errors of the estimators b_{0} and b_{1}.
The ideal values of b_{0} = 0 and b_{1} = 1 imply no bias between the compared methods, i.e., equivalence in the calibration results. A fail of the test for the axis intercept b_{0} implies a systematic bias, e.g., a bias caused by a wrong blank correction of one of the methods. If the test fails for the slope b_{1}, this implies a proportional bias. The combinations of the two errors can also appear.
OLS versus ODR
Mandel^{17} considers an approximate relationship between the ordinary least squares slope, b_{1}(OLS), and the orthogonal distance regression slope, b_{1}(ODR) as follows:
where s^{2}_{ex} is the variance of a single x value (which involves replicate observations of the same x) and s^{2}_{x} is the variance of the x variable.
Table 1 shows the relationship between s^{2}_{ex} and the ratio when a perfect system is considered, i.e., when s^{2}_{x} is constant and equal to 1.
On the basis of Table 1, Figure 2 shows that and have a behavior that is close to linearity when the variance of a single x value is lower than that of the x variable, i.e., from 0.0 to 0.2.
When the increases up to 0.5, the best regression seems to be quadratic, as shown in Figure 3, based on Table 1.
As gets close to the unity, grows rapidly to infinity, as shown in Figure 4, also based on Table 1.
EXPERIMENTAL
Four case studies using ODR are discussed in this work. In these examples, equal errors are considered in both variables.
In the first study, a catalytic fluorimetric method is compared with a photometric technique for the determination of the level of phytic acid in urine samples, and confidence intervals are calculated to evaluate the equivalence between the methods. The instrumentation is described by March.^{18} Moreover, an inadequate yet frequently used approach, the ttest, is also applied to the same set of data.
In the second study, the regression of an analytical curve using a certified reference material (CRM) to build the analytical curve for the determination of the copper content in water by Flame Atomic Absorption Spectrometry (FAAS) is carried out by both OLS and ODR approaches, and the results then are compared. The instrumentation is described by Oliveira.^{19}
The next study involves comparison of potentiometric stripping analysis (PSA) and atomic absorption spectroscopy (AAS) in determining lead in fruit juices and soft drinks. The instrumentation is described by Mannino.^{20}
For the last study, ammonium ions derived from the mainstream smoke of a cigarette is trapped (retained), extracted, and analyzed by ion chromatography equipped with a conductivity detector.^{21} An analytical curve was built with reference materials (MR) instead of CRM. In this way, ODR is the most recommended regression.
RESULTS AND DISCUSSION
All data were tested for normality assumption by the Shapiro  Wilk test, homoscedastic behavior based on the Cochran test, independence by the Durbin  Watson test, and for lack of fit by ANOVA.^{6}
All calculations were conducted by Microsoft Excel.
Case study 1
The level of phytic acid in urine samples was determined by a catalytic fluorimetric (CF) method and the results were compared with those obtained using an established extraction photometric (EP) technique. The results, in mg/L, are the means of triplicate measurements, as shown in Table 2.
ODR line: ŷ =  0.056 + 0.996x
CI(b_{0}) =  0.056 ± 0.090 (  0.147, 0.034)
CI(b_{1}) = 0.996 ± 0.040 (0.955, 1.036)
On the basis of Equation (13), the confidence intervals include the optimal estimators for the slope b_{1} and intercept b_{0}, which are one and zero, respectively, showing equivalence between the two methods.
If the ttest^{3} was used, which is incorrect, the comparison of the methods would not be considered equivalent, because, to the 95% confidence level, the value of t calculated (3.59) is higher than the value of t critical (2.09).
Case study 2
Table 3 presents the concentrations and replicate measurements for the determination of copper in water by FAAS. Data regression shows that there is no difference between the ODR coefficients (see equations below) and those derived from OLS. As a CRM was used, s^{2}_{ex} is negligible when it is compared to s^{2}_{x}, so s^{2}_{ex}/s^{2}_{x} is very close to zero. Therefore, in this case, the error in the xaxis can be considered negligible in the regression of the analytical curve.
OLS line: ŷ = 0.0004 + 0.0784x
ODR line: ŷ = 0.0004 + 0.0784x
CI(b_{1}ODR) = 0.0784 ± 0.0007 (0.0778, 0.0790)
There is no significant difference between the two approaches when a certified reference material (CRM) is used to build the analytical curve. However, when a CRM is not available or it is too expensive, a reference sample cannot be used. Under these conditions, the xaxis error must be taken into account in the regression line. In these cases, the OLS line can be different from that derived from ODR.
For this example, from an examination of the value of s_{ex}, Table 4 indicates that the ODR line must be considered rather than the OLS line using the Deming approach given in Equation (19). Standard deviations of a single x value higher than 0.030 mg/mL causes b_{1}(ODR) to exceed the confidence interval (0.0778, 0.0790).
Case study 3
This case study compares a new potentiometric method (y variable) with a reference flameless AAS method (x variable) for the determination of Pb in fruit juices and soft drinks. The results, in mg/dm^{3}, are based on three replicate determinations for AAS and five replicate determinations for PSA, as shown in Table 5.
ODR line: ŷ =  0.0030 + 0.9686x
CI(b_{0}) =  0.0030 ± 0.0154 (  0.0109, 0.0170)
CI(b_{1}) = 0.9686 ± 0.0828 (0.8933, 1.0489)
On the basis of Equations 13, 14, and 15, the confidence intervals include the optimal estimators for the slope b_{1} and the intercept b_{0}, which are one and zero, respectively, showing equivalence between the methods.
By the ttest, the comparison of the methods is also considered equivalent because, to the 95% confidence level, the value of t calculated (0.597) is higher than t critical (2.262).
Case study 4
Five reference materials are measured one time each, providing the results in Table 6.
Considering the variance ratio between each MR and the xaxis as 10%, the b_{1}(ODR) coefficient can be estimated by b_{1}(OLS) regression and calculated using Equation 19 as follows:
CONCLUSIONS
From the studied examples, it was possible to observe that when the errors involving xaxis data can be considered metrologically negligible, one should apply the OLS. Otherwise, the ODR should be used. It should be emphasized that the difference between the ODR and OLS results significantly increases in proportion to different ratios of . Furthermore, classical comparison between the methods must be done by the ODR rather than the ttest owing to errors residing in both axes.
There is a need to evaluate the impact of error in the xaxis data before performing linear regression because the inadequate application of regression may lead to substantially different conclusions.
Recebido em 29/8/12
Aceito em 3/2/13
Publicado na web em 4/6/13
 1. Tellinghuisen, J.; The Analyst 2010, 135, 1961.
 2. Synek, V.; Accred. Qual. Assur. 2001, 6, 360.
 3. Miller, J. N., Miller, J. C.; Statistics and Chemometrics for Analytical Chemistry, 4th ed., Prentice Hall: Harlow, UK, 2000.
 4. Martinez, A.; Riu, J.; Ruis, F. X.; Chemom. Intell. Lab. Syst. 2000, 54, 61.
 5. Boggs, P. T.; Spiegelman, C. H.; Donaldson, J. R.; Schnabel, R. B.; J. Econometrics 1988, 38,169 .
 6. Massart, D. L.; Vandeginste, B. G. M.; Buydens, S. J.; Lewi, P. J.; SmeyersVerbeke, J.; Handbook of Chemometrics and Qualimetrics: Parte A, Elsevier: Amsterdam, 1997.
 7. Cornbleet, P. J.; Gochman; N.; Clin. Chem. 1979, 25, 432.
 8. Poch, J.; Villaescusa, I.; J. Chem. Eng. 2012, 57, 490.
 9. Kane, M.; Mroch, A.; Appl. Meas. Educ. 2010, 23, 215.
 10. Riu, J.; Rius, F. X.; J. Chemom. 1995, 9, 343.
 11. Riu, J.; Rius, F. X.; Anal. Chem. 1996, 68, 1851.
 12. Sousa, J. A.; Reynolds, A. M.; Ribeiro, A. S.; Accred. Qual. Assur. 2012, 17, 207.
 13. Danzer, K.; Wagner, M.; Fischbacher, C.; Fresenius J. Anal. Chem. 1995, 352, 407.
 14. Garthwaite, P. H.; Jolliffe, I. T.; Jones, B.; Statistical Inference, Prentice Hall International Limited: UK, 1995.
 15. Wienold, J.; Traub, H.; Lange, L.; Giray, T.; Recknagel, S.; Kipphardt, H.; Matschat, R.; Panne, U.; J. Anal. At. Spectrom., 2009, 24, 1570.
 16. Mandel, J.; J. Qual. Techn. 1984, 16, 1.
 17. Mandel, J.; The Statistical Analysis of Experimental Data, Dover Publications: New York, 1964.
 18. March, J. G.; Simonet, B. M.; Grases, F.; Analyst 1999, 124, 897.
 19. Oliveira, E. C.; Monteiro, M. I. C.; Pontes, F. V. M.; Almeida, M. D.; Carneiro, M. C.; Silva, L. I. D.; Neto, A. A.; J. AOAC Int. 2012, 95, 560.
 20. Mannino, S.; Analyst 1982, 107, 1466.
 21. Oliveira, E. C.; Muller, E. I.; Abad, F.; Dallarosa, J.; Adriano, C. Quim. Nova 2010, 33, 984.
Publication Dates

Publication in this collection
08 Aug 2013 
Date of issue
2013
History

Received
29 Aug 2012 
Accepted
03 Feb 2013