Adaptive basis selection for functional data analysis via stochastic penalization

Anselmo, Cezar A.F.; Dias, Ronaldo; Garcia, Nancy L.

doi:10.1590/S0101-82052005000200004

Abstract

We propose an adaptive method of analyzing a collection of curves which can be, individually, modeled as a linear combination of spline basis functions. Through the introduction of latent Bernoulli variables, the number of basis functions, the variance of the error measurements and the coefficients of the expansion are determined. We provide a modification of the stochastic EM algorithm for which numerical results show that the estimates are very close to the true curve in the sense of L2 norm.

basis functions; SEM algorithm; functional statistics; summary measures; splines; non-parametric data analysis; registration

Adaptive basis selection for functional data analysis via stochastic penalization

Cezar A.F. Anselmo; Ronaldo Dias; Nancy L. Garcia

Departamento de Estatística, IMECC, UNICAMP Caixa Postal 6065 - 13081-970 Campinas, SP, Brazil, E-mails: cafa@ime.unicamp.br/ dias@ime.unicamp.br/ nancy@ime.unicamp.br

ABSTRACT

We propose an adaptive method of analyzing a collection of curves which can be, individually, modeled as a linear combination of spline basis functions. Through the introduction of latent Bernoulli variables, the number of basis functions, the variance of the error measurements and the coefficients of the expansion are determined. We provide a modification of the stochastic EM algorithm for which numerical results show that the estimates are very close to the true curve in the sense of L₂ norm.

Mathematical subject classification: Primary: 62G05, Secondary: 65D15.

Key words: basis functions, SEM algorithm, functional statistics, summary measures, splines, non-parametric data analysis, registration.

1 Introduction

It is very common to have data that comes as samples of functions, that is, the data are curves. Such curves can be obtained either from an on-line measuring process, where we have the data collected continuously in time or from a smoothing process applied to discrete data. Functional Data Analysis (FDA) is a set of techniques that can be used to study the variability of functions from a sample as well as its derivatives. The major goal is to explain the variability within and among the functions. For an extensive discussion of such techniques see Ramsay and Silverman (2002).

In this paper, we assume that each curve can be modeled as a linear combination of B-splines functions. The coefficients of the expansion can be obtained by the least square method. However, in doing so, we get a solution for which the number of basis functions equals the number of observations, achieving interpolation of the data. Interpolation is not desirable since we have noisy data. To avoid this problem we propose a new method to regularize the solution using stochastic penalization. This is possible through the introduction of latent Bernoulli random variables which indicate the subset of basis to be selected and consequently the dimension of the space.

Before applying the model we have to understand the different sources of variability for curves. Variability for functions can be of two types: range and phase. The range variability gives the common pattern to each individual or function. Phase variability, on the other hand, can mask the common pattern of the functions. The usual situation, where both sources of variability are present, require complex estimation techniques. As an example, we can think about height, it is well known that the growth velocity is very high for young children and slows down as the age increases. Different children have different growth velocities in scale (range variation) as well as in time (phase variation). The mean function - also called (cross-)sectional mean - is a descriptive statistic which is widely used to give a rough idea of the process generating the curves. Figures 1.1-1.3 show some examples of variability. Notice in Figure 1.3 that the cross-sectional mean can be very misleading in the presence of phase variability. For all simulated results in this section we used a low variance noise to better illustrate each individual curve.

Our approach, like most approaches in Functional Data Analysis can only be used for curves in the presence of range variability. If phase variability is present it is necessary to align the curves using a method introduced by Ramsay and Li (1998) called registration. After registration is done, we can find the mean of the registered curves - the structural mean. Figure 1.4 presents the same curves as Figure 1.3 after registration. It is clear that in this case the structural mean is much closer to the real curve than the arithmetic sectional mean of the original curves.

This paper is organized as follows. Section 2 presents the proposed model under range variability. An algorithm to estimate the curves based on a modification of the Stochastic EM algorithm is given in Section 3. The numerical results presented in Section 4 are based in small simulations and study two types of curves. The plots and the mean square errors (MSE) obtained show that the techniques are highly successful and very adaptive. Moreover, increasing the number of curves, lead us to believe that the method is consistent. In Section 5 we provided a modification of the continuous registration algorithm (Ramsay and Li 1998) which has better performance than the original one in some cases.

2 Proposed model

Suppose we have m individuals with n_i observations at points x_ij Î A Í , i = 1, ¼, m and j = 1, ¼, n_i. Let Y_ij = Y(x_ij) be the response obtained from the i-th individual at the point x_ij. Assuming that the observations are subjected only to range variability we have the following model

where e_ij are normally distributed with zero mean and constant variance s².

Rice (2000) suggested that random coefficients should be included in the model to account for individual curve variation. In fact, in many applications, the observed curves differ not only due to experimental error but also due to random individual differences among subjects. In this paper we propose one possible way of accomplishing this random effect. Assume that g can be written as a random linear combination of spline functions as

where B_k(·) are the well known spline basis functions (cubic B-splines)and Z_ki are independent Bernoulli random variables with Z_ki Î {0, 1} and _q(Z_ki = 1) = q_ki, for k = 1, ¼, K and i = 1¼, m. That is, for z_i = (z_1i, ¼, z_Ki)

To simplify the notation let Y_i = (Y_i₁, ¼,Y_1ni), Z_i = (Z_1i, ¼, Z_Ki), = b_i = (b_1i, ¼, b_Ki), = (B₁(x_i), ¼, B_K(x_i)), with x_i = (x_i₁, ¼, x_ini).

The conditional density of (Y_i |Z_i = z_i) is given by:

where f(·) denotes the standard multivariate normal density. Their joint density is

Thus the joint log-density of (Y, Z) with respect to a dominant measure can be written as:

Note that maximizing the complete log-likelihood f(y, z|s², b, q) is equivalent to solve a stochastic penalized least square problem associated to (2.5). Since log(q/1 - q) < 0, increasing the number of variables decreases both the sum of squares and the last term in (2.5). Therefore, we can interpret the latent variables z_ki as regularization parameters or stochastic penalization.

In addition, the sum of z_ki random variables provides the number of basis functions needed to fit a model (the dimension of the space) and the values of z_ki indicate which variables should be included in the model i = 1, ¼, m. However, this kind of representation leads to a limit solution with z_ki = 1 and q_ki® 1 as K ® ¥, which causes a non-identifiable model. One possible way to avoid this problem is by using the following transformation:

with 0 < l_i < M, for all i. Observe that q_ki goes to 1 - for large values of b_ki. That is, large values of b_ki indicate that the associated basis to this coefficient should be included in the model with high probability q_ki. In order to avoid extreme values of q_ki we suggest to use M = 1 limiting the inclusion probability to be 1 - e^-1. Thus, the complete log-likelihood can be rewritten as:

Without loss of generality suppose from now on that n_i º n for all i = 1, ¼, m.

3 A variation of the stochastic EM algorithm

The EM algorithm was introduced by Dempster, Laird and Rubin (1977) to deal with estimation in the presence of missing data. In our model, the Z_ki are non-observable random variables and the EM algorithm could be applied. More specifically, the algorithm finds iteratively the value of q that maximizes the complete likelihood where at each step the variables Z_ki are replaced by their expectations over the conditional density of (Z|Y) which is given by

where

The appealing feature of the EM algorithm is that it increases the incomplete likelihood function at each iteration. However, it is well known that EM algorithm can reach a saddle point or a plateau. Moreover, there is a high computational cost to find f(z|y). The stochastic EM algorithm proposed by Celeux and Diebolt (1992) is an alternative to overcome EM algorithm limitations. At each iteration, the missing data are replaced by simulated values Z_ki generated according to f(z|y). To simulate these values we notice that

and Metropolis-Hastings algorithm could be used. The theoretical convergence properties of SEM algorithms are difficult to assess since it involves the study of the ergodicity of the Markov chain generated by SEM algorithm and the existence of the corresponding stationary distribution. For particular cases and under regularity conditions, (Diebolt and Celeux 1993) proved convergence in probability to a local maximum. However, computational studies showed that SEM algorithm is even better than EM algorithm for several cases, for example censored data (Chauveau 1995), mixture case (Celeux, Chauveau and Diebolt 1996). A drawback of this procedure is that it requires thousands of simulations and the computational cost would be, again, very high.

In this work we propose a modification of the simulation step in the SEM algorithm. At each step, instead of generating Z_ki by the conditional density, we are going to generate them from their marginal Bernoulli distribution using estimates of q obtained from the data. Notice that if Metropolis-Hastings were to be used to generate from the conditional distribution using the marginal as the proposal distribution, the acceptance probability would be

Therefore, small changes at each step make the acceptance probability very high and the performances of the approximation and the SEM algorithm do not differ substantially.

Specifically, the algorithm can be described as follows.

Algorithm 3.1

1. Fix the maximum number of basis to be used to represent each curve;

2. For each curve i, take = 1/2, k = 1, ¼, ;

3. At iteration l:

(a) Simulate as a Bernoulli random variable with success probability until > 3 ;

(b) Estimate and using least squares;

(c) Estimate by maximizing the complete likelihood subject to l < 1;

(d) Estimate and using maximum likelihood;

(e) Update = 1- exp{};

(f) Save ;

(g) If the stopping criteria is satisfied, stop. If not, return to 3a.

4. Summarize all the obtained curves.

In order to apply Algorithm 3.1 we need to specify:

The maximum number of basis ;

The summarization procedure of the curves;

The stopping criterion.

The maximum number of basis

. There is no consensus about a criteria to fix the maximum number of basis functions on any adaptive process. Dias and Gamerman (2002) suggest the use of at least 3b + 2 as a starting point in the Bayesian non-parametric regression where b is the number of bumps of the curve. Particularly for our approach, numerical experiments give evidences that

= 4b + 3 is large enough.

Summary measures. For each iteration l in Algorithm 3.1, vectors are obtained for each curve i. These vectors contain the estimates of the coefficients for the basis selected (through the ) for the ith curve (the non-selected basis positions are filled with zeros) and the estimate of the ith curve is given by

The final estimate for the ith curve could be

Although (3.3) provides a natural estimate for each curve, other summary measures can be proposed through weighted averages of the coefficients of the selected basis. For example, for m = 1 and taking L = 3 iterations with K = 5 basis, assume that for each iteration we selected the following sets of basis {1,3,4}, {2,3,5} and {3,4,5} respectively. We could use

or

There are several other ways to weight the coefficients. Notice that we can summarize the estimates along each iteration for computing the estimate for the ith curve in a completely analogous way that we can summarize each estimate to obtain the final estimate . Therefore, we drop the l superscript and present here only summaries of to obtain the estimate .

Observe that (3.4) and (3.5) are the unweighted and weighted versions of

with

In this case, the weights take into account the number of times each basis was considered in the model.

We can do a similar summary measure by taking:

with

The weights proposed by 3.7 take into account two factors: the number of basis necessary to approximate each curve and the number of times each basis was selected. Another proposal is:

with

This equation is analogous to (3.7), but considers as weight the inverse of the number of basis necessary to approximate each curve. That is, the bigger the number of basis necessary to approximate the curve, the smaller its weight.

After computing the summary coefficient's we can summarize the curve as

where is obtained through (3.6), (3.7) or (3.8) and B is the design matrix given by the X variables.

In Section 4 we show that all these proposals provide very good estimates in terms of MSE.

The stopping criterion. We propose a flexible stopping criterion. Let d > 0 be such that we wish to stop the estimation process when the MSE between two successive estimates is smaller than d. For each estimated curve, if the maximum number of iterations is attained (1000) and MSE > d make d ¬ cd, c > 1 and begin again the estimation process. In the simulation study we used c = 1.3 and got good results.

4 Simulation

For all the simulations we used data with no outliers. Without loss of generality we took equally spaced observations for each curve and used the same sample grid for all individuals. Moreover, the curves are already registered and aligned by some registration method. How to register the curves is discussed in Section 5.

Simulations were run in bi-processed Athlon machine with 2.0 GHz processor and 1.5 Gb RAM memory.

The software used was Ox (http://www.nuff.ox.ac/uk/users/Doornik) and R (http://www.cran.r-project.org), operating inLinux platform.

The test curves used were

and

The observations Y_t are generated from the curves above plus a noise e_t. The variables {e_t, t = 1, ¼, n} are iid normal random variables with zero mean and standard deviation s. For comparison we run the simulations in three cases: small (s = 1/10), moderate (s = 1/4) and large (s = 1/2) standard deviation.

Instead of using the raw data to estimate the b's we use a smoothed version of them called the structural mean. There are two ways of smoothing the data. The first one takes the average of data (discrete observations) and then smooths it. The second one first smooths each curve and then takes the average of the smoothed curve. Simulation studies showed that there are no difference between these methods in terms of mean square error (MSE). Therefore, from now on, we are going to use as input data the smoothed version of the curves by first averaging the raw data and then smoothing the obtained curve.

First, we analyze the MSE when we estimate the final curve using Equation (3.9) and weights given by (3.6), (3.7) and (3.8). We simulated 3 curves and added a noise with small variance (s = 1/10). Figure 4.5 presents the estimated curves. Notice that all three estimates are practically the same and coincide with the true curve. Convergence was attained after 21, 141 and 159 iterations respectively. Figure 4.6 presents the same 3 curves but with a higher variance noise (s = 1/2). In this case, for one of the curves convergence was not attained even after 1000 iterations. Using the flexible stopping criteria described before we just needed 8 extra iterations to achieve convergence.

To see the performance of Algorithm 3.1, we run several simulations with different values of m, different variances for the functions given by (4.1) and (4.2). As expected, the larger the variance, the lower the quality of the estimation. On the other hand, we have provided consistent estimators (the bigger the sample size the better the estimate). Table 4.1 summarizes these results. All summary measures give approximately the same resulting curve. Measure (3.7a) gives better results for high variance or bigger sample size, while (3.8a) is better for all other cases, except one where (3.6a) is better. The occurrence of several identical results is caused by the simplicity of the curves and small sample size reducing the number of different solutions.

Thumbnail

5 Registration

Registration techniques can be found in Ramsay and Li (1998) and Ramsay (2003). The implementation of these techniques are available in R and Matlab. The package fda in R language has two available techniques landmark and continuous registration.

The landmark technique is appropriate when the curves to be registered have prominent features like valleys and peaks. Suppose we have curves g₀, g₁, ¼, g_m to be registered and they present Q of such properties at points t_iq, q = 1, ¼, Q and i = 1, ¼, m. In fact, the beginning at t_i₀ and the ending at t_i_,_Q₊₁ of the ith curve are also considered and Q + 2 properties are to be used in computing the warping function h(t). This function h is such that the registration of the curve g_i(t) is given by

The goal is to make a time transformation such that the Q + 2 properties are aligned in time. In the R package, the user provides the placement of the Q properties for each curve which makes the process highly subject to error. For example, certain curves do not present one or more of the critical points or there is ambiguity where to place them. When there are too many curves and/or too many properties the marking is too tedious. However, Ramsay (2003) showed that automatic methods for mark identification can lead to serious mistakes.

The continuous registration method tries to solve these problems. It maximizes a measure of similarity among the curves,

where [a, b] is the observation interval. This measure takes into account the whole curve and not only the critical points and it works well if h has good properties such as smoothness. However, it fails if g_i and g₀ differs also in range. This routine needs that the functions g_i and g₀ be given in functional form which can be obtained using Fourier series or B splines, for example.

In Figure 5.7 we present an example where we simulated m = 3 curves adding a low variance noise and shifting it through an uniform random variable in the interval [0, 2]. In this case, since the functions are periodic we used Fourier expansion with 6 terms. Figure 5.8 presents the registered curves. Observe that there is a noticeable difference between the structural mean and the true curve caused by the failure of the registration of two of the curves.

Consider another example where the curves have a range and phase variation. To obtain this effect we fixed the horizontal axis as a reference and defined as ''bumps'' the pieces of the true curve between two zeroes. Following, for each bump generate q_2iq, i = 1, ¼, m, q = 1, ¼, Q (Q = number of bumps) iid random samples from an uniform random variable in the interval (0.5,1.5). Figure 5.9 presents this transformation for 3 curves adding a low variance noise for the raw data and the functional data using Fourier transformation. Figure 5.10 presents the result of registration done by R routine registerfd(). As the registration process is not right, the structural mean differs very much from the true curve.

To overcome this problem we propose a modification in the continuous registration procedure. Based on (5.2) we try to minimize the cumulative difference between g_i and the reference curve g₀, however as each curve may have a different range we normalize them using

From this point on, we can use an idea similar to Ramsay (2003) which suggests to substitute each curve by its n-derivative, and our goal is to minimize

for each curve i, i = 1, ¼, m. In this work we decide to use n = 0. The procedure is described by Algorithm 5.1. We used the sample points as a grid to find .

Algorithm 5.1

1. Use Algorithm 3.1 to obtain the functional representation of each curve.

2. Normalize each curve using Equation (5.3).

3. Obtain for each of the normalized curves.

4. Shift the non-normalized curves taking g_i(t + ).

Figure 5.11 presents a sample of m = 7 curves having phase and range variation plus a low variance noise. Figure 5.12 shows the registered curves after the application of Algorithm 5.1. All estimates using (3.6a), (3.7a) and (3.8a) were very similar and presented small MSE. Afterward we amplified the noise by taking a variance 25 times bigger, see Figure 5.13. Figure 5.12 shows the registered curves after the application of Algorithm 5.1. It seems that a good registration of the curves was achieved.

Acknowledgments. We would like to thank the referee for the comments and suggestions that made this work better and clearer. Also, we would like tothank Douglas Bates for making available the source code of the routinesplineDesign() and Jim Ramsay for his ready response concerning the registration routines. This work was partially funded by CNPq 3000644/94-9, 475763/03-3 and 301054/93-2, FAPESP 01/00258-0 and 02/01554-5.

Received: 26/V/04. Accepted: 08/IX/04

#607/04.

[1] G. Celeux and J. Diebolt, A stochastic approximation type EM algorithm for the mixture problem, Stochastics Rep., 41 (1-2) (1992), 119-134.
[2] G. Celeux, D. Chauveau and J. Diebolt, Stochastic versions of the EM algorithm: an experimental study in the mixture case, J. of Statist. Comput. Simul., 55 (1996), 287-314.
[3] D. Chauveau, A stochastic EM algorithm for mixtures with censored data, J. Statist. Plann. Inference, 46 (1) (1995), 1-25.
[4] A.P. Dempster, N.M. Laird and D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Statist. Soc. Ser. B, 39 (1) (1977), 1-38. With discussion.
[5] R. Dias and D. Gamerman, A Bayesian approach to hybrid splines non-parametric regression, Journal of Statistical Computation and Simulation, 72 (4) (2002), 285-297.
[6] J. Diebolt and G. Celeux, Asymptotic properties of a stochastic EM algorithm for estimating mixing proportions, Comm. Statist. Stochastic Models, 9 (4) (1993), 599-613.
[7] J.O. Ramsay, Matlab, r and s-plus functions for Functional Data Analysis, ftp://ego.psych.mcgill.ca/pub/ramsay/FDAfuns, (2003).
[8] J.O. Ramsay and X. Li, Curve registration, J.R. Stat. Soc. Ser. B Stat. Methodol., 60 (2) (1998), 351-363.
[9] J.O. Ramsay and B.W. Silverman, Applied functional data analysis, Springer Series in Statistics, Springer-Verlag, New York. Methods and case studies, (2002).
[10] J.A. Rice, Personal communication, (2000).

Publication Dates

Publication in this collection
07 Nov 2005
Date of issue
Aug 2005

History

Accepted
08 Sept 2004
Received
26 May 2004

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

[1] [1] G. Celeux and J. Diebolt, A stochastic approximation type EM algorithm for the mixture problem, Stochastics Rep., 41 (1-2) (1992), 119-134.

[2] [2] G. Celeux, D. Chauveau and J. Diebolt, Stochastic versions of the EM algorithm: an experimental study in the mixture case, J. of Statist. Comput. Simul., 55 (1996), 287-314.

[3] [3] D. Chauveau, A stochastic EM algorithm for mixtures with censored data, J. Statist. Plann. Inference, 46 (1) (1995), 1-25.

[4] [4] A.P. Dempster, N.M. Laird and D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Statist. Soc. Ser. B, 39 (1) (1977), 1-38. With discussion.

[5] [5] R. Dias and D. Gamerman, A Bayesian approach to hybrid splines non-parametric regression, Journal of Statistical Computation and Simulation, 72 (4) (2002), 285-297.

[6] [6] J. Diebolt and G. Celeux, Asymptotic properties of a stochastic EM algorithm for estimating mixing proportions, Comm. Statist. Stochastic Models, 9 (4) (1993), 599-613.

[7] [7] J.O. Ramsay, Matlab, r and s-plus functions for Functional Data Analysis, ftp://ego.psych.mcgill.ca/pub/ramsay/FDAfuns, (2003).

[8] [8] J.O. Ramsay and X. Li, Curve registration, J.R. Stat. Soc. Ser. B Stat. Methodol., 60 (2) (1998), 351-363.

[9] [9] J.O. Ramsay and B.W. Silverman, Applied functional data analysis, Springer Series in Statistics, Springer-Verlag, New York. Methods and case studies, (2002).

[10] [10] J.A. Rice, Personal communication, (2000).

Brasil

Brasil

Adaptive basis selection for functional data analysis via stochastic penalization

Abstract

Publication Dates

History