Acessibilidade / Reportar erro

On the use of the Spectral Projected Gradient method for Support Vector Machines

Abstract

In this work we study how to solve the SVM optimization problem by using the Spectral Projected Gradient (SPG) method with three different strategies for computing the projection onto the constrained set. One of the strategies is based on Dykstra's alternating projection algorithm since there is not a mathematical equation for the projection onto the whole constrained set but the projection on each restriction is easy to compute with exact formulations. We present another strategy based on the Karush-Kunh-Tucker optimality conditions, we call it the Projected-KKT algorithm. We compare these strategies with a third one proposed by Dai and Fletcher. The three schemes are low computational cost and their use within the SPG algorithm leads to a solution of the SVM problem. We study the computational performance of the three strategies when solving randomly generated as well as real life SVM problems. The numerical results show that Projected-KKT is competitive in general with the Dai and Fletcher algorithm, and it is more efficient for some specific problems. They both outperform Dykstra's algorithm in all the tests.

support vector machines; quadratic programming; Spectral Projected Gradient method


On the use of the Spectral Projected Gradient method for Support Vector Machines* * This work was partially supported by the Centro de Estadística y Software Matemático (CESMa) and the Decanato de Investigation y Desarrollo (DID) at USB.

Debora Cores† † Corresponding author. ; René Escalante; María González-Lima; Oswaldo Jimenez

Universidad Simón Bolívar Departamento de Cómputo Científico y Estadística Apdo. 89000, Caracas 1080-A, Venezuela. E-mails: cores@cesma.usb.br / rene@cesma.usb.ve / mgl@cesma.usb.ve / oswjimenez@usb.ve

ABSTRACT

In this work we study how to solve the SVM optimization problem by using the Spectral Projected Gradient (SPG) method with three different strategies for computing the projection onto the constrained set. One of the strategies is based on Dykstra's alternating projection algorithm since there is not a mathematical equation for the projection onto the whole constrained set but the projection on each restriction is easy to compute with exact formulations. We present another strategy based on the Karush-Kunh-Tucker optimality conditions, we call it the Projected-KKT algorithm. We compare these strategies with a third one proposed by Dai and Fletcher. The three schemes are low computational cost and their use within the SPG algorithm leads to a solution of the SVM problem. We study the computational performance of the three strategies when solving randomly generated as well as real life SVM problems. The numerical results show that Projected-KKT is competitive in general with the Dai and Fletcher algorithm, and it is more efficient for some specific problems. They both outperform Dykstra's algorithm in all the tests.

Mathematical subject classification: Primary: 46N10; Secondary: 47N40.

Key words: support vector machines, quadratic programming, Spectral Projected Gradient method.

1 Introduction

Support Vector Machines (SVM) [7, 12, 38] has been used in the past years for solving many real life applications. Some of them are pattern recognition and classification problems as for example isolated handwritten digit recognition [9, 10, 12, 33, 34] object recognition [6], speaker identification [32], face detection images [27, 28], text categorization [24] and some nonlinear least squares problems as the inverse density estimation problem [40]. The SVM technique is based on finding the minimal of a single linear constrained quadratic problem subject to lower and upper bounds. Because of the nature of the applications, the quadratic problem to solve is large with dense (and mostly bad conditioned) Hessian. Therefore, efficient low cost optimization algorithms are required.

Different techniques and methods have been used (e.g. [14, 19, 27, 35]), some of them are projected gradient types. However, dealing with dense large scale problems is not easy. Besides, the difficulty when using projected gradient type methods is that the exact projection over the whole constrained set is not explicitly available.

In this paper we consider the Spectral Projected Gradient (SPG) method [5] to solve the SVM optimization problem. The SPG method is a low computational cost and storage since only requires first order information, few floating point operations per iteration, and few function evaluations since it does not required an exhaustive line search strategy as other projected type methods. It utilizes a nonmonotone line search globalization scheme that guarantees convergence without requiring the objective function to decrease at every iteration. These properties of the SPG method make it very attractive for solving large scale problems as it is shown in [5].

In order to solve the projection problem arising at each iteration of the SPG method, we present a new algorithm based on the solution of the Karush-Kuhn-Tucker (KKT) conditions, the so-called Projected-KKT algorithm. The idea is to solve a reduction of the linear system associated to the KKT conditions by using a partition into three sets of indices. If the partition is optimal, the solution of the projection problem is found. When the partition is not optimal, the sets are updated trying to satisfy the strict complementarity conditions and a new linear system is solved. Exact formulas can be derived for the solution of the linear system so each inner iteration is very cheap. The algorithm iterates until a solution is found. This idea is of the same spirit as that of the algorithm presented by Júdice and Pires in [25] but the Projected-KKT algorithm allows to handle lower and upper bounds in contrast with the Júdice and Pires scheme. Besides, comparing both algorithms in the case of two index sets we observe that they follow completely different strategies for moving the indices between sets. Therefore, both algorithms generate different sequence of iterates.

Because there is not a mathematical equation for the projection over the whole constrained set but the projection over each restriction is easy to compute with exact formulations we also consider, following Birgin et al. [3], a version of the Dykstra's alternating method for solving the projection problem.

Dai and Fletcher in [14] propose an algorithm based on secant approximation to solve quadratic programming problems with single linear constraints and bounds in the case that the Hessian matrix of the quadratic function is diagonal and positive definite. Therefore, this algorithm can be also used to solve the projection problem as it is illustrated in [14].

In this work we propose to solve the SVM problem with the Spectral Projected Gradient (SPG) method but computing the projection over the whole constrained set by three different projection techniques: the proposed Projected-KKT algorithm, a version of the Dykstra's algorithm, and the recently developed algorithm based on secant approximations due to Dai and Fletcher. The nature of these methods is different but we consider that their comparison is fair since they are all viable, and potentially low cost, approaches for solving the quadratic problem.

The SPG method assures convergence to a stationary point of the constrained quadratic problem from any initial guess (see [3, 5]). Also, the SPG algorithm utilizes a nonmonotone line search strategy that does not force the objective function to decrease at each iteration which reduces the number of function evaluations when it is compared with a monotone line search.

The SPG method has also been used in the SVM context in [35, 41]. In [35] the inner quadratic subproblems (arising when using GVP or SPG methods) are solved via the bisection-like algorithm proposed by Pardalos and Kovoor [29]. In [41] the emphasis relies on the decomposition techniques and the development of parallel software. There, the secant-based algorithm by Dai and Fletcher is used for solving the large size projection subproblems. Our work studies different alternatives for solving the projection subproblem and the impact on the SPG method by performing extensive numerical experiments with SVM-type problems. We show that the Projected-KKT algorithm we propose is competitive with the secant-based algorithm introduced in [14].

The outline of the paper is the following. In the next two sections we give a brief introduction to SVM, we describe the optimization problem to be solved and we introduce the SPG method when applied to the SVM optimization problem. The following section is devoted to present the projection strategies. The last section presents numerical results with the three strategies for randomly generated problems and for several interesting real life problems as detecting face images, object recognition, text categorization problems, and isolated handwritten digit recognition among others.

2 The Support Vector Machine Problem

The Support Vector Machines (SVM) is an interesting classification technique developed by Vapnik and his group at AT&T Bell laboratories [12, 38]. In this section, we give a brief introduction to this technique and describe the quadratic optimization problem to be solved when using SVM. The notation adopted in this section is, with slight modifications, that of [27].

Let us consider the given data set {(xi, yi), i = 1, ..., m with xinand yiZ}. In the SVM context, the optimization problem corresponding to the case in which the given data is linearly separable is the following

where w and b.

In order to motivate this optimization problem let us consider, for simplicity, only the cases where yi = 1 or yi= -1, for all i. In fact, we will consider only these cases in the whole paper. Therefore, we have two index sets, these are: corresponding to two classes, say Class 1 and Class 2, respectively, according to which class each xi belongs to. Notice that I1UI-1={1, ..., m}.

To solve problem (1) means to find the hyperplane with normal vector w that maximizes the separation margin between Class 1 and Class 2. Once a solution (w*, b*) of (1) is found, given any point x, the value of the function f (x)=sign(w*Tx + b*) indicates to which class this point belongs. Observe that the number of constraints of problem (1) is equal to the number of data samples which, for real life problems, is large.

The support vectors are called the vectors from the given data set where at least one of the constraints in the feasible set of (1) is satisfied as equality. It is easy to see that these support vectors necessarily exist. Indeed, if

then there exists a pair such that, for any ie ∈I1, and, for any .

The dual problem corresponding to problem (1) is a quadratic program of the form

where are the Lagrange multipliers, , e is the m-vector of all ones, and Dmxn is the symmetric positive semi-definite matrix defined as . Let us call λ* a solution of (2). Then we can recover the solution (w*, b*) of (1) by using duality theory. That is,

where the support vectors xi are the ones corresponding to solves the equation

for any support vector.

Then, the decision function, that is the function that says if a given point x is in Class 1 or 2, can be written as

with λ* the optimal multiplier vector. This is saying that for any xn if f(x) > 0, then x is on Class 1, and if f(x) < 0, then x is on Class 2.

Hence, in order to solve (1) we can just solve the dual problem (2). One advantage of doing this is that constraints in (2) are simpler than in the original problem. Another important advantage is that working with the dual problem the dimension of the feature space for the classification can be increased without increasing the dimension of the optimization problem to solve. This last comment will become clearer in the following.

If the data set is linearly nonseparable ([13], this is, that there does not exist a solution of problem (1)) then two variants are considered. On one hand, (1) is changed to

where h and C are positive constants to be chosen and ξ=(ξ1, ..., ξm) is a vector with variables that measure the amount of violation of the constraints.

Additionally, in the nonseparable case, the nonlinear decision surface is computed by mapping the input variable x into a higher dimensional "feature space" and by working with linear classification in that space. In other words, we define an application for xn into a "feature vector":

where aiand are some real functions. So, in the nonseparable case, the primal problem is obtained by substituting the variable x with the new "feature vector" in (5).

The dual problem corresponding to (5) (for h = 1) is given by

where D e

mxm is a symmetric positive semi-definite matrix with positive diagonal, defined as .

Observe that the dimension of problem (6) is the same of problem (2) and therefore, the computational effort for solving it is the same.

In practice, an explicit description of Φis not needed. This is avoided by using the concept of kernel function K. Let K be defined as K (x, y) = . Therefore, in problem (6) the matrix D is given by Dij = yiyjK (xi, xj). Using the kernel definition, the decision function can be rewritten as

which is now a nonlinear function given by linear superposition of kernel functions (one for each support vector). Therefore, in theory, what is really needed is to find a function that preserves, in higher dimensional spaces, the properties of the inner product, these are the kernel functions. We refer to [9] for the interested reader. In practice, different kernel functions are known and they are used for SVM problems. Some work has even been performed with non positive semi-definite Kernels as the sigmoid (see [23]). Experimental knowledge and data distribution may suggest the kernel function to use. But to find the right kernel for each application is a difficult task that goes beyond our work. In our numerical experimentation we use the suggested kernels in the literature.

In this work we are interested in solving the convex quadratic Problem (6). Observe that the Hessian of the objective function has dimension m x m being m the number of data samples, therefore it is usually large for real applications. Besides, the matrix is dense. Therefore, low cost and storage methods are preferable for solving it. We propose to solve it with the SPG method which is described in the next section.

Finally, observe that if the data is linearly separable, rank( D) = n if n < m (which is usually the case in practice). Therefore, in these instances the matrix is singular. In the case of using the kernel, the rank of the matrix D depends on the dimension of the feature space given by the kernel. Larger this dimension is, larger is the rank of D. We will show that our proposal Projected-KKT is very competitive when used with SPG in the case the matrix D is (or nearly) singular.

3 Application of the Spectral Projected Gradient Method to the SVM Problem

Let Ωbe a closed and convex set. The Spectral Projected Gradient (SPG) method [5] applied to any constrained optimization problem of the form

are recently developed iterative techniques, based on the Spectral Gradient (SG) method, that allow to minimize a nonlinear function over a convex set. The SG method solves unconstrained optimization problems by using as a search direction the negative gradient with a particular choice of the steplength named spectral length (for more details see [30]). This direction is called the spectral gradient step. The SG method has shown to be an efficient technique for solving large scale problems as is shown in [30] where the authors compare the SG method with the popular extensions of the conjugate gradient methods PR+ and CONMIN developed by Gilbert and Nocedal [20] and Shanno and Phua [36] respectively. They conclude that the SG method outperforms CONMIN and PR+ in number of gradient evaluations and CPU time for most of the problems tested.

The SPG method finds a solution of problem (7) by using the projection of the spectral gradient step over the convex set Ω, as any projection gradient method. That is:

where the superscript k corresponds to the iteration, and

with PΩ denoting the projection operator onto the convex set Ω. The parameter αk is the spectral steplength at iteration k, and it is given by

where

The steplength ρk is chosen at each iteration such that convergence is guaranteed from any initial guess by satisfying, for some M > 0 the following nonmonotone condition

with γ ∈ (0, 1) a chosen parameter.

The SPG method does not guarantee descent of the objective function at every iteration in contrast with the classic Projected Gradient method (see [26]), because of the spectral steplength (9) and the nonmonotone condition (10). In [5], the authors show that for many examples or applications, this nonmonotone behavior of the objective function makes the method very competitive and sometimes preferable than other optimization techniques. Also, in the SPG method, the spectral choice of the steplength is the essential feature that adds efficiency in the projected gradient methodology. The SPG method is a low cost and storage technique since few floating point operations and only first order information are required. Moreover, the nonmonotone condition (10) allows the objective function to increase at some iterations, which implies less function evaluations than the Armijo type condition (see [15]), used frequently in the classical projected gradient method or gradient type methods.

For solving the SVM problem (6), we are interested in projecting onto the set of restrictions , where

where ym satisfies |yi| = 1, for all i, and it corresponds to the problem (6). It is clear that Ω is a convex set.

The projection of any unonto the set (denoted by ) does not require additional floating point operations, and it is given, for i = 1, ..., m, by

and the projection onto the set (denoted by ), only requires 2 inner products and it is given by:

It is important to stress that there is not an explicit mathematical equation for the projection of any vector over the whole set Ω. In this work, we propose to use three different strategies for projection over the set Ω. One strategy uses the Dykstra's algorithm given in [8] and [16], the second one is based on the Karush-Kuhn-Tucker optimality conditions together with a combinatorial approach, in the same spirit of the Júdice and Pires algorithm [25], and the third one uses the Dai and Fletcher secant scheme [14]. We call the second approach the Projected-KKT algorithm, and it is the one developed in this work.

In the next section, we describe the Dykstra's and Projected-KKT methods, and comment on the Dai and Fletcher's strategy.

4 Projection Strategies

Given um, we are interested in solving the projection problem

Problem (14) is a particular case of a quadratic programming problem, where the Hessian of the objective function is the identity matrix.

In the following we describe the three methods used for solving this problem.

4.1 Dykstra's Algorithm

Problem (14) is the minimization of a strictly convex quadratic function over the intersection of a finite collection of closed and convex sets in the vector space

m. From this point of view, problem (14) can be solved by means of the alternating projection method, first proposed by Dykstra [16] for closed and convex cones, and later extended by Boyle and Dykstra [8] for closed and convex sets. Here we consider a particular case, due to Escalante and Raydan ([17, 18]), of a more general application of Dykstra's algorithm for solving the least-squares matrix problem.

The alternating projection method dates back to von Neumann [39] who treated the problem of finding the projection of a given point in a Hilbert space onto the intersection of two closed subspaces. Later, Cheney and Goldstein [11] extended the analysis of von Neumann's alternating projection scheme to the case of two closed and convex sets. In particular, they established convergence under mild assumptions. However, the limit point is not necessarily the closest one in the intersection set. Therefore, the alternating projection method, proposed by von Neumann, is not useful for problem (14).

Dykstra [16] found a clever modification of von Neumann's scheme. It guarantees convergence to the closest point in the intersection of closed and convex sets that are not necessarily closed subspaces. Dykstra's algorithm can also be obtained via duality (see [22]).

Next we present a version of Dykstra's algorithm ([17] and [18]) applied to problem (14).

Algorithm 1 (Dykstra's algorithm):

Given uRm, set λ° = u, and .

For k = 0, 1, 2, ...,

Here plays the role of the increments introduced by Dykstra [16]. There the vector of increments for all k, since in this case is a hyperplane (i.e. a translated subspace).

It is important to note that the numerical performance of Algorithm 1 may be influenced by the angle between the sets and . Next proposition addresses this topic.

Proposition 4.1.Let us denote θ the angle formed by the hyperplane defined by the equation γTλ = 0 and the box constraints. Then,

Proof. The normal vectors to the box constraints are the coordinate vectors ej with j = 1, ..., m. In the SVM application, |yj | = 1 for all j. Therefore, for all j.

This result shows that for large values of m, the angle between the hyper-plane and the box constraints is close to , which may imply that Algorithm 1 requires many iterations for convergence. However, this angle is not the only element that influence on the behavior of Algorithm 1. The choice of stopping criteria in Algorithm 1 is also very important. Birgin and Raydan in [4] present a simple example, where many of the stopping criteria used for alternating projection methods fail to converge and they propose a robust one. The example presented in [4] consists of two convex sets: one hyperplane and a set containing lower and upper bounds of the variables as problem (14). Therefore, projecting over simple convex and closed sets as the optimization problem (14) could take many iterations to converge.

4.2 Projected-KKT Algorithm

The Lagrangian function associated to the problem (14) is given by

where μ, ω and υ are the Lagrange multiplier vectors associated with the equality constraints, the lower bound constraints and the upper bound constraints respectively. Therefore, the Karush-Kuhn-Tucker (KKT) conditions of problem (14) are

for all i = l, ..., m. Given λ ∈ let (L, U, I) be a partition of the index set N= {1, ..., m} such that

Solving (14) means to find an optimal partition L, U, I such that equations (16) to (21) are satisfied.

Let us denote the dimensions of the sets L and U as l and s respectively, so the dimension of the set I is m — l — s.

From now on we use zB to denote the vector which components are the zi entries with iB. So, from the definitions of the sets L,U and I, and the equations (18) and (19), we have that

Using this notation, we can reduce the KKT system to a system of equations of order (m + 1) x (m + 1) instead of order (3m + 1) x (3m + 1) and the unknowns of the reduced KKT system are μ, ωL, λI and kI. The reduced KKT equations can be written as

The Projected-KKT method is an iterative method that tries to find an optimal partition by starting with λ0 and solving the equations (24) to (27) by computing

The partition (L, U, I) is optimal if the following conditions are satisfied

Thus, the idea behind the proposed Projected-KKT algorithm is to rearrange the indices in the sets L, U and I that do not satisfy equations (32), (33), (34) and (35). The way that these indices are eliminated from one set and added to another one is very simple, and it is explained in the following algorithm, where the expression denotes the ith component of the vector X at iteration k.

Algorithm 2 (Projected-KKT algorithm):

· Step 0: Given u ∈ Rm, let N = {1, 2, ..., m} and k = 0

· Step 1: λk = (u),

· Step 2: If , for all iN, stop λk is the solution of problem (14).

· Step 3:

· Step 4: Let Lk = {i e N : = 0}, Uk = {i e N : = C} and Ik = N - (LkUUk),

· Step 5: If Ik = ∅ then

-Step 5.1: If yTkk = 0, stop λk is a vertex solution.

-Step 5.2: If yTλk = 0, do:

* Step 5.2.1: λk+1=k),

* Step 5.2.2: k = k + 1 and repeat Step 3 and Step 4.

· Step 6: Compute the solution of the KKT system ,, and , using equations (28), (29), (30) and (31)

· Step 7: If conditions (32), (33), (34) and (35) are satisfied, then stop: λk is the solution of problem (14).

· Step 8: Let

V1 = {iIk:(32) is not satisfied}

V2={iIk : (33) is not satisfied}

V3={iUk : (34) is not satisfied}

V4={iLk : (35) is not satisfied}

· Step 9:Lk+1=(Lk — V4) V1, Uk+1=(Uk—V3)V2, Ik+1=N — (Lk+1Uk+1)

· Step 10: If Ik+1= ∅ then

—Step 10.1: If yT λk+1=0, then stop: λk+1 is the solution of problem (14).

—Step 10.2: If yT λk+1 ≠ 0, eliminate an index from Lk+1 or Uk+1 and add it to the set Ik+1.

· Step 11:k = k + 1 and go to Step 6.

As we mentioned in the introduction the proposed Projected-KKT algorithm can be viewed as a finite termination algorithm in the sense that three finite index sets are defined at each iteration. The indices move from one set to another depending on the solution of the KKT first order optimality conditions. This idea is of the same spirit as that of the active set method or the algorithm presented by Júdice and Pires in [25]. The Projected-KKT algorithm allows to handle lower and upper bounds in contrast with the Júdice an Pires scheme. Comparing both algorithms in the case of two index sets we observe that they follow completely different strategies for moving the indices between sets. Therefore, both algorithms generate different sequence of iterates. Also, the initial iterate in the Projected-KKT algorithm is found by making use of the type of problem being solved. This choice, which is relevant to the performance of the algorithm, is different from the choice of the initial index set in [25]. Moreover, the Projected-KKT algorithm exploits the fact that the Hessian of the objective function is the identity.

The algorithm is well defined since every iteration has a solution and because there is a finite choice for the index sets the algorithm terminates in a finite number of iterations unless cycling occurs. In practice for diagonal positive definite Hessians, as in the projection problem, cycling never occurs. In fact, we have performed extensive experimentation using the algorithm for solving quadratic problems with any positive semi-definitive Hessian and we have observed that cycling is associated with the deficient rank of this matrix.

4.3 Dai and Fletcher method

Dai and Fletcher in [14] presented an optimization method for solving linearly constrained quadratic programming problems, where the Hessian matrix is diagonal and positive definite as the projected problem (14). This optimization strategy is based on secant approximations. The authors show numerical results that illustrate the performance of a modification of the SPG method and their projection strategy for some randomly generated problems and real life SVM problems. Details on the Dai and Fletcher algorithm can be found in [14]. The Dai and Fletcher scheme will be used in this work as another projection strategy for solving problem (14) together with the SPG method presented by Raydan in [5].

5 Numerical Experimentation

In this section we present the numerical results obtained by applying the SPG method for solving the SVM quadratic programming problem (6). For this quadratic problem the convex constrained set is denoted by , where the sets and are given in equation (11). The projection over the convex set Ω is computed by three different methods: The Dykstra's algorithm presented in Section 4.1, the Projected-KKT algorithm presented in Section 4.2, and the recently developed method by Dai and Fletcher [14]. Our aim in this section is to compare the performance of the three different projection schemes when using SPG method for solving random and real life SVM problems.

5.1 Test Problems

We consider two different kind of test problems: the first class consists of synthetic data randomly generated and the second class is taken from real applications databases. The kernels considered in this work are polynomials of degree d given by K(x, y) = (1 + xTy)d and a Gaussian function with parameter σ given by

Randomly generated test problems: the first four databases of this group consist of 250 nonlinearly separable samples of dimension two, where each component of the sample was randomly generated in the interval [—0.5, 0.5] and has an uniform distribution. The classification of each sample xt is given by

where j = 1, 2, 3, 4. For these randomly generated problems, the mathematical expressions of the real separators and their level curves are known, so they can be drawn in3or in2.From now on xi(k) denotes the k-th component of the vector xi. Each value of j generates a database as follows:

lineal_wk: corresponds to j = 1 and f1 (xi)=xi (1)+xi (2). No kernel is considered for this database.

sinoid_kp5: corresponds to j = 2 and f2(xi = xi(2) — 0.25 sin(8xi(1)). A polynomial of degree five is considered as the kernel for this database.

polin3_kp3: corresponds to j = 3 and f3(xi) = i>xi (2) — 10(xi (1))3. A polynomial of degree three is used as the kernel for this database.

norm2_kg: corresponds to j = 4 and f4(xi) = — (0.35)2. A Gaussian kernel with σ = 0.3 is considered for this database.

The remaining databases of this group are:

Riply database: this database is taken from the Statistical Pattern Recognition Toolbox for Matlab and was proposed by Riply in [31]. It consists of 250 samples non linearly separable of dimension two, with 125 samples in class 1 and 125 samples in class 2. We denote by riply_wk this database without any kernel, and by riply_kp5 when a polynomial of degree five is used as a kernel.

Madelom database: this database belongs to the NIPS 2003 Feature Selection Challenge [1]. It consists of 2000 samples of dimension 500, in which 1000 samples belong to class 1 and the other 1000 samples belong to class 2. We denote by madelom_wk this database without any kernel and by madelom_kg when a Gaussian kernel with σ=1 is used.

Real applications databases:

Sonar database: this data was used by Gorman and Sejnowski [21] in their study of the classification of sonar signals using a neural network. It consists of 208 non linearly separable samples of dimension 60, but polynomially separable, representing data from sonar signals bouncing off a metal cylinder and sonar signals bouncing off a roughly cylindrical rock. Here 111 samples belong to class 1 and 97 samples belong to class 2. We denote this database by sonar_kp3, since we use a polynomial of degree three as a kernel.

Images database: it contains 6977 non linearly separable samples representing human (2429 samples) and non human (4548 samples) faces taken from CBCL MIT [37]. Since the number of samples is large we generated four sets of samples: image500_kp2 containing 500 samples randomly selected from the original database; image1500_kp2 with 1500 samples randomly selected from the original database; image4500_kp2 consisting of 4500 samples randomly taken from the original database; and image6977_kp2 the whole database. Each sample set maintains the human to non human faces proportion of the complete database. For this database a polynomial of degree two is used as a kernel.

Arcene database: this database consists of 100 samples of dimension 10000, where 56 samples are in class 1 and 44 samples are in class 2. These samples come from mass spectrometry and contain information about cancerigenic and non cancerigenic cells [1]. Since no kernel is used for this database, we denote it by arcene_wk.

Dexter database: this set of data belongs to text type classification, since it tries to determine which of the Reuters articles talk about "corporate acquisitions". It contains 300 samples of dimension 20000 with 150 samples in class 1 and 150 samples in class 2. We denote it by dexter_wk because no kernel is used.

Gisette database: the objective of this database is to distinguish the handwriting images of the digits "9" and "4". In this database there are 6000 samples of dimension 5000 with 3000 samples in class 1 and the other 3000 samples in class 2. We denote it by gisette_kp2 since a polynomial of degree two is used as a kernel.

The Arcene, exter and isette databases can be found in NIPS 2003 Feature Selection Challenge [1].

Ada database: this database intends to determine persons with high income from census data. It is composed by 4147 samples of dimension 48, where 318 of those are in class 1 and the remaining ones are in class 2. We denote by ada_wk the database without kernel and by ada_kg the same database when a Gaussian kernel with a = 0. 5 is used.

Gina database: it is considered as a handwriting digit recognition problem, since it tries to find out if the handwriting image of a digit is odd or even. This database has 3153 samples of dimension 970, where 1603 samples are in class 1 and 1550 are in class 2. We denote it by gina_wk since no kernel is used.

Hiva database: it is related to data representing active and non active components again HIV AIDS. It consists of 3845 samples of dimension 1617, where 3710 are in class 1 and 135 are in class 2. We denote it by hiva_wk.

Sylva database: samples in this database contains information about forest types. It has 13086 samples of dimension 216. We select a subset of 7500 samples from this database, such that 7047 samples are in class 1 and 453 are in class 2. We denote this subset of data by sylva_kp2 since a polynomial of degree two is used as a kernel.

The Ada, Gina, Hiva and Sylva databases are part of the WCCI Performance Prediction Challenge [2].

5.2 Numerical Results

The stopping criterium used in the SPG method is given by

where tol2 = 10 -4 and tol= 10 -8. The other input parameters for the SPG method are: αmin = 10-10, αmax = 1010, M = 10, γ = 10-10, σ1 = 0.1 and σ2 = 0.9. For more details about these input parameters see [5]. Additionally, the maximum number of function evaluations and iterations are 750000 and 500000 respectively.

In order to compare the numerical results attained by the SPG method with the three different projection strategies, we set: in the general Dykstra algorithm, the following stopping criteria,

and the maximum number of projections for iteration to 10000. For the Dai and Fletcher projection scheme, we used |r (λ)| < 10-12 or Δl<10-14 as tolerances for the two criteria proposed by these authors in [14]. The proposed Projected-KKT is stopped when conditions (32), (33), (34) and (35) are satisfied or the maximum number of iterations per iteration is 5000.

The initial iterate for the SPG method is randomly generated and is the same for the three projection strategies used.

The algorithms were implemented using a fortran 90 compiler and double precision. The experiments were run in a PC with Intel Core 2 Duo processor, 2.13 MHz and 2 GB RAM memory.

The Table 1 summarizes the main features of the databases considered in this work. The variables m and n represent the number of samples in each database and their dimension respectively; class 1 and class 2 in columns 4 and 5 correspond to the amount of samples in each class. In the column 6 denoted by kernel, the words: wk means no kernel is used, kpd means a polynomial kernel of degree d is considered, and kg(σ) means that a Gaussian kernel with parameter σ is utilized. This table also includes the rank of the matrix D in problem (6) for each database, and it is denoted by rankD. The last column, denoted by condD, shows the condition number of matrix D for each database.

The Table 2 shows the CPU time required for the SPG method with the three projection strategies. The columns denoted by Dykstra, DaiFlet and ProjKKT contain the CPU time for SPG method with Dykstra's method, Dai and Fletcher projection scheme and the proposed Projected-KKT algorithm, respectively. The numbers inside parenthesis represent the percentage of CPU time consumed by each projection strategy in the SPG method. The best CPU time for each database and for each value of C is highlighted on Table 2. The asterisk mark (*) on the right side of CPU times means that the maximum number of iterations established in the SPG method was reached before convergence.

Tables 3, 4 and 5 show the performance of the SPG method for different values of the parameter C when the projection over the set Ω is computed by Dykstra method, Dai and Fletcher strategy and the proposed Projected-KKT, respectively. The third column in these tables is denoted by Flag and the values of this parameter could be 0 (meaning problem (6) was solved); 4 (meaning that the projection algorithm reached the maximum number of iterations allowed; 2 (the SPG method reached the maximum number of iterations allowed); and 3 (meaning the SPG method requires more than the maximum number of function evaluations allowed). In these tables the variables #Iter in column 4, #EvalF in column 5, MaxPIter in column 6, indicate the number of iterations, number of function evaluations utilized for the SPG method, and the maximum number of iterations used by the projection strategy, respectively. The number appearing in column 7, denoted by MeanPIter gives the average number of iterations for the projection algorithm. Finally, columns 8 and 9, denoted by pgtwon and pginfn, represent the values of the stopping criteria given by expression (38) when p = 2 and p = ∞,respectively.

The last table, Table 6, illustrates the type of solution obtained by the SPG method with different values of the parameter C for the three algorithms for projecting over the set Ω. Columns 4, 5 and 6 show the number of support vectors that the optimal solution contains for each projection strategy. The last three columns of this table present the percentage of samples badly classified i.e., samples that appear located on the wrong hyperplane.

In all the tables presented in this work, observe that for some databases, we do not present the numerical results for all C, since the numerical behavior of the the optimization technique is the same as the one obtained with the biggest value of C shown in the tables for each database.

Figures 1, 2, 3, 4, 5 and 6 show the real separator curves and the computed separator curves given by equations (3) and (4) using the optimal solution attained by the SPG method with Projected-KKT for different values of C. The figures for the SPG method with Dai and Fletcher projection strategy are very similar to Figures 1, 2, 3, 4, 5 and 6 and therefore they are not shown here. In the scale legend of these figures appears the symbols utilized to distinguish samples in class 1 from samples in class 2. Also, it is possible to distinguish from all the samples which of them are the support vectors in class 1 and the support vectors in class 2. It is clear that these figures belong to databases with samples of dimension two: (lineal_wk, sinoid_kp5, polin3_kp3, norm2_kg, riply_wk and riply_kp5), since they can be easily drawn.







From Table 2 observe that SPG method together with Dykstra alternating projection scheme requires more CPU time than the SPG method with the others two projection strategies for all databases and all values of C. Inclusive, for some values of C and some databases this method does not converge to an stationary point of problem (6). The SPG-Dykstra method is stopped since the maximum number of SPG iterations (500000) or the maximum number of Dykstra iterations (10000) was reached before convergence occurs. Moreover, the percentage average of CPU time consumed for Dykstra's algorithm in the SPG method for all databases and all values of C, is 54.76 %. In many cases, as for example databases: lineal_wk, sylva7500_kp2, ada_kg, ada_wk, sonar_kp3 and riply_kp5, the percentage of CPU time consumed for Dyk-stra algorithm is higher than the percentage average. Observe that for ada_kg and C = 10 the CPU time utilized for SPG method with Dykstra strategy is 8 times the CPU time required for SPG method with Dai and Fletcher scheme. Also, for database norm2_kg with C = 10000, the CPU time required for SPG method with Dykstra algorithm is 15 times the CPU time for SPG method with Projected-KKT. The average percentage of CPU time used for Dai and Fletcher scheme and Projected-KKT strategy in the SPG method are the 20.97% and 21.03% respectively, for all problems solved. Moreover, in terms of efficiency of the three optimization methods, the SPG method using Projected-KKT has a slightly higher percentage of problems solved. So, the difference in average CPU time consumed for these two strategies is irrelevant.

From Tables 2, 3, 4 and 5, it is clear that the computational performance of the SPG method using Dai and Fletcher projection scheme and using the proposed Projected-KKT is very similar. We can say that both strategies are competitive for solving problem (6) in high contrast with the SPG method with Dykstra's algorithm. For example, for databases polin3_kp3, Sonar, Images, Ada and Hiva the performance of the SPG method using Dai and Fletcher is a little better than the performance of the SPG method with Projected-KKT. On the other hand, for databases Riply, Gisette and Gina the SPG method with Projected-KKT shows a better computational behavior when it is compared with SPG method using Dai and Fletcher scheme. For the rest of databases, the best computational performance varies with the method, the kernel used and the value of the parameter C. From now on, we focus our analysis only on the SPG method with Dai and Fletcher and Projected-KKT schemes since Dykstra's algorithm is not competitive with the these two techniques.

Even when Dai and Fletcher projection strategy and Projected-KKT method are very precise computing the projection required by the SPG method, small differences in the projection generate different iterations. These differences increase when the matrix D on problem (6) associated to the database is badly conditioned. It is clear fro tables 3, 4 and 5, that hen SP ethod uses few iterations, the computational behavior of the two competitive strategies is similar, as for example for databases: Arcene, Dexter and Gisette, where the associated matrix D is not bad conditioned when it is compared with the rest of the databases. In problems where the SPG method uses more iterations, it seems that the difference in the computed projection by Dai and Fletcher and Projected-KKT increases, aking that the SPG ith these two projecting strategies have different computational behaviors as for example in databases Madelon, Gina and Riply, where the matrix D is badly conditioned. Another important observation is that the average of projection iterations for SPG method with Dai and Fletcher strategy and Projected-KKT scheme is not necessarily less when the SPG method consumes less iterations. The computational performance of these two methods varies depending of the database (condition number of the matrix D) and the value of C.

It is also important to stress that for databases where the condition number of the matrix D is high, as for example the databases: lineal_wk, sinoid_kp5 and norma2_kg, the computational behavior of the two competitive strategies is similar, but in these cases the dimension of the samples is two. So, the difficulty on solving the SVM problem (6) also depend on the size of the problem.

Table 6 indicates that the solutions obtained for SPG method with Dai and Fletcher strategy and Projected-KKT scheme are almost equal for each database and each value of C .

Finally, we observe from Figures 1, 2, 3, 4, 5 and 6 that the proposed SPG with the Projected-KKT method classifies the samples on those databases. For lineal_wk data (Fig. 1) the real level zero curve is very close to the computed level zero curve and it seems not to depend on the values of the parameter C. In contrast, for nonlinearly separable data (sinoid_wk, polin3_kp3 and norm2_wk), the computed level zero curve highly depends on the values of parameter C and the kernel used. It is possible that in these nonlinearly separable samples the use of different kernels could get a better classification of the data. However, our interest in this work is to show that the proposed SPG with the Projected-KKT can solve the SVM problem (6), and that playing with different kernels and values of C the solution could be improved.

6 Conclusions

In this work we proposed an algorithm for projecting over a single linear constraint and a box constrained set. This algorithm is based on the Karush-Kuhn-Tucker conditions for a quadratic programming problem over the described set. To the best of our knowledge, this is the first time that a KKT-type algorithm has been used in the SVM context for projecting over the convex set. The proposed algorithm has finite termination unless cycling occurs. However, in practice for well-conditioned diagonal positive definite matrices (as the identity) cycling never occurs.

We studied how to solve the SVM optimization problem by using the Spectral Projected Gradient (SPG) method using the Projected-KKT method for solving the projection problem, as well as the secant-based algorithm proposed by Dai and Fletcher and a version of the Dykstra's method. The results obtained indicate that the Projected-KKT algorithm is competitive with the Dai and Fletcher projecting strategy within the SPG method, and that both strategies outperform the SPG method with Dykstra's algorithm. Using the Projected-KKT projection strategy more problems were solved than using the other two methods, indicating that it has a slightly higher efficiency. The number of support vectors attained by the two competitive methods is the same for almost all the problems tested, implying that both methods compute basically the same optimal separator. The computational behavior of these two methods highly depends on the size of the problem, the condition number of the Hessian matrix, the value of the parameter C, and the kernel used.

The results presented in this work are a contribution to the study of the behavior of optimization methods for solving large scale SVM problems.

Acknowledgeents. The authors are grateful to M.A. Francisco and F. Vargas from Universidad Simón Bolívar for programming assistence and useful contributions.

Received: 02/XII/08.

Accepted: 10/V/09.

#CAM-102/08.

  • [1] Neural Information Processing Systems (NIPS) Conference: Feature Selection Challenge, 2003. http://www.nipsfsc.ecs;soton.ac.uk/datasets/
  • [2] IEEE orld Congress on Coputational Intelligence: Perforance Prediction Challenge, 2006. http://www.modelselect.inf.ethz.ch/
  • [3] E. Birgin, J.M. Martínez and M. Raydan, Inexact spectral projected gradient methods on convex sets. IMA Journal of Numerical Analysis, 23 (2003), 539-559.
  • [4] E.G. Birgin and M. Raydan, Robust stopping criteria for Dykstra's algorithm. SIAM Journal on Scientific Computing, 26 (2005), 1405-1414.
  • [5] E.G. Birgin, J.M. Martinez and M. Raydan, Nonmonotone spectral projected gradient methods on convex sets. SIAM Journal on Optimization, 10 (2000), 1196-1211.
  • [6] V. Blanz, B. Schõlkopf, H. Bülthoffand, C. Burges, V.N. Vapnik and T. Vetter, Comparison of view-based object recognition algorithms using realistic 3d models. In: J.C. Vorbrüggen, C. von der Malsburg, W. von Seelen and B. Sendhoff, editors, Artificial Neural Networks, ICIANN'96, pages 251-256, Berlin 1996. Springer Lecture Notes in Computer Science, 1112 (1996).
  • [7] B.E. Boser, I.M. Guyon and V.N. Vapnik, A training algorithm for optimal margin classifiers. In: Proceedings of the 5th ACM Workshop on Computational Learning Theory, pages 144-152, Pittsburgh, PA, July 1992.
  • [8] J.P. Boyle and R.L. Dykstra, A method for finding projections onto the intersections of convex sets inHilbert spaces. Lecture Notes in Statistics, 37 (1986), 28-47.
  • [9] C.J.C. Burges, A tutorial on support vector machines for pattern recognition. In: U. Fayyad, editor, Data Mining on Knowledge Discovery, pages 121-167. Kluwer Academic Publishers, Netherlands (1998).
  • [10] C.J.C. Burges and B. Schölkopf, Improving the accuracy and speed of support vector machines. In: M. Jordan M. Mozer and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 375-381. MIT Press, Cambridge, MA (1997).
  • [11] W.. Cheney and A. Goldstein, Proximity maps for convex sets. Proc. er. ath. Soc., 10 (1959), 448-150.
  • [12] C. Cortes and V. Vapnik, Support vector networks. Machine Learning, 20 (1995), 1-25.
  • [13] N. Cristianini and J.S. Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, United Kingdom (2000).
  • [14] Y.H. Dai and R. Fletcher, New algorithms for singly linearly constrained quadratic programs subject to lower and upper bounds. Math. Program., 106 (2006),403-421.
  • [15] J.E. Dennis Jr. and R.B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations Prentice-Hall, Englewood Cliffs, NJ (1983).
  • [16] R.L. Dykstra, An algorithm for restricted least-squares regression. J. Amer. Stat. Assoc., 78 (1983), 837-842.
  • [17] R. Escalante and M. Raydan, Dykstra's algorithm for a constrained least-squares matrix problem. Num. Linear Algebra Appl., 3(6) (1996), 459-471.
  • [18] R Escalante and . Raydan, Dykstra's algorith for constrained least-squares rectangular matrix problems. Computers Math. Applic., 35(6) (1998), 73-79.
  • [19] M. Ferris and T. Munson, Interior-point methods for massive support vector machines. SIAM Journal on Optimization, 13(3) (2003), 783-804.
  • [20] J.C. Gilbert and J. Nocedal, Global convergence properties of conjugate gradient methods for optimization. SIAM J. Optim., 2, 21-42.
  • [21] R.P. oran and T.J. Sejno ski, Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks, 1 (1988), 75-89.
  • [22] S.P. Han, A successive projection method. Math. Program, 40 (1988), 1-14.
  • [23] H. Lin and C. Lin, A study on sigmoid kerrnels for svm and the training of non-psd kernels by smo-type methods. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei Taiwan (2003).
  • [24] T. Joachims, Text categorization with support vector machines: Learning with many relevant features. In: C. Nédellec and C. Rouveirol, editors, Proceedings of the 10th European Conference on Machine Learning (ECML-98), pages 137-142. Springer, April 1998.
  • [25] J.J. Júdice and F.M. Pires, Solution of large-scale separable strictly convex quadratic programs on the siplex. Linear Algebra and its applications, (1989), 215-220.
  • [26] J. Nocedal and S.J. Wright, Numerical Optimization. Springer, New York (1999).
  • [27] E.E. Osuna, R. Freund and F. Girosi, Support Vector Machines: Training and Applications. Technical Report A.I. Memo No. 1602, C.B.C.L. Paper No. 144, Massachusetts Institute of Technology, Cambridge, MA, USA (1997).
  • [28] E.E. Osuna, R. Freund and F. Girosi, Trainingsupport vector vector machines: An application to face detection. In: IEEE Conference on Computer Vision and Pattern Recognition, (1997), 130-136.
  • [29] P.M. Pardalos and N. Kovoor, An algorith for a singly constrained class of quadratic programs subject to upper and lower bounds. Mathematical Programming, 46.
  • [30] M. Raydan, The Barzilai and Borwein gradient method for the large scale unconstrained inimization problem. SIAM J. Opt., 7 (1997), 26-33.
  • [31] B.D. Riply, Neural networks and related methods for classification (with discussion). J. Royal Statistical Soc. Series B, 56 (1994), 409-456.
  • [32] M. Schmidt, Identifying speaker with support vector networks. In: Interface'96 Proceedings. Sydney, Australia (1996).
  • [33] B. Schölkopf, C. Burges and V.N. Vapnik, Extracting support data for a given task. In: U.M. Fayyad and R. Uthurusamy, editors, Proceedings First International Conference on Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA (1995).
  • [34] B. Schölkopf, C. Burges and V.N. Vapnik, Incorporating invariances in support vector learning machines. In: J.C. Vorbrüuggen C. von der Malsburg, W. von Seelen and B. Sendhoff, editors, Artificial Neural Networks, ICIANN'96, pages 47-52, Berlin 1996. Springer Lecture Notes in Computer Science, 1112 (1996).
  • [35] T. Serafim, G. Zanghirati and L. Zanni, Gradient projection methods for quadratic programs and applications in trainingsupport vector machines. Optimization Methods and Software, 20 (2003), 353-378.
  • [36] D.F. Shanno and K.H. Phua, Remark on algorithm 500: Minimization of unconstrained multivariate functions. ACM Trans. Math. Software, 6 (1980), 618-622.
  • [37] K. Sung and T. Poggio, Example-based learning for view-based human face detection. Technical Report A.I. Memo No. 1521, C.B.C.L. Paper No. 112, Massachusetts Institute of Technology, Cambridge, MA, USA (1994).
  • [38] V. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, New York (1995).
  • [39] J. von Neumann, Functional operators, Vol. II. The geometry of orthogonal spaces. Annals of Math. Studies, 22 (1950). Princeton University Press.
  • [40] J. Weston, A. Gammerman, M.O. Stitson, V. Vapnik, V. Vork and C. Watkins, Density estimation using support vector machines. Technical Report CSD-TR-97-23, Royal Holloway College (1997).
  • [41] L. Zanni, T. Serafini and G. Zanghirati, Parallel software for training large scale support vector machines on multiprocessor systems. Journal of Machine Learning Research, 7 (2006), 1467-1492.
  • *
    This work was partially supported by the Centro de Estadística y Software Matemático (CESMa) and the Decanato de Investigation y Desarrollo (DID) at USB.
  • †
    Corresponding author.
  • Publication Dates

    • Publication in this collection
      05 Nov 2009
    • Date of issue
      2009

    History

    • Accepted
      10 May 2009
    • Received
      02 Dec 2008
    Sociedade Brasileira de Matemática Aplicada e Computacional Sociedade Brasileira de Matemática Aplicada e Computacional - SBMAC, Rua Maestro João Seppe, nº. 900 , 16º. andar - Sala 163, 13561-120 São Carlos - SP Brasil, Tel./Fax: 55 16 3412-9752 - São Carlos - SP - Brazil
    E-mail: sbmac@sbmac.org.br