## versión On-line ISSN 1807-0302

### Comput. Appl. Math. v.22 n.1 São Carlos  2003

On the convergence properties of the projected gradient method for convex optimization

A. N. Iusem*

Instituto de Matemática Pura e Aplicada (IMPA) Estrada Dona Castorina 110, 22460-320 Rio de Janeiro, RJ, Brazil
E-mail: iusp@impa.br

ABSTRACT

When applied to an unconstrained minimization problem with a convex objective, the steepest descent method has stronger convergence properties than in the noncovex case: the whole sequence converges to an optimal solution under the only hypothesis of existence of minimizers (i.e. without assuming e.g. boundedness of the level sets). In this paper we look at the projected gradient method for constrained convex minimization. Convergence of the whole sequence to a minimizer assuming only existence of solutions has also been already established for the variant in which the stepsizes are exogenously given and square summable. In this paper, we prove the result for the more standard (and also more efficient) variant, namely the one in which the stepsizes are determined through an Armijo search.

Mathematical subject classification: 90C25, 90C30.

Key words: projected gradient method, convex optimization, quasi-Fejér convergence.

1 Introduction

1.1 The problem

We are concerned in this paper with the following smooth optimization problem:

where f :n is continuously differentiable and C Ì n is closed and convex. Each iteration of the projected gradient method, which we describe formally in subsection 1.3, basically consists of two stages: starting from the k-th iterate xk Î n, first a step is taken in the direction of –Ñ f(xk), and then the resulting point is projected onto C, possibly with additional one-dimensional searches in either one of the stages. The classical convergence result establishes that cluster points of {xk} (if any) are stationary points for (1)-(2), i.e. they satisfy the first order optimality conditions, but in general neither existence nor uniqueness of cluster points is guaranteed. In this paper we prove a much stronger result for the case in which f is convex, namely that the whole sequence {xk} converges to a solution of (1)-(2) under the only assumption of existence of solutions. An analogous result is known to hold for the steepest descent method for unconstrained optimization, which we describe in the next subsection.

1.2 The steepest descent method

Given a continuously differentiable f :n , the steepest descent method generates a sequence {xk} Î n through

where bk is some positive real number. Several choices are available for bk. The first one is to set the bk's exogenously, and a relevant option is

with

Other options consist of performing an exact line minimization, i.e.

or an inexact linesearch, e.g. following an Armijo rule, namely

with

for some > 0, s Î (0,1).

The basic convergence result on this method (and in general on descent direction methods), under either exact linesearches or Armijo searches, derives from Zangwill's global convergence theorem (see [16]) and establishes that every cluster point of {xk}, if any, is stationary, i.e. such that Ñ f() = 0. In order to ensure existence of cluster points, it is necessary to assume that the starting iterate x0 belongs to a bounded level set of f (see [15] for this and other related results). The situation is considerably better when f is convex: it is possible to prove convergence of the whole sequence to a minimizer of f under the sole asumption of existence of minimizers (i.e. without any additional assumption on boundedness of level sets). Results of this kind for the convex case can be found in [8], [10] and [14] for the method with exogenously given bk's satisfying (4)-(5), in [12] for the method with exact lineasearches as in (6), and in [2], [7] and [13] for the method with the Armijo rule (7)-(8). We observe that in the case of bk's given by (4)-(5) the method is not in general a descent one, i.e. it is not guaranteed that f(xk+1) < f(xk) for all k.

In this subsection we deal with problem (1)-(2). Convexity of C makes it posible to use the orthogonal projection onto C, PC : n C, for obtaining feasible directions which are also descent ones; namely a step is taken from xk in the direction of –Ñ f(xk), the resulting vector is projected onto C, and the direction from xk to this projection has the above mentioned properties. We remind that a point z Î C is stationary for problem (1)-(2) iff Ñ f(z)t (xz) > 0 for all x Î C. A formal description of the algorithm, called the projected gradient method, is the following:

Initialization: Take x0 Î C.

Iterative step: If xk is stationary, then stop. Otherwise, let

where bk, gk are positive stepsizes, for which, again, several choices are possible. Before discussing them, we mention that in the unconstrained case, i.e. C = n, then method given by (9)-(10) with gk = 1 for all k reduces to (3). Following [3], we will focus in three strategies for the stepsizes:

i) Armijo search along the feasible direction: {bk} Ì [, ] for some 0 < < and gk determined with an Armijo rule, namely

with

for some s Î (0,1).

ii) Armijo search along the boundary of C: gk = 1 for all k and bk determined through (7) and the following two equations instead of (8):

with

iii) Exogenous stepsize before projecting: bk given by (4)-(5) and gk = 1 for all k.

Several comments are in order. First, observe that in the unconstrained case (C = n) options (i) and (ii) reduce to the steepest descent method (3) with the Armijo rule given by (7)-(8), while option (iii) reduces to (3) with the bk's given by (4)-(5). Secondly, note that option (ii) requires a projection onto C for each step of the inner loop resulting from the Armijo search, i.e. possibly many projections for each k, while option (i) demands only one projection for each outer step, i.e. for each k. Thus, option (ii) is competitive only when PC is very easy to compute (e.g. when C is a box or a ball). Third, we mention that option (iii), as its counterpart in the unconstrained case, fails to be a descent method. Finally, it it easy to show that for option (iii) it holds that xk+1xk < ak for all k, with ak as in (4). In view of (5), this means that all stepsizes are ''small", while options (i) and (ii) allow for occasionally long steps. Thus option (iii) seems rather undesirable. Its redeeming feature is that its good convergence properties hold also in the nonsmooth case, when Ñ f(xk) is replaced by a subgradient xk of f at xk. Subgradients do not give raise to descent directions, so that Armijo searches are not ensured to succeed, and therefore exogenous stepsizes seem to be the only available alternative. This is the case analyzed in [1]. We will not be concerned with option (iii) in the sequel.

Without assuming convexity of f, the convergence results for these methods closely mirror the ones for the steepest descent method in the unconstrained case: cluster points may fail to exist, even when (1)-(2) has solutions, but if they exist, they are stationary and feasible, i.e. áÑ f(), xñ > 0, Î C for all cluster point of {xk} and all x Î C. These results can be found in Section 2.3.2 of [3]; for the case of option (ii), they are based upon the results in [9].

When f is convex, the stronger results for the unconstrained case, with bk's given by (4)-(5), have also been extended to the projected gradient method under option (iii): it has been proved in [1] that in such a case, the whole sequence {xk} converges to a solution of problem (1)-(2) under the sole assumption of existence of solutions. On the other hand, the current situation is rather worse for options (i) and (ii): as far as we know, neither existence nor uniqueness of cluster points for options (i) and (ii) has been proved, assuming only convexity of f. We will prove both for option (i) in the following two sections. The corresponding results for the less interesting option (ii) have very similar proofs, and we sketch them in Section 4.

Of course, results of this kind are immediate under stronger hypotheses on the problem: cluster points of {xk} certainly exist if the intersection of C with some level set of f is nonempty and bounded, and strict convexity of f ensures uniqueness of the cluster point.

2 Preliminaries

This section contains some previously established results needed in our analysis. We prove them in order to make the paper closer to being self-contained. We start with the so called quasi-Fejér convergence theorem (see [7], Theorem 1).

Proposition 1. Let T Ì n be a nonempty set and {ak} Ì n a sequence such that

for all z Î T and all k, where {k} Ì + is a summable sequence. Then

i) {ak} is bounded.

ii) If a cluster point of {ak} belongs to T, then the whole sequence {ak} converges to .

Proof.

i) Fix some z Î T. Applying iteratively (15) we get

Since {k} is summable, it follows that {ak} is bounded.

ii) Let now Î T be a cluster point of {ak} and take d > 0. Let {ak} be a subsequence of {ak} convergent to . Since {k} is summable, there exists k0 such that j < d/2, and there exists k1 such that k1 > k0 and ak2 < for any k > k1. Then, for any k > k1 we have:

We conclude that limk®¥ ak = .

Next we show that the linesearch for option (i) is always successful. We start with an immediate fact on descent directions.

Proposition 2. Take s Î (0,1), x Î C and v Î n such that Ñ f(x)t v < 0. Then there exists < 1 such that f(x + gv) < f(x) + sgÑ f(x)t v for all g Î (0, ].

Proof. The result follows from the differentiability of f.

Next we prove that the direction zk – xk in option (i) is a descent one.

Proposition 3. Take xk and zk as defined by (9)-(12). Then

i) xk belongs to C for all k.

ii) If Ñ f(xk) ¹ 0, then Ñ f(xk)t(zk – xk) < 0.

Proof.

i) By induction. It holds for k = 0 by the initialization step. Assume that xk Î C. By (9), zk Î C. By (12), gk Î [0,1]. By (10), xk+1 Î C.

ii) A well known elementary property of orthogonal projections states that ávu, PC(u) – uñ > 0 for all u Î n, v Î C. By (i), xk Î C. Thus, in view of (9),

Since bk > 0 and Ñ f(xk) ¹ 0, it follows from (16) that Ñ f(xk)t (zkxk) <bk Ñ f(xk)2 < 0.

Corollary 1. If Ñ f(xk) ¹ 0, then gk is well defined for the algorithm (9)-(12).

Proof. Consider Proposition 2 with x = xk, v = zk – xk. By Proposition 3(ii), Ñ f(x)t v < 0. Thus the assumption of Proposition 2 holds, and the announced exists, so that the inequality in (12) holds for all j such that 2j < . It follows that both (k) and gk are well defined.

Finally, we prove stationarity of the cluster points of {xk}, if any.

Proposition 4. Let {xk} be the sequence defined by (9)-(12). If {xk} is infinite, is a cluster point of {xk} and Problem (1)-(2) has solutions, then is stationary for Problem (1)-(2).

Proof. Since C is closed, belongs to C by Proposition 3(i). Let {xjk} be a subsequence of {xk} such that limk®¥ xjk = . Observe that {gk} Ì [0,1] by (11) and Corollary 1, and that {bk} Ì [, ]. Thus, we may assume without loss of generality that limk®¥ gjk = Î [0,1], limk®¥ bjk = > > 0. By (10), (11) and (12),

It follows from (17) that {f(xk)} is a decreasing sequence. Since {xk} Ì C by Proposition 3(i) and Problem (1)-(2) has solutions, {f(xk)} is bounded below, hence convergent, so that limk®¥[f(xk) – f(xk+1)] = 0. Taking limits in (17) along the subsequence, and taking into account (9), we get

using also continuity of both Ñ f and PC. Now we consider two cases. Suppose first that > 0. Let = Ñ f(). Then, it follows from (18) that

implying that 0 = ()t[PC() – ]. Since belongs to C, an elementary property of orthogonal projections implies that = PC() = PC(Ñ f()), and, taking into account that > 0, it follows easily that Ñ f()t(x) > 0 for all x Î C, i.e. is stationary for Problem (1)-(2).

Consider finally the case of 0 = = limk®¥ gjk. Fix q Î . Since gjk = 2(jk), there exists k such that (jk) > q, so that, in view of (12),

Taking limits in (20) with k ¥, and defining = PC(Ñ f()), we get, for an arbitrary q Î ,

Combining (21) with Proposition 2, we conclude that Ñ f()t() > 0. Using now Proposition 3(ii), we get that 0 = Ñ f()t() = Ñ f()t (PC() – ), i.e. (19) holds also in this case, and the conclusion is obtained with the same argument as in the previous case.

Two comment are in order. First, no result proved up to this point requires convexity of f. Second, all these results are rather standard and well known (see e.g. Section 2.3.2 in [3] or [16]) The novelty of this paper occurs in the following sections.

3 Convergence properties in the convex case

In this section we prove that when f is convex then the sequence generated by variant (i) of the projected gradient method (i.e. (9)-(12)) converges to a solution of Problem (1)-(2), under the only assumption of existence of solutions.

Theorem 1. Assume that Problem (1)-(2) has solutions. Then, either the algorithm given by (9)-(12) stops at some iteration k, in which case xk is a solution of Problem (1)-(2), or it generates an infinite sequence {xk}, which converges to a solution x* of the problem.

Proof. In the case of finite stopping, the stopping rule states that xk is stationary. Since f is convex, stationary points are solutions of Problem (1)-(2). We assume in the sequel that the algorithm generates an infinite sequence {xk}.

Let be any solution of Problem (1)-(2). Using (10) and elementary algebra:

The already used elementary property of orthogonal projections can be restated as áPC(u) – u, vPC(u)ñ > 0 for all u Î n and all v Î C. In view of (9)

By (23),

using the gradient inequality for the convex function f in the second inequality, and feasibility of xk, resulting from Proposition 3(i), together with optimality of and positivity of bk, in the third one. Combining now (22) and (24), and taking into account (10),

After rearrangement, we obtain from (25),

using the fact that gk belongs to [0,1] in the second inequality. Now we look at the specific way in which the gk's are determined. By (11), (12), for all j,

Multiplying (27) by (2bk)/s, and defining

we get, since {f(xj)} is nonincreasing,

Summing (29) with j between 0 and k,

and it follows from (30) that < ¥. By (26) and (28),

Let S* be the set of solutions of Problem (1)-(2). Since is an arbitrary element of S* and the k's are summable, (31) means that {xk} is quasi-Fejér convergent to S*. Since S* is nonempty by assumption, it follows from Proposition 1(i) that {xk} is bounded, and therefore it has cluster points. By Proposition 4 all such cluster points are stationary. By convexity of f, they are solutions of Problem (1)-(2), i.e. they belong to S*. By Proposition 1(ii), the whole sequence {xk} converges to a solution of Problem (1)-(2).

4 Option (ii): search along an arc

We sketch in this section the analysis corresponding to option (ii), which we restate next:

Initialization: Take x0 Î C.

Iterative step: If xk is stationary, then stop. Otherwise, take

where bk is given by

with

and

Results for this variant follow closely those for option (i), developed in the previous two sections. Without assuming convexity of f, the following results hold:

Proposition 5. If {xk} is the sequence generated by (32)-(35), then

i) {xk} Ì C.

ii) bk is well defined by (32)-(35).

iii) If Ñ f(xk) ¹ 0, then áÑ f(xk), PC(xk bk Ñ f(xk)) – xkñ < 0.

iv) If Problem (1)-(2) has solutions and is a cluster point of {xk}, then is stationary for Problem (1)-(2).

Proof. Item (i) follows immediately from (32); for the remaining items see Proposition 2.3.3 and Lemma 2.3.1 in [3].

For convex f, we have the following result.

Theorem 2. Assume that Problem (1)-(2) has solutions. Then, either the algorithm given by (31)-(35) stops at some iteration k, in which case xk is a solution of Problem (1)-(2), or it generates an infinite sequence {xk}, which converges to a solution x* of the problem.

Proof. The result for the case of finite termination follows from the stopping criterion and the convexity of f. For the case of an infinite sequence, we observe that the computations in the proof of Theorem 1 up to (26) do not use the specific form of the bk's and the gk's, so that they hold for the sequence under consideration, where now gk = 1 for all k and bk is given by (33)-(35). Thus, for all solution of Problem (1)-(2), we have

with

Note that ek > 0 for all k by Proposition 5(iii). We prove next that {ek} is summable.

In view of (32)-(35) (particularly the criterion of the arc-search), we have

Combining (37) and (38), and using then (33),

By (39), [f (x0) – f()] < ¥. In view of (36) it follows, as in Theorem 1, that {xk} is quasi-Fejér convergent to the solution set, and then Proposition 1(i) implies that {xk} is bounded, so that it has cluster points. By Proposition 5(iv) and convexity of f, all of them solve Problem (1)-(2). Finally, Proposition 1(ii) implies that the whole sequence {xk} converges to a solution.

5 Final remarks

1. The purpose of this paper is theoretical, and it consists of determining the convergence properties of the projected gradient method in the case of a convex objective. We make no claims whatsoever on the advantages and/or drawbacks of this algorithm viz-a-viz others.

2. Despite the previous remark, we mention that some variants of the projected gradient methods have been proved to be quite successful from a computational point of view, particularly the spectral projected gradient method (SPG); see e.g. [5], [6], [4]. In this method bk is taken as a safeguarded spectral parameter, with the following meaning. Let

We mention that when f is twice differentiable is the Rayleigh quotient asociated with the averaged Hessian matrix Ñ2 f(txk + (1 – t)xk1)dt, thus justifying the denomination of spectral parameter given to hk. Then bk is taken as follows: b0 is exogenously given; for k > 1, bk = bk–1 if áxk xk–1, Ñ f (xk) – Ñ f (xk–1)ñ < 0. Otherwise, bk is taken as the median between , hk and ( and act as the ''safeguards'' for the spectral parameter). Since our strategy (i) encompasses such choice of bk, the result of Theorem 1 holds for this variant. On the other hand, SPG as presented in [5], [6], includes another feature, namely a backtracking procedure generating a possibly nonmonotone sequence of functional values {f (xk)}. In fact, (12) is replaced by

with yk = max0 < j < m f(xkj) for some fixed m. The proof of Theorem 1 does not work for this nonmonotone Armijo search: one gets k < [ max 0 < j < m f(xkj) – f(xk+1)], but the right hand side of this inequality does not seem to be summable, as required for the application of the quasi-Fejér convergence result. The issue of the validity of Theorem 1 for SPG with this nonmonotone search remains as an open problem. We end this remark by mentioning that a variant of SPG with a search along the arc, similar to our option (ii), has also been developed in [5]. Our comments above apply to this variant in relation with Theorem 2.

3. When Ñ f is Lipschitz continuous in C with known Lipschitz constant L, it is well known that the Armijo search can be avoided, by taking, as gk in option (i) or bk in option (ii), a constant q Î (0,2/L), without affecting the convergence properties for the nonconvex case (i.e. Propositions 2-5). It is not difficult to verify that in the convex case Theorems 1 and 2 also remain valid for this choice of the stepsizes (in fact, the proofs are indeed simpler).

4. Unpublished results related to those in this paper were obtained by B.F. Svaiter in [16].

REFERENCES

[1] Alber, Ya.I., Iusem, A.N. and Solodov, M.V., On the projected subgradient method for nonsmooth convex optimization in a Hilbert space, Mathematical Programming, 81 (1998), 23-37.         [ Links ]

[2] Bereznyev, V.A., Karmanov, V.G. and Tretyakov, A.A., The stabilizing properties of the gradient method, USSR Computational Mathematics and Mathematical Physics, 26 (1986), 84-85.         [ Links ]

[3] Bertsekas, D., Nonlinear Programming, Athena Scientific, Belmont (1995).         [ Links ]

[4] Birgin, E.G. and Evtushenko, Y.G., Automatic differentiation and spectral projected gradient methods for optimal control problems, Optimization Methods and Software, 10 (1998), 125-146.         [ Links ]

[5] Birgin, E.G., Martínez, J.M. and Raydan, M., Nonmonotone spectral projected gradient methods on convex sets, SIAM Journal on Control and Optimization, 10 (2000), 1196-1211.         [ Links ]

[6] Birgin, E.G., Martínez, J.M. and Raydan, M., SPG: software for convex constrained optimization (to be published in ACM Transactions on Mathematical Software).         [ Links ]

[7] Burachik, R., Graña Drummond, L.M., Iusem, A.N. and Svaiter, B.F., Full convergence of the steepest descent method with inexact line searches, Optimization, 32 (1995), 137-146.         [ Links ]

[8] Correa, R. and Lemaréchal, C., Convergence of some algorithms for convex minimization, Mathematical Programming, 62 (1993), 261-275.         [ Links ]

[9] Gafni, E.N. and Bertsekas, D., Convergence of a gradient projection method, Technical Report LIDS-P-1201, Laboratory for Information and Decision Systems, M.I.T. (1982).         [ Links ]

[10] Golstein, E. and Tretyakov, N., Modified Lagrange Functions. Moscow (1989).         [ Links ]

[11] Hiriart Urruty, J.-B. and Lemaréchal, C., Convex Analysis and Minimization Algorithms, Springer, Berlin (1993).         [ Links ]

[11] Iusem, A.N. and Svaiter, B.F., A proximal regularization of the steepest descent method, RAIRO, Recherche Opérationelle, 29, 2, (1995), 123-130.         [ Links ]

[12] Kiwiel, K. and Murty, K.G., Convergence of the steepest descent method for minimizing quasi convex functions, Journal of Optimization Theory and Applications, 89 (1996), 221-226.         [ Links ]

[13] Nesterov, Y.E., Effective Methods in Nonlinear Programming, Moscow (1989).         [ Links ]

[14] Polyak, B.T., Introduction to Optimization, Optimization Software, New York (1987).         [ Links ]

[15] Svaiter, B.F., Projected gradient with Armijo search. Unpublished manuscript.         [ Links ]

[16] Zangwill, W.I., Nonlinear Programming: a Unified Approach, Prentice Hall, New Jersey (1969).         [ Links ]