Acessibilidade / Reportar erro

COMPLEXITY OF FIRST-ORDER METHODS FOR DIFFERENTIABLE CONVEX OPTIMIZATION

Abstract

This is a short tutorial on complexity studies for differentiable convex optimization. A complexity study is made for a class of problems, an "oracle" that obtains information about the problem at a given point, and a stopping rule for algorithms. These three items compose a scheme, for which we study the performance of algorithms and problem complexity. Our problem classes will be quadratic minimization and convex minimization in ℝn. The oracle will always be first order. We study the performance of steepest descent and Krylov spacemethods for quadratic function minimization and Nesterov’s approach to the minimization of differentiable convex functions.

first-order methods; complexity analysis; differentiable convex optimization


1 INTRODUCTION

Due to the huge increase in the size of problems tractable with modern computers, the study of problem complexity and algorithm performance became essential. This was recognized very early by computer scientists and mathematicians working on combinatorial problems, and has recently become a central issue in continuous optimization. Complexity studies for these problems started in the former Soviet Union, and the main results are described in the book by Nemirovski & Yudin [14[14] NEMIROVSKI AS & YUDIN DB. 1983. Problem Complexity and Method Efficiency in Optimization. John Wiley, New York.].

The special case of Linear Programming, which will not be tackled in this paper, initiated with Khachiyan [10[10] KHACHIYAN LG. 1979. A polynomial algorithm for linear programming. Doklady Akad. Nauk USSR, 244: 1093-1096. Translated in Soviet Math. Doklady 20:191-194.], also in Russia in 1978, and had an explosive expansion in the West with the creation of interior point methods in the 80’s and 90’s.

This paper starts with a brief introduction to the main concepts in the study of algorithm performance and complexity, following Nemirovski and Yudin, and then apply to the study of the convex optimization problem:

where ƒ: ℝn → ℝ is a continuously differentiable function.

In Section 2 we introduce the general framework for the study of algorithm performance and problem complexity and present a simple example.

We dedicate Section 3 to study the special case of convex quadratic functions, because they are the simplest non-linear functions: if a method is inefficient for quadratic problems, it will certainly be inefficient for more general problems; if it is efficient, it has a good chance of being adaptable to general differentiable convex problems, because near an optimal solution the quadratic approximation of the function uses to be precise. We study the performance of steepest descent and of Krylov space methods.

Section 4 will describe and analyze a basic method for unconstrained convex optimization devised by Nesterov [15[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston.], with “accelerated steepest descent” iterations. This method has become very popular, and the presentation and complexity proofs will be based on our paper [7[7] GONZAGA CC & KARAS EW. 2013. Fine tuning Nesterov's steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(1-2):141-166.].

Finally, we comment in Section 5 on improvements of this basic algorithm, presenting without proofs its extension to problems restricted to “simple sets” (sets onto which projecting a vector is easy).

2 SCHEMES, PERFORMANCE AND COMPLEXITY

A complexity study is associated with a scheme (Σ, ε) as follows., τ

  1. Σ is a class of problems.

    Examples: linear programming problems, unconstrained minimization of convex functions.

  2. is an oracle associated to Σ.

    The oracle is responsible for accessing the available information x) about a given problem in Σ at a given point x.(

    Examples: x) = {ƒ (x)} (zero order)(

    • x) = {ƒ (x), ∇ ƒ (x)} (first order).(

  3. τε is a stopping rule, associated with a precision ε > 0.

    Examples: for the minimization problem (1), τε defined by

    ƒ (x) − ƒ* ≤ ε,

    ║∇ƒ (x)║ ≤ ε

    xx*║ ≤ ε

    where x* is a solution of the problem and ƒ* = ƒ(x*).

An instance in a scheme would be for example: Solve a convex minimization problem (Σ) using only first order information (xk║≤ 10-6ε)) to a precision ║∇ƒ (

Algorithms. The general problem associated with a scheme (Σ, ε) is to find a point satisfying τε, using as information only consultations to the oracle and any mathematical procedures that do not depend on the particular problem being solved., τ

The algorithms studied in this paper follow the black box model described now. An algorithm starts with a point x0 and computes a sequence (xk)k∈ℕ. Each iteration k accesses the oracle at xk and uses the information obtained by the oracle at x0, x1, ... xk to compute a new point xk+1. It stops if xk+1 satisfies τε.

Algorithm 1.Black box model for (Σ, ε), τ

Data: x0, ε > 0, k = 0, I−1 = ∅

WHILE xk does not satisfy τε

  • Oracle at xk: xk)(

  • Update information set: Ik = Ik−1xk)(

  • Apply rules of the method to Ik: find xk+1

  • k = k + 1.

Algorithm performance for (Σ, ε), τ (worst case performance)

Consider an algorithm for (Σ, ε)., τ

  • The iteration performance is a bound on the number of iterations (oracle calls) needed to solve any problem in the scheme (Σ, ε). This bound will depend on ε and on parameters associated with each specific problem (initial point, space dimension, condition number, Lipschitz constants, etc.). In other words, it is the number of iterations needed to solve the “worst possible” problem in (Σ, ε)., τ, τ

  • The numerical performance is a bound on the number of arithmetical operations needed in the worst case. The numerical performance is usually proportional to the iteration performance for each given algorithm. In this paper we only study the iteration performance of algorithms.

  • The complexity of the scheme (Σ, ε) is the performance of the best possible algorithm for the scheme. It is frequently unknown, and finding it for different schemes is the main purpose of complexity studies., τ

The performance of any algorithm for a scheme gives an upper bound to its complexity. A lower bound to the complexity may sometimes be found by constructing an example of a (difficult) problem in Σ and finding a lower bound for any algorithm based on the same oracle and stopping rule. This will be the case in the end of Section 3.

If the performance of an algorithm for a scheme matches its complexity or a fixed multiple of it, it is called an optimal algorithm for the scheme.

First-order algorithms: In most of this paper we study first-order algorithms for solving the problem

where ƒ: ℝn → ℝ is a differentiable function. The problem classes will be the special cases of quadratic and convex functions.

A first-order algorithm starts from a given point x0 and constructs a sequence (xk) using the oracle xk) = {ƒ(xk), ∇ ƒ(xk)}. Each step computes a point xk+1 using the information set Ik = ∪kj=0 xj) so that((

where span(S) stands for the subspace generates by S.

In particular, the most well-known minimization algorithm is the steepest descent method, in which

where λk is a steplength. Each different choice of steplength (the rules of the method) defines a different steepest descent algorithm. This will be studied ahead in this paper.

Remark: In our algorithm model we used a single oracle, but there may be more than one. Typically, 0(x) = {ƒ(x)}, 1(x) = {∇ƒ(x)}, and 0 may be called more than once in each iteration. This is the case when line searches are used. The performance evaluation must then be adapted.

The notation O(·). Given two real positive functions g(·) and h(·), we say that g = O(h) if there exists some constant K > 0 such that g(·) ≤ Kh(·). This notation is very useful in complexity studies. For example, we shall prove that a certain steepest descent algorithm stops for kC is a parameter that identifies the problem in Σ and ε is the precision. We may write k = C O(log(1/ε)), ignoring the coefficient 1/4., where

2.1 Example: root of a continuous function

Here we present a simple example to illustrate how a complexity analysis works. Consider the following example of (Σ, ε) given by:, τ

  • Σ: Given a continuous function ƒ : [0, 1] → ℝ with ƒ(0) ≤ 0 and ƒ(1) ≥ 0, find x ∈ [0, 1] such that ƒ(x) = 0.

  • x ∈ [0, 1], x) = {ƒ(x)}.: For (

  • τε: For ε > 0, τε is satisfied if |x − x*| ≤ ε for some root x*.

Remark: The stopping rule above is obviously not computable. We shall use a practical rule that implies τε.

Algorithm 2. Bisection

Data: ε ∈ (0, 1), a0 = 0, b0 = 1, k = 0

WHILE bk - ak > ε (stopping rule)

  • m = (ak + bk)/2

  • Compute ƒ(m) (oracle)

  • IF ƒ(m) ≤ 0, set ak+1 = m, bk+1 = bk

  • ELSE set bk+1 = m, ak+1 = ak

  • k = k + 1.

Performance: the following facts are straightforward at all iterations:

  1. ƒ (ak) ≤ 0 and ƒ (bk) ≥ 0 and by the intermediate value theorem, τε is implied by bkak ≤ ε.

  2. bkak = 2-k

If the algorithm does not stop at iteration k, then 2−k > ε, and then k < log2(1/ε). We conclude that the stopping rule will be satisfied for k = ⌈log2(1/ε)⌉, where ⌈r⌉ is the smallest integer above r. Thus that the performance of the scheme above is k = ⌈log2(1/ε)⌉ = O(log(1/ε)).

It is possible to prove that this is the best possible algorithm for this scheme, and hence the complexity of the scheme is this performance.

    Remarks:
  1. Note that the only assumption here was the continuity of ƒ. With stronger assumptions (Lipschitz constants for instance), better algorithms are described in numerical calculus textbooks.

  2. The rules of the method are in the bisection calculation. Only the present oracle information m) is used at step k.(

3 MINIMIZATION OF A QUADRATIC FUNCTION: FIRST-ORDER METHODS

Quadratic functions are the simplest nonlinear functions, and so an efficient algorithm for minimizing nonlinear functions must also be efficient in the quadratic case. On the other hand, near a minimizer, a twice differentiable function is usually well approximated by a quadratic function. A quadratic function is defined by

where c ∈ ℝn and H is an n × n symmetric matrix. Then for x ∈ ℝn,

If x* is a minimizer or a maximizer of ƒ, then

and hence finding an extremal of ƒ is equivalent to solving the linear system Hx* = −c, one of the most important problems in Mathematics.

The behavior of a quadratic function depends on the eigenvalues of its Hessian H. Since H is symmetric, it is known that H has n real eigenvalues

which may be associated with n orthonormal (mutually orthogonal with unit norm) eigenvectors ν1, ν2, ..., νn. There are four cases, represented in Figure 1:

Figure 1
Quadratic functions.
  1. If μ1 > 0, then H is a positive definite matrix, ƒ is strictly convex and its unique minimizer is the unique solution of Hx = −c.

  2. If μ1 < 0, then infx∈ℝn ƒ (x) = −∞, and ƒ (x)→ −∞ along the direction ν1.

    Consider now the cases in which there are null eigenvalues. Let them be μ1 = μ2 = ... = μk = 0. Thus H is a positive semi-definite matrix.

  3. If cT νi = 0 for i = 1, ..., k, then ƒ has a k-dimensional set of minimizers.

  4. If cT νi < 0 for some i = 1, ..., k, then ƒ is unbounded below.

In this section, we study the following scheme:

  • Σ: the class of quadratic functions that are bounded below (cases (i) and (iii) above). A function in Σ has at least one minimizer x*. Without loss of generality, the study of algorithmic properties (not the implementation) may assume that x* = 0, and so the function becomes

x) = {ƒ(x), ∇ ƒ(x)} (first order).: (

τε: Given an initial point x0 ∈ ℝn, two rules will be used in the analysis:

  • • Absolute error bound: ƒ(x) − ƒ* ≤ ε,

  • • Relative error bound: ƒ(x) − ƒ* ≤ ε(ƒ(x0) − ƒ*).

  • These rules are not implementable, because they require the knowledge of x*, but are very useful in the performance analysis.

Simplification: As we explained above, we assume that ƒ(·) has a minimizer x* = 0. After performing the analysis with this simplification, we substitute x − x* for x. A further simplification may be done by diagonalizing H, also without loss of generality.

3.1 Steepest descent algorithms

In the first half of the 19th century, Cauchy found that the gradient ∇ ƒ(x) of a function ƒ is the direction of maximum ascent of ƒ from x, and stated the gradient method. It is the most basic of all optimization algorithms, and its performance is still an active research topic.

Algorithm 3. Steepest descent algorithm (model)

Data: x0 ∈ ℝn, ε > 0, k = 0

WHILE xk does not satisfy τε

  • Choose a steplength λk > 0

  • xk+1 = xk − λk∇ ƒ(xk) = (I − λkH)xk

  • k = k + 1.

Steplengths: Each different method for choosing the steplengths λk defines a different steepest descent algorithm. Let us describe the two best known choices for the steplengths:

• The Cauchy step, or exact step,

the unique minimizer of ƒ along the direction −g with g = ∇ ƒ(xk). The steplength is computed by setting ∇ ƒ(xk − λkg)⊥g and simplifying, which results in

• The short step: λk < 2/μn, a fixed steplength.

Complexity results

Now we study the iteration performance of the steepest descent methods with these two steplength choices for minimizing a strictly convex quadratic function (case (i)). Given ε > 0 and x0 ∈ ℝn, we consider that τε is satisfied at a given x ∈ ℝn if

In both cases the algorithmstops in O(C log(1/ε)) iterations, where C = μn1. At thismoment the following question is open: find a steepest descent algorithm (by a different choice of λk) with performance O( √C log(1/ε)). This performance is achieved in practice for “normal problems” (but not for particular worst case problems) by Barzilai-Borwein and spectral methods, described in [3[3] BIRGIN EG, MARTÍNEZ JM & RAYDAN M. 2009. Spectral Projected Gradient Methods. In C.A. Floudas and P.M. Pardalos, editors, Encyclopedia of Optimization, pages 3652-3659. Springer.].

Theorem 1.Let C = μn1 ≥ 1 be the condition number of H. The iteration performance of the steepest descent method with Cauchy steplength for minimizing f starting at x0 ∈ ℝn and with stopping criterion (5) is given by

Proof. We begin by stating a classical result for the steepest descent step, which is based on the Kantorovich inequality and is proved for instance in [12, p. 238[12] LUENBERGER DG & YE Y. 2008. Linear and Nonlinear Programming. Springer, New York, third edition.],

Using this recursively, we obtain

which implies

It is known that t ∈ [1, +∞)→ t > 1, is an increasing function and that for

Consequently

If τε is not satisfied at an iteration k, then by (5), > ε or

which implies k < , completing the proof.

Example. In this example we show that the bound obtained in Theorem 1 is sharp, i.e., it cannot be improved. Take the following problem in ℝ2:

Assume that the initial point of some iteration has the shape x = (C, 1) z, for some z ∈ ℝ. Then

Computing the steplength λ by (4), we obtain λ = 2/(C + 1), and then the next iterate will be

It follows that

and this will be repeated on all iterations, with the worst possible performance as in Theorem 1.

Theorem 2.Let C = μn1 ≥ 1 be the condition number of H. The iteration performance of the steepest descent method with short steps λk = 1/μn, for minimizing ƒ starting at x0 ∈ ℝn and with stopping criterion (5) is given by

Proof. A simplification in the analysis can be made by diagonalizing the matrix H by using the orthonormal matrix whose columns are the eigenvectors of H. Thus, we can consider, without loss of generality, that H = diag(μ1,..., μn). Given x0 ∈ ℝn, by the steepest descent algorithm with short steps,

Thus, for all i = 1, ..., n,

Consequently, by the definition of ƒ,

Proceeding like in the proof of Theorem 1, we obtain

So, if τε is not satisfied at an iteration k, then k < , completing the proof.

    Remarks:
  • When the short steps 1/μn are used, the result is that for i = 1, ..., n, |xik| ≤√ε|xi0|, for kxk) ≤ ε ƒ(x0), but also ║xk║≤ √ε║x0║ and ║∇ ƒ(xk)║≤ √ε║∇ ƒ(x0).. Hence not only ƒ(

  • The diagonalization of H can be made without loss of generality for the performance analysis, as we did in the proof of Theorem 2. This leads to an interesting observation about the constant C, for the case in which there are null eigenvalues (case(iii)). Assuming that μ1 = μ2 = ... = μp = 0, we see that for i = 1, ..., p, (∇ ƒ(x))i = 0, and the variables xi remain constant forever having no influence on the performance. The bounds in Theorems 1 and 2 remain valid for C = μnp+1.

3.2 Krylov methods

Krylov space methods are the best possible algorithms for minimizing a quadratic function using only first-order information. Let us describe the geometry of a Krylov space method for the quadratic (2).

• Starting at a point x0, define the line V1 = {x0 + θ ∇ ƒ(x0)|θ ∈ ℝ} and

This is actually the Cauchy step. We may write V1 = x0 + span {Hx0}.

• Second step: take the affine space defined by ∇ ƒ(x0) and ∇ ƒ(x1), V2 = x0 + span {∇ ƒ(x0),∇ ƒ(x1)} and note that since ∇ ƒ(x1) = H(x0 + θ ∇ ƒ(x0)) = Hx0 + θ H2x0,

and the next iterate will be

This is a two-dimensional problem.

k-th step: adding ∇ ƒ(xk−1) to the set of gradients, we construct the set

and the next point will be

a k−dimensional problem.

Since ∇ ƒ(xk) ⊥ Vk because of the minimization, either ∇ ƒ(xk) = 0 and the problem is solved, or Vk+1 is (k+1)−dimensional. It is then clear that xn is an optimal solution because Vn = ℝn. This gives us a first performance bound kn for the Krylov space method. This bound is bad for high-dimensional spaces.

Main question: how to solve (Pk). Without proof (see for instance [15[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston.,20[20] RIBEIRO AA & KARAS EW. 2013. Otimização Contínua: aspectos teóricos e computacionais. Cengage Learning. In Portuguese.]), it is known that the directions (xkxk−1) are conjugate, and any conjugate direction algorithm like Fletcher-Reeves [4[4] FLETCHER R & REEVES CM. 1964. Function minimization by conjugate gradients. Computer J., 7:149-154.,18[18] NOCEDAL J & WRIGHT SJ. 2006. Numerical Optimization. Springer Series in Operations Research. Springer-Verlag, 2nd edition.] solves (Pk) at each iteration with about the same work as in the steepest descent method. From now on we do a performance analysis of the Krylov space method, with stopping criterion

A result for the relative error bound will also be discussed in the end of the section. The analysis is quite technical and will result in (16).

Definition 1.Given x0 ∈ ℝn and k ∈ ℕ, define the k-th Krylov space by

Consider Vk = x0 + Kk and define the sequence (xk) by

Let Pk be the set of polynomials p: ℝ → ℝ of degree k such that p(0) = 1, i.e,

From now on we deal with matrix polynomials, setting t = H.

Lemma 1.A point xVk if, and only if, x = p(H)x0 for some polynomial pPk. Furthermore,

Proof. A point xVk if, and only if,

where pPk. Furthermore,

As H is symmetric, (p(H))T H = H p(H), completing the proof.

Lemma 2.For any polynomial pPk,

Proof. Consider an arbitrary polynomial pPk. From Lemma 1, the point x = p(H)x0 belongs to Vk. As xk minimizes ƒ in Vk, we have ƒ(xk) ≤ ƒ(x). Using (8) we complete the proof.

Lemma 3.Let A ∈ ℝn×n be a symmetric matrix with eigenvalues λ1, λ2, ... ,λn. If q: ℝ → ℝ is a polynomial, then q1), q2), ..., qn) are the eigenvalues of q(A).

Proof. As A is a symmetric matrix, there exists an orthogonal matrix P such that A = PDPT, with D = diag(λ1, λ2, ...,λn). If q(t) = a0 + a1t + ··· +aktk, then

Note that

which completes the proof.

3.2.1 Chebyshev Polynomials

The Chebyshev polynomials will be needed in the performance analysis of Krylov methods.

Definition 2.The Chebyshev polynomial of degree k, Tk: [−1, 1] → ℝ, is defined by

The next lemma shows that Tk is, in fact, a polynomial (even though it does not look like one).

Lemma 4.For all t ∈ [−1, 1], T0(t) = 1 and T1(t) = t. Furthermore, for all k ≥ 1,

Proof. The first statements follow from the definition. In order to prove the recurrence rule, consider θ: [−1, 1] → [0, π], given by θ(t) = arccos(t). Thus,

and

But cos ((t)) = Tk (t) and cos (θ(t)) = t. So,

completing the proof.

Lemma 5.If Tk (t) = aktk + ··· + a2t2 + a1t + a0, then ak = 2k−1. Furthermore,

  1. If k is even, then a0 = (−1) and a2 j−1 = 0, for all j = 1, ... , ;

  2. If k is odd, then a1 = (−1) k and a2 j = 0, for all j = 0, 1, ..., .

Proof. We prove by induction. The results are trivial for k = 0 and k = 1. Suppose that the results hold for all natural number less than or equal to k. Using the induction hypothesis, consider

By Lemma 4,

leading to the first statement. Suppose that (k + 1) is even. Then k is odd and (k − 1) is even. Thus, by induction hypothesis, Tk has only odd powers of t and Tk−1 has only even powers. In this way, by (9), Tk+1 has only even powers of t. Furthermore, its independent term is

On the other hand, if (k + 1) is odd, then k is even and (k − 1) is odd. Again by the induction hypothesis, Tk only has even powers of t and Tk−1 has only odd powers. Thus by (9), Tk+1 has only odd powers of t. Furthermore, its linear term is

completing the proof.

The next lemma discusses a relationship between a Chebyshev polynomial of odd degree and polynomials of the set Pk, defined in (7).

Lemma 6.Consider L > 0 and k ∈ ℕ. Then there exists pPk such that, for all t ∈ [0, L],

Proof. By Lemma 5, for all t ∈ [−1, 1], we have

where the polynomial in parentheses has only even powers of t. So, for all t ∈ [0, L],

Defining

we complete the proof.

3.2.2 Complexity results

Now we present the main result about the performance of Krylov methods for minimizing a convex quadratic function. This result is based on [19, Thm. 3, p. 170[19] POLYAK BT. 1987. Introduction to Optimization. Optimization Software Inc., New York.].

We use the matrix norm defined by

Theorem 3. Let μn be the largest eigenvalue of H and consider the sequence (xk) defined by (6). Then for k ∈ ℕ

and ƒ(xk) − ƒ*ƒ ≤ ε is satisfied for

Proof. Without loss of generality, assume that x* = 0. By Lemma 2, for all polynomial pPk,

But, from Lemma 3 and (10),

Considering the polynomial pPk given in Lemma 6 and using the fact that all eigenvalues of H belong to (0, μn], we have

proving (11). If τε is not satisfied at an iteration k, then ƒ(xk) > ε and consequently

which implies (12) and completes the proof.

Performance of the method for the relative error bound: A similar analysis for τε given by (5), also using Chebyshev polynomials, can be done using the condition number C. This is done in [14[14] NEMIROVSKI AS & YUDIN DB. 1983. Problem Complexity and Method Efficiency in Optimization. John Wiley, New York.,23[23] SHEWCHUK JR. 1994. An introduction to the conjugate gradient method without the agonizing pain. Technical report, School of Computer Science, Carnegie Mellon University, August.], and the result is

clearly better than the best performance of the steepest descent algorithm for the steplength rules studied above, and for reasonable values of μ1, better than (12).

Complexity bound. The Krylov space methods uses at each iteration all the information gathered in the previous steps, and hence it seems to be the best possible algorithm based on first order information. In fact, Nemirovskii & Yudin [14[14] NEMIROVSKI AS & YUDIN DB. 1983. Problem Complexity and Method Efficiency in Optimization. John Wiley, New York.] prove that no algorithm using a first order oracle can have a performance more than twice as good as the Krylov space method.

For methods based on accumulated first order information there is a negative result described by Nesterov [15, p. 59[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston.]: he constructs a quadratic problem (which he calls “the worst problem in the world”) for which such methods need at least

iterations to reach τε.

We conclude that the best performance for a first order method must be between the bounds (12) and (15). So the complexity of the scheme is

4 CONVEX DIFFERENTIABLE FUNCTIONS: THE BASIC ALGORITHM

In this section we study the performance of algorithms for the unconstrained minimization of differentiable convex functions. Quadratic functions are a particular case, and hence the performance bounds for first order algorithms will not be better than those found in the former section.

The role played by μn in quadratic functions will be played by a Lipschitz constant L for the gradient of ƒ(indeed, for a quadratic function the largest eigenvalue is a Lipschitz constant for the gradient), and we shall see that there are optimal algorithms, i.e., algorithms with the performance given by (16) with μn replaced by L. These algorithmswere developed by Nesterov [15[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston.], and are also studied in our papers [7[7] GONZAGA CC & KARAS EW. 2013. Fine tuning Nesterov's steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(1-2):141-166.,8[8] GONZAGA CC, KARAS EW & ROSSETTO DR. 2013. An optimal algorithm for constrained differentiable convex optimization. SIAM Journal on Optimization, 23(4):1939-1955.].

Consider the scheme (Σ, ε) where, τ

  • Σ: the class of minimization problems of a convex continuously differentiable function ƒ with a Lipschitz constant L > 0 for the gradient. It means that for all x, y ∈ ℝn,

  • x) = {ƒ(x), ∇ ƒ(x)} (first order): (

  • τε: defined by ƒ(x) - ƒ* ≤ ε where x* is a solution of the problem and ƒ* = ƒ(x*).

Simple quadratic functions. The following definition will be useful in our development: we shall call “simple” a quadratic function φ:n → ℝ with ∇2φ(x) = γ I , γ ∈ ℝ, γ > 0. The following facts are easily proved for such functions:

  • φ(·) has a unique minimizer ν ∈ ℝn (which we shall refer as the center of the quadratic), and the function can be written as

  • Given x ∈ ℝn,

and

4.1 The algorithm

We now state the main algorithm and then study its properties. We include in the statement of the algorithm the definitions of the relevant functions (approximations of ƒ(·) and the simple quadratic defined below).

We begin by summarizing the geometrical construction at an iteration k, represented in Figure 2. The iteration starts with two points xk, νk ∈ ℝn and a simple quadratic function

Figure 2
The mechanics of the algorithm.

whose global minimizer is νk.

A point yk = xk + α(νkxk) is chosen between xk and νk. The choice of α is a central issue, and will be discussed later. All the action is centered on yk, with the following construction:

  • Take a gradient step from yk, generating xk+1.

  • Define a linear approximation of ƒ(·)

  • Compute a value α ∈ (0, 1), and define φα(x) = αℓ(x) + (1 − α)φk (x), with Hessian γk+1 I = ∇2φα(x) = (1 − α)γkI, and let νk+1 be the minimizer of this simple quadratic. The iteration is completed by defining

Now we state the algorithm.

Algorithm 4.

Data: x0 ∈ ℝn, ν0 = x0, γ0 = L, k = 0

REPEAT

  • Compute αk ∈ (0, 1) such that Lαk2 = (1 − αk)γk

  • Set yk = xk + αkk+1xk)

  • Compute ƒ(yk) and g = ∇ ƒ(yk)

  • Updates

    • xk+1 = ykg/L (steepest descent step)

    • γk+1 = (1 − αk)γk

      • For the analysis define

      • xφk(x) = ƒ(xk) + x − νk2

      • x → ℓ(x) = ƒ(y-k) + gT (x - yk)

      • x → u(x) = ƒ(y-k) + gT (x - yk) + x - yk2

      • xφαk (x) = αkℓ(x) + (1 − αk)φk (x)

    • νk+1 = argmin φαk(·) = νk -

    • k = k + 1.

4.1.1 Analysis of the algorithm

The most important procedure in the algorithm is the choice of the parameter αk, which then determines yk at each iteration. The choice of αk is the one devised by Nesterov in [15[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston., Scheme (2.2.6)]. Instead of “discovering” the values for these parameters, we shall simply adopt them and show their properties.

Once yk is determined, two independent actions are taken:

  1. A steepest descent step from yk computes xk+1.

  2. A new simple quadratic is constructed by combining φk (·) and the linear approximation ℓ(·) of ƒ(·) about yk:

Our scope will be to prove two facts:

  • At any iteration k, φxk+1). ≥ ƒ(

  • For all x ∈ ℝn, φk+1(x) − ƒ(x) ≤ (1 − αk)(φk (x) − ƒ(x)).

From these facts we shall conclude that ƒ(xk) → ƒ* with the same speed as γk → 0, which easily leads to the desired performance result.

The first lemma shows our main finding about the geometry of these points. All the action happens in the two-dimensional space defined by xk, νk, νk+1. Note the beautiful similarity of the triangles in Figure 3.

Figure 3
Geometric properties of the steps.

Lemma 7.Consider the sequences generated by Algorithm 4. Then for k ∈ ℕ,

Proof. By the algorithm, we know that Lα2k = γk+1, and

completing the proof.

Lemma 8.Consider the sequences generated by Algorithm 4. Then for k ∈ ℕ,

Proof. By the definition of φαk,

But, φkk) = ƒ(xk) ≥ ℓ(xk). Using this, the definition of ℓ and the fact that αk ∈ (0, 1), we have

By the definition of yk in the algorithm, νkyk = (1−αk)(νkxk) and xk−yk = −αkkxk). Substituting this in (21), we complete the proof.

Lemma 9.Consider the sequences generated by Algorithm 4. Then for k ∈ ℕ,

Proof. The first inequality follows trivially from the convexity of ƒ and the definition of u. Since xk+1 and νk+1 are respectively global minimizers of u(·) and φαk (·), we have from (18) that, for all x ∈ ℝn,

As ƒ(yk) = u(yk) and, from the last lemma, u(yk) ≤ φαkk), we only need to show that

The construction is shown in Fig. 3: since, by Lemma 7, xk+1 = xk + αkk+1xk),

Using this, (24) and the fact that by construction, α2k = , we obtain

proving the second inequality of (22).

By construction,

Comparing to (24) and using the fact that ƒ(xk+1) ≤ , we get (23), completing the proof.

Lemma 10.For any x ∈ ℝn and k ∈ ℕ,

Proof. By the definition of φαk and the fact that ℓ(x) ≤ ƒ(x), for all x ∈ ℝn,

Subtracting ƒ(x) in both sides, using (23) and the definition of γk+1, we have

Using this recursively, we get the result and complete the proof.

4.1.2 Complexity

The following lemma was proved by Nesterov [15, p. 77[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston.] with a different notation.

Lemma 11.Consider the sequence (γk) generated by Algorithm 4, i.e., given γ0 > 0,

Then, for k ∈ ℕ, γk ≤ 4L/k2.

Proof. As αk =

Thus, the result follows directly from [7[7] GONZAGA CC & KARAS EW. 2013. Fine tuning Nesterov's steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(1-2):141-166., Lemma 10].

Theorem 4.Consider the sequences generated by Algorithm 4 and assume that x* is an optimal solution. Then for k ∈ ℕ,

and ƒ(xk) − ƒ* ≤ ε is satisfied for

Proof. From Lemma 10, (25) holds in particular at x*,

Using the fact that ƒ(xk) = φkk) ≤ φk(x*) and the definition of φ0, we get,

Since x* is a minimizer of the convex function ƒ,

Applying this and the result of Lemma 11 in (28),

As γ0 = L, we have (26). If τε is not satisfied at an iteration k, then ƒ(xk) − ƒ* > ε and consequently

which implies (27) and completes the proof.

So, the iteration performance of Algorithm4 is

which corresponds to the complexity (16) for quadratic performance. Then, the algorithm is optimal.

5 CONVEX DIFFERENTIABLE FUNCTIONS: ENHANCED ALGORITHMS

The algorithm presented in the former section may be extended in several ways: the need for a previous knowledge of a Lipschitz constant L may be eliminated, a strong convexity constant akin to the smallest eigenvalue in the quadratic case may be used, and the algorithm may be extended to problems constrained to so-called simple sets. These extensions are treated in our paper [8[8] GONZAGA CC, KARAS EW & ROSSETTO DR. 2013. An optimal algorithm for constrained differentiable convex optimization. SIAM Journal on Optimization, 23(4):1939-1955.] and in references therein.

In this section we state the extension of the basic algorithm to problems with simple constraints, without proofs. Consider the scheme (Σ, ε) where, τ

  • Σ: the class of problems

    • minimize ƒ(x)

    • subject to x ∈ Ω,

  • where Ω ⊂ ℝn is a closed convex set and ƒ: ℝn → ℝ is convex and continuously differentiable, with a Lipschitz constant L > 0 for the gradient. We assume that Ω is a “simple” set, in the following sense: given an arbitrary point x ∈ ℝn , an oracle is available to compute PΩ(x) = x − y║, the orthogonal projection onto the set Ω.

  • x) = {ƒ(x), ∇ƒ(x), PΩ(x)}.: (

  • τε: defined by ƒ(x) - ƒ* ≤ ε where x* is a solution of the problem and ƒ* = ƒ(x*).

We now state the basic algorithm for constrained problems, without proofs. We keep in the statement the definition of the functions used in the analysis made in [8[8] GONZAGA CC, KARAS EW & ROSSETTO DR. 2013. An optimal algorithm for constrained differentiable convex optimization. SIAM Journal on Optimization, 23(4):1939-1955.].

Algorithm 5.

Data: x0 ∈ Ω, ν0 = x0, γ0 = L, k = 0

REPEAT

  • Compute α ∈ (0, 1) such that Lα2 = (1 − α)γk

  • yk = xk + α(νkyk)

  • Compute ƒ(yk) and g = ∇ƒ(yk)

  • Updates

    • xk+1 = PΩ(yk − g/L) (projected steepest descent step)

    • γk+1 = (1 − αk)γk

      • For the analysis define

      • xφk(x) = ƒ(yk) + x - νk2

      • x → ℓ(x) = ƒ(yk) + gT (x - yk)

      • xu(x) = ƒ(yk) + gT (x - yk) + x - yk2

      • xφαk(x) = αkℓ(x) + (1 - αk)φk(x)

    • νk+1 = φαk(x) = PΩ

    • k = k + 1.

Consider the sequences generated by Algorithm 5. Then, as proved in [8[8] GONZAGA CC, KARAS EW & ROSSETTO DR. 2013. An optimal algorithm for constrained differentiable convex optimization. SIAM Journal on Optimization, 23(4):1939-1955., Thm. 2.6], at any iteration k before stopping,

and hence

This expression is similar to (27). In fact, if x* is a global minimizer, then

may be introduced in the last expression, retrieving (27).

Estimations of the Lipschitz constant. Both Algorithms 4 and 5 and the algorithms discussed by Nesterov in [15[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston., Chapter 2] make explicit use of a Lipschitz constant L for the function gradient. In [16[16] NESTEROV Y. 2013. Gradient methods for minimizing composite objective function. Mathematical Programming, 140(1):125-161.], Nesterov describes a method for a more general problem, easily applied to the situations studied in this paper. This method includes a scheme for estimating the Lipschitz constant. In [7[7] GONZAGA CC & KARAS EW. 2013. Fine tuning Nesterov's steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(1-2):141-166.,8[8] GONZAGA CC, KARAS EW & ROSSETTO DR. 2013. An optimal algorithm for constrained differentiable convex optimization. SIAM Journal on Optimization, 23(4):1939-1955.], the authors eliminate the use of L at the cost of an extra imprecise line search, and obtain an algorithm which keeps the optimal complexity properties and also inherits the global convergence properties of the steepest descent method for general continuously differentiable optimization. Besides this, the algorithm takes advantage of the knowledge of the strong convexity constant for the function and develop in [7[7] GONZAGA CC & KARAS EW. 2013. Fine tuning Nesterov's steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(1-2):141-166.] an adaptive procedure for estimating it. In another context – constrained minimization of non-smooth homogeneous convex functions – Richtárik [21[21] RICHTÁRIK P. 2011. Improved algorithms for convex minimization in relative scale. SIAM Journal on Optimization, 21(3):1141-1167.] uses an adaptive scheme for guessing bounds for the distance between a point x0 and an optimal solution x*. This bound determines the number of subgradient steps needed to obtain a desired precision.

Extensions. Nesterov’s approach is applied to penalty methods by Lan, Lu & Monteiro [11[11] LAN G, LU Z & MONTEIRO RDC. 2011. Primal-dual first-order methods with O(1/ε) iterationcomplexity for cone programming. Mathematical Programming, 126(1):1-29.], and interior descent methods based on Bregman distances are described by Auslender & Teboulle [1[1] AUSLENDER A & TEBOULLE M. 2006. Interior gradient and proximal methods for convex and conic optimization. SIAM Journal on Optimization, 16(3):697-725.]. This method has been generalized to a non-interior method using projections by Rossetto [22[22] ROSSETTO DR. 2012. Tópicos em métodos ótimos para otimização convexa. PhD thesis, Department of Applied Mathematics, University of São Paulo, Brazil. In Portuguese.]. Results for higher order methods are discussed in Nesterov & Polyak [17[17] NESTEROV Y & POLYAK BT. 2006. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108:177-205.]. Accelerated versions of first-order algorithms following Nesterov’s approach were developed by Monteiro, Ortiz & Svaiter [13[13] MONTEIRO RDC, ORTIZ C & SVAITER BF. 2012. An adaptive accelerated first-order method for convex optimization. Technical report, School of ISyE, Georgia Tech, July.] and by Beck & Teboulle [2[2] BECK A & TEBOULLE M. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183-202, March.], with improved performance for benchmark problems.

6 CONCLUSIONS

In this paper we described what we believe to be the basic results in the study of algorithm performance and problem complexity for the minimization of convex functions, both unconstrained and with simple constraints.

Algorithms with proved low worst-case performance are not necessarily efficient for practical problems. Khachiyan’s algorithm [10[10] KHACHIYAN LG. 1979. A polynomial algorithm for linear programming. Doklady Akad. Nauk USSR, 244: 1093-1096. Translated in Soviet Math. Doklady 20:191-194.] for linear programming is very inefficient, but had a great impact on the development of both continuous and discrete optimization. Karmarkar’s algorithm [9[9] KARMARKAR N. 1984. A new polynomial time algorithm for linear programming. Combinatorica, 4:373-395.] for linear programming improved Khachiyan’s performance bound, and his bound was again improved later (see [6[6] GONZAGA CC. 1992. Path-following methods for linear programming. SIAM Review, 34(2):167- 224.]). The effort to improve complexity led to better algorithms, which are nowadays used for solving large scale linear and quadratic in many domains. In fact, the largest linear programming problem treated up to now had 1.1 billion variables and 380 million constraints, solved by Gondzio & Grothey [5[5] GONDZIO J & GROTHEY A. 2006. Solving nonlinear financial planning problems with 109 decision variables on massively parallel architectures. In M. Costantino and C. A. Brebbia, editors, Computational Finance and its Applications II, WIT Transactions on Modelling and Simulation, 43, volume 43. WIT Press.] using an interior point algorithm.

The conjugate gradient algorithm (Krylov space method) has optimal performance for quadratic problems, but its extension to more general problems is not straightforward. It was superseded by quasi-Newton methods, which are more efficient for non-quadratic problems, but coincide with it in the convex quadratic case. Note that the conjugate gradient method was not motivated by the complexity study, which came later.

Accelerated gradient methods are now in the phase of development, and we are not aware of any extensive comparison with classical algorithms. Research in this field is presently very active, and it is not clear to what classes of problems this approach will be applied and which methods will be the winners in practical applications to large-scale problems.

REFERENCES

  • [1]
    AUSLENDER A & TEBOULLE M. 2006. Interior gradient and proximal methods for convex and conic optimization. SIAM Journal on Optimization, 16(3):697-725.
  • [2]
    BECK A & TEBOULLE M. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183-202, March.
  • [3]
    BIRGIN EG, MARTÍNEZ JM & RAYDAN M. 2009. Spectral Projected Gradient Methods. In C.A. Floudas and P.M. Pardalos, editors, Encyclopedia of Optimization, pages 3652-3659. Springer.
  • [4]
    FLETCHER R & REEVES CM. 1964. Function minimization by conjugate gradients. Computer J., 7:149-154.
  • [5]
    GONDZIO J & GROTHEY A. 2006. Solving nonlinear financial planning problems with 109 decision variables on massively parallel architectures. In M. Costantino and C. A. Brebbia, editors, Computational Finance and its Applications II, WIT Transactions on Modelling and Simulation, 43, volume 43. WIT Press.
  • [6]
    GONZAGA CC. 1992. Path-following methods for linear programming. SIAM Review, 34(2):167- 224.
  • [7]
    GONZAGA CC & KARAS EW. 2013. Fine tuning Nesterov's steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(1-2):141-166.
  • [8]
    GONZAGA CC, KARAS EW & ROSSETTO DR. 2013. An optimal algorithm for constrained differentiable convex optimization. SIAM Journal on Optimization, 23(4):1939-1955.
  • [9]
    KARMARKAR N. 1984. A new polynomial time algorithm for linear programming. Combinatorica, 4:373-395.
  • [10]
    KHACHIYAN LG. 1979. A polynomial algorithm for linear programming. Doklady Akad. Nauk USSR, 244: 1093-1096. Translated in Soviet Math. Doklady 20:191-194.
  • [11]
    LAN G, LU Z & MONTEIRO RDC. 2011. Primal-dual first-order methods with O(1/ε) iterationcomplexity for cone programming. Mathematical Programming, 126(1):1-29.
  • [12]
    LUENBERGER DG & YE Y. 2008. Linear and Nonlinear Programming Springer, New York, third edition.
  • [13]
    MONTEIRO RDC, ORTIZ C & SVAITER BF. 2012. An adaptive accelerated first-order method for convex optimization. Technical report, School of ISyE, Georgia Tech, July.
  • [14]
    NEMIROVSKI AS & YUDIN DB. 1983. Problem Complexity and Method Efficiency in Optimization John Wiley, New York.
  • [15]
    NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course Kluwer Academic Publishers, Boston.
  • [16]
    NESTEROV Y. 2013. Gradient methods for minimizing composite objective function. Mathematical Programming, 140(1):125-161.
  • [17]
    NESTEROV Y & POLYAK BT. 2006. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108:177-205.
  • [18]
    NOCEDAL J & WRIGHT SJ. 2006. Numerical Optimization Springer Series in Operations Research. Springer-Verlag, 2nd edition.
  • [19]
    POLYAK BT. 1987. Introduction to Optimization Optimization Software Inc., New York.
  • [20]
    RIBEIRO AA & KARAS EW. 2013. Otimização Contínua: aspectos teóricos e computacionais Cengage Learning. In Portuguese.
  • [21]
    RICHTÁRIK P. 2011. Improved algorithms for convex minimization in relative scale. SIAM Journal on Optimization, 21(3):1141-1167.
  • [22]
    ROSSETTO DR. 2012. Tópicos em métodos ótimos para otimização convexa PhD thesis, Department of Applied Mathematics, University of São Paulo, Brazil. In Portuguese.
  • [23]
    SHEWCHUK JR. 1994. An introduction to the conjugate gradient method without the agonizing pain. Technical report, School of Computer Science, Carnegie Mellon University, August.

Publication Dates

  • Publication in this collection
    Sep-Dec 2014

History

  • Received
    08 Dec 2013
  • Accepted
    09 Feb 2014
Sociedade Brasileira de Pesquisa Operacional Rua Mayrink Veiga, 32 - sala 601 - Centro, 20090-050 Rio de Janeiro RJ - Brasil, Tel.: +55 21 2263-0499, Fax: +55 21 2263-0501 - Rio de Janeiro - RJ - Brazil
E-mail: sobrapo@sobrapo.org.br