COMPLEXITY OF FIRST-ORDER METHODS FOR DIFFERENTIABLE CONVEX
               OPTIMIZATION

Gonzaga, Clóvis C.; Karas, Elizabeth W.

doi:10.1590/0101-7438.2014.034.03.0395

Abstract

This is a short tutorial on complexity studies for differentiable convex optimization. A complexity study is made for a class of problems, an "oracle" that obtains information about the problem at a given point, and a stopping rule for algorithms. These three items compose a scheme, for which we study the performance of algorithms and problem complexity. Our problem classes will be quadratic minimization and convex minimization in ℝⁿ. The oracle will always be first order. We study the performance of steepest descent and Krylov spacemethods for quadratic function minimization and Nesterov’s approach to the minimization of differentiable convex functions.

first-order methods; complexity analysis; differentiable convex optimization

1 INTRODUCTION

Due to the huge increase in the size of problems tractable with modern computers, the study of problem complexity and algorithm performance became essential. This was recognized very early by computer scientists and mathematicians working on combinatorial problems, and has recently become a central issue in continuous optimization. Complexity studies for these problems started in the former Soviet Union, and the main results are described in the book by Nemirovski & Yudin ^[¹⁴[14] NEMIROVSKI AS & YUDIN DB. 1983. Problem Complexity and Method Efficiency in Optimization. John Wiley, New York.^].

The special case of Linear Programming, which will not be tackled in this paper, initiated with Khachiyan ^[¹⁰[10] KHACHIYAN LG. 1979. A polynomial algorithm for linear programming. Doklady Akad. Nauk USSR, 244: 1093-1096. Translated in Soviet Math. Doklady 20:191-194.^], also in Russia in 1978, and had an explosive expansion in the West with the creation of interior point methods in the 80’s and 90’s.

This paper starts with a brief introduction to the main concepts in the study of algorithm performance and complexity, following Nemirovski and Yudin, and then apply to the study of the convex optimization problem:

where ƒ: ℝⁿ → ℝ is a continuously differentiable function.

In Section 2 we introduce the general framework for the study of algorithm performance and problem complexity and present a simple example.

We dedicate Section 3 to study the special case of convex quadratic functions, because they are the simplest non-linear functions: if a method is inefficient for quadratic problems, it will certainly be inefficient for more general problems; if it is efficient, it has a good chance of being adaptable to general differentiable convex problems, because near an optimal solution the quadratic approximation of the function uses to be precise. We study the performance of steepest descent and of Krylov space methods.

Section 4 will describe and analyze a basic method for unconstrained convex optimization devised by Nesterov ^[¹⁵[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston.^], with “accelerated steepest descent” iterations. This method has become very popular, and the presentation and complexity proofs will be based on our paper ^[⁷[7] GONZAGA CC & KARAS EW. 2013. Fine tuning Nesterov's steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(1-2):141-166.^].

Finally, we comment in Section 5 on improvements of this basic algorithm, presenting without proofs its extension to problems restricted to “simple sets” (sets onto which projecting a vector is easy).

2 SCHEMES, PERFORMANCE AND COMPLEXITY

A complexity study is associated with a scheme (Σ, _ε) as follows., τ

Σ is a class of problems.

Examples: linear programming problems, unconstrained minimization of convex functions.
is an oracle associated to Σ.

The oracle is responsible for accessing the available information x) about a given problem in Σ at a given point x.(

Examples: x) = {ƒ (x)} (zero order)(
- x) = {ƒ (x), ∇ ƒ (x)} (first order).(
τ_ε is a stopping rule, associated with a precision ε > 0.

Examples: for the minimization problem (1), τ_ε defined by

ƒ (x) − ƒ* ≤ ε,

║∇ƒ (x)║ ≤ ε

║x − x*║ ≤ ε

where x* is a solution of the problem and ƒ* = ƒ(x*).

An instance in a scheme would be for example: Solve a convex minimization problem (Σ) using only first order information (x^k║≤ 10^-6 (τ_ε)) to a precision ║∇ƒ (

Algorithms. The general problem associated with a scheme (Σ, _ε) is to find a point satisfying τ_ε, using as information only consultations to the oracle and any mathematical procedures that do not depend on the particular problem being solved., τ

The algorithms studied in this paper follow the black box model described now. An algorithm starts with a point x⁰ and computes a sequence (x^k)k∈ℕ. Each iteration k accesses the oracle at x^k and uses the information obtained by the oracle at x⁰, x¹, ... x^k to compute a new point x^k+1. It stops if x^k+1 satisfies τ_ε.

Algorithm 1.Black box model for (Σ, _ε), τ

Data: x⁰, ε > 0, k = 0, I₋₁ = ∅

WHILE x^k does not satisfy τ_ε

Oracle at x^k: x^k)(
Update information set: I_k = I_k−1 ∪ x^k)(
Apply rules of the method to I_k: find x^k+1
k = k + 1.

Algorithm performance for (Σ, _ε), τ (worst case performance)

Consider an algorithm for (Σ, _ε)., τ

The iteration performance is a bound on the number of iterations (oracle calls) needed to solve any problem in the scheme (Σ, _ε). This bound will depend on ε and on parameters associated with each specific problem (initial point, space dimension, condition number, Lipschitz constants, etc.). In other words, it is the number of iterations needed to solve the “worst possible” problem in (Σ, _ε)., τ, τ
The numerical performance is a bound on the number of arithmetical operations needed in the worst case. The numerical performance is usually proportional to the iteration performance for each given algorithm. In this paper we only study the iteration performance of algorithms.
The complexity of the scheme (Σ, _ε) is the performance of the best possible algorithm for the scheme. It is frequently unknown, and finding it for different schemes is the main purpose of complexity studies., τ

The performance of any algorithm for a scheme gives an upper bound to its complexity. A lower bound to the complexity may sometimes be found by constructing an example of a (difficult) problem in Σ and finding a lower bound for any algorithm based on the same oracle and stopping rule. This will be the case in the end of Section 3.

If the performance of an algorithm for a scheme matches its complexity or a fixed multiple of it, it is called an optimal algorithm for the scheme.

First-order algorithms: In most of this paper we study first-order algorithms for solving the problem

where ƒ: ℝⁿ → ℝ is a differentiable function. The problem classes will be the special cases of quadratic and convex functions.

A first-order algorithm starts from a given point x⁰ and constructs a sequence (x^k) using the oracle x^k) = {ƒ(x^k), ∇ ƒ(x^k)}. Each step computes a point x^k+1 using the information set I_k = ∪_k^j=0 x^j) so that((

where span(S) stands for the subspace generates by S.

In particular, the most well-known minimization algorithm is the steepest descent method, in which

where λ_k is a steplength. Each different choice of steplength (the rules of the method) defines a different steepest descent algorithm. This will be studied ahead in this paper.

Remark: In our algorithm model we used a single oracle, but there may be more than one. Typically, ⁰(x) = {ƒ(x)}, ¹(x) = {∇ƒ(x)}, and ⁰ may be called more than once in each iteration. This is the case when line searches are used. The performance evaluation must then be adapted.

The notation O(·). Given two real positive functions g(·) and h(·), we say that g = O(h) if there exists some constant K > 0 such that g(·) ≤ Kh(·). This notation is very useful in complexity studies. For example, we shall prove that a certain steepest descent algorithm stops for k ≤ C is a parameter that identifies the problem in Σ and ε is the precision. We may write k = C O(log(1/ε)), ignoring the coefficient 1/4., where

2.1 Example: root of a continuous function

Here we present a simple example to illustrate how a complexity analysis works. Consider the following example of (Σ, _ε) given by:, τ

Σ: Given a continuous function ƒ : [0, 1] → ℝ with ƒ(0) ≤ 0 and ƒ(1) ≥ 0, find x ∈ [0, 1] such that ƒ(x) = 0.
x ∈ [0, 1], x) = {ƒ(x)}.: For (
τ_ε: For ε > 0, τ_ε is satisfied if |x − x*| ≤ ε for some root x*.

Remark: The stopping rule above is obviously not computable. We shall use a practical rule that implies τ_ε.

Algorithm 2. Bisection

Data: ε ∈ (0, 1), a₀ = 0, b₀ = 1, k = 0

WHILE b_k - a_k > ε (stopping rule)

m = (a_k + b_k)/2
Compute ƒ(m) (oracle)
IF ƒ(m) ≤ 0, set a_k+1 = m, b_k+1 = b_k
ELSE set b_k+1 = m, a_k+1 = a_k
k = k + 1.

Performance: the following facts are straightforward at all iterations:

ƒ (a_k) ≤ 0 and ƒ (b_k) ≥ 0 and by the intermediate value theorem, τ_ε is implied by b_k − a_k ≤ ε.
b_k − a_k = 2^-k

If the algorithm does not stop at iteration k, then 2^−k > ε, and then k < log₂(1/ε). We conclude that the stopping rule will be satisfied for k = ⌈log₂(1/ε)⌉, where ⌈r⌉ is the smallest integer above r. Thus that the performance of the scheme above is k = ⌈log₂(1/ε)⌉ = O(log(1/ε)).

It is possible to prove that this is the best possible algorithm for this scheme, and hence the complexity of the scheme is this performance.

Note that the only assumption here was the continuity of ƒ. With stronger assumptions (Lipschitz constants for instance), better algorithms are described in numerical calculus textbooks.
The rules of the method are in the bisection calculation. Only the present oracle information m) is used at step k.(

3 MINIMIZATION OF A QUADRATIC FUNCTION: FIRST-ORDER METHODS

Quadratic functions are the simplest nonlinear functions, and so an efficient algorithm for minimizing nonlinear functions must also be efficient in the quadratic case. On the other hand, near a minimizer, a twice differentiable function is usually well approximated by a quadratic function. A quadratic function is defined by

where c ∈ ℝⁿ and H is an n × n symmetric matrix. Then for x ∈ ℝⁿ,

If x* is a minimizer or a maximizer of ƒ, then

and hence finding an extremal of ƒ is equivalent to solving the linear system Hx* = −c, one of the most important problems in Mathematics.

The behavior of a quadratic function depends on the eigenvalues of its Hessian H. Since H is symmetric, it is known that H has n real eigenvalues

which may be associated with n orthonormal (mutually orthogonal with unit norm) eigenvectors ν₁, ν₂, ..., ν_n. There are four cases, represented in Figure 1:

Figure 1
Quadratic functions.

If μ₁ > 0, then H is a positive definite matrix, ƒ is strictly convex and its unique minimizer is the unique solution of Hx = −c.
If μ₁ < 0, then inf_x∈ℝⁿ ƒ (x) = −∞, and ƒ (x)→ −∞ along the direction ν₁.

Consider now the cases in which there are null eigenvalues. Let them be μ₁ = μ₂ = ... = μ_k = 0. Thus H is a positive semi-definite matrix.
If c^T νi = 0 for i = 1, ..., k, then ƒ has a k-dimensional set of minimizers.
If c^T νi < 0 for some i = 1, ..., k, then ƒ is unbounded below.

In this section, we study the following scheme:

Σ: the class of quadratic functions that are bounded below (cases (i) and (iii) above). A function in Σ has at least one minimizer x*. Without loss of generality, the study of algorithmic properties (not the implementation) may assume that x* = 0, and so the function becomes

x) = {ƒ(x), ∇ ƒ(x)} (first order).: (

τ_ε: Given an initial point x⁰ ∈ ℝⁿ, two rules will be used in the analysis:

• Absolute error bound: ƒ(x) − ƒ* ≤ ε,
• Relative error bound: ƒ(x) − ƒ* ≤ ε(ƒ(x⁰) − ƒ*).
These rules are not implementable, because they require the knowledge of x*, but are very useful in the performance analysis.

Simplification: As we explained above, we assume that ƒ(·) has a minimizer x* = 0. After performing the analysis with this simplification, we substitute x − x* for x. A further simplification may be done by diagonalizing H, also without loss of generality.

3.1 Steepest descent algorithms

In the first half of the 19^th century, Cauchy found that the gradient ∇ ƒ(x) of a function ƒ is the direction of maximum ascent of ƒ from x, and stated the gradient method. It is the most basic of all optimization algorithms, and its performance is still an active research topic.

Algorithm 3. Steepest descent algorithm (model)

Data: x⁰ ∈ ℝⁿ, ε > 0, k = 0

WHILE x^k does not satisfy τ_ε

Choose a steplength λ_k > 0
x^k+1 = x^k − λ_k∇ ƒ(x^k) = (I − λ_kH)x^k
k = k + 1.

Steplengths: Each different method for choosing the steplengths λ_k defines a different steepest descent algorithm. Let us describe the two best known choices for the steplengths:

• The Cauchy step, or exact step,

the unique minimizer of ƒ along the direction −g with g = ∇ ƒ(x^k). The steplength is computed by setting ∇ ƒ(x^k − λ_kg)⊥g and simplifying, which results in

• The short step: λ_k < 2/μ_n, a fixed steplength.

Complexity results

Now we study the iteration performance of the steepest descent methods with these two steplength choices for minimizing a strictly convex quadratic function (case (i)). Given ε > 0 and x⁰ ∈ ℝⁿ, we consider that τ_ε is satisfied at a given x ∈ ℝⁿ if

In both cases the algorithmstops in O(C log(1/ε)) iterations, where C = μ_n/μ₁. At thismoment the following question is open: find a steepest descent algorithm (by a different choice of λ_k) with performance O( √C log(1/ε)). This performance is achieved in practice for “normal problems” (but not for particular worst case problems) by Barzilai-Borwein and spectral methods, described in ^[³[3] BIRGIN EG, MARTÍNEZ JM & RAYDAN M. 2009. Spectral Projected Gradient Methods. In C.A. Floudas and P.M. Pardalos, editors, Encyclopedia of Optimization, pages 3652-3659. Springer.^].

Theorem 1.Let C = μ_n/μ₁ ≥ 1 be the condition number of H. The iteration performance of the steepest descent method with Cauchy steplength for minimizing f starting at x⁰ ∈ ℝⁿ and with stopping criterion (5) is given by

Proof. We begin by stating a classical result for the steepest descent step, which is based on the Kantorovich inequality and is proved for instance in ^[^{12, p.
238}[12] LUENBERGER DG & YE Y. 2008. Linear and Nonlinear Programming. Springer, New York, third edition.^],

Using this recursively, we obtain

which implies

It is known that t ∈ [1, +∞)→ t > 1, is an increasing function and that for

Consequently

If τ_ε is not satisfied at an iteration k, then by (5), > ε or

which implies k < , completing the proof.

Example. In this example we show that the bound obtained in Theorem 1 is sharp, i.e., it cannot be improved. Take the following problem in ℝ²:

Assume that the initial point of some iteration has the shape x = (C, 1) z, for some z ∈ ℝ. Then

Computing the steplength λ by (4), we obtain λ = 2/(C + 1), and then the next iterate will be

It follows that

and this will be repeated on all iterations, with the worst possible performance as in Theorem 1.

Theorem 2.Let C = μ_n/μ₁ ≥ 1 be the condition number of H. The iteration performance of the steepest descent method with short steps λ_k = 1/μ_n, for minimizing ƒ starting at x⁰ ∈ ℝⁿ and with stopping criterion (5) is given by

Proof. A simplification in the analysis can be made by diagonalizing the matrix H by using the orthonormal matrix whose columns are the eigenvectors of H. Thus, we can consider, without loss of generality, that H = diag(μ₁,..., μ_n). Given x⁰ ∈ ℝⁿ, by the steepest descent algorithm with short steps,

Thus, for all i = 1, ..., n,

Consequently, by the definition of ƒ,

Proceeding like in the proof of Theorem 1, we obtain

So, if τ_ε is not satisfied at an iteration k, then k < , completing the proof.

When the short steps 1/μ_n are used, the result is that for i = 1, ..., n, |x_i^k| ≤√ε|x_i⁰|, for k ≥ x^k) ≤ ε ƒ(x⁰), but also ║x^k║≤ √ε║x⁰║ and ║∇ ƒ(x^k)║≤ √ε║∇ ƒ(x⁰).. Hence not only ƒ(
The diagonalization of H can be made without loss of generality for the performance analysis, as we did in the proof of Theorem 2. This leads to an interesting observation about the constant C, for the case in which there are null eigenvalues (case(iii)). Assuming that μ₁ = μ₂ = ... = μ_p = 0, we see that for i = 1, ..., p, (∇ ƒ(x))_i = 0, and the variables x_i remain constant forever having no influence on the performance. The bounds in Theorems 1 and 2 remain valid for C = μ_n/μ_p+1.

3.2 Krylov methods

Krylov space methods are the best possible algorithms for minimizing a quadratic function using only first-order information. Let us describe the geometry of a Krylov space method for the quadratic (2).

• Starting at a point x⁰, define the line V₁ = {x⁰ + θ ∇ ƒ(x⁰)|θ ∈ ℝ} and

This is actually the Cauchy step. We may write V₁ = x⁰ + span {Hx⁰}.

• Second step: take the affine space defined by ∇ ƒ(x⁰) and ∇ ƒ(x¹), V₂ = x⁰ + span {∇ ƒ(x⁰),∇ ƒ(x¹)} and note that since ∇ ƒ(x¹) = H(x⁰ + θ ∇ ƒ(x⁰)) = Hx⁰ + θ H²x⁰,

and the next iterate will be

This is a two-dimensional problem.

• k-th step: adding ∇ ƒ(x^k−1) to the set of gradients, we construct the set

and the next point will be

a k−dimensional problem.

Since ∇ ƒ(x^k) ⊥ V_k because of the minimization, either ∇ ƒ(x^k) = 0 and the problem is solved, or V_k+1 is (k+1)−dimensional. It is then clear that xⁿ is an optimal solution because V_n = ℝn. This gives us a first performance bound k ≤ n for the Krylov space method. This bound is bad for high-dimensional spaces.

Main question: how to solve (P_k). Without proof (see for instance ^[¹⁵[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston.^,²⁰[20] RIBEIRO AA & KARAS EW. 2013. Otimização Contínua: aspectos teóricos e computacionais. Cengage Learning. In Portuguese.^]), it is known that the directions (x^k − x^k−1) are conjugate, and any conjugate direction algorithm like Fletcher-Reeves ^[⁴[4] FLETCHER R & REEVES CM. 1964. Function minimization by conjugate gradients. Computer J., 7:149-154.^,¹⁸[18] NOCEDAL J & WRIGHT SJ. 2006. Numerical Optimization. Springer Series in Operations Research. Springer-Verlag, 2nd edition.^] solves (P_k) at each iteration with about the same work as in the steepest descent method. From now on we do a performance analysis of the Krylov space method, with stopping criterion

A result for the relative error bound will also be discussed in the end of the section. The analysis is quite technical and will result in (16).

Definition 1.Given x⁰ ∈ ℝⁿ and k ∈ ℕ, define the k-th Krylov space by

Consider V_k = x⁰ + K_k and define the sequence (x^k) by

Let P_k be the set of polynomials p: ℝ → ℝ of degree k such that p(0) = 1, i.e,

From now on we deal with matrix polynomials, setting t = H.

Lemma 1.A point x ∈ V_k if, and only if, x = p(H)x⁰ for some polynomial p ∈ P_k. Furthermore,

Proof. A point x ∈ V_k if, and only if,

where p ∈ P_k. Furthermore,

As H is symmetric, (p(H))^T H = H p(H), completing the proof.

Lemma 2.For any polynomial p ∈ P_k,

Proof. Consider an arbitrary polynomial p ∈ P_k. From Lemma 1, the point x = p(H)x⁰ belongs to V_k. As x^k minimizes ƒ in V_k, we have ƒ(x^k) ≤ ƒ(x). Using (8) we complete the proof.

Lemma 3.Let A ∈ ℝ^n×n be a symmetric matrix with eigenvalues λ₁, λ₂, ... ,λ_n. If q: ℝ → ℝ is a polynomial, then q(λ₁), q(λ₂), ..., q(λ_n) are the eigenvalues of q(A).

Proof. As A is a symmetric matrix, there exists an orthogonal matrix P such that A = PDP^T, with D = diag(λ₁, λ₂, ...,λ_n). If q(t) = a₀ + a₁t + ··· +a_kt^k, then

Note that

which completes the proof.

3.2.1 Chebyshev Polynomials

The Chebyshev polynomials will be needed in the performance analysis of Krylov methods.

Definition 2.The Chebyshev polynomial of degree k, T_k: [−1, 1] → ℝ, is defined by

The next lemma shows that T_k is, in fact, a polynomial (even though it does not look like one).

Lemma 4.For all t ∈ [−1, 1], T₀(t) = 1 and T₁(t) = t. Furthermore, for all k ≥ 1,

Proof. The first statements follow from the definition. In order to prove the recurrence rule, consider θ: [−1, 1] → [0, π], given by θ(t) = arccos(t). Thus,

and

But cos (kθ(t)) = T_k (t) and cos (θ(t)) = t. So,

completing the proof.

Lemma 5.If T_k (t) = a_kt^k + ··· + a₂t² + a₁t + a₀, then a_k = 2^k−1. Furthermore,

If k is even, then a₀ = (−1) and a_{2 j−1} = 0, for all j = 1, ... , ;
If k is odd, then a₁ = (−1) k and a_{2 j} = 0, for all j = 0, 1, ..., .

Proof. We prove by induction. The results are trivial for k = 0 and k = 1. Suppose that the results hold for all natural number less than or equal to k. Using the induction hypothesis, consider

By Lemma 4,

leading to the first statement. Suppose that (k + 1) is even. Then k is odd and (k − 1) is even. Thus, by induction hypothesis, T_k has only odd powers of t and T_k−1 has only even powers. In this way, by (9), T_k+1 has only even powers of t. Furthermore, its independent term is

On the other hand, if (k + 1) is odd, then k is even and (k − 1) is odd. Again by the induction hypothesis, T_k only has even powers of t and T_k−1 has only odd powers. Thus by (9), T_k+1 has only odd powers of t. Furthermore, its linear term is

completing the proof.

The next lemma discusses a relationship between a Chebyshev polynomial of odd degree and polynomials of the set P_k, defined in (7).

Lemma 6.Consider L > 0 and k ∈ ℕ. Then there exists p ∈ P_k such that, for all t ∈ [0, L],

Proof. By Lemma 5, for all t ∈ [−1, 1], we have

where the polynomial in parentheses has only even powers of t. So, for all t ∈ [0, L],

Defining

we complete the proof.

3.2.2 Complexity results

Now we present the main result about the performance of Krylov methods for minimizing a convex quadratic function. This result is based on ^[^{19, Thm. 3, p. 170}[19] POLYAK BT. 1987. Introduction to Optimization. Optimization Software Inc., New York.^].

We use the matrix norm defined by

Theorem 3. Let μ_n be the largest eigenvalue of H and consider the sequence (x^k) defined by (6). Then for k ∈ ℕ

and ƒ(x^k) − ƒ*ƒ ≤ ε is satisfied for

Proof. Without loss of generality, assume that x* = 0. By Lemma 2, for all polynomial p ∈ P_k,

But, from Lemma 3 and (10),

Considering the polynomial p ∈ P_k given in Lemma 6 and using the fact that all eigenvalues of H belong to (0, μ_n], we have

proving (11). If τ_ε is not satisfied at an iteration k, then ƒ(x^k) > ε and consequently

which implies (12) and completes the proof.

Performance of the method for the relative error bound: A similar analysis for τ_ε given by (5), also using Chebyshev polynomials, can be done using the condition number C. This is done in ^[¹⁴[14] NEMIROVSKI AS & YUDIN DB. 1983. Problem Complexity and Method Efficiency in Optimization. John Wiley, New York.^,²³[23] SHEWCHUK JR. 1994. An introduction to the conjugate gradient method without the agonizing pain. Technical report, School of Computer Science, Carnegie Mellon University, August.^], and the result is

clearly better than the best performance of the steepest descent algorithm for the steplength rules studied above, and for reasonable values of μ₁, better than (12).

Complexity bound. The Krylov space methods uses at each iteration all the information gathered in the previous steps, and hence it seems to be the best possible algorithm based on first order information. In fact, Nemirovskii & Yudin ^[¹⁴[14] NEMIROVSKI AS & YUDIN DB. 1983. Problem Complexity and Method Efficiency in Optimization. John Wiley, New York.^] prove that no algorithm using a first order oracle can have a performance more than twice as good as the Krylov space method.

For methods based on accumulated first order information there is a negative result described by Nesterov ^[^{15, p.
59}[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston.^]: he constructs a quadratic problem (which he calls “the worst problem in the world”) for which such methods need at least

iterations to reach τ_ε.

We conclude that the best performance for a first order method must be between the bounds (12) and (15). So the complexity of the scheme is

4 CONVEX DIFFERENTIABLE FUNCTIONS: THE BASIC ALGORITHM

In this section we study the performance of algorithms for the unconstrained minimization of differentiable convex functions. Quadratic functions are a particular case, and hence the performance bounds for first order algorithms will not be better than those found in the former section.

The role played by μ_n in quadratic functions will be played by a Lipschitz constant L for the gradient of ƒ(indeed, for a quadratic function the largest eigenvalue is a Lipschitz constant for the gradient), and we shall see that there are optimal algorithms, i.e., algorithms with the performance given by (16) with μ_n replaced by L. These algorithmswere developed by Nesterov ^[¹⁵[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston.^], and are also studied in our papers ^[⁷[7] GONZAGA CC & KARAS EW. 2013. Fine tuning Nesterov's steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(1-2):141-166.^,⁸[8] GONZAGA CC, KARAS EW & ROSSETTO DR. 2013. An optimal algorithm for constrained differentiable convex optimization. SIAM Journal on Optimization, 23(4):1939-1955.^].

Consider the scheme (Σ, _ε) where, τ

Σ: the class of minimization problems of a convex continuously differentiable function ƒ with a Lipschitz constant L > 0 for the gradient. It means that for all x, y ∈ ℝⁿ,

x) = {ƒ(x), ∇ ƒ(x)} (first order): (
τ_ε: defined by ƒ(x) - ƒ* ≤ ε where x* is a solution of the problem and ƒ* = ƒ(x*).

Simple quadratic functions. The following definition will be useful in our development: we shall call “simple” a quadratic function φ: ℝⁿ → ℝ with ∇²φ(x) = γ I , γ ∈ ℝ, γ > 0. The following facts are easily proved for such functions:

φ(·) has a unique minimizer ν ∈ ℝⁿ (which we shall refer as the center of the quadratic), and the function can be written as

Given x ∈ ℝⁿ,

and

4.1 The algorithm

We now state the main algorithm and then study its properties. We include in the statement of the algorithm the definitions of the relevant functions (approximations of ƒ(·) and the simple quadratic defined below).

We begin by summarizing the geometrical construction at an iteration k, represented in Figure 2. The iteration starts with two points x^k, ν^k ∈ ℝⁿ and a simple quadratic function

Figure 2
The mechanics of the algorithm.

whose global minimizer is ν^k.

A point y^k = x^k + α(ν^k − x^k) is chosen between x^k and ν^k. The choice of α is a central issue, and will be discussed later. All the action is centered on y^k, with the following construction:

Take a gradient step from y^k, generating x^k+1.
Define a linear approximation of ƒ(·)

Compute a value α ∈ (0, 1), and define φ_α(x) = αℓ(x) + (1 − α)φ_k (x), with Hessian γ_k+1 I = ∇²φ_α(x) = (1 − α)γ_kI, and let ν^k+1 be the minimizer of this simple quadratic. The iteration is completed by defining

Now we state the algorithm.

Algorithm 4.

Data: x⁰ ∈ ℝⁿ, ν₀ = x₀, γ₀ = L, k = 0

REPEAT

Compute α_k ∈ (0, 1) such that Lα_k² = (1 − α_k)γ_k
Set y^k = x^k + α_k(ν^k+1 − x^k)
Compute ƒ(y^k) and g = ∇ ƒ(y^k)
Updates
- x^k+1 = y^k − g/L (steepest descent step)
- γ_k+1 = (1 − α_k)γ_k
- - For the analysis define
  - x → φ_k(x) = ƒ(x^k) + x − ν^k║²║
  - x → ℓ(x) = ƒ(y-^k) + g^T (x - y^k)
  - x → u(x) = ƒ(y-^k) + g^T (x - y^k) + x - y^k║²║
  - x → φ_αk (x) = α_kℓ(x) + (1 − α_k)φ_k (x)
- ν^k+1 = argmin φ_αk(·) = ν^k -
- k = k + 1.

4.1.1 Analysis of the algorithm

The most important procedure in the algorithm is the choice of the parameter α_k, which then determines y^k at each iteration. The choice of αk is the one devised by Nesterov in ^[¹⁵[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston., Scheme (2.2.6)^]. Instead of “discovering” the values for these parameters, we shall simply adopt them and show their properties.

Once y^k is determined, two independent actions are taken:

A steepest descent step from y^k computes x^k+1.
A new simple quadratic is constructed by combining φ_k (·) and the linear approximation ℓ(·) of ƒ(·) about y^k:

Our scope will be to prove two facts:

At any iteration k, φx^k+1). ≥ ƒ(
For all x ∈ ℝ^{n,
φ_k+1(x)
− ƒ(x) ≤ (1 −
α_k)(φ_k
(x) − ƒ(x)).}

From these facts we shall conclude that ƒ(x^k) → ƒ* with the same speed as γ_k → 0, which easily leads to the desired performance result.

The first lemma shows our main finding about the geometry of these points. All the action happens in the two-dimensional space defined by x^k, ν^k, ν^k+1. Note the beautiful similarity of the triangles in Figure 3.

Figure 3
Geometric properties of the steps.

Lemma 7.Consider the sequences generated by Algorithm 4. Then for k ∈ ℕ,

Proof. By the algorithm, we know that Lα²_k = γ_k+1, and

completing the proof.

Lemma 8.Consider the sequences generated by Algorithm 4. Then for k ∈ ℕ,

Proof. By the definition of φ_αk,

But, φ_k(ν^k) = ƒ(x^k) ≥ ℓ(x^k). Using this, the definition of ℓ and the fact that α_k ∈ (0, 1), we have

By the definition of y^k in the algorithm, ν^k−y^k = (1−α_k)(ν^k−x^k) and x^k−y^k = −α_k(ν^k−x^k). Substituting this in (21), we complete the proof.

Lemma 9.Consider the sequences generated by Algorithm 4. Then for k ∈ ℕ,

Proof. The first inequality follows trivially from the convexity of ƒ and the definition of u. Since x^k+1 and ν^k+1 are respectively global minimizers of u(·) and φα_k (·), we have from (18) that, for all x ∈ ℝⁿ,

As ƒ(y^k) = u(y^k) and, from the last lemma, u(y^k) ≤ φα_k (ν^k), we only need to show that

The construction is shown in Fig. 3: since, by Lemma 7, x^k+1 = x^k + α^k(ν^k+1 − x^k),

Using this, (24) and the fact that by construction, α²_k = , we obtain

proving the second inequality of (22).

By construction,

Comparing to (24) and using the fact that ƒ(x^k+1) ≤ , we get (23), completing the proof.

Lemma 10.For any x ∈ ℝⁿ and k ∈ ℕ,

Proof. By the definition of φα_k and the fact that ℓ(x) ≤ ƒ(x), for all x ∈ ℝⁿ,

Subtracting ƒ(x) in both sides, using (23) and the definition of γ^k+1, we have

Using this recursively, we get the result and complete the proof.

4.1.2 Complexity

The following lemma was proved by Nesterov ^[^{15, p. 77}[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston.^] with a different notation.

Lemma 11.Consider the sequence (γ_k) generated by Algorithm 4, i.e., given γ₀ > 0,

Then, for k ∈ ℕ, γ_k ≤ 4L/k².

Proof. As α_k =

Thus, the result follows directly from ^[⁷[7] GONZAGA CC & KARAS EW. 2013. Fine tuning Nesterov's steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(1-2):141-166., Lemma 10^].

Theorem 4.Consider the sequences generated by Algorithm 4 and assume that x* is an optimal solution. Then for k ∈ ℕ,

and ƒ(x^k) − ƒ* ≤ ε is satisfied for

Proof. From Lemma 10, (25) holds in particular at x*,

Using the fact that ƒ(x^k) = φ_k(ν^k) ≤ φ_k(x*) and the definition of φ₀, we get,

Since x* is a minimizer of the convex function ƒ,

Applying this and the result of Lemma 11 in (28),

As γ₀ = L, we have (26). If τ_ε is not satisfied at an iteration _k, then ƒ(x^k) − ƒ* > ε and consequently

which implies (27) and completes the proof.

So, the iteration performance of Algorithm4 is

which corresponds to the complexity (16) for quadratic performance. Then, the algorithm is optimal.

5 CONVEX DIFFERENTIABLE FUNCTIONS: ENHANCED ALGORITHMS

The algorithm presented in the former section may be extended in several ways: the need for a previous knowledge of a Lipschitz constant L may be eliminated, a strong convexity constant akin to the smallest eigenvalue in the quadratic case may be used, and the algorithm may be extended to problems constrained to so-called simple sets. These extensions are treated in our paper ^[⁸[8] GONZAGA CC, KARAS EW & ROSSETTO DR. 2013. An optimal algorithm for constrained differentiable convex optimization. SIAM Journal on Optimization, 23(4):1939-1955.^] and in references therein.

In this section we state the extension of the basic algorithm to problems with simple constraints, without proofs. Consider the scheme (Σ, _ε) where, τ

Σ: the class of problems
- minimize ƒ(x)
- subject to x ∈ Ω,
where Ω ⊂ ℝⁿ is a closed convex set and ƒ: ℝⁿ → ℝ is convex and continuously differentiable, with a Lipschitz constant L > 0 for the gradient. We assume that Ω is a “simple” set, in the following sense: given an arbitrary point x ∈ ℝⁿ , an oracle is available to compute P_Ω(x) = x − y║, the orthogonal projection onto the set Ω.║
x) = {ƒ(x), ∇ƒ(x), P_Ω(x)}.: (
τ_ε: defined by ƒ(x) - ƒ* ≤ ε where x* is a solution of the problem and ƒ* = ƒ(x*).

We now state the basic algorithm for constrained problems, without proofs. We keep in the statement the definition of the functions used in the analysis made in ^[⁸[8] GONZAGA CC, KARAS EW & ROSSETTO DR. 2013. An optimal algorithm for constrained differentiable convex optimization. SIAM Journal on Optimization, 23(4):1939-1955.^].

Algorithm 5.

Data: x₀ ∈ Ω, ν₀ = x₀, γ₀ = L, k = 0

REPEAT

Compute α ∈ (0, 1) such that Lα² = (1 − α)γ_k
y^k = x^k + α(ν^k − y^k)
Compute ƒ(y^k) and g = ∇ƒ(y^k)
Updates
- x^k+1 = P_Ω(y^k − g/L) (projected steepest descent step)
- γ_k+1 = (1 − α_k)γ_k
- - For the analysis define
  - x → φ_k(x) = ƒ(y^k) + x - ν^k║²║
  - x → ℓ(x) = ƒ(y^k) + g^T (x - y^k)
  - x → u(x) = ƒ(y^k) + g^T (x - y^k) + x - y^k║²║
  - x → φα_k(x) = α_kℓ(x) + (1 - α_k)φ_k(x)
- ν^k+1 = φα_k(x) = P_Ω
- k = k + 1.

Consider the sequences generated by Algorithm 5. Then, as proved in ^[⁸[8] GONZAGA CC, KARAS EW & ROSSETTO DR. 2013. An optimal algorithm for constrained differentiable convex optimization. SIAM Journal on Optimization, 23(4):1939-1955., Thm. 2.6^], at any iteration k before stopping,

and hence

This expression is similar to (27). In fact, if x* is a global minimizer, then

may be introduced in the last expression, retrieving (27).

Estimations of the Lipschitz constant. Both Algorithms 4 and 5 and the algorithms discussed by Nesterov in ^[¹⁵[15] NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course. Kluwer Academic Publishers, Boston., Chapter 2^] make explicit use of a Lipschitz constant L for the function gradient. In ^[¹⁶[16] NESTEROV Y. 2013. Gradient methods for minimizing composite objective function. Mathematical Programming, 140(1):125-161.^], Nesterov describes a method for a more general problem, easily applied to the situations studied in this paper. This method includes a scheme for estimating the Lipschitz constant. In ^[⁷[7] GONZAGA CC & KARAS EW. 2013. Fine tuning Nesterov's steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(1-2):141-166.^,⁸[8] GONZAGA CC, KARAS EW & ROSSETTO DR. 2013. An optimal algorithm for constrained differentiable convex optimization. SIAM Journal on Optimization, 23(4):1939-1955.^], the authors eliminate the use of L at the cost of an extra imprecise line search, and obtain an algorithm which keeps the optimal complexity properties and also inherits the global convergence properties of the steepest descent method for general continuously differentiable optimization. Besides this, the algorithm takes advantage of the knowledge of the strong convexity constant for the function and develop in ^[⁷[7] GONZAGA CC & KARAS EW. 2013. Fine tuning Nesterov's steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(1-2):141-166.^] an adaptive procedure for estimating it. In another context – constrained minimization of non-smooth homogeneous convex functions – Richtárik ^[²¹[21] RICHTÁRIK P. 2011. Improved algorithms for convex minimization in relative scale. SIAM Journal on Optimization, 21(3):1141-1167.^] uses an adaptive scheme for guessing bounds for the distance between a point x₀ and an optimal solution x*. This bound determines the number of subgradient steps needed to obtain a desired precision.

Extensions. Nesterov’s approach is applied to penalty methods by Lan, Lu & Monteiro ^[¹¹[11] LAN G, LU Z & MONTEIRO RDC. 2011. Primal-dual first-order methods with O(1/ε) iterationcomplexity for cone programming. Mathematical Programming, 126(1):1-29.^], and interior descent methods based on Bregman distances are described by Auslender & Teboulle ^[¹[1] AUSLENDER A & TEBOULLE M. 2006. Interior gradient and proximal methods for convex and conic optimization. SIAM Journal on Optimization, 16(3):697-725.^]. This method has been generalized to a non-interior method using projections by Rossetto ^[²²[22] ROSSETTO DR. 2012. Tópicos em métodos ótimos para otimização convexa. PhD thesis, Department of Applied Mathematics, University of São Paulo, Brazil. In Portuguese.^]. Results for higher order methods are discussed in Nesterov & Polyak ^[¹⁷[17] NESTEROV Y & POLYAK BT. 2006. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108:177-205.^]. Accelerated versions of first-order algorithms following Nesterov’s approach were developed by Monteiro, Ortiz & Svaiter ^[¹³[13] MONTEIRO RDC, ORTIZ C & SVAITER BF. 2012. An adaptive accelerated first-order method for convex optimization. Technical report, School of ISyE, Georgia Tech, July.^] and by Beck & Teboulle ^[²[2] BECK A & TEBOULLE M. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183-202, March.^], with improved performance for benchmark problems.

6 CONCLUSIONS

In this paper we described what we believe to be the basic results in the study of algorithm performance and problem complexity for the minimization of convex functions, both unconstrained and with simple constraints.

Algorithms with proved low worst-case performance are not necessarily efficient for practical problems. Khachiyan’s algorithm ^[¹⁰[10] KHACHIYAN LG. 1979. A polynomial algorithm for linear programming. Doklady Akad. Nauk USSR, 244: 1093-1096. Translated in Soviet Math. Doklady 20:191-194.^] for linear programming is very inefficient, but had a great impact on the development of both continuous and discrete optimization. Karmarkar’s algorithm ^[⁹[9] KARMARKAR N. 1984. A new polynomial time algorithm for linear programming. Combinatorica, 4:373-395.^] for linear programming improved Khachiyan’s performance bound, and his bound was again improved later (see ^[⁶[6] GONZAGA CC. 1992. Path-following methods for linear programming. SIAM Review, 34(2):167- 224.^]). The effort to improve complexity led to better algorithms, which are nowadays used for solving large scale linear and quadratic in many domains. In fact, the largest linear programming problem treated up to now had 1.1 billion variables and 380 million constraints, solved by Gondzio & Grothey ^[⁵[5] GONDZIO J & GROTHEY A. 2006. Solving nonlinear financial planning problems with 109 decision variables on massively parallel architectures. In M. Costantino and C. A. Brebbia, editors, Computational Finance and its Applications II, WIT Transactions on Modelling and Simulation, 43, volume 43. WIT Press.^] using an interior point algorithm.

The conjugate gradient algorithm (Krylov space method) has optimal performance for quadratic problems, but its extension to more general problems is not straightforward. It was superseded by quasi-Newton methods, which are more efficient for non-quadratic problems, but coincide with it in the convex quadratic case. Note that the conjugate gradient method was not motivated by the complexity study, which came later.

Accelerated gradient methods are now in the phase of development, and we are not aware of any extensive comparison with classical algorithms. Research in this field is presently very active, and it is not clear to what classes of problems this approach will be applied and which methods will be the winners in practical applications to large-scale problems.

REFERENCES

^[1]
AUSLENDER A & TEBOULLE M. 2006. Interior gradient and proximal methods for convex and conic optimization. SIAM Journal on Optimization, 16(3):697-725.
^[2]
BECK A & TEBOULLE M. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183-202, March.
^[3]
BIRGIN EG, MARTÍNEZ JM & RAYDAN M. 2009. Spectral Projected Gradient Methods. In C.A. Floudas and P.M. Pardalos, editors, Encyclopedia of Optimization, pages 3652-3659. Springer.
^[4]
FLETCHER R & REEVES CM. 1964. Function minimization by conjugate gradients. Computer J., 7:149-154.
^[5]
GONDZIO J & GROTHEY A. 2006. Solving nonlinear financial planning problems with 10⁹ decision variables on massively parallel architectures. In M. Costantino and C. A. Brebbia, editors, Computational Finance and its Applications II, WIT Transactions on Modelling and Simulation, 43, volume 43. WIT Press.
^[6]
GONZAGA CC. 1992. Path-following methods for linear programming. SIAM Review, 34(2):167- 224.
^[7]
GONZAGA CC & KARAS EW. 2013. Fine tuning Nesterov's steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(1-2):141-166.
^[8]
GONZAGA CC, KARAS EW & ROSSETTO DR. 2013. An optimal algorithm for constrained differentiable convex optimization. SIAM Journal on Optimization, 23(4):1939-1955.
^[9]
KARMARKAR N. 1984. A new polynomial time algorithm for linear programming. Combinatorica, 4:373-395.
^[10]
KHACHIYAN LG. 1979. A polynomial algorithm for linear programming. Doklady Akad. Nauk USSR, 244: 1093-1096. Translated in Soviet Math. Doklady 20:191-194.
^[11]
LAN G, LU Z & MONTEIRO RDC. 2011. Primal-dual first-order methods with O(1/ε) iterationcomplexity for cone programming. Mathematical Programming, 126(1):1-29.
^[12]
LUENBERGER DG & YE Y. 2008. Linear and Nonlinear Programming Springer, New York, third edition.
^[13]
MONTEIRO RDC, ORTIZ C & SVAITER BF. 2012. An adaptive accelerated first-order method for convex optimization. Technical report, School of ISyE, Georgia Tech, July.
^[14]
NEMIROVSKI AS & YUDIN DB. 1983. Problem Complexity and Method Efficiency in Optimization John Wiley, New York.
^[15]
NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course Kluwer Academic Publishers, Boston.
^[16]
NESTEROV Y. 2013. Gradient methods for minimizing composite objective function. Mathematical Programming, 140(1):125-161.
^[17]
NESTEROV Y & POLYAK BT. 2006. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108:177-205.
^[18]
NOCEDAL J & WRIGHT SJ. 2006. Numerical Optimization Springer Series in Operations Research. Springer-Verlag, 2nd edition.
^[19]
POLYAK BT. 1987. Introduction to Optimization Optimization Software Inc., New York.
^[20]
RIBEIRO AA & KARAS EW. 2013. Otimização Contínua: aspectos teóricos e computacionais Cengage Learning. In Portuguese.
^[21]
RICHTÁRIK P. 2011. Improved algorithms for convex minimization in relative scale. SIAM Journal on Optimization, 21(3):1141-1167.
^[22]
ROSSETTO DR. 2012. Tópicos em métodos ótimos para otimização convexa PhD thesis, Department of Applied Mathematics, University of São Paulo, Brazil. In Portuguese.
^[23]
SHEWCHUK JR. 1994. An introduction to the conjugate gradient method without the agonizing pain. Technical report, School of Computer Science, Carnegie Mellon University, August.

Publication Dates

Publication in this collection
Sep-Dec 2014

History

Received
08 Dec 2013
Accepted
09 Feb 2014

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

[1] ^[1]
AUSLENDER A & TEBOULLE M. 2006. Interior gradient and proximal methods for convex and conic optimization. SIAM Journal on Optimization, 16(3):697-725.

[2] ^[2]
BECK A & TEBOULLE M. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183-202, March.

[3] ^[3]
BIRGIN EG, MARTÍNEZ JM & RAYDAN M. 2009. Spectral Projected Gradient Methods. In C.A. Floudas and P.M. Pardalos, editors, Encyclopedia of Optimization, pages 3652-3659. Springer.

[4] ^[4]
FLETCHER R & REEVES CM. 1964. Function minimization by conjugate gradients. Computer J., 7:149-154.

[5] ^[5]
GONDZIO J & GROTHEY A. 2006. Solving nonlinear financial planning problems with 10⁹ decision variables on massively parallel architectures. In M. Costantino and C. A. Brebbia, editors, Computational Finance and its Applications II, WIT Transactions on Modelling and Simulation, 43, volume 43. WIT Press.

[6] ^[6]
GONZAGA CC. 1992. Path-following methods for linear programming. SIAM Review, 34(2):167- 224.

[7] ^[7]
GONZAGA CC & KARAS EW. 2013. Fine tuning Nesterov's steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(1-2):141-166.

[8] ^[8]
GONZAGA CC, KARAS EW & ROSSETTO DR. 2013. An optimal algorithm for constrained differentiable convex optimization. SIAM Journal on Optimization, 23(4):1939-1955.

[9] ^[9]
KARMARKAR N. 1984. A new polynomial time algorithm for linear programming. Combinatorica, 4:373-395.

[10] ^[10]
KHACHIYAN LG. 1979. A polynomial algorithm for linear programming. Doklady Akad. Nauk USSR, 244: 1093-1096. Translated in Soviet Math. Doklady 20:191-194.

[11] ^[11]
LAN G, LU Z & MONTEIRO RDC. 2011. Primal-dual first-order methods with O(1/ε) iterationcomplexity for cone programming. Mathematical Programming, 126(1):1-29.

[12] ^[12]
LUENBERGER DG & YE Y. 2008. Linear and Nonlinear Programming Springer, New York, third edition.

[13] ^[13]
MONTEIRO RDC, ORTIZ C & SVAITER BF. 2012. An adaptive accelerated first-order method for convex optimization. Technical report, School of ISyE, Georgia Tech, July.

[14] ^[14]
NEMIROVSKI AS & YUDIN DB. 1983. Problem Complexity and Method Efficiency in Optimization John Wiley, New York.

[15] ^[15]
NESTEROV Y. 2004. Introductory Lectures on Convex Optimization. A basic course Kluwer Academic Publishers, Boston.

[16] ^[16]
NESTEROV Y. 2013. Gradient methods for minimizing composite objective function. Mathematical Programming, 140(1):125-161.

[17] ^[17]
NESTEROV Y & POLYAK BT. 2006. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108:177-205.

[18] ^[18]
NOCEDAL J & WRIGHT SJ. 2006. Numerical Optimization Springer Series in Operations Research. Springer-Verlag, 2nd edition.

[19] ^[19]
POLYAK BT. 1987. Introduction to Optimization Optimization Software Inc., New York.

[20] ^[20]
RIBEIRO AA & KARAS EW. 2013. Otimização Contínua: aspectos teóricos e computacionais Cengage Learning. In Portuguese.

[21] ^[21]
RICHTÁRIK P. 2011. Improved algorithms for convex minimization in relative scale. SIAM Journal on Optimization, 21(3):1141-1167.

[22] ^[22]
ROSSETTO DR. 2012. Tópicos em métodos ótimos para otimização convexa PhD thesis, Department of Applied Mathematics, University of São Paulo, Brazil. In Portuguese.

[23] ^[23]
SHEWCHUK JR. 1994. An introduction to the conjugate gradient method without the agonizing pain. Technical report, School of Computer Science, Carnegie Mellon University, August.

Brasil

Brasil

COMPLEXITY OF FIRST-ORDER METHODS FOR DIFFERENTIABLE CONVEX OPTIMIZATION

Abstract

1 INTRODUCTION

2 SCHEMES, PERFORMANCE AND COMPLEXITY

2.1 Example: root of a continuous function

3 MINIMIZATION OF A QUADRATIC FUNCTION: FIRST-ORDER METHODS

3.1 Steepest descent algorithms

Complexity results

3.2 Krylov methods

3.2.1 Chebyshev Polynomials

3.2.2 Complexity results

4 CONVEX DIFFERENTIABLE FUNCTIONS: THE BASIC ALGORITHM

4.1 The algorithm

4.1.1 Analysis of the algorithm

4.1.2 Complexity

5 CONVEX DIFFERENTIABLE FUNCTIONS: ENHANCED ALGORITHMS

6 CONCLUSIONS

REFERENCES

Publication Dates

History