On the divergence of line search methods

Mascarenhas, Walter F.

Abstract

We discuss the convergence of line search methods for minimization. We explain how Newton's method and the BFGS method can fail even if the restrictions of the objective function to the search lines are strictly convex functions, the level sets of the objective functions are compact, the line searches are exact and the Wolfe conditions are satisfied. This explanation illustrates a new way to combine general mathematical concepts and symbolic computation to analyze the convergence of line search methods. It also illustrate the limitations of the asymptotic analysis of the iterates of nonlinear programming algorithms.

line search methods; convergence

On the divergence of line search methods

Walter F. Mascarenhas

Instituto de Matemática e Estatística, Universidade de São Paulo, Rua do Matão 1010, Cidade Universitária, 05508-090 São Paulo, SP, Brazil, E-mail: walterfm@ime.usp.br

ABSTRACT

We discuss the convergence of line search methods for minimization. We explain how Newton's method and the BFGS method can fail even if the restrictions of the objective function to the search lines are strictly convex functions, the level sets of the objective functions are compact, the line searches are exact and the Wolfe conditions are satisfied. This explanation illustrates a new way to combine general mathematical concepts and symbolic computation to analyze the convergence of line search methods. It also illustrate the limitations of the asymptotic analysis of the iterates of nonlinear programming algorithms.

Mathematical subject classification: 20E28, 20G40, 20C20.

Key words: line search methods, convergence.

1 Introduction

Line search methods are fundamental algorithms in nonlinear programming. Their theory started with Cauchy [4] and they were implemented in the first electronic computers in the late 1940's and early 1950's. They have been intensively studied since then and today they are widely used by scientists and engineers. Their convergence theory is well developed and is described at length in many good surveys, as [11], and even in text books, like [2] and [12].

Line search methods, as discussed in this work, are used to solve the unconstrained minimization problem for a smooth function f:

We see them as discrete dynamical systems of the form

where {x_k} Ì ⁿ is the sequence we expect to converge to the solution of problem (1) and e_k contains auxiliary information specific to each method. At the kth step we choose a search direction d_k and analyze f along the line {x_k + wd_k,w Î }. We search for a step size a_kÎ such that the sequence x_k satisfies constraints

that are simple and try to force the x_k to converge to a local minimizer of f.

For example, Newton's method for minimizing a function can be written in the framework (2) by taking

D(f,x_k,a_k,d_k,e_k) = Ñ²f(x_k)¹Ñf(x_k)

and E = 0. The BFGS method could be considered by taking e_k Î ^n×n and setting

for g_k = Ñf(x_k), g_k₊₁ = Ñf(x_k + a_kd_k) and s_k = a_kd_k. A typical example of constraints C on the stepsize a_k in (3) are the Wolfe conditions:

where 0 < s < b < 1. Usually, condition (6) enforces a sufficient decay in the value of f from step k to k + 1 and condition (7) leads to steps s_k = x_k₊₁ x_k which are not too short.

A typical theorem about the convergence of line search methods look like this one adapted from page 212 in Nocedal and Wright's book [12]:

Theorem 1. Suppose f: ⁿ ® has continuous second order derivatives,

is bounded and there exist C > 0 such that

If the n × n matrix e₀ is symmetric positive definite and the iterates x_k generated by the BFGS method, as described in (2)-(5), satisfy the Wolfe conditions (6)-(7) then the sequence {x_k} converges to a minimizer of f (which is unique since f is strictly convex by (9)).

Starting in the 1960's, similar theorems have been proved for several line search methods. Since that time people have tried to find more general results regarding the convergence of line search methods. Most researchers were happy with constraints like the Wolfe conditions and acknowledged their need, because it is easy to enforce them in practice and it is also easy to build examples in which results like theorem 1 are false if similar constraints are not imposed. However, convexity constraints like (9) were regarded as too strong and undesirable. Much effort was devoted to eliminating them but progress was slow and frustrating due to the nonlinear nature of expressions like (5). As a consequence, M. Powell, one of the leading researchers in this area, wrote in [16] that:

Moreover, theoretical studies have suggested several improvements to algorithms, and they provide a broad view of the subject that is very helpful to research. However, because of the difficulty of analyzing nonlinear calculations, the vast majority of theoretical questions that are important to the performance of optimization algorithms in practice are unanswered...

A final answer regarding the need of the convexity hypothesis (9) in theorem 1 was first published in 2002 by Y. Dai [6] and it was somewhat surprising: if f is not convex then the iterates generated by the BFGS method may never approach any point z such that Ñf(z) = 0.

We were unaware of Dai's work and in the year his result was published we found a similar answer regarding the convergence of the BFGS method [13]. Our approach, however, was quite different from his. Our work was based on the observation that equations (4)-(7) have symmetries which can be exploited to build a counterexample for theorem 1 without the hypothesis (9). After the publication of [13] we generalized the argument used in that paper and our purpose in this work is to present this generalization.

Our approach is motivated by the work started by S. Lie in the late 1800's, in which symmetries have lead to remarkable solutions of nonlinear differential equations [3]. We developed a technique to produce examples with search lines as in Figure 1.

The line search methods we discuss are invariant with respect to orthogonal changes of variables and scaling, in the sense that if they assign a step s_k = x_k₊₁ x_k to the point x_k and objective function F, Q is an orthogonal matrix and l Î then the step _k corresponding to the objective function (x) = F(l¹Q^tx) at the point _k = lQx_k is _k = lQs_k. We argue that in relevant cases these symmetries lead to iterates as in Figure 1.

Actually, the possibility of cyclic behavior for line search methods was already mentioned by Curry in 1944 [5]. It was also discussed in [6, 9, 13, 15]. Here we go one step further and present a systematic way to build examples that display this behavior. The qualitative behavior of the iterates in our examples is captured by the concepts of flower and dandelion described in section 3. In intuitive terms the iterates can be seen as defining the petals of a flower and their accumulation points lie in a cycle that define the flower's core. When the iterates approach the core along well defined directions we say that the flower is a dandelion.

In sections 2 and 7 we present concrete examples of dandelions, for Newton's method and the BFGS method, respectively. In these examples the line searches are exact, the first Wolfe condition is satisfied and the restrictions of the objective functions to the search lines are strictly convex, the level sets of f are compact, but yet the iterates have the cyclic asymptotic behavior illustrated in Figure 1. The BFGS and Newton's methods are among the most important line search methods and our examples refute the following conjecture:

If when applying the BFGS or Newton's methods we choose the first local minimizer along the search line then the iterates converge to a local minimizer of the objective function.

Besides symmetries, this work is based on a theorem proved by H. Whitney in 1934 [17]. Whitney's theorem regards the extension of C^m functions from subsets of ⁿ to ⁿ. It says that if a function F and its partial derivatives up to order m are defined in a subset E of ⁿ and F's Taylor series up to order m behave properly in E then F can be extended to a C^m function in ⁿ. Whitney's theorem is a handy tool to highlight the weak points of nonlinear programming algorithms.

In section 2 we illustrate how symmetries and Whitney's theorem can be combined to analyze nonlinear line search methods in particular situations as if they were linear. This analysis is not adversely affected by the number of dimensions. To the contrary, as we go to higher dimensions the number of free parameters at our disposal increases. We are then able to observe phenomena contrary to our 2 or 3 dimensional intuition. However, as the experience with Lax Pairs has shown [3], "exploiting symmetry" is easier said than done. The algebraic manipulations necessary to implement our ideas can be overwhelming. Although the example for Newton's method presented in section 2 is a direct consequence of symmetry, we would not be able to build the example for the BFGS method in section 7 without the software Mathematica. Fortunately, today we have the luxury of tools like Mathematica and can focus on the fundamental geometrical aspects of the line search methods.

Our arguments can be adapted to objective functions with mth order Lipschitz continuous derivatives or to the more general class (ⁿ) of functions discussed by C. Fefferman in [7], but we do not aim for utmost generality and restrict ourselves to objective functions with Lipschitz continuous second order derivatives, so that we can speak in terms of gradients and Hessians and avoid the use of higher order multilinear forms. On the other hand, [1] indicates that things are different for analytic objective functions, mainly because these functions are "rigid" and it is not possible to change them only locally, or more technically, due to the lack of analytic partitions of the unity.

This work has six more sections and an appendix. Section 2 motivates our approach by using it to analyze the convergence of Newton's method. The technical concepts that formalize our arguments are presented in sections 3 and 4. Section 5 discusses the Wolfe conditions and section 6 explains how to build examples in which the objective function is convex along the search lines. In section 7 we combine the results from the previous sections to build an example of divergence for the BFGS method. In the appendix we prove our claims.

Finally, we would like to emphasize that it is important to look at the results in this work from a broad perspective. The examples presented here should not be taken as evidence against the use of Newton's method or the BFGS method. To the contrary, these methods perform quite well in practice. In real life numerical algorithms are implemented in floating point arithmetic and rounding errors would break our examples apart (and introduce other subtle problems). This work highlights the limitations of the asymptotic analysis of these algorithms. Our examples show that, even if taken to extremes, complex nonlinear calculations may not be able to explain the practical behavior of nonlinear programming algorithms. The main advice one can extract from this work is the following rule of thumb:

If you find that it is difficult to prove that a line search method converges under certain conditions, and other people have tried the same for a couple of decades and did not succeed, then consider the possibility that in theory the method may actually fail under these conditions, even if all your numerical experiments indicate that it always converge.

2 Newton's method

We now describe a family examples of divergence for Newton's method for minimization. The examples are parameterized by the step size a: given a > 0 we build an example in which all step-sizes a_k are equal to a. This section motivates the theory presented later on. Although the geometry underlying the examples is accurately described by Figure 1, the algebraic details make they look more complex than they really are. Thus, we suggest that you pay little attention to the formulae and focus on the structure of our argument, which can be summarized as follows:

(a) we guess general expressions for the iterates x_k, function values f_k, gradients g_k and Hessians h_k which we believe to be compatible with the symmetries in Newton's method and the theory presented below.

(b) we plug these expressions into the formula that define Newton's method and obtain equations relating our guesses in item (a).

(c) we solve these equations and the next sections guarantee the existence of an objective function F such that F(x_k) = f_k, ÑF(x_k) = g_k, Ñ²F(x_k) = h_k and Ñ²F(x_k + ws_k) s_k > 0 for w Î and k big enough.

Following this recipe, we decomposed

⁶ as a direct sum of a three dimensional "horizontal" subspace and a three dimensional "vertical" subspace and tried iterates x_k, function values f_k, gradients g_k and Hessians h_k of the form¹ 1 The matrices hk are not positive definite and in practice one would take another search direction if, for example, this fact was detected during a Cholesky factorization of hk. However, to keep the algebra as simples as possible, in this work we do not enforce the condition that the Hessians Ñ 2 f( xk) are s.p.d. In [14] we show that Newton's method may fail even Ñ 2 f( xk) is s.p.d. for all k and the Wolfe conditions are satisfied. :

for l Î (0,1), = (^h, ^v) and = (^h, ^v) with ^h, ^v, ^h and ^vÎ ³ and

where I is the 3 × 3 identity matrix, Q_h and Q_v are 3 × 3 orthogonal matrices, A is a symmetric 3 × 3 matrix and C is a 3 × 3 matrix. We then concluded that

are convenient: they are simple and after picking them we still have the freedom to choose , A and C in order to satisfy the hypothesis of the theory presented in the next sections and obtain iterates x_k which are consistent with the formula

that defines Newton's method with step-size a. If we replace ÑF(x_k) and Ñ²F(x_x) in (14) by g_k and h_k in (11) then l, Q_h and Q_v cancel out and we obtain the equations

where

Notice that, due to the invariance of Newton's method with respect to orthogonal changes of variables and scaling, there is no "k" in (15)-(16). Equations (15) yields

Equations (10) and (11) show that

g_k = 2^k

^t

,

g_k₊₁ = 2^(k+1)

^t D(2)Q

, and f_k₊₁ f_k = 2^(k+1)

. Therefore, if

then s_k = x_k₊₁ x_k is a descent direction, the line searches are exact (g_k₊₁ = 0) and the first Wolfe condition

f_k₊₁ f_k < sg_k

holds for

0 < s < min {1, /(2^t)}.

We now apply the results from the next sections: items 4c and 4d in the definition of seed in section 4 require that

and section 6 says that to guarantee the convexity of the objective function along the search lines we should ask for

To complete the specification of the terms in (10)-(11) we chose the 9 entries of C and the 6 independent entries of A in order to satisfy (15)-(20). These equations and inequalities are linear in A and C and the following matrices satisfy them:

Equations (10)-(13), (16)-(19) and (21) define iterates and function values, gradients and Hessians of the objective function at them. The next sections guarante the existence of an objective function F with Lipschitz continuous second order derivatives such that F(x_k) = f_k, ÑF(x_k) = g_k and Ñ²F(x_k) = h_k. Moreover, neither the vectors D'(0) and D'(0) nor the vectors D(0) and D(0)D(2) are aligned² 2 D'(l) here is the derivative of D(l) with respect to l. and the lines _k = {D(0) x_k + wD(0)s_k, w Î } are such that _r Ç _k = for all r + 1 < k < r + 5 and theorem 3 and lemmas 1 and 3 in the following sections show that we have much freedom to chose the value of F(x) along the search segments {x_k + ws_k, w Î [0,1]}: if the function y: [0,1] ® has Lipschitz continuous second order derivatives and

then F can be chosen so that F(x_k + ws_k) = (1/2)^ky(w) for w Î [0,1] and k large. In fact, condition (20) and theorem 4 in section 6 show that F can be chosen so that F(x_k + ws_k)s_k > 0 for w Î and k large and the level sets

W(f,z) = {x Î ⁿ such that f(x) < z}

are bounded.

3 Flowers and Dandelions

We now present a framework to apply Whitney's theorem to study the convergence of line search methods. We describe examples in which the iterates x_k and the function values f_k, the gradients g_k and the Hessians h_k of the objective function are grouped into p converging subsequences, which we call petals (see Figure 2). The limits of these subsequences {x_k}, {f_k}, {g_k} and {h_k} are the members of periodic sequences {c_k}, {j_k}, {g_k} and {q_k}, so that lim_q_{® ¥} x_pq+r = c_r for all r and c_r+p = c_r, lim_{q ® ¥}f_pq+r = j_r for all r and j_r+p = j_r, g_r+p = g_r and q_r+p = q_r. In formal terms:

Definition 1. A flower (n, p, l, x_k, f_k, g_k, h_k) is a collection formed by

1. l Î (0,1) and positive integers n and p.

2. Sequences {x_k} and {c_k} in ⁿ and a constant M > 0 such that

(a) x_i = x_jÛ i = j.

(b) c_j = c_kÛ j º k mod p.

(c) l^k< M||x_k - c_k||< M²l^k.

3. Sequences {f_k}, {j_k} Ì , {g_k}, {g_k} Ì ⁿ and {h_k}, {q_k} Ì ⁿ such that

where

ⁿ is the set of n × n symmetric matrices.

Notice that the mod in item 2.b and equation (22) in the definition above imply that the sequences {c_k}, {g_k} and {q_k} have period p and item 2.c implies that as k ® ¥ the sequence x_k accumulates at the limit cycle defined by the c_k. The next definitions and theorems relates the f_k, g_k and h_k in a flower to an objective function F with Lipschitz continuous second derivatives:

Definition 2. Suppose U Ì ⁿ and V Ì ^p. We define Lip^m(U,V) as the space of functions F: U ® V with Lipschitz continuous mth derivatives. If V = then we call this space simply by Lip^m(U).

Definition 3. We define LC²(ⁿ) as the set of functions f in Lip²(ⁿ) for which there exists constants C_f and R_fÎ , which depend on f, such that if ||x||> R_f then Ñ²f(x) is positive definite and ||Ñ²f(x)^-1||< C_f.

The class LC²(ⁿ) is interesting because if f Î LC²(ⁿ) then there exists a constant D_f such that

As a consequence, the level sets

W(f,z) = {x Î ⁿ such that f(x) < z}

are compact, as required by many theorems regarding the convergence of line search methods, and all points x with Ñf(x) = 0 are such that ||x|| < D_fC_f. Thus, the elements of LC²(ⁿ) have a compact set of critical points and if an algorithm fails to find one of them then the algorithm is to be blamed, not the objective function. The next theorem shows that flowers can be interpolated by functions in LC²(ⁿ):

Theorem 2. Given a flower

(n, p, l, x_k, f_k, g_k, h_k) there exists F Î LC²(

) such that F(x_k) = f_k, ÑF(x_k) = g_k and Ñ²F(x_k ) = h_k for all k.

If the flower is a dandelion then we can improve this result and specify the objective function and its derivatives along the segments {x_k + ws_k, w Î [0,1]}:

Theorem 3. If the functions {F_k}, {G_k} and {H_k} and the intervals {[a_k,b_k]} are compatible with the dandelion (m, n, p, l, x_k, f_k, g_k, h_k) then there exists k₀Î and F Î LC²(ⁿ) such that F(x_k+ ws_k) = F_k(w, l^k), ÑF(x_k + ws_k) = G_k(w, l^k) and Ñ²F(x_k + ws_k) = H_k(w, l^k) for k > k₀ and w Î [a_k, b_k ].

This theorem will make sense after you read the following definitions:

Definition 4. A flower (n, p, l, x_k, f_k, g_k, h_k) is a dandelion if there exist functions X_kÎ Lip²([0,1],ⁿ) such that, for all k and S_k(z) = X_k₊₁(z) - X_k(z),

(a) X_k+p = X_k and x_k = X_k(l^k).

(b) The vectors S_k(0), (0) and (0) are linearly independent.

(c) The vectors S_k(0), S_k+₁(0) and (0) are linearly independent.

Definition 5. The intervals [a_k,b_k], defined for k Î , are compatible with a dandelion if

(a) a_k = a_k+p< 0 and b_k+p = b_k> 1 for k Î .

(b) The segments {X_k(0) + wS_k(0), w Î [a_k, b_k]} and {X_r(0) + wS_r(0), w Î [a_r, b_r]} are disjoint if r + 1 < k < r + p - 1.

Definition 6. For all k Î , consider functions F_kÎ Lip²(²), G_kÎ Lip¹ (², ⁿ) and H_kÎ Lip⁰(², ⁿ). We say that {F_k}, {G_k} and {H_k} are compatible with a dandelion

if F_k+p = F_k, G_k+p = G_k and H_k+p = H_k and there exists k₀ such that if k > k₀, i Î {0,1}, w Î

and z = l^k then

for V_k(w) = (1 - w)(0) + w(0) and U_k(w) = (1 - w)(0) + w(0).

If we do not care about the derivatives of F in directions normal to the segments {x_k + ws_k, w Î [0,1]} then the search for compatible functions F_k, G_k and H_k is simplified by the following lemma

Lemma 1. Let (m, n, p, l, x_k, f_k, g_k, h_k) be a dandelion and, for k Î , functions F_kÎ Lip²(²) for which F_k+p = F_k. If there exist k₀Î such that

for i Î {0,1} and k > k₀ then there exist functions G_kÎ Lip¹(²,ⁿ) and H_kÎ Lip⁰(², ⁿ) such that and {F_k}, {G_k} and {H_k } are compatible.

Therefore, we can focus on F_k(w,z) and neglect H_k(w,z) and G_k(w,z) for w Ï {0,1}. In the next section we describe a class of dandelions for which we can restrict ourselves to functions F_k of the form F_k(w,z) = y_k(w).

4 Symmetric dandelions and their seeds

In this section we present a family of dandelions that includes the examples for the BFGS and Newton's methods mentioned in the introduction. These dandelions combine symmetries imposed by an orthogonal matrix Q with contractions dictated by a diagonal matrix D. They are defined by their seeds:

Definition 7. A seed S(n, p, d, Q, _k, _k, _k, _k) is a collection formed by

1. n, p Î and d = (d₁... d_n) Î ⁿ such that p > 3 and d_n = max d_i.

2. A n × n orthogonal matrix Q such that Q^p = I and the diagonal matrix D(z) with ith diagonal entry equal to

commutes with Q for z Î .

3. A sequence {_k} Ì ⁿ such that _k+p = _k, the points c_r = D(0)Q^r

_r are distinct for 0< r < p and the vectors

are such that, for 0 < r < p, the vectors D'(0)_r and D'(0)_r are not aligned and neither are the vectors D(0)_r and D(0)_r+₁.

4. Sequences {_k} Ì , {_k} Ì ⁿ and {_k} Ì ⁿ such that

(a)

_k+p = _k, _k+p = _k and _k+p = _k for all k.

(b) (

_k)_ij = 0 if d_i + d_j > d_n.

(c) If d_l = d_n - 1 and J = {j | d_j > 0} then

(d) If d_n< 2, L = {l | d_l = d_n} and S = {(i,j) | d_i + d_j = d_n, d_i > 0, d_j > 0} then

A seed and l Î (0,1) generate a dandelion through the formulae

where

This result is formalized by the following lemma:

Lemma 2. If S(n, p, d, Q, _k, _k, _k, _k) is a seed and l Î (0,1) then there exists a unique dandelion (S) with X_k, f_k, g_k, h_k, j_k, g_k and q_k as in (34)-(39). Moreover, (S)'s steps s_k = x_k+₁ - x_k are given by

for

_k defined in (31).

It is easy to apply lemma 1 and theorem 3 to (S):

Lemma 3. If S(n, p, d, Q, _k, _k, _k, _k) is a seed, l Î (0,1), x_k, f_k, g_k and h_k are given by (34)-(37) and the functions y_rÎ Lip²(), defined for 0< r < p, satisfy

for i Î {0,1} then the functions F_k(w,z) = y_k_mod_p(w) satisfy (30).

To build a symmetric dandelion and specify the value of the objective function along the search segments {x_k + ws_k, w Î [0,1]} it is enough to find the right seed and functions y_r. Once we find them, the existence of an objective function F with ÑF(x_k) = g_k, Ñ²F(x_k) = h_k and F(x_k + ws_k) = y_{k mod p}(w) for x_k, f_k, g_k and h_k in (34)-(37) is guaranteed by lemma 3 and theorem 3. To find a seed we proceed as in section 2: we plug (34)-(37) into the expressions that define

(a) the method of our interest,

(b) the compatibility conditions in the definition of seed and theorem 3,

(c) additional constraints, like the Wolfe conditions in section 5 and the convexity conditions in section 6.

and analyze the result. If equations (34)-(37) are compatible with the method's symmetries, as they are for the BFGS and Newton's methods, then we have a chance of handling these constraints and may even find closed form solutions for them. If equations (34)-(37) are not related to the method's symmetries then the dandelion will not bloom.

5 The Wolfe conditions

In this section we show that the Wolfe conditions

and

may not prevent the cyclic behavior illustrated in figures 1 and 2. In fact, we can have cyclic behavior even if we replace the second Wolfe condition (43) by the stronger requirement g_k₊₁ = 0, which is called exact line search condition. These conditions are invariant with respect to orthogonal changes of variables and scaling and can be easily checked for the dandelions coming from a seed:

Lemma 4. Let S(n, p, d, Q, _k, _k, _k, _k) be a seed and (m, n, p, l, x_k, f_k, g_k, h_k ) the corresponding dandelion.

(a) If

= 0 for 0< r < p then

g_k+₁ = 0 for all k.

(b) If

< 0 for 0< r < p then the first Wolfe condition (42) holds for

(c) If

< 0 for 0< r < p then the second Wolfe condition (43) is verified for

6 Convexity along the search lines

Convexity is an important simplifying assumption in optimization. It is tempting to conjecture that strict convexity along the search lines guarantees the convergence of line search methods³ 3 By strict convexity along the search line we mean that the directional second derivatives Ñ 2 F( xk + wsk) sk are positive. . Sometimes even stronger conjectures are made, like having convergence if we choose a global minimizer or the first local minimizer along the search line. We now show that strict convexity along the search lines does not rule out the cyclic behavior depicted in figures 1 and 2. To do that, we present a theorem that yields an objective function which is strictly convex along the search lines of a dandelion that comes from a seed:

Theorem 4. Let S(n, p, d, Q, _k, _k, _k, _k) be a seed, l Î (0,1) and x_k, c_k, f_k, g_k and h_k be given by (34)-(37). Let be the convex hull of {c_k, 0< k < p}. Consider the lines _k = {c_k + w(c_k+₁ - c_k), w Î } and assume that _rÇ _kÇ = for r + 1 < k < r + p - 1. If, for 0< r < p,

then there exist a function F Î LC²(ⁿ) and k₀ such that if k > k₀ then F(x_k) = f_k, ÑF(x_k) = g_k, Ñ²F(x_k) = h_k and Ñ²F(x_k + ws_k) s_k > 0 for all w Î .

If the vector d that defines the seed S has more than two entries equal to 0 then the lines _r and _k and r + 1 < k < r + p - 1 are disjoint for almost all choices of the points _k. Therefore, to build examples of divergence in which the objective function is strictly convex along the search lines we can focus on (44) and check the intersections later, just to make sure we were not unlucky. This is what we did in section 2 and will do again in the next section.

7 The BFGS method

In this section we present an example of divergence for the BFGS method in which the objective function is strictly convex along the search lines. The essence of this example is already present in [13] and we suggest that you read this reference as an introduction to this section. Our purpose here is to show that the concepts of flower, dandelion and seed reduce the construction of examples like the one in [13] to the solution of algebraic problems. Software like Mathematica or Mapple can decide if such algebraic problems have solutions and find accurate approximation to these solutions. The validation of the example becomes then a question of using these approximate solutions wisely to check the requirements in the definitions of flower, dandelion and seed and the hypothesis of the lemmas and theorems in the previous sections.

We analyze the BFGS method with exact line searches, i.e.,

g_k₊₁ = 0. In this case the Hessian approximations B_k are updated according to the formula:

where g_k = ÑF(x_k). The iterates x_k evolve according to

Equations (45)-(46) are invariant with respect to orthogonal changes of variables and scaling, in the sense that if Q is an orthogonal matrix, l Î and F, B_k and x_k satisfy equations (45)-(46) then (x) = F(l^-1Q^tx), _k = l^-2QB_kQ^t and _k = lQx_k satisfy them too. It is hard to exploit these symmetries because the BFGS method was conceived to correct the matrices B_k. However, it has an additional symmetry: if we take B_k of the form

combined with the conditions

then equation (46) is automatically satisfied and (45) holds if

for r_k such that

Notice that if we assume (48) and take a_k = 1 for 0 < k < n, a_k in (50) for k > n then the a_k are positive and the vectors g_k,... g_k+n_-1 are linearly independent, because if we suppose that µ_jg_k+j = 0 with µ₀ = 1 then (48) leads to the contradiction 0 = µ_jg_k+j = g_k < 0. Therefore, under (48)-(50) the matrices B_k defined by (47) are positive definite and to build an example of divergence for the BFGS method we can ignore a_k and B_k and focus on (48)-(49).

To apply the theory above we plug (34)-(37) into (48) and (49) and deduce that a seed S(n, p, d, Q, _k, _k, _k, _k) is consistent with the BFGS method if

where Z = D(l^-1)Q. If we fix Z then (52) becomes similar to an eigenvalue problem, with the r's playing the role of eigenvalues and the 's of eigenvectors. We solved equations (52) for _k and r_k in the case

l = and n = 6, as described in the lemma

Lemma 5. If Q and D(l) are the matrices above, d_n = 3, n = 6 and l = then there exist vectors _kÎ ⁶ and coefficients and r_kÎ that satisfy (52) and are such that _k+₁₁ = _k'

_k+₁₁ =

_k and the six vectors in the set

are linearly independent for each k Î .

The linear independence of the vectors D(l^-j)Q^j

_k+j in lemma 5 implies that we can solve the following linear systems of equations on the vectors

_k:

The vectors

_k obtained from (54) define steps s_k by (40):

s_k = X^k

_k for X = D(l)Q

and we claim that the points

satisfy

and lead to points x_k = X^k

_k such that s_k = x_k₊₁ - x_k. In fact, using (56) it is straightforward to deduce that x_k₊₁ - x_k = X^k⁺¹

_k₊₁ - X^k

_k = X^k

_k = s_k. To verify (57), notice that (56) yields

where

Now, since

_k₊₁₁ =

_k, equation (54) implies that

_k₊₁₁ =

_k and

Combining this equation, (55) and (59) we conclude that D = 0 and equation (58) and the identity

_j-66 =

_j lead to (57).

In the proof of lemma 5 we show that the

_k can be chosen so that the vectors

_k defined by (54) satisfy the linear independence requirements in the definition of seed. The

_k and

_k above, d = (0,0,0,1,1,3),

are also compatible with this definition. As a consequence, n = 6, p = 66, d, Q

_k,

_k and

_k define a seed S. This seed leads to a dandelion

(S) with steps s_k and gradients g_k compatible with the BFGS method with the matrices B_k in (45) and step sizes a_k in (50). Equations (34)-(37) and (54) imply that the function values and gradients associated to

(S) satisfy

Lemma 4 shows that the iterates x_k corresponding to satisfy the first Wolfe condition for 0 < s < 1 - l³. Moreover, almost all choices of the vectors _k in the proof of lemma 5 lead to lines _k = {D(0)Q^k

_k+ wD(0)Q^k

_k, w Î

} such that

_rÇ

_k =

for r + 1 < k < r + 65 and

D'(0)s_k₊₁¹ 0. Equation (60) shows that

h_ks_k > 0 and

h_k+₁s_k > 0. As a result, equation (61) and theorem 4 yield an objective function F Î LC²(

⁶) such that the iterates x_k generated by applying the BFGS method to F with x₀ =

₀ and B_k above satisfy

Ñ²F(x_k + ws_k)s_k > 0 for k large and w Î

.

This completes the presentation of an example showing that there is not enough strength in the definition of the BFGS method, the exact line search condition and the Wolfe conditions to prevent the cyclic behavior in figures 1 and 2, even when the objective function is strictly convex along the search lines and has compact level sets.

You may now be wondering if we could not find a simpler example, in dimension less than 6. Unfortunately, we were not able to find such example because the eigenvalue problem (52) does not seem to have real solutions r_k ¹ 0 for n < 6. Notice that this is a purely algebraic issue, which has nothing to do with nonlinear programming. Similarly, the minimum dimension at which we could build counter examples with the techniques described here for other methods could be bigger or smaller than 6, depending on the particular symmetries of the method and the existence of solutions for the corresponding algebraic problems.

8 Appendix

We now prove the results stated in the previous sections. Our main tool is the following corollary of the Whitney's extension theorem:

Lemma 6. Let E be a bounded subset of ⁿ and suppose F: E ® , G: E ® ⁿ and H: E ® ⁿ are functions with domain E and M is a constant. If

then there existsÎ Lip²(ⁿ) such that (x) = F(x), Ñ(x) = G(x) and Ñ²(x) = H(x) for x Î E. Moreover, there exists constants C and R such that if ||x|| > R then Ñ²(x) is positive definite and ||Ñ²(x)^-1|| < R.

Whitney's theorem is stated in different levels of generality in the pure mathematics literature. The most complete and general approach is due to C. Fefferman [7, 8] but, unfortunately, it is stated in a language that is too abstract for the average non linear programming researcher. In page 48 of [10], L. Hörmander presents a more concrete version of the theorem for functions in C^m. This version of the theorem is stated using n-uples to denote partial derivatives, i.e., if a = (a₁,...,a_n) Î ⁿ, |a| = åa_i and f is a function with partial derivatives of order |a| then ¶^af(x) is defined as

Moreover, a! = Õa_i and x^a = Õ. Using this notation, the arguments used to prove theorem 2.3.6 in [10] can be adapted to prove the following version of Whitney's theorem:

Theorem 5. Let E be a bounded set in ⁿ and consider, for each a Î ⁿ with |a|< m, a Lipschitz continuous function u_a : E ® . If there exists a constant M such that

for all x,y Î E and a Î ⁿ with |a|< m then there exists f Î Lip^m(ⁿ) such that ¶^af(x) = u_a(x) for all a Î ⁿ with |a|< m and x Î E. Moreover, there exists a constant C such that |¶^af(x)|< C for all x Î ⁿ and a Î ⁿ with |a|< m.

This theorem allows us to extend much of the discussion in this work to Lip^m(ⁿ) for m > 2. However, for m > 2 we cannot group the partial derivatives in vectors (gradients) and matrices (Hessians) as we did in the statement of lemma 6. As a consequence, the geometry behind the examples gets blurred by the technicalities as m increases and we decided it would be best to focus on the consequences of lemma 6 instead of exploring more general results like theorem 5.

We can apply Whitney's result to a dandelion with x_k = X_k(l^k) because if

and d > 0 is small and the intervals {[a_k,b_k]} are compatible with then the distance between points in the 2 dimensional surface

can be estimated in terms of w and z:

Lemma 7. Given a dandelion with x_k = X_k(l^k), compatible intervals {[a_k, b_k]} and _k in (65) there exists d > 0 such that if w_j,w_kÎ [a_k,b_k], z_j, z_kÎ [0,d], y_j = _j(w_j,z_j), y_k = _k(w_k,z_k) and ||y_j - y_k|| < d then either

and in case (a)

in case (b)

and in case (c)

We also use the next lemmas. After the statement of these lemmas we prove the theorems and the paper ends with the proofs of the lemmas.

Lemma 8. Consider E Ì ⁿ, constants K > 1, d > 0 and functions F: E ® , G: E ® ⁿ and H: E ® ⁿ. If for all x, z Î E there exist m Î and y₁, ..., y_mÎ E such that, for y₀ = x and y_m+₁ = z,

then all x,y Î E with ||x - y|| < d satisfy (62)-(64) with M = 3 K⁴.

Lemma 9. If the dandelion (m, n, p, l, x_k, f_k, g_k, h_k) and the functions {F_k}, {G_k} and {H_k} are compatible and the equations (26)-(29) hold for k > k₀ then there exists M Î such that if j,k > k₀ and u,w Î then

for Dh = _k(u, l^k) -_k(w, l^k), Dv = _k(w, l^j) -_k(w, l^k) and as in (65).

Lemma 10. If f₀, f₁, g₀, g₁, h₀ and h₁Î are such that g₀ < f₁ - f₀ < g₁, h₁ > 0 and h₂ > 0 then there exists y Î Lip²() such that y(0) = f₀, y(1) = f₁, y'(0) = g₀, y'(1) = g₁, y''(0) = h₀, y''(1) = h₁ and y''(w) > 0 for all w Î .

Lemma 11. Given a function y Î Lip¹(²) such that y(i,0) = 0 and Ñy(i,0) = 0 for i Î {0,1} there exists f Î Lip¹(²) such that

(a) f(i,z) = 0 and Ñf(i,z) = 0 for i Î {0,1} and z Î .

(b) f(w,0) = y(w,0) and Ñf(w,0) = Ñy(w,0) for w Î .

Lemma 12. Consider l Î (0,1), a function F Î Lip²(²), functions Y₁ and Y₂ in Lip²([0,1],ⁿ) and Y₃(z) = Y₂(z) - Y₁(z). If the vectors (0), (0) and Y₃(0) are linearly independent and, for i Î {0,1},

then there exist d > 0, G Î Lip¹(²,ⁿ) and H Î Lip⁰(², ⁿ) such that if z Î [0,d] and i Î {0,1} then

for

Proof of Theorem 2. Item 2 in the definition of flower in section 3 implies that there exists d > 0 such that if ||x_k - x_j|| < d then j º k mod p . The periodicity of j_k, g_k and q_k given by (22), item 2 in definition 1 and the bounds (22)-(25) imply that if x and z belong to the set E = {x_k, k Î } È {c_k, k Î } and ||x - z|| < d then there exists k such that m = 1 and y₁ = c_k satisfy the conditions (70)-(73) in lemma 8. Therefore, lemmas 6 and 8 imply that there exists a function i as required by theorem 2.

Proof of Theorem 3. Consider k₀ and d obtained from lemma 7 and

Lemma 7 and the three equations in (26) imply that the expressions

define functions F, G and H with domain E, i.e., if _j(w_j, l^j) = _k(w_k, l^k) then

Lemmas 7 and 9 show that the functions F, G and H above satisfy the hypothesis of lemma 8 and theorem 3 follows from lemma 6.

Proof of Theorem 4. For r > 0, consider the compact convex set

Since

_k Ç

_r Ç

=

for r + 1 < k < r + p - 1 there exists

> 0 such that

_kÇ

_r Ç

=

for the same k and r. For each 0 < r < p there exist a_r < 0 and b_r > 1 such that

_r Ç

= {X_r(0) + wS_r(0), w Î [a_r,b_r]}.

Extending the definition of a_k and b_k by periodicity, a_k = a_k_{mod p} and b_k = b_k_{mod p}, we obtain intervals [a_k,b_k] compatible with . Combining lemmas 1, 3 and 10 and theorem 3 we obtain k₁ and Î Lip²(ⁿ) such that if k > k₁ then (x_k) = f_k, Ñ(x_k) = g_k, Ñ²(x_k) = h_k and Ñ²(x_k + ws_k) s_k > 0 for w Î [a_k,b_k].

The points X_k(0) belong to the interior of the compact convex set and there exist c_k < 0 and d_k > 1 such that c_k+p = c_k, d_k+p = d_k and

_k Ç

= X_k(0) + wS_k(0), w Î [c_k,d_k]

and vectors U_k, V_k such that U_k+p = U_k, V_k+p = V_k, S_k(0) < 0, S_k(0) > 0 and

Let L be a Lipschitz constant for the second derivatives of , µ > 0 such that

and t: ® be the function t(w) = max(w,0)³. The function

and k₀ > k₁ such that and ,

(x_k + b_ks_k - X_k(0) - c_kS_k(0)) > 0 and (x_k + b_ks_k - X_k(0) - d_kS_k(0)) > 0

for k > k₀ are as required by theorem 4. In fact, F coincides with in and if k > k₀ and w > b_k then

Proof of Lemma 1. Let F be the function obtained by applying theorem 2 to . Lemma 12 applied to the functions = F_k - F, Y₁ = X_k and Y₂ = X_k₊₁ yield functions _k and _k such that G_k(x) = ÑF(x) + _k(x) and H_k(x) = Ñ²F(x) + _k(x) are as claimed in lemma 1.

Proof of Lemma 2. Equations (34)-(37) show that

where e_i Î ⁿ is the vector with e_ii = 1 and e_ij = 0 for i ¹ j, Item 4b in definition 7 implies that if T = {(i,j) such that d_i + d_j < d_n} then

Since Q^p = I and _k+p = _k, the X_k in (34) satisfy item (a) in the definition of dandelion. The relations _k+p = _k, _k+p = _k and _k+p = _k imply that f_k, g_k and h_k accumulate at the limits j_k, g_k and q_k in the last column of (34)-(37) and these limits satisfy (22). We now verify (23)-(25). Equation (85) yields (23). The first equation in (84) leads to

where J = {j | d_j > 0}. The second equation in (84) and (85) imply that

for U = {u |d_u = d_n - 1} and equations (87) and (32) show that the g_k satisfy (24). Equations (38), (39) and (86) lead to

for L and S in item 4b of definition 7, and (25) follows from (33). Finally, the linear independence requirements in items (b) and (c) of definition 4 follow from item 3 in definition 7.

Proof of Lemma 3. Direct computation using (34)-(37) and (41) show that the function F_k satisfies (30).

Proof of Lemma 4. To prove this lemma, plug (34)-(37) into the expressions in the hypothesis of lemma 4 and compare the results with (42)-(43).

Proof of Lemma 5. The matrix Z in (52) can be written as

where the Z_i's are the following 1 × 1 and 2 × 2 blocks:

If we identify the blocks Z_i with the complex numbers

then (52) can be interpreted as the equations

in the complex variables

_j,k given by

In the case r_k+11 = r_k and _r+11 = _r equation (91) is equivalent to

where

_j = (

_j_,0,

_j_,1,

_j_,2,

_j_,3, ...,

_j_,7,

_j_,10)^t and

Equation (94) suggests that we take r₀, r₁, ..., r₁₀ such that the corresponding matrices A in (95) are singular. Given such r's we can find vectors _j ¹ 0 that satisfy (94) and use them to define the vectors taking real and imaginary parts of (92)-(93). This approach leads to the polynomial equations

on the y_r = 1/r_r. To obtain accurate approximations for appropriated r_r we take

y₆ = -1.9948, y₇ = -0.3737, y₈ = -1.2355, y₉ = 0.9857, y₁₀ = 0.11717

and apply Newton's method to find the remaining y_i. Expression (96) correspond to a system of six real equations and Newton's method starting with

y₀ = 2.04831, y₁ = 3.33798, y₂ = -1.15867, y₃ = 0.300795, y₄ = -0.634211, y₅ = -2.44761,

converges quickly to a solution of this system. This approach led to rational approximations

's for the y_i's such that ||det(A(

, -z_j))|| < 10^-1000 for j = 1,2,3,4 and 5 and such that the Jacobian of the system (96) is well conditioned at

. A standard argument using Kantorovich's theorem proves the existence of an exact y in a neighborhood of radius 10^-500 of our rational approximation. Using the approximation

we dropped the last row in each of the matrices A(y, -z_j) and computed highly accurate approximations for the five complex vectors

^jÎ

¹¹ in (94) normalized by the condition (

^j)₁ = 1. Using (92)-(93) we obtained accurate approximations to the vectors

_j required by lemma 5. Using these approximations we verified that the vectors in (53) are indeed linearly independent and solved the systems (54) for the vectors

_k. Finally, we computed approximations for the

_k in (55)-(56) and verified that the corresponding lines

_k and

_r in the hypothesis of theorem 4 are at least 10^-5 apart. Our computations indicate that the exact lines

_k and

_r do not cross. We did a rigorous sensitivity analysis of the computations above and it indicated that 500 digits is precision enough to guarantee that our conclusions apply to the exact y's,

's,

's and

's.

Proof of Lemma 6. Applying the version of Whitney's theorem in page 6 of [8] to w(t) = t and a properly scaled version of the polynomials

P^x(y) = F(x) + G(x)^t(y - x) + (y - x)^tH(x)(y - x)

we obtain a function Î Lip²(ⁿ) and a constant C > 0 such that (x) = F(x), Ñ(x) = ÑF(x) and Ñ²(x) = Ñ²F(x) for all x Î E and ||Ñ²(x)|| < 1/C for all x Î ⁿ. Let R > 0 be such that 2||x|| < R for all x Î E and consider a polynomial F such that

The constants C and R and the function defined by

satisfies the requirement of lemma 6.

Proof of Lemma 7. By compatibility of {[a_k,b_k]} and and periodicity (X_k+p = X_k), there exists > 0 for which the segments = {_k(w, 0), w Î [a_k,b_k]} are such that dist > if r + 1 < k < r + p - 1. If d > 0 is small enough then the surfaces

are such that

The definition of dandelion and the Lipschitz continuity of the derivatives of X_k imply that for d > 0 small if v, z Î [0,d], k Î and a, b, c Î then

We also have that

where D_k(v,z) satisfies the inequality

when v,z Î [0,d]. The linear independence requirements in the definition of dandelion imply that (0) ¹ 0 and (100) implies that if d > 0 is small and v,z Î [0,d] then

We now show that d > 0 for which the expressions (97)-(101) above are valid fulfils the requirements in lemma 7. Given y_j and y_k as in the hypothesis of lemma 7 there exists i Î [k,k - p) such that i º j mod p Since ||y_j - y_k|| < d and _i = _j the surfaces and are at most d apart and (97) leaves only the three possibilities: (a) i = k, (b) i = k + 1 or (c) i = k + p - 1. This corresponds to (66) and to complete this proof we now verify the bounds (67)-(69). In case (a) X_j = X_k and S_j = S_k and the definition of y_j and y_k in the hypothesis and (100) lead to

The bounds (98) and (101) and the abbreviation (w_j - w_k)S_k(z_k) = G yield

and (67) follows from the equations G =

_j(w_j,z_k) - y_k and

y_j - _j(w_j,z_k) = (1 - w_j) (X_k(z_j) - X_k(z_k)) + w_j(X_k₊₁(z_j) - X_k₊₁(z_k)).

In case (b) X_j = X_k₊₁ and S_j = S_k₊₁ and (100) lead to

y_j - y_k = w_j S_k₊₁(z_j) + (z_j - z_k) X'_j(z_k) + (1 - w_k)S_k(z_k) + D_j(z_k,z_j)

and using (99) and (101) we deduce that

||y_j - y_k|| > 3d ( ||w_j S_j(z_j)|| + || (z_j - z_k) X'_j(z_k)|| + || (1 - w_k)S_k(z_k)|| )

-|| D_j(z_k,z_j)|| > d( ||w_j S_j(z_j)|| + ||X_j(z_j) - X_j(z_k)|| + || (1 - w_k)S_k(z_k)|| )

and (68) follows at once. Finally, case (c) is analogous to case (b).

Proof of Lemma 8. Given x,z Î E, let y₁,...,y_m be as in the hypothesis of lemma 8 and define

y₀ = x, y_m₊₁ = z and f_j = F(y_j), g_j = G(y_j) and h_j = H(y_j).

The bounds (71) and (70) yield

and (62) is obtained by taking j = m + 1 in (102). The identity

and the bounds (70), (72) and (102) and imply that

Taking j = m + 1 we deduce that x and z satisfy (63). Finally, notice that

for D = (y_i + y_i_-1 - 2y₀)^th₀(y_i- y_i_-1). The last terms in (104) cancel because

Thus, if we take j = m + 1 then (70), (103) and (104) yield (64) for M = 3K⁴.

Proof of Lemma 9. The bounds (74) and (77) follow from the Lipschitz continuity of H_k. Equation (75) can be derived from the first equation in (29),

and the Lipschitz continuity of the first derivatives of G_k. Equation (76) is a consequence of the first equation in (27) and (29), (105) and the Lipschitz continuity of the second derivatives of F_k. The bounds (78) and (79) are clearly satisfied if j = k and from now on we assume that j ¹ k. In this case

and X_k(l^j) = X_k(l^k) + (0)(l^j - l^k) + (0) (l^2j - l^2k) + µ_j,k, where µ_j,k = (X''(z) - X''(0)) dzdx satisfies

for a Lipschitz constant L for . The bound (106) and X_k Î Lip²([0,1],ⁿ) lead to

for

The conditions F_k Î Lip²(²) and G_k Î Lip¹(²,ⁿ) imply that

The bound (78) follows from the second equation in (29), (107) and (109). Finally, the bound (79) can be deduced from the second equation in (27), equations (28), (106), (107), equation (108) with l = j and l = k and equation (110).

Proof of Lemma 10. Given > 0, consider the function

where (w) is the piecewise cubic given by

with

_i = g_i - f₁ + f₀ and

_i = h_i/2 for i Î {0,1}. The function

belongs to Lip²(

) if and only if it has continuous second order derivatives at w = 2

and w = 1 - 2

. This condition leads to a linear system of six equations on the six variables

and

. Solving this system we obtain

The second derivative of is a piecewise linear function with values

at the nodes w = 0, w = , w = 2, w = 1 - 2, w = 1 - and w = 1. The hypothesis implies that ₀ < 0 and ₁ > 0 and (111) shows that > 0 and < 0 if > 0 is small. Therefore, (112) implies that ''(w) > 0 for all w if is small enough.

Proof of Lemma 11. Applying Whitney's theorem to the set

E = {(0,y) | |y| < 3} È {(1,y) | |y| < 3} È {(x,0) | |x| < 3 } Ì ²

and the functions F: E ® and G : E ® ² given by

F(w,0) = y(w,0), F(i,z) = 0, G(w,0) = Ñy(w,0), G(i,z) = 0,

for i = 0,1 we obtain a function Î Lip¹(²) such that

(w,0) = y(w,0), Ñ

(w,0) = Ñy(w,0),

(i,z) = 0, Ñ

(i,z) = 0

for i = 0,1 and |w|,|z| < 3. Let t : ® be a C^¥ function such that t(x) = 1 for |x| < 2 and t(x) = 0 for |x| > 3. The function

f(w,z) = t(z) (t(w)(w,z) + (1 - t(w))y(w,z))

satisfies items (a) and (b) in lemma 11.

Proof of Lemma 12. Let Z₁, Z₂ and Z₃ be the functions defined by

The vectors Z₁(0), Z₂(0) and Z₃(0) are linearly independent. Therefore, there exist W₁, W₂, W₃Î ⁿ such that

The implicit function theorem guarantees the existence of d > 0 and functions a_ij Î Lip¹([0,d]), defined for 1 < i,j < 3, such that

and the vectors

are such that

for z Î [0,d]. Lemma 11 applied to y = yields f Î Lip¹(²) such that

for w Î and i Î {0,1}. We now show that any G Î Lip¹(²,ⁿ) such that

for z Î [0,d] satisfies (81) and (82). Equations (80) and (117) imply that G(i,z) = 0 and second and third equations in (81) follows from (114)-(118) and the definition of S(z) and V(w). To verify (82), notice that (117) and (118) lead to

Equation (116) yields (0)^t Z_j(0) + A_i(0)^t (0) = 0 and (114)-(116) imply that

If j = 3 then (0) = (0) = (0) - (0) = Z₂(0) -Z₁(0) and (120) leads to

Reminding that (0) = (0) and using (116) and (120)-(121) we obtain

Equations (117) and (118) show that

and (82) follows from (116), (118), (122)-(124) and the fact that

To complete this proof we define H as

for

Equations (80) and (117) imply that H(i,z) = 0 for i Î {0,1} and (118) and (116) imply the second equation in (83). Using (117), (126) and (127)-(130) we get

Equations (113), (119) and (122)-(124) show that

Finally, equations (113) shows that the right hand side of (131) is the expansion of ¶G/¶z (w,0) on the basis {W₁,W₂,W₃} of a tri-dimensional subspace that contains ¶G/¶z (w,0) and we have shown (83).

Received: 15/IV/06. Accepted: 15/I/07.

#699/06.

[1] P. Absil, R. Mahony and B. Andrews, Convergence of the iterates of descent methods for analytic functions. Siam Journal of Optimization, 16(2) (2005), 531-547.
[2] D. Bertsekas, Nonlinear Programming, second edition, second printing, Athena Scientific, Belmont, Massachusets, 2003.
[3] B. Cantwell, Introduction to Symmetry Analysis, Cambridge University Press, Cambridge UK, (2002).
[4] A. Cauchy, Méthodes générales pour la résolution des systémes d'équations simultanées. C.R. Acad. Sci. Par., 25 (1847), 536-538.
[5] H. Curry, The Method of steepest descent for non-linear minimization problems. Quart. Appl. Math., 2 (1944), 258-261.
[6] Y. Dai, Convergence properties of the BFGS Algorithm. SIAM J. Optim., 13(3) (2002), 693-701.
[7] C. Fefferman, A generalized sharp Whitney theorem for jets. Revista Matemática Iberoamericana, 21(2) (2005).
[8] C. Fefferman, Extension of C^m^,w-Smooth Functions by Linear Operators, manuscript available at: www.math.princeton.edu/ facultypapers/Fefferman To appear in Revista Matematica Iberoamericana.
[9] C. Gonzaga, Two Facts on the convergence of the Cauchy Algorithm. Journal of Optimization Theory and Applications, 107(3) (2000), 591-6000.
[10] L. Hörmander, The Analysis of Linear Partial Differential Operators I Springer Verlag, Heidelberg, 1983.
[11] J. Nocedal, Theory of algorithms for unconstrained optimization. Acta numerica, 1 (1992), 199-242.
[12] J. Nocedal and S. Wright, Numerical Optimization Springer series in operations research, Springer, 1999.
[13] W.F. Mascarenhas, The BFGS method with exact line searches fails for non-convex objective functions. Math. Prog. Ser B., 99 (2004), 49-61.
[14] W.F. Mascarenhas, Newton's iterates can converge to non stationary points, Manuscript Submitted to Mathematical Programming. Preliminary version available on line at: www.ime.usp.br/ ~walterfm/pub/newtonFails.pdf
[15] M. Powell: Nonconvex minimization calculations and the conjugate gradient method. In: D.F. Griffiths, ed., Numerical Analysis, Lecture Notes in Mathematics 1066, Springer Verlag, Berlin, (1984), 122-141.
[16] M. Powell, Convergence Properties of Algorithms for Nonlinear Optimization SIAM Review, 28(4) (1986), 487-500.
[17] H. Whitney, Analytic extensions of differentiable functions defined in closed sets. Trans. Amer. Math. Soc., 36 (1934), 63-89.

1

The matrices

h_k are not positive definite and in practice one would take another search direction if, for example, this fact was detected during a Cholesky factorization of

h_k. However, to keep the algebra as simples as possible, in this work we do not enforce the condition that the Hessians Ñ

²

f(

x_k) are s.p.d. In [14] we show that Newton's method may fail even Ñ

²

f(

x_k) is s.p.d. for all

k and the Wolfe conditions are satisfied.

2

D'(l) here is the derivative of

D(l) with respect to l.

3

By strict convexity along the search line we mean that the directional second derivatives

Ñ

²

F(

x_k + ws_k)

s_k are positive.

Publication Dates

Publication in this collection
10 May 2007
Date of issue
2007

History

Received
15 Apr 2006
Accepted
15 Jan 2007

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

[1] [1] P. Absil, R. Mahony and B. Andrews, Convergence of the iterates of descent methods for analytic functions. Siam Journal of Optimization, 16(2) (2005), 531-547.

[2] [2] D. Bertsekas, Nonlinear Programming, second edition, second printing, Athena Scientific, Belmont, Massachusets, 2003.

[3] [3] B. Cantwell, Introduction to Symmetry Analysis, Cambridge University Press, Cambridge UK, (2002).

[4] [4] A. Cauchy, Méthodes générales pour la résolution des systémes d'équations simultanées. C.R. Acad. Sci. Par., 25 (1847), 536-538.

[5] [5] H. Curry, The Method of steepest descent for non-linear minimization problems. Quart. Appl. Math., 2 (1944), 258-261.

[6] [6] Y. Dai, Convergence properties of the BFGS Algorithm. SIAM J. Optim., 13(3) (2002), 693-701.

[7] [7] C. Fefferman, A generalized sharp Whitney theorem for jets. Revista Matemática Iberoamericana, 21(2) (2005).

[8] [8] C. Fefferman, Extension of C^m^,w-Smooth Functions by Linear Operators, manuscript available at: www.math.princeton.edu/ facultypapers/Fefferman To appear in Revista Matematica Iberoamericana.

[9] [9] C. Gonzaga, Two Facts on the convergence of the Cauchy Algorithm. Journal of Optimization Theory and Applications, 107(3) (2000), 591-6000.

[10] [10] L. Hörmander, The Analysis of Linear Partial Differential Operators I Springer Verlag, Heidelberg, 1983.

[11] [11] J. Nocedal, Theory of algorithms for unconstrained optimization. Acta numerica, 1 (1992), 199-242.

[12] [12] J. Nocedal and S. Wright, Numerical Optimization Springer series in operations research, Springer, 1999.

[13] [13] W.F. Mascarenhas, The BFGS method with exact line searches fails for non-convex objective functions. Math. Prog. Ser B., 99 (2004), 49-61.

[14] [14] W.F. Mascarenhas, Newton's iterates can converge to non stationary points, Manuscript Submitted to Mathematical Programming. Preliminary version available on line at: www.ime.usp.br/ ~walterfm/pub/newtonFails.pdf

[15] [15] M. Powell: Nonconvex minimization calculations and the conjugate gradient method. In: D.F. Griffiths, ed., Numerical Analysis, Lecture Notes in Mathematics 1066, Springer Verlag, Berlin, (1984), 122-141.

[16] [16] M. Powell, Convergence Properties of Algorithms for Nonlinear Optimization SIAM Review, 28(4) (1986), 487-500.

[17] [17] H. Whitney, Analytic extensions of differentiable functions defined in closed sets. Trans. Amer. Math. Soc., 36 (1934), 63-89.

Brasil

Brasil

On the divergence of line search methods

Abstract

Publication Dates

History