SciELO - Scientific Electronic Library Online

 
vol.28 issue1Central schemes for porous media flows author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Computational & Applied Mathematics

On-line version ISSN 1807-0302

Comput. Appl. Math. vol.28 no.1 São Carlos  2009

 

New versions of the Hestenes-Stiefel nonlinear conjugate gradient method based on the secant condition for optimization

 

 

Li Zhang

College of Mathematics and Computational Science, Changsha University of Science and Technology, Changsha 410076, China, E-mail: mathlizhang@126.com

 

 


ABSTRACT

Based on the secant condition often satisfied by quasi-Newton methods, two new versions of the Hestenes-Stiefel (HS) nonlinear conjugate gradient method are proposed, which are descent methods even with inexact line searches. The search directions of the proposed methods have the form dk = - θkgk + βkHSdk-1, or dk = -gk + βkHSdk-1+ θkyk-1. When exact line searches are used, the proposed methods reduce to the standard HS method. Convergence properties of the proposed methods are discussed. These results are also extended to some other conjugate gradient methods such as the Polak-Ribiére-Polyak (PRP) method. Numerical results are reported.
Mathematical subject classification: 90C30, 65K05.

Keywords: HS method, descent direction, global convergence.


 

 

1 Introduction

Assume that f: Rn R is a continuously differentiable function whose gradient is denoted by g. The problem considered in this paper is

The iterates for solving (1.1) are given by

where the stepsize αk is positive and computed by some line search, and dk is the search direction.

Conjugate gradient methods are very efficient iterative methods for solving (1.1) especially when n is large. The search direction has the following form:

where βk is a parameter. Some well-known conjugate gradient methods include the Polak-Ribiére-Polyak (PRP) method [17, 18], the Hestenes-Stiefel (HS) method [13], the Liu-Storey (LS) method [15], the Fletcher-Reeves (FR) method [8], the Dai-Yuan (DY) method [5] and the conjugate descent (CD) method [9]. In this paper, we are interested in the HS method, in which βk is defined by

where yk-1 = gk - gk-1. Throughout the paper, we denote sk-1 = xk - xk-1 = αk-1dk-1, and ||·|| stands for the Euclidean norm. Convergence properties of conjugate gradient methods can be found in the book [6], the survey paper [12] and references therein.

The HS method behaves like the PRP method in practical computation and is generally regarded as one of the most efficient conjugate gradient methods. An important feature of the HS method is that it satisfies conjugacy condition

which is independent of the objective function and line search. However, Dai and Liao [4] pointed out that in the case 0, the conjugacy condition (1.5) may have some disadvantages (for instance, see [21]). In order to construct a better formula for βk, Dai and Liao proposed a new conjugacy condition and a new conjugate gradient method called the Dai-Liao (DL) method, given by

with a parameter t [0, ). Based on the idea of the DL method, Hager and Zhang [11] proposed a descent conjugate gradient method (HZ).

Besides conjugate gradient methods, the following gradient type methods

have also been studied extensively by many authors. Here θk and βk are two parameters. Clearly, if θk = 1, the methods (1.7) become conjugate gradient methods (1.3). When

the methods (1.7) reduce to spectral conjugate gradient methods [2] and scaled conjugate gradient methods [1]. Yuan and Stoer [21] proposed a subspace method to compute the parameters θk and βk which solve the following subproblem

where Ωk = Span{gk, dk-1} and Bk is a suitable quasi-Newton matrix such as the memoryless BFGS update matrix [19]. Zhang et al. [23] proposed a modified FR method where the parameters in (1.7) are given by

This method satisfies = -||gk||2 and this property depends neither on the line search used, nor on the convexity of the objective function. Moreover, this method converges globally for nonconvex functions with Armijo or Wolfe line search.

Recently, based on the direction generated by the memoryless BFGS update matrix [19], Zhang et al. [22, 24] proposed the following three-term conjugate gradient type method,

where θk and βk are two parameters. If the parameters in (1.8) are given by

then this method becomes the modified PRP method [22]. If the parameters in (1.8) are chosen as

then we get the three-term HS method [24]. Both methods still retain the relation = -||gk||2 and performed well in practical computations.

In this paper, we are concerned with the methods (1.7) and (1.8) with the parameter βk = . Then we try to construct new θk by using idea of the DL method [4].

This paper is organized as follows. In Section 2, we present new formulas for θk and corresponding algorithms. In Section 3, we analyze global convergence properties of the proposed methods with some inexact line searches. In Section 4, we extend the results of Section 2 and Section 3 to other conjugate gradient methods. In Section 5, we report numerical comparisons with existing conjugate gradient methods by using problems in the CUTE library [3].

 

2 New formula for θk and algorithms

In this section, we first describe the following two-terms HS conjugate gradient type method,

where, for convenience, we write θk = 1 +

In order to introduce our method, let us simply recall the conjugacy condition proposed by Dai and Liao [4]. Linear conjugate gradient methods generate a search direction such that the conjugacy condition holds, namely,

where Q is the symmetric and positive definite Hessian matrix of the quadratic objective function f(x). For general nonlinear functions, it follows from the mean value theorem that there exists some τ (0, 1) such that

Therefore it is reasonable to replace (2.2) by the following conjugacy condition:

Dai and Liao [4] used the secant condition of quasi-Newton methods, that is,

where Hk is an approximation to the inverse Hessian. For quasi-Newton methods, the search direction dk can be calculated in the form

By the use of (2.4) and (2.5), we get that

The above relation implies that (2.3) holds if the line search is exact since in this case = 0. However, practical numerical algorithms normally adopt inexact line searches instead of exact ones. For this reason, Dai and Liao replaced the above conjugacy condition by

where t is a scalar. If we substitute (1.3) into (2.6), we get the formula for in (1.6).

In order to get the formula for θk in our method, substituting (2.1) into (2.6), we have

Set

where h is a parameter. We get from the above two equalities that

Now let

with ρ [0, 1]. Then, we have

For convenience, we summarize the above method as the following algorithm which we call the two-term HS method.

Algorithm 2.1 (two-term HS Method):

Step 0: Given the constant ρ [0, 1], choose an initial point x0 Rn. Let k: = 0.

Step 1: Compute dk by

where

Step 2: Determine αk by some line search.

Step 3: Let the next iterate be xk+1 = xk + αkdk.

Step 4: Let k: = k + 1. Go to Step 1.

By the same argument as Algorithm 2.1, we can get the following three-term HS method. In the rest of this paper, we only give the direction in the algorithm where the other steps are as same as Algorithm 2.1 since conjugate gradient methods are mainly determined by their search directions.

Algorithm 2.2 (three-term HS Method):

where

Remark 2.1. It is interesting to note that when ρ = 0 in (2.7)-(2.8) or (2.9)-(2.10), we have

which is independent of any line search and convexity of the objective function. In this case, Algorithm 2.2 reduces to the three-term HS method [24].

Remark 2.2. If exact line search is used, it is easy to see that Algorithm 2.1 and Algorithm 2.2 reduce to the standard HS method.

 

3 Convergence properties

In this section, we only analyze convergence properties of Algorithm 2.1. The corresponding results for Algorithm 2.2 can be obtained by using same argument as Algorithm 2.1. In the global convergence analysis of many iterative methods, the following assumption is often needed.

Assumption A.

The level set Ω = {x Rn| f(x) < f(x0)} is bounded.

In some neighborhood N of Ω, f is continuously differentiable and its gradient is Lipschitz continuous, namely, there exists a constant L > 0 such that

Clearly, Assumption A implies that there exists a constant γ such that

In order to ensure global convergence of Algorithm 2.1, we need some line search to compute the stepsize αk. The Wolfe line search consists of finding αk satisfying

The strong Wolfe line search corresponds to: that

where 0 < δ < σ < 1 are constants.

The following lemma, called the Zoutendijk condition, is often used to prove global convergence of conjugate gradient methods. It was originally given by Zoutendijk [25] and Wolfe [20].

Lemma 3.1. Let Assumption A hold, {xk} be generated by (1.2) and dk satisfy < 0. If αk satisfies the Wolfe condition (3.3) or the strong Wolfe condition (3.4), then we have

In the global convergence analysis for many methods, the sufficient descent condition plays an important role. The following result shows that Algorithm 2.1 produces sufficient descent directions.

Lemma 3.2. Let {xk} and {dk} be generated by Algorithm 2.1, and let αk be obtained by the Wolfe line search (3.4). If ρ [0,1), then we have

Moreover, if ρ = 1, then < 0.

Proof. We have from (2.7) and the definition of (1.4) that

which implies that

Since

and the Wolfe line search implies

from (3.8), we have (3.6) by induction. The proof is then finished.

By the use of the first equality in (3.8) and the second inequality in the strong Wolfe line search (3.4), we have the following result.

Lemma 3.3. Let {xk} and {dk} be generated by Algorithm 2.1, and let αk be obtained by the strong Wolfe line search (3.4) with σ < . If ρ = 1, then we have

The following theorem establishes global convergence of Algorithm 2.1 for strongly convex functions.

Theorem 3.4. Suppose Assumption A holds and f is strongly convex on N, that is, there exists a constant µ > 0 such that

If ρ [0,1), then the sequence {xk} generated by Algorithm 2.1 with the Wolfe line search (3.3) satisfies limk ||gk|| = 0.

Proof. It follows from (3.10) and (3.1) that

Now we begin to estimate θk and βk in (2.7). It follows from (1.4), (3.1) and (3.11) that

It follows from (2.8), (3.12) and the second inequality in (3.3) that

The above inequality together with (3.12) implies that

We have from (3.6) and (3.5) that

It follows from the above inequality and (3.13) that

which means limk ||gk|| = 0. The proof is then completed.

By Lemma 3.3 and same argument in the above theorem, we have the following corollary.

Corollary 3.5. Suppose Assumption A holds and f is strongly convex on N. If ρ = 1 and αk is determined by the strong Wolfe line search (3.4) with σ < , then the sequence {xk} generated by Algorithm 2.1 satisfies limk ||gk|| = 0.

In order to ensure global convergence of Algorithm 2.1 for nonconvex functions, we adopt the idea of the MBFGS method proposed by Li and Fukushima [14] and modify Algorithm 2.1, replacing yk-1 in (2.7) by

where ε1 is a small positive constant. For convenience, we present this modified version as the following algorithm.

Algorithm 3.1 (modified two-term HS Method):

where

An important property of zk-1 is that, when the Wolfe line search is used, it satisfies

This inequality is the same as (3.11) and plays the same role in the proof of global convergence of Algorithm 3.1 for nonconvex functions. By the use of (3.17) and the same arguments as in Theorem 3.4, we have the following strongly global convergence result for Algorithm 3.1 for nonconvex objective functions.

Theorem 3.6. Suppose Assumption A holds, then the sequence {xk} generated by Algorithm 3.1 with the Wolfe line search (3.3) satisfies limk→ ∞ ||gk|| = 0.

Another technique to guarantee global convergence of conjugate gradient methods for general nonlinear functions is to restrict βk nonnegative as in the PRP+ and HS+ methods [10]. In fact, if we replace in Algorithm 2.1 by = max{0, }, we have the following algorithm which we call the two-term HS+ Method.

Algorithm 3.2 (two-term HS+ Method):

where

Using the same argument as Theorem 4.3 in [10], we can get the following global convergence result. Here we omit its proof.

Theorem 3.7. Suppose Assumption A holds, then the sequence {xk} generated by Algorithm 3.2 with the strong Wolfe line search (3.4) satisfies

 

4 Applications

In this section, we extend the results on new versions of the HS method in Sections 2 and 3 to some well-known conjugate gradient methods. For instance, if we replace the term in step 1 of Algorithm 2.1 and Algorithm 2.2 by ||gk-1||2 or -, we get new versions of the PRP and LS methods, respectively.

Algorithm 4.1 (two-term PRP Method):

where

Algorithm 4.2 (three-term PRP Method):

where

Algorithm 4.3 (two-term LS Method):

where

Algorithm 4.4 (three-term LS Method):

where

Algorithm 4.5 (two-term FR Method):

where

Remark 4.1. When ρ = 0 in the above algorithms, the search direction satisfies the sufficient descent condition = -||gk||2, which is also independent of any line search and convexity of the objective function. Moreover in this case, Algorithms 4.2 and 4.3 are identical and reduce to the modified PRP method [23], and Algorithm 4.5 becomes the modified FR method [22]. It is clear that these methods reduce to conjugate gradient methods respectively if exact line search is used.

Remark 4.2. Global convergence properties of these algorithms are similar to those of Algorithm 2.1 or Algorithm 2.2. Here we only analyze Algorithm 4.1 and Algorithm 4.5.

The next result shows that the direction generated by Algorithm 4.1 or Algorithm 4.5 satisfies the sufficient descent condition if the strong Wolfe line search (3.4) is used.

Lemma 4.1. Let {xk} and {dk} be generated by Algorithm 4.1 or Algorithm 4.5 with the strong Wolfe line search (3.4). If ρ < , then for all k, we have that

Proof. We prove (4.5) by induction. Since = -||g0||2, in this case, the relation (4.5) holds with k = 0. It follows from (4.1) or (4.3) that

The above equality with the second inequality in (3.4) implies that

Repeating the same process, we have that

which shows that (4.5) holds. The proof is then completed.

Remark 4.3. If we replace in Algorithm 4.1 by = max{0, }, then this restricted algorithm converges globally for nonconvex functions by Lemma 4.1 and the argument of Theorem 4.3 in [10].

The next result is based on the work of [23]. Here we also omit its proof.

Theorem 4.2. Suppose Assumption A holds. Let {xk} and {dk} be generated by Algorithm 4.5 with the strong Wolfe line search (3.4). If ρ < , then we have liminfk ||gk|| = 0.

 

5 Numerical results

In this section, we compare the performance of the proposed methods with those of the PRP+ method developed by Gilbert and Nocedal [10], and the CG_DESCENT method proposed by Hager and Zhang [11].

The PRP+ code was obtained from Nocedal's web page at http://www.ece.northwestern.edu/~ nocedal/software.html, and the CG_DESCENT code from Hager's web page at http://www.math.ufl.edu/~ hager/. The PRP+ code is coauthored by Liu, Nocedal and Waltz, and the CG_DESCENT code is coauthored by Hager and Zhang. The test problems are unconstrained problems in the CUTE library [3].

We stop the iteration if the inequality ||g(xk)|| < 10-6 is satisfied. All codes were written in Fortran and run on PC with 2.66GHz CPU processor and 1GB RAM memory and Linux operation system. Tables 1 and 2 list all numerical results. For convenience, we give the meanings of these methods in the tables.

 

 

 

 

  • "cg-descent" stands for the CG-DESCENT method with the approximate Wolfe line search [11]. Here we use the Source code Fortran 77 Version 1.4 (November 14, 2005) on Hager's web page and default parameters there;

  • "Algorithm 2.1" is Algorithm 2.1 with ρ = 1 and the same line search as "cg-descent";

  • "prp+" means the PRP+ method with the strong Wolfe line search proposed by Moré and Thuente [16].

In order to get relatively better ρ values in Algorithm 2.1, we choose 10 complex problems to test Algorithm 2.1 with different ρ values. Table 1 lists these numerical results, where "problem", "iter", "fn", "gn", "time", "||g(x)||" and "f(x)" mean the name of the test problem, the total number of iterations, the total number of function evaluations, the total number of gradient evaluations, the CPU time in seconds, the infinity norm of the final value of the gradient and the final value of the function at the final point, respectively.

In Table 1, we see that Algorithm 2.1 with ρ = 1 performed best. Moreover, we also compared Algorithm 2.1 with other Algorithms in the previous sections and numerical results showed that they performed similarly. So in this section, we only listed the numerical results for Algorithm 2.1 with ρ = 1, the "cg-descent" and "prp+" methods. These results are reported in Table 2 where "-1" means the method failed.

Figures 1-4 show the performance of the above methods relative to CPU time, the number of iterations, the number of function evaluations and the number of gradient evaluations, respectively, which were evaluated using the profiles of Dolan and Moré [7]. For example, the performance profiles with respect to CPU time means that for each method, we plot the fraction P of problems for which the method is within a factor τ of the best time. The left side of the figure gives the percentage of the test problems for which a method is the fastest; the right side gives the percentage of the test problems that are successfully solved by each of the methods. The top curve is the method that solved the most problems in a time that was within a factor τ of the best time.

 

 

 

 

 

 

 

 

Figure 1 shows that "Algorithm 2.1" performed slightly better than the "cg-descent" method did for the test problems. It outperforms the "cg-descent" and "prp+" methods for about 59% (71 out of 120) test problems. "Algorithm 2.1" and the "cg-descent" method ultimately solve 100% of the test problems. The "prp+" method performed worst since it only solves 77% of the test problems successfully. But Figure 2 shows that "prp+" has the best performance with respect to the number of iterations since it solves about 46% of the problems with the smallest number of iterations. We can see from Figures 3 and 4 that "Algorithm 2.1" has the best performance with respect to the number of function and gradient evaluations since it corresponds to the top curve.

 

6 Conclusions

We have proposed some new versions of the HS method based on the secant condition, which can generate sufficient descent directions with inexact line searches. Moreover, we proved that the proposed HS methods converge globally for strongly convex functions. Two modified schemes are introduced and proved to be globally convergent for general nonconvex functions. These results are also extended to some other conjugate gradient methods. Some results of the paper extend some work of the references [22, 23, 24]. The performance profiles showed that the proposed methods are also efficient for problems from the CUTE library.

Acknowledgement. This work was supported by the NSF foundation (10701018) of China.

 

REFERENCES

[1] N. Andrei, Scaled conjugate gradient algorithms for unconstrained optimization. Comput. Optim. Appl., 38 (2007), 401-416.         [ Links ]

[2] E. Birgin and J.M. Martínez, A spectral conjugate gradient method for unconstrained optimization. Appl. Math. Optim., 43 (2001), 117-128.         [ Links ]

[3] K.E. Bongartz, A.R. Conn, N.I.M. Gould and P.L. Toint, CUTE: constrained and unconstrained testing environments. ACM Trans. Math. Softw., 21 (1995), 123-160.         [ Links ]

[4] Y.H. Dai and L.Z. Liao, New conjugate conditions and related nonlinear conjugate gradient methods. Appl. Math. Optim., 43 (2001), 87-101.         [ Links ]

[5] Y.H. Dai and Y. Yuan, A nonlinear conjugate gradient method with a strong global convergence property. SIAM J. Optim., 10 (1999), 177-182.         [ Links ]

[6] Y.H. Dai and Y. Yuan, Nonlinear Conjugate Gradient Methods. Shanghai Science and Technology Publisher, Shanghai (2000).         [ Links ]

[7] E.D. Dolan and J.J. Moré, Benchmarking optimization software with performance profiles. Math. Program., 91 (2002), 201-213.         [ Links ]

[8] R. Fletcher and C. Reeves, Function minimization by conjugate gradients. Comput. J., 7 (1964), 149-154.         [ Links ]

[9] R. Fletcher, Practical Methods of Optimization, Vol I: Unconstrained Optimization. John Wiley & Sons, New York (1987).         [ Links ]

[10] J.C. Gilbert and J. Nocedal, Global convergence properties of conjugate gradient methods for optimization. SIAM. J. Optim., 2 (1992), 21-42.         [ Links ]

[11] W.W. Hager and H. Zhang, A new conjugate gradient method with guaranteed descentand an efficient line search. SIAM J. Optim., 16 (2005), 170-192.         [ Links ]

[12] W.W. Hager and H. Zhang, A survey of nonlinear conjugate gradient methods. Pacific J. Optim., 2 (2006), 35-58.         [ Links ]

[13] M.R. Hestenes and E.L. Stiefel, Methods of conjugate gradients for solving linear systems. J. Research Nat. Bur. Standards Section B, 49 (1952), 409-432.         [ Links ]

[14] D. Li and M. Fukushima, A modified BFGS method and its global convergence in nonconvex minimization. J. Comput. Appl. Math., 129 (2001), 15-35.         [ Links ]

[15] Y.L. Liu and C.S. Storey, Efficient generalized conjugate gradient algorithms, Part 1: Theory. J. Optim. Theory Appl., 69 (1991), 129-137.         [ Links ]

[16] J.J. Moré and D.J. Thuente, Line search algorithms with guaranted sufficient decrease. ACM Trans. Math. Softw., 20 (1994), 286-307.         [ Links ]

[17] B. Polak and G. Ribiere, Note sur la convergence des méthodes de directions conjuguées. Rev. Française Informat Recherche Operationelle, 16 (1969), 35-43.         [ Links ]

[18] B.T. Polyak, The conjugate gradient method in extreme problems. USSR Comp. Math. Math. Phys., 9 (1969), 94-112.         [ Links ]

[19] D.F. Shanno, Conjugate gradient methods with inexact searches. Math. Oper. Res., 3 (1978), 244-256.         [ Links ]

[20] P. Wolfe, Convergence conditions for ascent methods. SIAM Rev., 11 (1969), 226-235.         [ Links ]

[21] Y. Yuan and J. Stoer, A subspace study on conjugate algorithms. ZAMM Z. Angew. Math. Mech., 75 (1995), 69-77.         [ Links ]

[22] L. Zhang, W. Zhou and D. Li, A descent modified Polak-Ribière-Polyak conjugate gradient method and its global convergence. IMA J. Numer. Anal., 26 (2006), 629-640.         [ Links ]

[23] L. Zhang, W. Zhou and D. Li, Global convergence of a modified Fletcher-Reeves conjugate gradient method with Armijo-type line search. Numer. Math., 104 (2006), 561-572.         [ Links ]

[24] L. Zhang, W. Zhou and D. Li, Some descent three-term conjugate gradient methods and their global convergence. Optim. Methods Softw., 22 (2007), 697-711.         [ Links ]

[25] G. Zoutendijk, Nonlinear Programming, Computational methods, Integer and Nonlinear Programming. J.Abadie (ed.), North-Holland, Amsterdam (1970), 37-86.         [ Links ]

 

 

Received: 08/II/09. Accepted: 15/II/09.

 

 

#CAM-53/09.