Services on Demand
Article
Indicators
Related links
 Cited by Google
 Similars in SciELO
 Similars in Google
Share
Computational & Applied Mathematics
Print version ISSN 22383603
Online version ISSN 18070302
Comput. Appl. Math. vol.28 no.1 São Carlos 2009
New versions of the HestenesStiefel nonlinear conjugate gradient method based on the secant condition for optimization
Li Zhang
College of Mathematics and Computational Science, Changsha University of Science and Technology, Changsha 410076, China, Email: mathlizhang@126.com
ABSTRACT
Based on the secant condition often satisfied by quasiNewton methods, two new versions of the HestenesStiefel (HS) nonlinear conjugate gradient method are proposed, which are descent methods even with inexact line searches. The search directions of the proposed methods have the form d_{k} =  θ_{k}g_{k} + β_{k}^{HS}d_{k}_{1}, or d_{k} = g_{k } + β_{k}^{HS}d_{k}_{1}+ θ_{k}y_{k}_{1}. When exact line searches are used, the proposed methods reduce to the standard HS method. Convergence properties of the proposed methods are discussed. These results are also extended to some other conjugate gradient methods such as the PolakRibiérePolyak (PRP) method. Numerical results are reported.
Mathematical subject classification: 90C30, 65K05.
Keywords: HS method, descent direction, global convergence.
1 Introduction
Assume that f: R^{n }→ R is a continuously differentiable function whose gradient is denoted by g. The problem considered in this paper is
The iterates for solving (1.1) are given by
where the stepsize α_{k} is positive and computed by some line search, and d_{k} is the search direction.
Conjugate gradient methods are very efficient iterative methods for solving (1.1) especially when n is large. The search direction has the following form:
where β_{k} is a parameter. Some wellknown conjugate gradient methods include the PolakRibiérePolyak (PRP) method [17, 18], the HestenesStiefel (HS) method [13], the LiuStorey (LS) method [15], the FletcherReeves (FR) method [8], the DaiYuan (DY) method [5] and the conjugate descent (CD) method [9]. In this paper, we are interested in the HS method, in which β_{k} is defined by
where y_{k}_{1} = g_{k } g_{k}_{1}. Throughout the paper, we denote s_{k}_{1} = x_{k}  x_{k}_{1} = α_{k}_{1}d_{k}_{1}, and · stands for the Euclidean norm. Convergence properties of conjugate gradient methods can be found in the book [6], the survey paper [12] and references therein.
The HS method behaves like the PRP method in practical computation and is generally regarded as one of the most efficient conjugate gradient methods. An important feature of the HS method is that it satisfies conjugacy condition
which is independent of the objective function and line search. However, Dai and Liao [4] pointed out that in the case ≠ 0, the conjugacy condition (1.5) may have some disadvantages (for instance, see [21]). In order to construct a better formula for β_{k}, Dai and Liao proposed a new conjugacy condition and a new conjugate gradient method called the DaiLiao (DL) method, given by
with a parameter t ∈ [0, ∞). Based on the idea of the DL method, Hager and Zhang [11] proposed a descent conjugate gradient method (HZ).
Besides conjugate gradient methods, the following gradient type methods
have also been studied extensively by many authors. Here θ_{k} and β_{k} are two parameters. Clearly, if θ_{k} = 1, the methods (1.7) become conjugate gradient methods (1.3). When
the methods (1.7) reduce to spectral conjugate gradient methods [2] and scaled conjugate gradient methods [1]. Yuan and Stoer [21] proposed a subspace method to compute the parameters θ_{k} and β_{k} which solve the following subproblem
where Ω_{k} = Span{g_{k}, d_{k}_{1}} and B_{k} is a suitable quasiNewton matrix such as the memoryless BFGS update matrix [19]. Zhang et al. [23] proposed a modified FR method where the parameters in (1.7) are given by
This method satisfies = g_{k}^{2} and this property depends neither on the line search used, nor on the convexity of the objective function. Moreover, this method converges globally for nonconvex functions with Armijo or Wolfe line search.
Recently, based on the direction generated by the memoryless BFGS update matrix [19], Zhang et al. [22, 24] proposed the following threeterm conjugate gradient type method,
where θ_{k} and β_{k} are two parameters. If the parameters in (1.8) are given by
then this method becomes the modified PRP method [22]. If the parameters in (1.8) are chosen as
then we get the threeterm HS method [24]. Both methods still retain the relation = g_{k}^{2} and performed well in practical computations.
In this paper, we are concerned with the methods (1.7) and (1.8) with the parameter β_{k} = . Then we try to construct new θ_{k} by using idea of the DL method [4].
This paper is organized as follows. In Section 2, we present new formulas for θ_{k} and corresponding algorithms. In Section 3, we analyze global convergence properties of the proposed methods with some inexact line searches. In Section 4, we extend the results of Section 2 and Section 3 to other conjugate gradient methods. In Section 5, we report numerical comparisons with existing conjugate gradient methods by using problems in the CUTE library [3].
2 New formula for θ_{k} and algorithms
In this section, we first describe the following twoterms HS conjugate gradient type method,
where, for convenience, we write θ_{k} = 1 +
In order to introduce our method, let us simply recall the conjugacy condition proposed by Dai and Liao [4]. Linear conjugate gradient methods generate a search direction such that the conjugacy condition holds, namely,
where Q is the symmetric and positive definite Hessian matrix of the quadratic objective function f(x). For general nonlinear functions, it follows from the mean value theorem that there exists some τ ∈ (0, 1) such that
Therefore it is reasonable to replace (2.2) by the following conjugacy condition:
Dai and Liao [4] used the secant condition of quasiNewton methods, that is,
where H_{k} is an approximation to the inverse Hessian. For quasiNewton methods, the search direction d_{k} can be calculated in the form
By the use of (2.4) and (2.5), we get that
The above relation implies that (2.3) holds if the line search is exact since in this case = 0. However, practical numerical algorithms normally adopt inexact line searches instead of exact ones. For this reason, Dai and Liao replaced the above conjugacy condition by
where t is a scalar. If we substitute (1.3) into (2.6), we get the formula for in (1.6).
In order to get the formula for θ_{k} in our method, substituting (2.1) into (2.6), we have
Set
where h is a parameter. We get from the above two equalities that
Now let
with ρ ∈ [0, 1]. Then, we have
For convenience, we summarize the above method as the following algorithm which we call the twoterm HS method.
Algorithm 2.1 (twoterm HS Method):
Step 0: Given the constant ρ ∈ [0, 1], choose an initial point x_{0} ∈ R^{n}. Let k: = 0.
Step 1: Compute d_{k} by
where
Step 2: Determine α_{k} by some line search.
Step 3: Let the next iterate be x_{k}_{+1} = x_{k} + α_{k}d_{k}.
Step 4: Let k: = k + 1. Go to Step 1.
By the same argument as Algorithm 2.1, we can get the following threeterm HS method. In the rest of this paper, we only give the direction in the algorithm where the other steps are as same as Algorithm 2.1 since conjugate gradient methods are mainly determined by their search directions.
Algorithm 2.2 (threeterm HS Method):
where
Remark 2.1. It is interesting to note that when ρ = 0 in (2.7)(2.8) or (2.9)(2.10), we have
which is independent of any line search and convexity of the objective function. In this case, Algorithm 2.2 reduces to the threeterm HS method [24].
Remark 2.2. If exact line search is used, it is easy to see that Algorithm 2.1 and Algorithm 2.2 reduce to the standard HS method.
3 Convergence properties
In this section, we only analyze convergence properties of Algorithm 2.1. The corresponding results for Algorithm 2.2 can be obtained by using same argument as Algorithm 2.1. In the global convergence analysis of many iterative methods, the following assumption is often needed.
Assumption A.
The level set Ω = {x ∈ R^{n} f(x) < f(x_{0})} is bounded.
In some neighborhood N of Ω, f is continuously differentiable and its gradient is Lipschitz continuous, namely, there exists a constant L > 0 such that
Clearly, Assumption A implies that there exists a constant γ such that
In order to ensure global convergence of Algorithm 2.1, we need some line search to compute the stepsize α_{k}. The Wolfe line search consists of finding α_{k} satisfying
The strong Wolfe line search corresponds to: that
where 0 < δ < σ < 1 are constants.
The following lemma, called the Zoutendijk condition, is often used to prove global convergence of conjugate gradient methods. It was originally given by Zoutendijk [25] and Wolfe [20].
Lemma 3.1. Let Assumption A hold, {x_{k}} be generated by (1.2) and d_{k} satisfy < 0. If α_{k} satisfies the Wolfe condition (3.3) or the strong Wolfe condition (3.4), then we have
In the global convergence analysis for many methods, the sufficient descent condition plays an important role. The following result shows that Algorithm 2.1 produces sufficient descent directions.
Lemma 3.2. Let {x_{k}} and {d_{k}} be generated by Algorithm 2.1, and let α_{k} be obtained by the Wolfe line search (3.4). If ρ ∈ [0,1), then we have
Moreover, if ρ = 1, then < 0.
Proof. We have from (2.7) and the definition of (1.4) that
which implies that
Since
and the Wolfe line search implies
from (3.8), we have (3.6) by induction. The proof is then finished.
By the use of the first equality in (3.8) and the second inequality in the strong Wolfe line search (3.4), we have the following result.
Lemma 3.3. Let {x_{k}} and {d_{k}} be generated by Algorithm 2.1, and let α_{k} be obtained by the strong Wolfe line search (3.4) with σ < . If ρ = 1, then we have
The following theorem establishes global convergence of Algorithm 2.1 for strongly convex functions.
Theorem 3.4. Suppose Assumption A holds and f is strongly convex on N, that is, there exists a constant µ > 0 such that
If ρ ∈ [0,1), then the sequence {x_{k}} generated by Algorithm 2.1 with the Wolfe line search (3.3) satisfies lim_{k}_{→ ∞} g_{k} = 0.
Proof. It follows from (3.10) and (3.1) that
Now we begin to estimate θ_{k} and β_{k} in (2.7). It follows from (1.4), (3.1) and (3.11) that
It follows from (2.8), (3.12) and the second inequality in (3.3) that
The above inequality together with (3.12) implies that
We have from (3.6) and (3.5) that
It follows from the above inequality and (3.13) that
which means lim_{k}_{→ ∞} g_{k} = 0. The proof is then completed.
By Lemma 3.3 and same argument in the above theorem, we have the following corollary.
Corollary 3.5. Suppose Assumption A holds and f is strongly convex on N. If ρ = 1 and α_{k} is determined by the strong Wolfe line search (3.4) with σ < , then the sequence {x_{k}} generated by Algorithm 2.1 satisfies lim_{k}_{→ ∞} g_{k} = 0.
In order to ensure global convergence of Algorithm 2.1 for nonconvex functions, we adopt the idea of the MBFGS method proposed by Li and Fukushima [14] and modify Algorithm 2.1, replacing y_{k}_{1} in (2.7) by
where ε_{1} is a small positive constant. For convenience, we present this modified version as the following algorithm.
Algorithm 3.1 (modified twoterm HS Method):
where
An important property of z_{k}_{1} is that, when the Wolfe line search is used, it satisfies
This inequality is the same as (3.11) and plays the same role in the proof of global convergence of Algorithm 3.1 for nonconvex functions. By the use of (3.17) and the same arguments as in Theorem 3.4, we have the following strongly global convergence result for Algorithm 3.1 for nonconvex objective functions.
Theorem 3.6. Suppose Assumption A holds, then the sequence {x_{k}} generated by Algorithm 3.1 with the Wolfe line search (3.3) satisfies lim_{k}_{→ ∞} g_{k} = 0.
Another technique to guarantee global convergence of conjugate gradient methods for general nonlinear functions is to restrict β_{k} nonnegative as in the PRP+ and HS+ methods [10]. In fact, if we replace in Algorithm 2.1 by = max{0, }, we have the following algorithm which we call the twoterm HS+ Method.
Algorithm 3.2 (twoterm HS+ Method):
where
Using the same argument as Theorem 4.3 in [10], we can get the following global convergence result. Here we omit its proof.
Theorem 3.7. Suppose Assumption A holds, then the sequence {x_{k}} generated by Algorithm 3.2 with the strong Wolfe line search (3.4) satisfies
4 Applications
In this section, we extend the results on new versions of the HS method in Sections 2 and 3 to some wellknown conjugate gradient methods. For instance, if we replace the term in step 1 of Algorithm 2.1 and Algorithm 2.2 by g_{k}_{1}^{2} or , we get new versions of the PRP and LS methods, respectively.
Algorithm 4.1 (twoterm PRP Method):
where
Algorithm 4.2 (threeterm PRP Method):
where
Algorithm 4.3 (twoterm LS Method):
where
Algorithm 4.4 (threeterm LS Method):
where
Algorithm 4.5 (twoterm FR Method):
where
Remark 4.1. When ρ = 0 in the above algorithms, the search direction satisfies the sufficient descent condition = g_{k}^{2}, which is also independent of any line search and convexity of the objective function. Moreover in this case, Algorithms 4.2 and 4.3 are identical and reduce to the modified PRP method [23], and Algorithm 4.5 becomes the modified FR method [22]. It is clear that these methods reduce to conjugate gradient methods respectively if exact line search is used.
Remark 4.2. Global convergence properties of these algorithms are similar to those of Algorithm 2.1 or Algorithm 2.2. Here we only analyze Algorithm 4.1 and Algorithm 4.5.
The next result shows that the direction generated by Algorithm 4.1 or Algorithm 4.5 satisfies the sufficient descent condition if the strong Wolfe line search (3.4) is used.
Lemma 4.1. Let {x_{k}} and {d_{k}} be generated by Algorithm 4.1 or Algorithm 4.5 with the strong Wolfe line search (3.4). If ρ < , then for all k, we have that
Proof. We prove (4.5) by induction. Since = g_{0}^{2}, in this case, the relation (4.5) holds with k = 0. It follows from (4.1) or (4.3) that
The above equality with the second inequality in (3.4) implies that
Repeating the same process, we have that
which shows that (4.5) holds. The proof is then completed.
Remark 4.3. If we replace in Algorithm 4.1 by = max{0, }, then this restricted algorithm converges globally for nonconvex functions by Lemma 4.1 and the argument of Theorem 4.3 in [10].
The next result is based on the work of [23]. Here we also omit its proof.
Theorem 4.2. Suppose Assumption A holds. Let {x_{k}} and {d_{k}} be generated by Algorithm 4.5 with the strong Wolfe line search (3.4). If ρ < , then we have liminf_{k}_{→ ∞} g_{k} = 0.
5 Numerical results
In this section, we compare the performance of the proposed methods with those of the PRP+ method developed by Gilbert and Nocedal [10], and the CG_DESCENT method proposed by Hager and Zhang [11].
The PRP+ code was obtained from Nocedal's web page at http://www.ece.northwestern.edu/~ nocedal/software.html, and the CG_DESCENT code from Hager's web page at http://www.math.ufl.edu/~ hager/. The PRP+ code is coauthored by Liu, Nocedal and Waltz, and the CG_DESCENT code is coauthored by Hager and Zhang. The test problems are unconstrained problems in the CUTE library [3].
We stop the iteration if the inequality g(x_{k})_{∞} < 10^{6} is satisfied. All codes were written in Fortran and run on PC with 2.66GHz CPU processor and 1GB RAM memory and Linux operation system. Tables 1 and 2 list all numerical results. For convenience, we give the meanings of these methods in the tables.

"cgdescent" stands for the CG_{}DESCENT method with the approximate Wolfe line search [11]. Here we use the Source code Fortran 77 Version 1.4 (November 14, 2005) on Hager's web page and default parameters there;

"Algorithm 2.1" is Algorithm 2.1 with ρ = 1 and the same line search as "cgdescent";

"prp+" means the PRP+ method with the strong Wolfe line search proposed by Moré and Thuente [16].
In order to get relatively better ρ values in Algorithm 2.1, we choose 10 complex problems to test Algorithm 2.1 with different ρ values. Table 1 lists these numerical results, where "problem", "iter", "fn", "gn", "time", "g(x)_{∞}" and "f(x)" mean the name of the test problem, the total number of iterations, the total number of function evaluations, the total number of gradient evaluations, the CPU time in seconds, the infinity norm of the final value of the gradient and the final value of the function at the final point, respectively.
In Table 1, we see that Algorithm 2.1 with ρ = 1 performed best. Moreover, we also compared Algorithm 2.1 with other Algorithms in the previous sections and numerical results showed that they performed similarly. So in this section, we only listed the numerical results for Algorithm 2.1 with ρ = 1, the "cgdescent" and "prp+" methods. These results are reported in Table 2 where "1" means the method failed.
Figures 14 show the performance of the above methods relative to CPU time, the number of iterations, the number of function evaluations and the number of gradient evaluations, respectively, which were evaluated using the profiles of Dolan and Moré [7]. For example, the performance profiles with respect to CPU time means that for each method, we plot the fraction P of problems for which the method is within a factor τ of the best time. The left side of the figure gives the percentage of the test problems for which a method is the fastest; the right side gives the percentage of the test problems that are successfully solved by each of the methods. The top curve is the method that solved the most problems in a time that was within a factor τ of the best time.
Figure 1 shows that "Algorithm 2.1" performed slightly better than the "cgdescent" method did for the test problems. It outperforms the "cgdescent" and "prp+" methods for about 59% (71 out of 120) test problems. "Algorithm 2.1" and the "cgdescent" method ultimately solve 100% of the test problems. The "prp+" method performed worst since it only solves 77% of the test problems successfully. But Figure 2 shows that "prp+" has the best performance with respect to the number of iterations since it solves about 46% of the problems with the smallest number of iterations. We can see from Figures 3 and 4 that "Algorithm 2.1" has the best performance with respect to the number of function and gradient evaluations since it corresponds to the top curve.
6 Conclusions
We have proposed some new versions of the HS method based on the secant condition, which can generate sufficient descent directions with inexact line searches. Moreover, we proved that the proposed HS methods converge globally for strongly convex functions. Two modified schemes are introduced and proved to be globally convergent for general nonconvex functions. These results are also extended to some other conjugate gradient methods. Some results of the paper extend some work of the references [22, 23, 24]. The performance profiles showed that the proposed methods are also efficient for problems from the CUTE library.
Acknowledgement. This work was supported by the NSF foundation (10701018) of China.
REFERENCES
[1] N. Andrei, Scaled conjugate gradient algorithms for unconstrained optimization. Comput. Optim. Appl., 38 (2007), 401416. [ Links ]
[2] E. Birgin and J.M. Martínez, A spectral conjugate gradient method for unconstrained optimization. Appl. Math. Optim., 43 (2001), 117128. [ Links ]
[3] K.E. Bongartz, A.R. Conn, N.I.M. Gould and P.L. Toint, CUTE: constrained and unconstrained testing environments. ACM Trans. Math. Softw., 21 (1995), 123160. [ Links ]
[4] Y.H. Dai and L.Z. Liao, New conjugate conditions and related nonlinear conjugate gradient methods. Appl. Math. Optim., 43 (2001), 87101. [ Links ]
[5] Y.H. Dai and Y. Yuan, A nonlinear conjugate gradient method with a strong global convergence property. SIAM J. Optim., 10 (1999), 177182. [ Links ]
[6] Y.H. Dai and Y. Yuan, Nonlinear Conjugate Gradient Methods. Shanghai Science and Technology Publisher, Shanghai (2000). [ Links ]
[7] E.D. Dolan and J.J. Moré, Benchmarking optimization software with performance profiles. Math. Program., 91 (2002), 201213. [ Links ]
[8] R. Fletcher and C. Reeves, Function minimization by conjugate gradients. Comput. J., 7 (1964), 149154. [ Links ]
[9] R. Fletcher, Practical Methods of Optimization, Vol I: Unconstrained Optimization. John Wiley & Sons, New York (1987). [ Links ]
[10] J.C. Gilbert and J. Nocedal, Global convergence properties of conjugate gradient methods for optimization. SIAM. J. Optim., 2 (1992), 2142. [ Links ]
[11] W.W. Hager and H. Zhang, A new conjugate gradient method with guaranteed descentand an efficient line search. SIAM J. Optim., 16 (2005), 170192. [ Links ]
[12] W.W. Hager and H. Zhang, A survey of nonlinear conjugate gradient methods. Pacific J. Optim., 2 (2006), 3558. [ Links ]
[13] M.R. Hestenes and E.L. Stiefel, Methods of conjugate gradients for solving linear systems. J. Research Nat. Bur. Standards Section B, 49 (1952), 409432. [ Links ]
[14] D. Li and M. Fukushima, A modified BFGS method and its global convergence in nonconvex minimization. J. Comput. Appl. Math., 129 (2001), 1535. [ Links ]
[15] Y.L. Liu and C.S. Storey, Efficient generalized conjugate gradient algorithms, Part 1: Theory. J. Optim. Theory Appl., 69 (1991), 129137. [ Links ]
[16] J.J. Moré and D.J. Thuente, Line search algorithms with guaranted sufficient decrease. ACM Trans. Math. Softw., 20 (1994), 286307. [ Links ]
[17] B. Polak and G. Ribiere, Note sur la convergence des méthodes de directions conjuguées. Rev. Française Informat Recherche Operationelle, 16 (1969), 3543. [ Links ]
[18] B.T. Polyak, The conjugate gradient method in extreme problems. USSR Comp. Math. Math. Phys., 9 (1969), 94112. [ Links ]
[19] D.F. Shanno, Conjugate gradient methods with inexact searches. Math. Oper. Res., 3 (1978), 244256. [ Links ]
[20] P. Wolfe, Convergence conditions for ascent methods. SIAM Rev., 11 (1969), 226235. [ Links ]
[21] Y. Yuan and J. Stoer, A subspace study on conjugate algorithms. ZAMM Z. Angew. Math. Mech., 75 (1995), 6977. [ Links ]
[22] L. Zhang, W. Zhou and D. Li, A descent modified PolakRibièrePolyak conjugate gradient method and its global convergence. IMA J. Numer. Anal., 26 (2006), 629640. [ Links ]
[23] L. Zhang, W. Zhou and D. Li, Global convergence of a modified FletcherReeves conjugate gradient method with Armijotype line search. Numer. Math., 104 (2006), 561572. [ Links ]
[24] L. Zhang, W. Zhou and D. Li, Some descent threeterm conjugate gradient methods and their global convergence. Optim. Methods Softw., 22 (2007), 697711. [ Links ]
[25] G. Zoutendijk, Nonlinear Programming, Computational methods, Integer and Nonlinear Programming. J.Abadie (ed.), NorthHolland, Amsterdam (1970), 3786. [ Links ]
Received: 08/II/09. Accepted: 15/II/09.
#CAM53/09.