STOCHASTIC GRADIENT METHODS FOR UNCONSTRAINED OPTIMIZATION

Krejić, Nataša; Jerinkić, Nataša Krklec

doi:10.1590/0101-7438.2014.034.03.0373

Abstract

This papers presents an overview of gradient based methods for minimization of noisy functions. It is assumed that the objective functions is either given with error terms of stochastic nature or given as the mathematical expectation. Such problems arise in the context of simulation based optimization. The focus of this presentation is on the gradient based Stochastic Approximation and Sample Average Approximation methods. The concept of stochastic gradient approximation of the true gradient can be successfully extended to deterministic problems. Methods of this kind are presented for the data fitting and machine learning problems.

unconstrained optimization; stochastic gradient; stochastic approximation; sample average approximation

1 INTRODUCTION

Stochastic optimization problems appear in all areas or engineering, physical and social sciences. Typical applications are model fitting, parameter estimation, experimental design, performance evaluation etc. The models we are considering here can be written in the form

where f: ℝn → ℝ is either observed with noise or is defined as the mathematical expectation. In fact the objective function depends on a vector of random variables ξ from some probability space that might be know or unknown, depending on application. Thus the exact evaluation of f (x) is impossible to evaluate and it is necessary to use simulation to estimate the objective function value.

The feasible set Ω can be defined by constraints of different types - simple box constraints, deterministic constraints, chance constraints, constraints in the form of mathematical expectation etc. In this paper we consider only the case Ω = ℝn . More precisely, we consider only two types of stochastic problems. The first type are problems with random objective function,

where ξ represents the noise (or randomness) and x is the decision variable. Such models appear when the decision has to be taken before the full information about the problem parameters is known. The lack of information in that case is represented by random vector ξ . The second type of problems are the problems with the mathematical expectation as the objective function

Although the noise is technically removed in the above problem, it is rather hard to solve it as the expectation is hard or even impossible to state analytically even if the distribution of ξ is known.

Methods for solving stochastic optimization problems are combination of ideas from numerical optimization and statistics. Thus the class of popular methods include simulation-based methods, direct methods for stochastic search, annealing type algorithms, genetic algorithms, methods of reinforced learning, statistical methods and many others, ^[³⁴[34] SPALL JC. 2003. Introduction to Stochastic Search and Optimization. Wiley-Interscience serises in discrete mathematics, New Jersey.^{], [}⁷[7] BIRGE JR & LOUVEAUX F. 1997. Introduction to Stochastic Programming, Springer.^]. Among all of them we restrict our attention here on two methods typically used in simulation based optimization: Stochastic Approximation, SA, and Sample Average Approximation, SAA.

Stochastic Approximation methods are introduced in the seminal paper of Robbins and Monro, ^[³⁰[30] ROBBINS H & MONRO S. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics 22(3):400-407.^] and remain a popular choice for solving stochastic optimization problems. They relay mainly on noisy gradient evaluations and depend heavily on the choice of steplength sequence. The choice of this sequence is the subject of many research efforts as well as other techniques for accelerating the convergence of SA methods. Sample average Approximation methods can be seen as an alternative to SA methods. In this approach a sample from the underlying distribution is used to construct a deterministic sample average problem which can be solved by optimization methods. However the sample used for the SAA approximation very often needs to be large and a naive application of standard nonlinear optimization techniques is not feasible. Therefore there has been extensive research in variable sample size methods that reduce the cost of SAA.

Both SA and SAA methods are considered here in the framework of gradient-related optimization (gradient methods, subgradient methods, second order quasi Newton methods) as well as in the derivative-free framework. This survey will largely deal with gradient methods for stochastic optimization and an interested reader can look at Spall ^[³⁴[34] SPALL JC. 2003. Introduction to Stochastic Search and Optimization. Wiley-Interscience serises in discrete mathematics, New Jersey.^] for an overview of other methods. This paper is organized as follows. In Section 2 we discuss the SA method and its modifications. Section 3 contains results for the SAA methods and unconstrained problems with the mathematical expectation objective function. Two important deterministic problems that are rather similar to SAA problem are discussed in Section 4 as well as methods for obtaining their solution that rely on stochastic gradients. Finally, some conclusion and research perspectives are presented in Section 5.

2 STOCHASTIC APPROXIMATION METHODS

There have been countless applications of Stochastic Approximation (SA) method since the work of Robbins and Monro ^[³⁰[30] ROBBINS H & MONRO S. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics 22(3):400-407.^]. In this section we give an overview of its main properties and some of its generalizations. The problem we consider is

assuming that only the noisy measurements (x) of the function and its gradient ĝ(x) are available. Let us start by considering the SA algorithm for solving systems of nonlinear equations as it was defined originally in ^[³⁰[30] ROBBINS H & MONRO S. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics 22(3):400-407.^]. The convergence theory presented here relays on imposition of statistical conditions on the objective function and the noise. Convergence analysis can be conducted throughout differential equations as well, see ^[³⁴[34] SPALL JC. 2003. Introduction to Stochastic Search and Optimization. Wiley-Interscience serises in discrete mathematics, New Jersey.^] and ^[²⁵[25] KUSHNER H. 2010. Stochastic Approximation: A survey. Computational Statistics, 2(1):87-96.^] for further references.

Consider the system of nonlinear equations

with g(x) being the gradient of f (x). Suppose that only the measurements with noise that depends on the iteration as well as on the decision variable x

are available. Then the SA is defined by

The sequence of step sizes {ak}k∈N is also called the gain sequence and it has dominant influence on the convergence.

Let {xk} be a sequence generated by an SA method. Denote by Fk the σ-algebra generated by x0, x1,..., xk. If the problem has an unique solution x* the set of assumptions that ensures the convergence of an SA method is the following.

S1. The gain sequence satisfies:

S2.For some symmetric, positive definite matrix B and for every η ∈ (0,1),

S3. For all x and k,

S4.There exists a constant c> 0 such that for all x and k,

||g(x)||² + E [ ||ξk(x)||²| Fk ] ≤ c (1 +||x||²).

The first assumption which implies that the step sizes converge to zero is standard in stochastic algorithms, see ^[³⁴[34] SPALL JC. 2003. Introduction to Stochastic Search and Optimization. Wiley-Interscience serises in discrete mathematics, New Jersey.^]. The second condition, is imposed in order to avoid inefficiently small step sizes. On the other hand, the summability condition on a_k ² ensures stability. Its role is to decrease the influence of the noise when the iterates come into a region around the solution. An example of a sequence that satisfies the first assumption is

where α ∈ (0.5, 1) and a is some positive constant. The condition of zero mean is also standard in stochastic optimization. Its implication is that ĝ_k(x) is an unbiased estimator of g(x). Notice that under assumption S3, the condition in S4 is equal to

E [ĝk(x)||²] ≤ c (1 +||x|| ²).

Therefore, the mean of ||ĝ_k(x)||² can not grow faster than a quadratic function of x. Under these assumptions, the almost sure convergence of the SA algorithm can be established. The convergence in mean square i.e. E[|x_k-x|)²] → 0 as k → ∞ was proved in ^[³⁰[30] ROBBINS H & MONRO S. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics 22(3):400-407.^] and the theorem bellow states a stronger result, the almost sure convergence.

Theorem 2.1.^[³⁴[34] SPALL JC. 2003. Introduction to Stochastic Search and Optimization. Wiley-Interscience serises in discrete mathematics, New Jersey.^] Consider the SA algorithm defined by (6). Suppose that assumptions S1 -S4 hold and that x* is a unique solution of the system (4). Then x_k converges almost surely to x*.

Closely related and more general result is proved in Bertsekas, Tsitsiklis ^[⁶[6] BERTSEKAS DP & TSITSIKLIS JN. 2000. Gradient Convergence in Gradient Methods with Errors. SIAM J. Optimization, 10(2):627-642.^] where the gradient related method of the form

is considered. Here ξk is either stochastic or deterministic error, a_k is a sequence of diminishing step sizes that satisfy assumption S1 and s_k is a descent direction. In this context, the direction s_k is not necessarily the gradient but it is gradient-related. The convergence is stated in the following theorem.

Theorem 2.2.^[⁶[6] BERTSEKAS DP & TSITSIKLIS JN. 2000. Gradient Convergence in Gradient Methods with Errors. SIAM J. Optimization, 10(2):627-642.^] Let {x_k} be a sequence generated by (8), where s_k is a descent direction. Assume that S1 and S3 -S4 hold and that there exist positive scalars c₁ and c₂ such that

c₁||∇f (x_k)||² ≤ - ∇ f (x_k) T sk, ||s_k|| ≤ c ₂(1 + ||∇ f (x_k)||).

Then, either f (xk) →-∞ or else f (x_k) converges to a finite value and lim_k→∞ ∇ f (x) = 0. Furthermore, every limit of {x_k} is a stationary point of f.

The gain sequence is the key element of the SA method. It has impact on stability as well as on convergence rate. Under some regularity conditions, Fabian ^[¹³[13] FABIAN V. 1968. On Asymptotic Normality in Stochastic Optimization. The Annals of Mathematical Statistics, 39:1327-1332.^], the asymptotic normality of x_k is obtained. More precisely,

where →d denotes the convergence in distribution, α refers to (7) and is the covariance matrix that depends on the gain sequence and on the Hessian of f . Therefore, the iterate x_k approximately has normal distribution N (x*, k -α ◹) for large k. Due to assumption S1, the maximal convergence rate is obtained for α = 1. However, this reasoning is based on the asymptotic result. Since the algorithms are finite in practice, it is often desirable to set α < 1 because α = 1 yields smaller steps. moreover, if we want to minimize ||◹||, the ideal sequence would be

where H (x) denotes the Hessian matrix of f , Benveniste et al. ^[⁵[5] BENVENISTE A, METIVIER M & PRIOURET P. 1990. Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, New York.^]. Even though this result is purely theoretical, sometimes the Hessian at x* can be approximated by H (x_k) and that way one can enhance the rate of convergence.

Two main drawbacks of the SA method are slow convergence and the fact that the convergence theory applies only if the solution of (4) is unique i.e. only if f has an unique minimizer. Thus several generalizations are developed to address these two issues. One can easily see from (7) that the gain coefficients increase with the increase of a. On the other hand a large a might have negative influence on stability. Therefore several generalizations of the gain coefficients are considered in the literature. One possibility, Spall ^[³⁵[35] SPALL JC 1998. Implementation of the Simultaneous Perturbation Algorithm for Stochastic Optimization. IEEE Transactions on Aerospace and Electronic Systems, 34(3):817-823.^], is to introduce the so called stability constant A > 0, and obtain

Now, the values of a and A can be chosen together to ensure effective practical performance of the algorithm, allowing for larger a and thus producing larger step sizes in latter iterations, when the effect of A is small, and avoiding instability in early iterations. The empirically recommended value of A is at most 10% of the iterations allowed or expected during the optimization process, for more details see ^[³⁵[35] SPALL JC 1998. Implementation of the Simultaneous Perturbation Algorithm for Stochastic Optimization. IEEE Transactions on Aerospace and Electronic Systems, 34(3):817-823.^].

Several generalizations of the SA method are based on adaptive step sizes that try to adjust the step size at each iteration to the progress achieved in the previous iterations. The first attempt of this kind has been made in Kesten ^[²⁰[20] KESTEN H. 1958. Accelerated stochastic approximation. Ann. Math. Statist. 29:41-59.^] for one dimensional problems. The main idea is to monitor the changes in the sign of x_k+1-x_k. If the sign of the difference between two consecutive iterations starts to change frequently we are probably in the domain of noise and therefore small steeps are needed to avoid oscillations. This idea is generalized in Delyon, Juditsky ^[¹¹[11] DELYON B & JUDITSKY A. 1993. Accelerated stochastic approximation. SIAM J. Optimization 3(4):868-881.^] for multidimensional problems. The gain coefficients are defined as

where I is the identification function defined as I (t) = 1 if t < 0 and I (t) = 0 if (t) ≥ 0. The method is accelerating the convergence of SA and its almost sure convergence is proved under the standard assumptions.

The idea of sign changes is further developed in Xu, Dai ^[³⁸[38] XU Z & DAI YD. 2012. New stochastic approximation algorithms with adaptive step sizes. Optimization Letters, 6:1831-1846.^]. It is shown that the sequence {s_k / k} stands for the change frequency of the sign of ĝ_kT ₊₁ ĝk in some sense. The assumption in ^[³⁸[38] XU Z & DAI YD. 2012. New stochastic approximation algorithms with adaptive step sizes. Optimization Letters, 6:1831-1846.^] is that the noise ξ_k is state independent. The theoretical analysis shows that in that case s_k/ k converges to P(ξ₁ T ξ₂ < 0) in the mean square sense. Based on that result, a switching algorithm that uses the switching parameter t_k,defined as

is proposed. Then the gain coefficients are defined as

where 0.5 ≤ α < β ≤1, v is a small positive constant and a, A are the constants from assumption S1. To prove the convergence of the switching algorithm (9)-(10) one additional assumption is introduced in ^[³⁸[38] XU Z & DAI YD. 2012. New stochastic approximation algorithms with adaptive step sizes. Optimization Letters, 6:1831-1846.^].

S5. G(x) =g(x)-x is a weighted maximum norm pseudo-contraction operator. That is for all x ∈ ℝn there exists a positive vector ω and some x* ∈ ℝn such that

||G(x)-x*||ω ≤β||x -x*||w,

where β ∈(0,1) and ||·||ω is defined as lxlw = max{x(i)ω(i)^-1 , i =1,...,n}where x(i) and ω(i) are the ith components of x and ω respectively.

Theorem 2.3.^[³⁸[38] XU Z & DAI YD. 2012. New stochastic approximation algorithms with adaptive step sizes. Optimization Letters, 6:1831-1846.^] Suppose that assumptions S1 -S2 and S5 hold. Then for {x_k }generated through (9)-(10) we have x_k →x* as k →∞ with probability 1.

If the objective function is given in the form of mathematical expectation the adaptive step length sequence can be determined as proposed in Yousefian et al. ^[³⁶[36] YOUSEFIAN F, NEDIC A & SHANBHAG UV. 2012. On stochastic gradient and subgradient methods with adaptive steplength sequences. Automatica, 48(1):56-67.^]. For the problem

the following assumptions are stated.

S6. The function F(·,ξ) is convex on a closed and convex set D ⊂ ℝn for every ξ ∈ Q, and the expected value E (F(x,ξ)) is finite for every x ∈D.

S7. The errors ξ_k in the noisy gradient ĝ_k are such that for some µ > 0,

A self-adaptive scheme is based on the error minimization and the convergence result is as follows.

Theorem 2.4.^[³⁶[36] YOUSEFIAN F, NEDIC A & SHANBHAG UV. 2012. On stochastic gradient and subgradient methods with adaptive steplength sequences. Automatica, 48(1):56-67.^] Let assumptions S6 and S7 hold. Let the function f be differentiable over the set D with Lipschitz gradient and assume that the optimal set of problem (11) is nonempty. Assume that the step size sequence {ak}is generated through the following self-adaptive scheme

a_k =a_k-1(1 -ca_k-1) for all k ≥1,

where c > 0 is a scalar and the initial step size is such that 0 < a ₀ < 1 / c. Then the sequence {x_k }converges almost surely to a random point that belongs to the optimal set.

An important choice for the gain sequence is a constant sequence. Although such sequences do not satisfy assumption S1 and almost sure convergence to solution can not be obtained, it can be shown that a constant step size can conduct the iterations to a region that contains the solution. This result initiated development of a cascading steplength SA scheme in ^[³⁶[36] YOUSEFIAN F, NEDIC A & SHANBHAG UV. 2012. On stochastic gradient and subgradient methods with adaptive steplength sequences. Automatica, 48(1):56-67.^] where a fixed step size is used until some neighborhood of the solution is reached. After that, in order to come closer to the solution, the step size is decreased and again the fixed step size is used until the ring around the solution is sufficiently tighten up. That way, the sequence of iterates is guided towards the solution.

A hybrid method which combines the SA gain coefficients and the step sizes obtained from the inexact Armijo line search under the assumptions valid for the SA method is considered in Krejić et al. ^[²⁴[24] KREJIĆ N, LUŽANIN Z & STOJKOVSKA I. 2013. A gradient method for unconstrained optimization in noisy environment. Applied Numerical Mathematics, 70:1-21.^]. The method takes the advantages of both approaches, safe convergence of SA method and fast progress obtained by line search if the current iterate if far away from the solution (where the SA steps would be unnecessarily small). The step size is defined according to the following rule. For a given C > 0 and a sequence {a_k}that satisfies assumption S1 we define

where β_k is obtained from the Armijo inequality

After that the new iteration is obtained as

The existence of C such that the gain coefficient (12) is well defined as well as the convergence of the sequence generated by (13) is proved in ^[²⁴[24] KREJIĆ N, LUŽANIN Z & STOJKOVSKA I. 2013. A gradient method for unconstrained optimization in noisy environment. Applied Numerical Mathematics, 70:1-21.^] under one additional assumption.

S8.Observation noise is bounded and there exists a positive constant M such that ||ξ_k(x)|| ≤ M a.s. for all k and x.

Theorem 2.5.^[²⁴[24] KREJIĆ N, LUŽANIN Z & STOJKOVSKA I. 2013. A gradient method for unconstrained optimization in noisy environment. Applied Numerical Mathematics, 70:1-21.^]Assume that Assumptions S1-S4 and S8 hold, the gradient g is Lipschitz continuous with the constant L, and the Hessian matrix H (x*) exists and is nonsingular. Let

where

Let {x_k}be an infinite sequence generated by (13). Then x_k → x* a.s.

Many important issues regarding the convergence of the SA methods are not mentioned so far. One effective possibility to speed up the convergence is to apply the averaging to the sequence of gradient estimations ĝ(x_k)as suggested in Andradottir ^[¹[1] ANDRADOTTIR S. 1996. A Scaled Stochastic Approximation Algorithm. Management Science, 42(4):475-498.^]. It is shown that the rate of convergence could be significantly better than the rate of SA if two conditionally independent gradient estimations are generated and the new iteration is obtained using a scaled linear combination of the two gradients with the gain coefficient. More details on this procedure are available in ^[¹[1] ANDRADOTTIR S. 1996. A Scaled Stochastic Approximation Algorithm. Management Science, 42(4):475-498.^]. Let us also mention a robust SA scheme that determines an optimal constant step length based on minimization of the theoretical error for a pre-specified number of steps ^[²⁷[27] NEMIROVSKI A, JUDITSKY A, LAN G & SHAPIRO A. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization 19(4):1574-1609.^].

We have assumed in the above discussion that the noisy gradient values are available. This is the case for example if the analytical expression of F in (11) is available. In this case, under certain assumption we can interchange the expectation and derivative and thus a sample average approximation of the gradient can be calculated. It is important to be able to use a sample gradient estimation with relatively modest sample size as calculation of the sample gradient is in general expensive for large samples. However it is safe to claim that the analytical expression for the gradient calculation is not available in many cases and thus the only input data we have are (possibly noisy) function values. Thus the gradient approximation with finite differences appears to be a natural choice in many applications. The first method of this kind is due to Keifer, Wolfowitz, ^[²¹[21] KIEFER J & WOLFOWITZ J. 1952. Stochastic Estimation of the Maximum of a Regression Function. The Annals of Mathematical Statistics 23(3): 462-466.^]. Many generalizations and extensions are later considered in the literature, see Fu ^[¹⁵[15] FU MC. 2006. Gradient Estimation, HENDERSON SG & NELSON BL. (Eds.). Handbook in OR & MS, 13:575-616.^] for example. Among many methods of this kind the Stochastic Perturbation method is particularly efficient as it uses one two function values to obtain a good gradient approximation, see ^[³⁵[35] SPALL JC 1998. Implementation of the Simultaneous Perturbation Algorithm for Stochastic Optimization. IEEE Transactions on Aerospace and Electronic Systems, 34(3):817-823.^] for implementation details.

The questions of stopping criteria, global convergence, search directions which are not gradient related, and other important questions are beyond the scope of this paper. An interested reader might look at Spall ^[³⁴[34] SPALL JC. 2003. Introduction to Stochastic Search and Optimization. Wiley-Interscience serises in discrete mathematics, New Jersey.^], Shapiro et al. ^[³²[32] SHAPIRO A, DENTCHEVA D & RUSZCZYNSKI A. 2009. Lectures on Stochastic Programming. SIAM.^] for guidance on these issues and relevant literature.

3 SAMPLE AVERAGE APPROXIMATION

Sample Average Approximation (SAA) is a widely used technique for approaching the problems of the form

The basic idea is to approximate the objective function f (x) with the sample mean

where N is the size of a sample represented by i.i.d. random vectors ξ1,...,ξN . Under the standard assumption such as finite variance of F(x,ξ) the (strong) Law of Large Numbers implies that N(x) converges to f (x) almost surely. Moreover, if F(x,ξ)is dominated by an integrable function, then the uniform almost sure convergence of N(x) on the compact subsets of ℝn is obtained.

Within the SAA framework the original problem (14) is replaced by the approximate problem

and thus the key question is the relationship between their respective solutions as N tends to infinity. Denote by X* the set of optimal solutions of the problem (14) and let f* be the optimal value of the objective function. Furthermore, denote by *_N and the set of optimal solutions and the corresponding optimal values, respectively, of the problem (16). Then, the following result holds.

Theorem 3.1. (³²[32] SHAPIRO A, DENTCHEVA D & RUSZCZYNSKI A. 2009. Lectures on Stochastic Programming. SIAM.) Suppose that there exists a compact set C ⊂ ℝn such that X* is nonempty and X*⊂ C. Assume that the function f is finite valued and continuous on C and that _N converges to f almost surely, uniformly on C. Also, suppose that for N large enough the set is nonempty and ⊂ C. Then → f* and the distance between sets _Nand X* tends to zero almost surely as N → ∞.

Let _N be an approximate solution of the problem (14). Clearly _N ( _N) can be calculated for a given sample. The Central Limit Theorem can be used to obtain an approximation of the error bound c_N ( _N) such that the inequality holds with some high probability δ ∈(0,1). For example, using the sample variance the following error bound is obtained

with z being the quantile of the standard normal distribution. The error bound is directly proportional to the variance of the estimator Var( _N( _N)). Therefore, in order to provide a tight bound one can consider some techniques for reducing variance such as the quasi-Monte Carlo or Latin hypercube sampling ^[³²[32] SHAPIRO A, DENTCHEVA D & RUSZCZYNSKI A. 2009. Lectures on Stochastic Programming. SIAM.^]. However, these techniques tend to deteriorate the i.i.d. assumption. This issue is addressed further on in this section.

The gap where x* is a solution of the original problem can be estimated as well. Clearly g( _N) ≥ 0. To obtain an upper bound suppose that M independent samples of size N are available, i.e. we have i.i.d. sample ξ₁m,...,ξ_N^m, m = 1,...,M.

Denote by the relevant (nearly) optimal values and define and Then, the upper bound estimator for the gap g( _N) is

where N' is some large enough sample size and t_M-1,δ is the quantile of Student's distribution with M -1 degrees of freedom. It should be mentioned that the sample size bounds such that the solutions of an approximate problem are nearly optimal for the true problem with some high probability are mainly too conservative for practical applications in general. For further references on this topic, see ^[³²[32] SHAPIRO A, DENTCHEVA D & RUSZCZYNSKI A. 2009. Lectures on Stochastic Programming. SIAM.^] for instance.

Recall that almost sure convergence _N (x)towards f (x)is achieved if the sample is i.i.d. under standard assumptions. However, if the sample is not i.i.d. the almost sure convergence of is achievable only if the sample size N which defines the SAA problem grows at the certain rate, Homem-de-Mello ^[¹⁹[19] HOMEM-DE-MELLO T 2003. Variable-Sample Methods for Stochastic Optimization. ACM Transactions on Modeling and Computer Simulation, 13(2):108-133.^]. The analysis presented in ^[¹⁹[19] HOMEM-DE-MELLO T 2003. Variable-Sample Methods for Stochastic Optimization. ACM Transactions on Modeling and Computer Simulation, 13(2):108-133.^] allows for biased estimators _N (x) if _N (x) is at least asymptotically unbiased. Let us first assume that the sample ξ₁k,...,ξk_Nk generated at the iteration k is independent of the sample at the previous iteration for every k. The following assumptions are needed.

R1. For each x, there exists M(x)> 0 such that sup_i,kF(x,ξ_ik) ≤ M(x) with probability 1.

R2. For each x, we have that lim_k→∞E(_N (x)) = f (x).

Theorem 3.2.⁽¹⁹[19] HOMEM-DE-MELLO T 2003. Variable-Sample Methods for Stochastic Optimization. ACM Transactions on Modeling and Computer Simulation, 13(2):108-133.⁾ Suppose that assumptions R1-R2 hold and that the sample size sequence {Nk }satisfies ∑^∞ _k=1αNk < ∞for all α ∈(0,1). Then (x) converges to f (x) almost surely.

For example, Nk ≥ √k satisfies the previously stated summability condition.

The rate of convergence is also addressed in ^[¹⁹[19] HOMEM-DE-MELLO T 2003. Variable-Sample Methods for Stochastic Optimization. ACM Transactions on Modeling and Computer Simulation, 13(2):108-133.^], i.e. the error bounds for | (x)- f (x)| are developed. In the case where N_k ≥c₁k^ρ for c₁ > 0 and ρ > 2 it can be proved, under some additional assumptions, that for every k sufficiently large the following inequality holds almost surely

Moreover, if the sample is cumulative, the corresponding error bound is

where C is some positive constant.

The above analysis provides a justification for the SAA approximation as well as a guidance for choosing N in (16). Thus from now on we concentrate on gradient methods for solving (16). Several papers exploit ideas from deterministic optimization. Generally speaking we are interested in solving the SAA problem for some finite, possibly very large N as well as obtaining asymptotic results i.e. the results that cover the case N →∞ even if in practical applications one deals with a finite value of N. A naive application of an optimization solver to (16) is very often prohibitively costly if N is large due to the cost of calculating _N(x) and its gradient. Thus there is a vast literature dealing with variable sample scheme for solving (16).

Two main approaches can be distinguished. In the first approach the objective function _N is replaced with (x) at each iteration k and the iterative procedure is essentially a two step procedure of the following form. Given the current approximation x_k and the sample size N_k one has to find s_k such that the value of (x_k + s_k) is decreased. After that we set x_k+1 =x_k + s_k and choose a new sample size N_k+1. The key ingredient of this procedure is the choice of N_k+1.

The schedule of sample sizes {N_k} should be defined in such way that either N_k =N for k large enough or N_k →∞ if one is interested in asymptotic properties.

The second approach, often called the diagonalization scheme or the surface response method, is again a two step procedure. It consists of a sequence of SAA problems with different sample sizes that are approximately solved. So for the current x_k and N_k the problem (16) with N = N_k is approximately solved (within an inner loop) for starting with x_k as the initial approximation. After that we set x_k+1 = N_kand choose the new sample size N_k+1. Two important points in this procedure are the choice of Nk+1 and the precision in solving each of the optimization problems min

Let us now look into algorithms of the first kind. Keeping in mind that min is just an approximation of the original problem and that the cost of each iteration depends on N_k,it is rather intuitive to start the optimization procedure with smaller samples and gradually increase the sample size N_k as the solution is approached. Thus the most common schedule sequence would be an increasing sequence N₀,N₁,.... The convergence theory for this kind of reasoning is introduced in Wardi, ^[⁴⁰[40] WARDI Y 1990. Stochastic Algorithms with Armijo Stepsizes for Minimization of Functions. Journal of Optimization Theory and Applications 64(2):399-417.^] where an Armijo type line search method is combined with SAA approach. In order to solve the problem of type (14), the iterative sequence is generated as

where N_k is the sample size used at iteration k and α_k is the largest number in (0,1] satisfying the inequality

The method is convergent with zero upper density ^[⁴⁰[40] WARDI Y 1990. Stochastic Algorithms with Armijo Stepsizes for Minimization of Functions. Journal of Optimization Theory and Applications 64(2):399-417.^], assuming that N_k →∞. More precisely, the following statement is proved.

Theorem 3.3.^[⁴⁰[40] WARDI Y 1990. Stochastic Algorithms with Armijo Stepsizes for Minimization of Functions. Journal of Optimization Theory and Applications 64(2):399-417.^] Assume that the function f is given by (14) and that F is twice continuously differentiable on ℝn for every ξ. Furthermore assume that for every compact set D ⊂ℝⁿ, there exists K > 0 such that for every x ∈ D and every ξ.

If N_k →∞ then the sequence {xk}given by (19) converges with zero upper density on compact sets.

An extension of the above work is presented in Yan, Mukai ^[³⁷[37] YAN D & MUKAI H. 1993. Optimization Algorithm with Probabilistic Estimation. Journal of Optimization Theory and Applications, 79(2):345-371.^] where the adaptive precision is proposed i.e. the sequence {Nk}_k∈ℕ is not determined in advance as in ^[⁴⁰[40] WARDI Y 1990. Stochastic Algorithms with Armijo Stepsizes for Minimization of Functions. Journal of Optimization Theory and Applications 64(2):399-417.^] but it is adapted during the iterative procedure. Nevertheless the sample size has to satisfy N_k →∞. The convergence result is slightly stronger as the convergence with probability 1 is proved under the set of appropriate assumptions. The more general result that applies to both gradient and subgradient methods is obtained in Shapiro, Wardi ^[³³[33] SHAPIRO A & WARDI Y. 1996. Convergence Analysis of Stochastic Algorithms. Mathematics of Operations Research 21(3):615-628.^] where the convergence with probability 1 is proved for sample average gradient and subgradient methods assuming that the sample size tends to infinity.

In practical applications, the sample size is finite. So, let us now suppose that N_max is the sample size which makes good approximation of the original objective function. Very often we assume that the sample is generated at the beginning of the process which justifies considering the SAA objective function as deterministic. In this case one wishes again to decrease the cost of the optimization process by decreasing the number of function and gradient evaluations. Let us now look closer at the possible schedule of N_k. Clearly the sample size should be equal to Nmax at the final stages of the optimization procedure to ensure that the problem (16) with N =N_max is solved. Thus one can consider even some heuristic schedule ^[¹⁴[14] FRIEDLANDER MP & SCHMIDT M. 2012. Hybrid deterministic-stochastic methods for data fitting. SIAM J. Scientific Computing, 34(3):1380-1405.^] to generate a non-decreasing sequence {N_k}which eventually becomes stationary with N_k =N_max for k large enough. For example, a simple way to define such sequence could be to increase Nk by a fixed number every K iterations.

The problem of scheduling can be approached from a different perspective in the following manner. Instead of constantly increasing the sample size one could monitor the progress in decreasing the (approximate) objective function and choose the next sample size according to that progress. One algorithm of this kind is presented in Deng, Ferris ^[¹²[12] DENG G & FERRIS MC. 2009. Variable-Number Sample Path Optimization. Mathematical Programming 117(1-2):81-109.^] where the Bayes risk is used to decide the scheduling sequence within a trust region method. Another class of results in the framework of trust region methods is presented in Bastin ^[²[2] BASTIN F. 2004. Trust-Region Algorithms for Nonlinear Stochastic Programming and Mixed Logit Models. PhD thesis, University of Namur, Belgium.^] and Bastin et al. ^[³[3] BASTIN F, CIRILLO C & TOINT PL. 2006. An adaptive Monte Carlo algorithm for computing mixed logit estimators. Computational Management Science 3(1):55-79.^{], [}⁴[4] BASTIN F, CIRILLO C & TOINT PL. 2006. Convergence theory for nonconvex stochastic programming with an application to mixed logit. Math. Program., Ser. B 108:207-234.^]. The key point of the approach considered in ^[²[2] BASTIN F. 2004. Trust-Region Algorithms for Nonlinear Stochastic Programming and Mixed Logit Models. PhD thesis, University of Namur, Belgium.^,³[3] BASTIN F, CIRILLO C & TOINT PL. 2006. An adaptive Monte Carlo algorithm for computing mixed logit estimators. Computational Management Science 3(1):55-79.^,⁴[4] BASTIN F, CIRILLO C & TOINT PL. 2006. Convergence theory for nonconvex stochastic programming with an application to mixed logit. Math. Program., Ser. B 108:207-234.^] is that the sample sizes might oscillate during the iterative process, i.e. {Nk}is not necessarily non-decreasing at the initial stages of the iterative process. Eventually N_k =N _max is reached and (16) is solved, but very often with smaller costs if compared with an increasing scheduling. The efficiency of this approach comes from the fact that there is a balance between the precision of the objective function approximation and the progress towards the solution. The same idea is further developed for the line search methods in Krejić, Krklec ^[²²[22] KREJIĆ N & KRKLEC N. 2013. Line search methods with variable sample size for unconstrained optimization. Journal of Computational and Applied Mathematics, 245:213-231.^] as follows.

Let us assume that the gradient of ∇F is available and that the search direction p_k satisfies . The armijo rule with η ∈ (0, 1) is applied to find α_k such that x_k+1 = x_k +α_k p_k where

The sample size is updated as follows. First, the candidate sample size N +_k is determined by comparing the measure of decrease in the objective function dm_k = -α_k(∇(x_k))T p_k and the so called lack of precision εN_δ(x) defined by (17). The main idea is to find the sample size N⁺_k such that dm_k ≈ (xk). The reasoning behind this idea is the following. If the decrease measure dm_k is greater than the lack of precision (x^k), the current approximation is probably far away from the solution. In that case, there is no need to impose high precision and therefore the sample size is decreased if possible. The candidate sample size does not exceed N_max,but there is also the lower bound, i.e. N_kmin ≤ N_k⁺ ≤ N_max. This lower bound is increased only if N_k+1 > N_k and there is not enough progress concerning the function . After finding the candidate sample size, a safeguard check is performed in order to prohibit the decrease of the sample size which might be unproductive. More precisely, if N_k+ < N_k the following parameter is calculated

If ρ_k is relatively small, then it is presumed that these two model functions are too different and thus there is no gain in decreasing the sample size. So, N_k+1 = N_k. In all the other cases, the decrease is accepted and N_k+1 = N_k ⁺ . The convergence analysis relays on the following important result which states that after some finite number of iterations, the objective function becomes and (16) is eventually solved.

Theorem 3.4.^[²²[22] KREJIĆ N & KRKLEC N. 2013. Line search methods with variable sample size for unconstrained optimization. Journal of Computational and Applied Mathematics, 245:213-231.^] Suppose that F(·,ξ) is continuously differentiable and bounded from below for every ξ. Furthermore, suppose that there exist a positive constant κ and number n₁ ∈ ℕ such that (x_k) ≥ κ for every k ≥ n ₁ . Then there exists q ∈ ℕ such that N_k = N_max for every k ≥ q.

Let us now present some results for the second type of methods, the so called diagonalization methods described above. One possibility to determine the sample sizes in the sequence of optimization problems to be solved is presented in Royset ^[³¹[31] ROYSET JO. 2012. Optimality functions in stochastic programming. Math. Program. 135(1-2):293-321.^] where an optimality function is used to determine when to switch on to a larger sample size. The optimality function is defined by mapping θ : ℝn → (-∞,0] which, under standard conditions, satisfies θ(x) = 0 if and only if x is a solution in some sense. For unconstrained problem and its SAA approximation is given by . Under the set of standard assumptions, almost sure convergence of θ_N (x) towards θ(x) is stated together with asymptotic normality.

Denote by the iterate obtained after a finite number of iterations of an algorithm applied on the SAA problem with sample size N_k where is the initial point. The point i s an approximate solution of (16) with N = N_k and it is assumed that the optimization algorithm used to determine that point is successful in the following sense.

R3. For any N_k every accumulation point of the sequence generated by the optimization method for solving

satisfies θ_Nk ( ) = 0 almost surely.

The algorithm proposed in ^[³¹[31] ROYSET JO. 2012. Optimality functions in stochastic programming. Math. Program. 135(1-2):293-321.^] increases the sample size when

θ_Nk (x_k) ≥ - δ₁Δ (N_k ),

where δ₁ is some positive constant and Δ is a function that maps ℕ into (0,∞) and satisfies lim_N→∞ Δ ( N) = 0. The sample size is assumed to be strictly increasing and unbounded, but the exact dynamics of increasing is not specified. The convergence of the algorithm is proved under one additional assumption.

R4. On any given set S ⊂ ℝn, the function F(·,ξ) is continuously differentiable and F(·,ξ) and ||∇_x F(·,ξ)|| are dominated by an integrable function.

Theorem 3.5.^[³¹[31] ROYSET JO. 2012. Optimality functions in stochastic programming. Math. Program. 135(1-2):293-321.^] Suppose that the assumptions R3-R4 are satisfied and that the sequence of iterates generated by the algorithm proposed in ^[³¹[31] ROYSET JO. 2012. Optimality functions in stochastic programming. Math. Program. 135(1-2):293-321.^] is bounded. Then, every accumulation point of that sequence satisfies θ( ) = 0 almost surely.

The relation between the sample size and the error tolerance for each of the optimization problems solved within the diagonalization methods is considered in Pasupathy ^[²⁸[28] PASUPATHY R. 2010. On Choosing Parameters in Retrospective-Approximation Algorithms for Stochastic Root Finding and Simulation Optimization. Operations Research, 58(4):889-901.^]. The error tolerance here is a small number ε_k which almost surely satisfies || - || ≤ ε_k where and represent the approximate and the true (unique) solution of the corresponding SAA problem, respectively. A measure of effectiveness is defined as q_k =|| - ||²W_k where W_k represents the number of simulation calls needed to obtain the approximate solution . Since the almost sure convergence is analyzed, it is assumed that N_k →∞ and ε_k → 0. It is proved that the measure of effectiveness is bounded in a stochastic sense if the following three conditions hold.

R5. If the numerical procedure used to solve SAA problems exhibits linear convergence, we as sume that . If the numerical procedure used to solve SAA problems exhibits polynomial convergence of order p > 1, we assume .

R6. lim sup_{k
→∞}(∑^k_j=1 N_j)ε_k² < ∞.

R7. lim sup_k→∞(∑k_j=1N_j )N_k ^-1 < ∞.

If any of the above conditions is violated, then qk tends to infinity in probability. The key point of analysis in ^[²⁸[28] PASUPATHY R. 2010. On Choosing Parameters in Retrospective-Approximation Algorithms for Stochastic Root Finding and Simulation Optimization. Operations Research, 58(4):889-901.^] is that the error tolerance should not be decreased faster than the sample size is increased. The dynamics of change depends on the convergence rate of numerical procedures used to solve the SAA problems. Moreover, the mean squared error analysis implies the choice of εk and N_k such that

In order to further specify the choice of the optimal sequence of sample sizes, the following theorem is stated.

Theorem 3.6.^[²⁸[28] PASUPATHY R. 2010. On Choosing Parameters in Retrospective-Approximation Algorithms for Stochastic Root Finding and Simulation Optimization. Operations Research, 58(4):889-901.^] Suppose that (21) holds together with the assumptions R5-R7. If the numerical procedure used to solve SAA problems exhibits linear convergence, then

If the numerical procedure used to solve SAA problems exhibits polynomial convergence of order p > 1,then lim sup_k→∞N_k/N p_k-1< ∞.

More specific recommendations are given for linear, sublinear and polynomial rates in ^[²⁸[28] PASUPATHY R. 2010. On Choosing Parameters in Retrospective-Approximation Algorithms for Stochastic Root Finding and Simulation Optimization. Operations Research, 58(4):889-901.^]. For example, if the applied algorithm is linearly convergent, then the linear growth of a sample size is recommended, i.e. it can be set N_k+1 =1.1Nk l for example. Also, in that case, exponential or polynomial growth of order p > 1 are not recommended. However, if the polynomial rate of convergence of order p > 1 is achieved, then we can set N_k+1 =⌈ N ^1.1 ⌉ or N_k+1 = ⌈ e Nk 1.1 ⌉ for instance. Furthermore, it is implied that the error tolerance sequence should be of the form K / √N_k where K is some positive constant.

The diagonalization methods are defined for a finite N as well. One possibility is presented in Polak, Royset ^[²⁹[29] POLAK E & ROYSET JO. 2008. Eficient sample sizes in stochastic nonlinear programing. Journal of Computational and Applied Mathematics 217(2):301-310.^] where the focus is on finite sample size N although the almost sure convergence is addressed. The idea is to approximately solve the sequence of SAA problems with N =N_k, k =1,..., s applying n_k iterations at every stage k. The sample size is nondecreasing and the sample is assumed to be cumulative. The method consists of three phases. The first phase provides the estimates of relevant parameters such as the sample variance. In the second phase, the scheduling sequence is obtained. Finally, the sequence of the SAA problems is solved in the last phase.

An additional optimization problem is formulated and solved in the second phase in order to find the number s of the SAA problem to be solved, the sample sizes N_k,k =1,..., s and the number of iterations n_k that are applied to solve the corresponding SAA problem. The objective function of this additional problem is the overall cost ∑^s _k=1n_kw(N_k ), where w(N) is the estimated cost of one iteration of the algorithm applied on the function _N. For example w(N) = N. The constraint for this problem is motivated by the stopping criterion f (x) - f* ≤ ε ( f (x ₀) - f* ) where f* is the optimal value of the objective function. More precisely, the cost-to-go is defined as e_k = f (x k_nk) - f* where x k_nk is the last iteration at the stage k. Furthermore, the upper bound estimate for e_s is determined as follows. Let ∆(N) be the function defined as in Royset ^[³¹[31] ROYSET JO. 2012. Optimality functions in stochastic programming. Math. Program. 135(1-2):293-321.^].

R8. There exists strictly decreasing function ∆(N) : ℕ →(0, ∞) such that lim_N→∞ ∆(N) =0 and | (x) - f (x)| ≤ ∆(Nk) holds almost surely.

One may use a bound like (18), but it is usually too conservative for practical implementations. Therefore, ∆(N) is estimated with the confidence interval bound of the form (17) where the variance is estimated in the first stage. The following bound is derived

where l_k (s) represents the remaining number of iterations after the stage k and θ defines the rate of convergence of the deterministic method applied on SAA. The initial cost-to-go from e ₀ = f (x 1 ₀) - f* is also estimated in the first phase. Finally, the efficient strategy is obtained as the solution of the following problem

In order to prove the asymptotic result, the following assumption regarding the optimization method used at each stage is imposed.

R9. The numerical procedure used to solve SAA problems almost surely exhibits linear rate of convergence with parameter θ ∈ (0, 1).

Theorem 3.7.^[²⁹[29] POLAK E & ROYSET JO. 2008. Eficient sample sizes in stochastic nonlinear programing. Journal of Computational and Applied Mathematics 217(2):301-310.^] Suppose that the assumptions R8-R9 hold and that the sample size sequence tends to infinity. Then lim_s→∞e_s = 0 almost surely.

4 APPLICATIONS TO DETERMINISTIC PROBLEMS

A number of important deterministic problems can be written in the form of

where f_i (x) are given functions and N is a large integer. For example, least squares and maximum likelihood problems are of this form. The objective function in (22) and its gradient are generally expensive to compute if N is large. On the other hand for a given sample realization ξ1,...,ξ_N and f_i (x) = F(x,ξi ) the SAA problems discussed in Section 3 are the same as (22). Therefore the SAA methods that deal with finite N can be used for solving the deterministic problems specified in (22). The main idea of this approach is to use the same reasoning as in the variable sample schemes to decrease the cost of calculating the objective function and its gradient i.e. to approximate the function and the gradient with and ∇. One application of a variable sample method to the data fitting problem is presented in Krejić, Krklec Jerinkić, ^[²³[23] KREJIĆ N & KRKLEC JERINKIĆ N. 2014. Nonmonotone line search methods with variable sample size. Numerical Algorithms, 8:1-29.^]. In this section we consider two important problems, data fitting and machine learning, and methods for their solutions that use stochastic gradient approximation in the sense of approximate gradient as explained above.

The data fitting problem of the form (22) is considered in Friedlander, Schmidt ^[¹⁴[14] FRIEDLANDER MP & SCHMIDT M. 2012. Hybrid deterministic-stochastic methods for data fitting. SIAM J. Scientific Computing, 34(3):1380-1405.^]. The problem is solved a by quasi Newton method but the gradient approximation of the SAA type is used. Let the gradient estimation at the current iteration k be given as g_k =∇ f (x_k) + ϵ_k where ϵ is the error term. T he following assumptions are stated.

P1. The functions f₁,..., f_N are continuously differentiable and the function f is strongly convex with parameter µ. Also, the gradient ∇ f is Lipschitz continuous with parameter L.

P2. There are constants β₁ ≥ 0 and β₂ ≥ 1 such that ||∇ f_i (x)||² ≤ β₁ + β₂ ||∇ f (x) ||² for all x and i = 1,...,N.

The algorithm can be considered as an increasing sample size method where the sample size is bounded with N. The main issue in ^[¹⁴[14] FRIEDLANDER MP & SCHMIDT M. 2012. Hybrid deterministic-stochastic methods for data fitting. SIAM J. Scientific Computing, 34(3):1380-1405.^] is the rate of convergence and the convergence analysis is done with the assumption of a constant step size. More precisely

Two approaches are considered: deterministic and stochastic sampling. The deterministic sampling assumes that if the sample size is N_k then the gradients to be evaluated ∇ f_i (x_k ), i = 1,...,N_k, are determined in advance. For example, the first N_k functions are used to obtain the gradient approximation g_k = ∑Nk_i=1∇ f_i (x_k ). On the other hand, stochastic sampling assumes that the gradients ∇ f_i (x), i = 1,..., N_k, to be evaluated are chosen randomly. We state the relevant results considering R-linear rate of convergence. In the case of deterministic gradient, q-linear convergence is also attained but under stronger conditions on the increase of the sample size.

Theorem 4.1.^[¹⁴[14] FRIEDLANDER MP & SCHMIDT M. 2012. Hybrid deterministic-stochastic methods for data fitting. SIAM J. Scientific Computing, 34(3):1380-1405.^] Suppose that the assumptions P1-P2 hold and that (N -Nk)/N = O (γ k/2) for some γ ∈ (0, 1). Then for any ε 0, σ = max{γ, 1 -µ/L} + ε and every k, in deterministic case we obtain

f (x ) - f (x*) = ( f (x ₀) - f (x*)) O ((1 - µ/L + ε)k) + O(σk).

Moreover, if (N -N_k)/(N_k N) = O(γ k), in stochastic case we obtain

E [f (xk ) - f (x*)] = (f (x₀) - f (x*)) O ((1 - µ/L +ε)) + O(σk ).

Machine learning applications which usually assume large number of training points can also be viewed as problems of the form (22). Methods for solving such problems are the subject of Byrd et al. ^[⁸[8] BYRD RH, CHIN GM, NEVEITT W & NOCEDAL J. 2011. On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning. SIAM J. Optim., 21(3):977-995.^] and Byrd et al. ^[⁹[9] BYRD RH, CHIN GM, NOCEDAL J & WU Y. 2012. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127-155.^]. The main idea in ^[⁸[8] BYRD RH, CHIN GM, NEVEITT W & NOCEDAL J. 2011. On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning. SIAM J. Optim., 21(3):977-995.^] is to create methods which use the second order derivative information but with cost comparable to the steepest descent method. The focus is on using a cheep Hessian approximations ∇² (x_k) where S_k is a number of training points, i.e. the Hessian-related sample size at iteration k. More precisely, matrix-free conjugate gradient method is applied in order to obtain the search direction p_k as an approximate solution of the system

Here, N_k is a sample size related to the gradient and the function approximation. This can be considered as an inexact Newton method. In the relevant examples, p_k is guaranteed to be a descent search direction and therefore the Armijo line search is applied. The proposed method (named S-Newton) does not specify the dynamic of changing the sample sizes. It only requires that the variable sample strategy is used and S_k < N_k. The analysis is conducted for the full gradient case.

Theorem 4.2.^[⁸[8] BYRD RH, CHIN GM, NEVEITT W & NOCEDAL J. 2011. On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning. SIAM J. Optim., 21(3):977-995.^] Suppose that the function is twice continuously differentiable and uniformly convex and that there exists a constant γ > 0 such that x T ∇²(x_k)x ≥ γ ||x|| for every k and x. Then the sequence generated by the S-Newton method with N_k = N satisfies lim_k→∞||∇(x_k)||= 0.

The same result can be obtained for the so called SLM method which uses matrix-free limited memory BFGS method. In that case, the conjugate gradient method is used to obtain the search direction where the sub-sampled Hessian approximation ∇² (xk) is used for the initial matrixvector product at every iteration. The line search uses Wolfe conditions for choosing the suitable step size.

The dynamic of increasing the sample size in the machine learning problem is addressed in ^[⁹[9] BYRD RH, CHIN GM, NOCEDAL J & WU Y. 2012. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127-155.^]. The main idea is to estimate the sample size which makes the search direction p_k descent for the objective function _N without evaluating the true gradient ∇ f_N (x). The approximation of the negative gradient p_k = -∇ (x_k) is a descend direction if for some θ ∈^[0^,¹[1] ANDRADOTTIR S. 1996. A Scaled Stochastic Approximation Algorithm. Management Science, 42(4):475-498.^], the following inequality holds

Since and N is large, inequality (23) is approximated by

where is a sample variance related to the chosen sample of the size N_k . The algorithm for the sample size schedule proposed in ^[⁹[9] BYRD RH, CHIN GM, NOCEDAL J & WU Y. 2012. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127-155.^] can be described as follows. After finding the step size αk such that (x_k + α_k p_k) < (x_k) and setting x_k+1 = x_k + α_k p_k, a new sample of the same size N_k is chosen. If inequality (24) holds for the new sample, the sample size remains unchanged, i.e. N_k+1 = N_k. Otherwise, the sample is augmented and the new sample size is determined by

In order to conduct the complexity analysis, the constant step size is considered and the q-linear convergence rate is analyzed.

Theorem 4.3.^[⁹[9] BYRD RH, CHIN GM, NOCEDAL J & WU Y. 2012. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127-155.^] Suppose that the function _N is twice continuously differentiable and x* is a solution of the problem (22) with _N(x*) = 0. Furthermore, assume that there are constants 0 < λ < L such that λ ||h||² ≤ h T ∇² (x)h ≤ L ||h||² for all x and h. Let the sequence of iterates be generated by x_k+1 = x_k -α ∇ (x_k) where α = (1 -θ) / L and θ ∈ (0,1). If the condition (23) is satisfied at iteration k, then

where β = . Moreover, if (23) holds for every k, then

The schedule {N } for the gradient estimations is extended to the second order approximations to define a Newton- method, the S-Newton method defined in ^[⁸[8] BYRD RH, CHIN GM, NEVEITT W & NOCEDAL J. 2011. On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning. SIAM J. Optim., 21(3):977-995.^]. This method uses the updating of N_k described above while the Hessian-related sample size S_k follows the dynamic of N_k. More precisely, S_k = RN_k where R is some positive number substantially smaller than 1. Also, the sample used for the Hessian approximation is assumed to be a subset of the sample used for the gradient and the function approximations. The stopping criterion for the conjugate gradient method used for obtaining the search direction is more complex than in ^[⁸[8] BYRD RH, CHIN GM, NEVEITT W & NOCEDAL J. 2011. On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning. SIAM J. Optim., 21(3):977-995.^] since it is related to the sample size. Wolfe conditions are imposed to obtain a suitable step size.

5 CONCLUSIONS

In this survey we considered unconstrained problems with the stochastic or expectation objective function. We focused our attention on two specific classes of gradient-related methods: Stohastic Approximation and Sample Average Approximation and many other important approaches are left out for the sake of brevity. An interested reader should consults ^[³²[32] SHAPIRO A, DENTCHEVA D & RUSZCZYNSKI A. 2009. Lectures on Stochastic Programming. SIAM.^,³⁴[34] SPALL JC. 2003. Introduction to Stochastic Search and Optimization. Wiley-Interscience serises in discrete mathematics, New Jersey.^] for the initial guidance into stochastic optimization problems. Several natural extensions are easily incorporated in the framework considered in this paper, for example search directions with second order information which usually yield faster convergence, but also require additional cost, ^[⁸[8] BYRD RH, CHIN GM, NEVEITT W & NOCEDAL J. 2011. On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning. SIAM J. Optim., 21(3):977-995.^,⁹[9] BYRD RH, CHIN GM, NOCEDAL J & WU Y. 2012. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127-155.^,¹⁰[10] BYRD RH, HANSEN SL, NOCEDAL J & SINGER Y. 2014. A Stochastic Quasi-Newton Method for Large-Scale Optimization. Technical report, arXiv:1401.7020 [math.OC].^]. In order to decrease the linear algebra costs, one can consider preconditioners as their construction might be a nontrivial issue due to the presence of random variable. On the other hand, given that there are a lot of problems yielding only input-output information, an interesting approach within SA and SAA frameworks is based on zero order information, ^[¹⁷[17] GHADIMI S & LAN G 2013. Stochastic first and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341-2368.^]. Constrained problems are always of great interest and some recent research of penalty methods within SA methodology is presented in ^[³⁹[39] WANG X, MA S & YUAN Y. Penalty Methods with Stochastic Approximation for Stochastic Nonlinear Programming. Technical report arXiv:1312.2690 [math.OC].^]. Projection and filter methods with variable sample size might be a valuable topic of future research. Among the others, chance constrained problems are especially challenging. In all the considered methods, deriving some complexity bounds can be of great interest from practical point of view.

*
Research supported by Serbian Ministry of Education Science and Technological Development, grant no. 174030.

REFERENCES

^[1]
ANDRADOTTIR S. 1996. A Scaled Stochastic Approximation Algorithm. Management Science, 42(4):475-498.
^[2]
BASTIN F. 2004. Trust-Region Algorithms for Nonlinear Stochastic Programming and Mixed Logit Models. PhD thesis, University of Namur, Belgium
^[3]
BASTIN F, CIRILLO C & TOINT PL. 2006. An adaptive Monte Carlo algorithm for computing mixed logit estimators. Computational Management Science 3(1):55-79.
^[4]
BASTIN F, CIRILLO C & TOINT PL. 2006. Convergence theory for nonconvex stochastic programming with an application to mixed logit. Math. Program., Ser. B 108:207-234.
^[5]
BENVENISTE A, METIVIER M & PRIOURET P. 1990. Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, New York
^[6]
BERTSEKAS DP & TSITSIKLIS JN. 2000. Gradient Convergence in Gradient Methods with Errors. SIAM J. Optimization, 10(2):627-642.
^[7]
BIRGE JR & LOUVEAUX F. 1997. Introduction to Stochastic Programming, Springer.
^[8]
BYRD RH, CHIN GM, NEVEITT W & NOCEDAL J. 2011. On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning. SIAM J. Optim., 21(3):977-995.
^[9]
BYRD RH, CHIN GM, NOCEDAL J & WU Y. 2012. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127-155.
^[10]
BYRD RH, HANSEN SL, NOCEDAL J & SINGER Y. 2014. A Stochastic Quasi-Newton Method for Large-Scale Optimization. Technical report, arXiv:1401.7020 [math.OC].
^[11]
DELYON B & JUDITSKY A. 1993. Accelerated stochastic approximation. SIAM J. Optimization 3(4):868-881.
^[12]
DENG G & FERRIS MC. 2009. Variable-Number Sample Path Optimization. Mathematical Programming 117(1-2):81-109.
^[13]
FABIAN V. 1968. On Asymptotic Normality in Stochastic Optimization. The Annals of Mathematical Statistics, 39:1327-1332.
^[14]
FRIEDLANDER MP & SCHMIDT M. 2012. Hybrid deterministic-stochastic methods for data fitting. SIAM J. Scientific Computing, 34(3):1380-1405.
^[15]
FU MC. 2006. Gradient Estimation, HENDERSON SG & NELSON BL. (Eds.). Handbook in OR & MS, 13:575-616.
^[16]
GHADIMI S & LAN G & ZHANG H. 2014. Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization. Technical report arXiv:1308.6594 [math.OC].
^[17]
GHADIMI S & LAN G 2013. Stochastic first and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341-2368.
^[18]
HOMEM-DE-MELLO T. 2008. On rates of convergence for stochastic optimization problem under non-independent and identocally distributed sampling. SIAM J. Optim. 19(2):524-551.
^[19]
HOMEM-DE-MELLO T 2003. Variable-Sample Methods for Stochastic Optimization. ACM Transactions on Modeling and Computer Simulation, 13(2):108-133.
^[20]
KESTEN H. 1958. Accelerated stochastic approximation. Ann. Math. Statist. 29:41-59.
^[21]
KIEFER J & WOLFOWITZ J. 1952. Stochastic Estimation of the Maximum of a Regression Function. The Annals of Mathematical Statistics 23(3): 462-466.
^[22]
KREJIĆ N & KRKLEC N. 2013. Line search methods with variable sample size for unconstrained optimization. Journal of Computational and Applied Mathematics, 245:213-231.
^[23]
KREJIĆ N & KRKLEC JERINKIĆ N. 2014. Nonmonotone line search methods with variable sample size. Numerical Algorithms, 8:1-29.
^[24]
KREJIĆ N, LUŽANIN Z & STOJKOVSKA I. 2013. A gradient method for unconstrained optimization in noisy environment. Applied Numerical Mathematics, 70:1-21.
^[25]
KUSHNER H. 2010. Stochastic Approximation: A survey. Computational Statistics, 2(1):87-96.
^[26]
MARTI K. 2005. Stochastic Optimization Methods, Springer
^[27]
NEMIROVSKI A, JUDITSKY A, LAN G & SHAPIRO A. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization 19(4):1574-1609.
^[28]
PASUPATHY R. 2010. On Choosing Parameters in Retrospective-Approximation Algorithms for Stochastic Root Finding and Simulation Optimization. Operations Research, 58(4):889-901.
^[29]
POLAK E & ROYSET JO. 2008. Eficient sample sizes in stochastic nonlinear programing. Journal of Computational and Applied Mathematics 217(2):301-310.
^[30]
ROBBINS H & MONRO S. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics 22(3):400-407.
^[31]
ROYSET JO. 2012. Optimality functions in stochastic programming. Math. Program 135(1-2):293-321.
^[32]
SHAPIRO A, DENTCHEVA D & RUSZCZYNSKI A. 2009. Lectures on Stochastic Programming. SIAM
^[33]
SHAPIRO A & WARDI Y. 1996. Convergence Analysis of Stochastic Algorithms. Mathematics of Operations Research 21(3):615-628.
^[34]
SPALL JC. 2003. Introduction to Stochastic Search and Optimization. Wiley-Interscience serises in discrete mathematics, New Jersey.
^[35]
SPALL JC 1998. Implementation of the Simultaneous Perturbation Algorithm for Stochastic Optimization. IEEE Transactions on Aerospace and Electronic Systems, 34(3):817-823.
^[36]
YOUSEFIAN F, NEDIC A & SHANBHAG UV. 2012. On stochastic gradient and subgradient methods with adaptive steplength sequences. Automatica, 48(1):56-67.
^[37]
YAN D & MUKAI H. 1993. Optimization Algorithm with Probabilistic Estimation. Journal of Optimization Theory and Applications, 79(2):345-371.
^[38]
XU Z & DAI YD. 2012. New stochastic approximation algorithms with adaptive step sizes. Optimization Letters, 6:1831-1846.
^[39]
WANG X, MA S & YUAN Y. Penalty Methods with Stochastic Approximation for Stochastic Nonlinear Programming. Technical report arXiv:1312.2690 [math.OC].
^[40]
WARDI Y 1990. Stochastic Algorithms with Armijo Stepsizes for Minimization of Functions. Journal of Optimization Theory and Applications 64(2):399-417.

Research supported by Serbian Ministry of Education Science and Technological Development, grant no. 174030.

Publication Dates

Publication in this collection
Sep-Dec 2014

History

Received
11 Sept 2013
Accepted
13 Dec 2013

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

[1] *
Research supported by Serbian Ministry of Education Science and Technological Development, grant no. 174030.

Brasil

Brasil

STOCHASTIC GRADIENT METHODS FOR UNCONSTRAINED OPTIMIZATION^* * Research supported by Serbian Ministry of Education Science and Technological Development, grant no. 174030. Research supported by Serbian Ministry of Education Science and Technological Development, grant no. 174030.

Abstract

1 INTRODUCTION

2 STOCHASTIC APPROXIMATION METHODS

3 SAMPLE AVERAGE APPROXIMATION

4 APPLICATIONS TO DETERMINISTIC PROBLEMS

5 CONCLUSIONS

REFERENCES

Publication Dates

History

Brasil

Brasil

STOCHASTIC GRADIENT METHODS FOR UNCONSTRAINED OPTIMIZATION* * Research supported by Serbian Ministry of Education Science and Technological Development, grant no. 174030. Research supported by Serbian Ministry of Education Science and Technological Development, grant no. 174030.

Abstract

1 INTRODUCTION

2 STOCHASTIC APPROXIMATION METHODS

3 SAMPLE AVERAGE APPROXIMATION

4 APPLICATIONS TO DETERMINISTIC PROBLEMS

5 CONCLUSIONS

REFERENCES

Publication Dates

History

STOCHASTIC GRADIENT METHODS FOR UNCONSTRAINED OPTIMIZATION^* * Research supported by Serbian Ministry of Education Science and Technological Development, grant no. 174030. Research supported by Serbian Ministry of Education Science and Technological Development, grant no. 174030.