## Print version ISSN 0101-7438

### Pesqui. Oper. vol.34 no.3 Rio de Janeiro Sept./Dec. 2014

#### http://dx.doi.org/10.1590/0101-7438.2014.034.03.0373

Articles

STOCHASTIC GRADIENT METHODS FOR UNCONSTRAINED OPTIMIZATION*

1Department of Mathematics and Informatics, Faculty of Science, University of Novi Sad, Trg Dositeja Obradovića 4, 21000 Novi Sad, Serbia. E-mails: natasak@uns.ac.rs; natasa.krklec@dmi.uns.ac.rs

ABSTRACT

This papers presents an overview of gradient based methods for minimization of noisy functions. It is assumed that the objective functions is either given with error terms of stochastic nature or given as the mathematical expectation. Such problems arise in the context of simulation based optimization. The focus of this presentation is on the gradient based Stochastic Approximation and Sample Average Approximation methods. The concept of stochastic gradient approximation of the true gradient can be successfully extended to deterministic problems. Methods of this kind are presented for the data fitting and machine learning problems.

Key words: unconstrained optimization; stochastic gradient; stochastic approximation; sample average approximation

1 INTRODUCTION

Stochastic optimization problems appear in all areas or engineering, physical and social sciences. Typical applications are model fitting, parameter estimation, experimental design, performance evaluation etc. The models we are considering here can be written in the form

where f: ℝn → ℝ is either observed with noise or is defined as the mathematical expectation. In fact the objective function depends on a vector of random variables ξ from some probability space that might be know or unknown, depending on application. Thus the exact evaluation of f (x) is impossible to evaluate and it is necessary to use simulation to estimate the objective function value.

The feasible set Ω can be defined by constraints of different types - simple box constraints, deterministic constraints, chance constraints, constraints in the form of mathematical expectation etc. In this paper we consider only the case Ω = ℝn . More precisely, we consider only two types of stochastic problems. The first type are problems with random objective function,

where ξ represents the noise (or randomness) and x is the decision variable. Such models appear when the decision has to be taken before the full information about the problem parameters is known. The lack of information in that case is represented by random vector ξ . The second type of problems are the problems with the mathematical expectation as the objective function

Although the noise is technically removed in the above problem, it is rather hard to solve it as the expectation is hard or even impossible to state analytically even if the distribution of ξ is known.

Methods for solving stochastic optimization problems are combination of ideas from numerical optimization and statistics. Thus the class of popular methods include simulation-based methods, direct methods for stochastic search, annealing type algorithms, genetic algorithms, methods of reinforced learning, statistical methods and many others, [34], [7]. Among all of them we restrict our attention here on two methods typically used in simulation based optimization: Stochastic Approximation, SA, and Sample Average Approximation, SAA.

Stochastic Approximation methods are introduced in the seminal paper of Robbins and Monro, [30] and remain a popular choice for solving stochastic optimization problems. They relay mainly on noisy gradient evaluations and depend heavily on the choice of steplength sequence. The choice of this sequence is the subject of many research efforts as well as other techniques for accelerating the convergence of SA methods. Sample average Approximation methods can be seen as an alternative to SA methods. In this approach a sample from the underlying distribution is used to construct a deterministic sample average problem which can be solved by optimization methods. However the sample used for the SAA approximation very often needs to be large and a naive application of standard nonlinear optimization techniques is not feasible. Therefore there has been extensive research in variable sample size methods that reduce the cost of SAA.

Both SA and SAA methods are considered here in the framework of gradient-related optimization (gradient methods, subgradient methods, second order quasi Newton methods) as well as in the derivative-free framework. This survey will largely deal with gradient methods for stochastic optimization and an interested reader can look at Spall [34] for an overview of other methods. This paper is organized as follows. In Section 2 we discuss the SA method and its modifications. Section 3 contains results for the SAA methods and unconstrained problems with the mathematical expectation objective function. Two important deterministic problems that are rather similar to SAA problem are discussed in Section 4 as well as methods for obtaining their solution that rely on stochastic gradients. Finally, some conclusion and research perspectives are presented in Section 5.

2 STOCHASTIC APPROXIMATION METHODS

There have been countless applications of Stochastic Approximation (SA) method since the work of Robbins and Monro [30]. In this section we give an overview of its main properties and some of its generalizations. The problem we consider is

assuming that only the noisy measurements (x) of the function and its gradient ĝ(x) are available. Let us start by considering the SA algorithm for solving systems of nonlinear equations as it was defined originally in [30]. The convergence theory presented here relays on imposition of statistical conditions on the objective function and the noise. Convergence analysis can be conducted throughout differential equations as well, see [34] and [25] for further references.

Consider the system of nonlinear equations

with g(x) being the gradient of f (x). Suppose that only the measurements with noise that depends on the iteration as well as on the decision variable x

are available. Then the SA is defined by

The sequence of step sizes {ak}k∈N is also called the gain sequence and it has dominant influence on the convergence.

Let {xk} be a sequence generated by an SA method. Denote by Fk the σ-algebra generated by x0, x1,..., xk. If the problem has an unique solution x* the set of assumptions that ensures the convergence of an SA method is the following.

S1. The gain sequence satisfies:

S2. For some symmetric, positive definite matrix B and for every η ∈ (0,1),

S3. For all x and k,

S4. There exists a constant c> 0 such that for all x and k,

||g(x)||2 + E [ ||ξk(x)||2| Fk ] ≤ c (1 +||x||2).

The first assumption which implies that the step sizes converge to zero is standard in stochastic algorithms, see [34]. The second condition, is imposed in order to avoid inefficiently small step sizes. On the other hand, the summability condition on ak 2 ensures stability. Its role is to decrease the influence of the noise when the iterates come into a region around the solution. An example of a sequence that satisfies the first assumption is

where α ∈ (0.5, 1) and a is some positive constant. The condition of zero mean is also standard in stochastic optimization. Its implication is that ĝk(x) is an unbiased estimator of g(x). Notice that under assumption S3, the condition in S4 is equal to

E [ĝk(x)||2 ] ≤ c (1 +||x|| 2).

Therefore, the mean of ||ĝk(x)||2 can not grow faster than a quadratic function of x. Under these assumptions, the almost sure convergence of the SA algorithm can be established. The convergence in mean square i.e. E[|xk-x|)2] → 0 as k → ∞ was proved in [30] and the theorem bellow states a stronger result, the almost sure convergence.

Theorem 2.1. [34] Consider the SA algorithm defined by (6). Suppose that assumptions S1 -S4 hold and that x* is a unique solution of the system (4). Then xk converges almost surely to x*.

Closely related and more general result is proved in Bertsekas, Tsitsiklis [6] where the gradient related method of the form

is considered. Here ξk is either stochastic or deterministic error, ak is a sequence of diminishing step sizes that satisfy assumption S1 and sk is a descent direction. In this context, the direction sk is not necessarily the gradient but it is gradient-related. The convergence is stated in the following theorem.

Theorem 2.2. [6] Let {xk} be a sequence generated by (8), where sk is a descent direction. Assume that S1 and S3 -S4 hold and that there exist positive scalars c1 and c2 such that

c 1||∇f (xk)||2 ≤ - ∇ f (xk) T sk, ||sk|| ≤ c 2 (1 + ||∇ f (xk)||).

Then, either f (xk) →-∞ or else f (xk) converges to a finite value and limk→∞f (x) = 0. Furthermore, every limit of {xk} is a stationary point of f.

The gain sequence is the key element of the SA method. It has impact on stability as well as on convergence rate. Under some regularity conditions, Fabian [13], the asymptotic normality of xk is obtained. More precisely,

where →d denotes the convergence in distribution, α refers to (7) and is the covariance matrix that depends on the gain sequence and on the Hessian of f . Therefore, the iterate xk approximately has normal distribution N (x*, k -α ◹) for large k. Due to assumption S1, the maximal convergence rate is obtained for α = 1. However, this reasoning is based on the asymptotic result. Since the algorithms are finite in practice, it is often desirable to set α < 1 because α = 1 yields smaller steps. moreover, if we want to minimize ||◹||, the ideal sequence would be

where H (x) denotes the Hessian matrix of f , Benveniste et al. [5]. Even though this result is purely theoretical, sometimes the Hessian at x* can be approximated by H (xk) and that way one can enhance the rate of convergence.

Two main drawbacks of the SA method are slow convergence and the fact that the convergence theory applies only if the solution of (4) is unique i.e. only if f has an unique minimizer. Thus several generalizations are developed to address these two issues. One can easily see from (7) that the gain coefficients increase with the increase of a. On the other hand a large a might have negative influence on stability. Therefore several generalizations of the gain coefficients are considered in the literature. One possibility, Spall [35], is to introduce the so called stability constant A > 0, and obtain

Now, the values of a and A can be chosen together to ensure effective practical performance of the algorithm, allowing for larger a and thus producing larger step sizes in latter iterations, when the effect of A is small, and avoiding instability in early iterations. The empirically recommended value of A is at most 10% of the iterations allowed or expected during the optimization process, for more details see [35].

Several generalizations of the SA method are based on adaptive step sizes that try to adjust the step size at each iteration to the progress achieved in the previous iterations. The first attempt of this kind has been made in Kesten [20] for one dimensional problems. The main idea is to monitor the changes in the sign of xk+1-xk. If the sign of the difference between two consecutive iterations starts to change frequently we are probably in the domain of noise and therefore small steeps are needed to avoid oscillations. This idea is generalized in Delyon, Juditsky [11] for multidimensional problems. The gain coefficients are defined as

where I is the identification function defined as I (t) = 1 if t < 0 and I (t) = 0 if (t) ≥ 0. The method is accelerating the convergence of SA and its almost sure convergence is proved under the standard assumptions.

The idea of sign changes is further developed in Xu, Dai [38]. It is shown that the sequence {sk / k} stands for the change frequency of the sign of ĝkT +1 ĝk in some sense. The assumption in [38] is that the noise ξk is state independent. The theoretical analysis shows that in that case sk/ k converges to P1 T ξ2 < 0) in the mean square sense. Based on that result, a switching algorithm that uses the switching parameter tk,defined as

is proposed. Then the gain coefficients are defined as

where 0.5 ≤ α < β ≤1, v is a small positive constant and a, A are the constants from assumption S1. To prove the convergence of the switching algorithm (9)-(10) one additional assumption is introduced in [38].

S5. G(x) =g(x)-x is a weighted maximum norm pseudo-contraction operator. That is for all x ∈ ℝn there exists a positive vector ω and some x* ∈ ℝn such that

||G(x)-x*||ω ≤β||x -x*||w,

where β ∈(0,1) and ||·||ω is defined as lxlw = max{x(i)ω(i)-1 , i =1,...,n}where x(i) and ω(i) are the ith components of x and ω respectively.

Theorem 2.3. [38] Suppose that assumptions S1 -S2 and S5 hold. Then for {xk }generated through (9)-(10) we have xkx* as k →∞ with probability 1.

If the objective function is given in the form of mathematical expectation the adaptive step length sequence can be determined as proposed in Yousefian et al. [36]. For the problem

the following assumptions are stated.

S6. The function F(·,ξ) is convex on a closed and convex set D ⊂ ℝn for every ξ ∈ Q, and the expected value E (F(x,ξ)) is finite for every xD.

S7. The errors ξk in the noisy gradient ĝk are such that for some µ > 0,

A self-adaptive scheme is based on the error minimization and the convergence result is as follows.

Theorem 2.4. [36] Let assumptions S6 and S7 hold. Let the function f be differentiable over the set D with Lipschitz gradient and assume that the optimal set of problem (11) is nonempty. Assume that the step size sequence {ak}is generated through the following self-adaptive scheme

ak =ak-1(1 -cak-1) for all k ≥1,

where c > 0 is a scalar and the initial step size is such that 0 < a 0 < 1 / c. Then the sequence {xk }converges almost surely to a random point that belongs to the optimal set.

An important choice for the gain sequence is a constant sequence. Although such sequences do not satisfy assumption S1 and almost sure convergence to solution can not be obtained, it can be shown that a constant step size can conduct the iterations to a region that contains the solution. This result initiated development of a cascading steplength SA scheme in [36] where a fixed step size is used until some neighborhood of the solution is reached. After that, in order to come closer to the solution, the step size is decreased and again the fixed step size is used until the ring around the solution is sufficiently tighten up. That way, the sequence of iterates is guided towards the solution.

A hybrid method which combines the SA gain coefficients and the step sizes obtained from the inexact Armijo line search under the assumptions valid for the SA method is considered in Krejić et al. [24]. The method takes the advantages of both approaches, safe convergence of SA method and fast progress obtained by line search if the current iterate if far away from the solution (where the SA steps would be unnecessarily small). The step size is defined according to the following rule. For a given C > 0 and a sequence {ak}that satisfies assumption S1 we define

where βk is obtained from the Armijo inequality

After that the new iteration is obtained as

The existence of C such that the gain coefficient (12) is well defined as well as the convergence of the sequence generated by (13) is proved in [24] under one additional assumption.

S8. Observation noise is bounded and there exists a positive constant M such that ||ξk (x)|| ≤ M a.s. for all k and x.

Theorem 2.5. [24] Assume that Assumptions S1-S4 and S8 hold, the gradient g is Lipschitz continuous with the constant L, and the Hessian matrix H (x*) exists and is nonsingular. Let

where

Let {xk}be an infinite sequence generated by (13). Then xkx* a.s.

Many important issues regarding the convergence of the SA methods are not mentioned so far. One effective possibility to speed up the convergence is to apply the averaging to the sequence of gradient estimations ĝ(xk)as suggested in Andradottir [1]. It is shown that the rate of convergence could be significantly better than the rate of SA if two conditionally independent gradient estimations are generated and the new iteration is obtained using a scaled linear combination of the two gradients with the gain coefficient. More details on this procedure are available in [1]. Let us also mention a robust SA scheme that determines an optimal constant step length based on minimization of the theoretical error for a pre-specified number of steps [27].

We have assumed in the above discussion that the noisy gradient values are available. This is the case for example if the analytical expression of F in (11) is available. In this case, under certain assumption we can interchange the expectation and derivative and thus a sample average approximation of the gradient can be calculated. It is important to be able to use a sample gradient estimation with relatively modest sample size as calculation of the sample gradient is in general expensive for large samples. However it is safe to claim that the analytical expression for the gradient calculation is not available in many cases and thus the only input data we have are (possibly noisy) function values. Thus the gradient approximation with finite differences appears to be a natural choice in many applications. The first method of this kind is due to Keifer, Wolfowitz, [21]. Many generalizations and extensions are later considered in the literature, see Fu [15] for example. Among many methods of this kind the Stochastic Perturbation method is particularly efficient as it uses one two function values to obtain a good gradient approximation, see [35] for implementation details.

The questions of stopping criteria, global convergence, search directions which are not gradient related, and other important questions are beyond the scope of this paper. An interested reader might look at Spall [34], Shapiro et al. [32] for guidance on these issues and relevant literature.

3 SAMPLE AVERAGE APPROXIMATION

Sample Average Approximation (SAA) is a widely used technique for approaching the problems of the form

The basic idea is to approximate the objective function f (x) with the sample mean

where N is the size of a sample represented by i.i.d. random vectors ξ1,...,ξN . Under the standard assumption such as finite variance of F(x,ξ) the (strong) Law of Large Numbers implies that N(x) converges to f (x) almost surely. Moreover, if F(x,ξ)is dominated by an integrable function, then the uniform almost sure convergence of N(x) on the compact subsets of ℝn is obtained.

Within the SAA framework the original problem (14) is replaced by the approximate problem

and thus the key question is the relationship between their respective solutions as N tends to infinity. Denote by X* the set of optimal solutions of the problem (14) and let f* be the optimal value of the objective function. Furthermore, denote by *N and the set of optimal solutions and the corresponding optimal values, respectively, of the problem (16). Then, the following result holds.

Theorem 3.1. (32) Suppose that there exists a compact set C ⊂ ℝn such that X* is nonempty and X*⊂ C. Assume that the function f is finite valued and continuous on C and that N converges to f almost surely, uniformly on C. Also, suppose that for N large enough the set is nonempty and C. Then f* and the distance between sets N and X* tends to zero almost surely as N → ∞.

Let N be an approximate solution of the problem (14). Clearly N ( N) can be calculated for a given sample. The Central Limit Theorem can be used to obtain an approximation of the error bound cN ( N) such that the inequality holds with some high probability δ ∈(0,1). For example, using the sample variance the following error bound is obtained

with z being the quantile of the standard normal distribution. The error bound is directly proportional to the variance of the estimator Var( N( N)). Therefore, in order to provide a tight bound one can consider some techniques for reducing variance such as the quasi-Monte Carlo or Latin hypercube sampling [32]. However, these techniques tend to deteriorate the i.i.d. assumption. This issue is addressed further on in this section.

The gap where x* is a solution of the original problem can be estimated as well. Clearly g( N) ≥ 0. To obtain an upper bound suppose that M independent samples of size N are available, i.e. we have i.i.d. sample ξ1m,...,ξNm, m = 1,...,M.

Denote by the relevant (nearly) optimal values and define and Then, the upper bound estimator for the gap g( N) is

where N' is some large enough sample size and tM-1,δ is the quantile of Student's distribution with M -1 degrees of freedom. It should be mentioned that the sample size bounds such that the solutions of an approximate problem are nearly optimal for the true problem with some high probability are mainly too conservative for practical applications in general. For further references on this topic, see [32] for instance.

Recall that almost sure convergence N (x)towards f (x)is achieved if the sample is i.i.d. under standard assumptions. However, if the sample is not i.i.d. the almost sure convergence of is achievable only if the sample size N which defines the SAA problem grows at the certain rate, Homem-de-Mello [19]. The analysis presented in [19] allows for biased estimators N (x) if N (x) is at least asymptotically unbiased. Let us first assume that the sample ξ1k,...,ξkNk generated at the iteration k is independent of the sample at the previous iteration for every k. The following assumptions are needed.

R1. For each x, there exists M(x)> 0 such that supi,kF(xik) ≤ M(x) with probability 1.

R2. For each x, we have that limk→∞E(N (x)) = f (x).

Theorem 3.2. (19) Suppose that assumptions R1-R2 hold and that the sample size sequence {Nk }satisfies k=1 αNk < ∞for all α ∈(0,1). Then (x) converges to f (x) almost surely.

For example, Nk ≥ √k satisfies the previously stated summability condition.

The rate of convergence is also addressed in [19], i.e. the error bounds for | (x)- f (x)| are developed. In the case where Nkc1kρ for c1 > 0 and ρ > 2 it can be proved, under some additional assumptions, that for every k sufficiently large the following inequality holds almost surely

Moreover, if the sample is cumulative, the corresponding error bound is

where C is some positive constant.

The above analysis provides a justification for the SAA approximation as well as a guidance for choosing N in (16). Thus from now on we concentrate on gradient methods for solving (16). Several papers exploit ideas from deterministic optimization. Generally speaking we are interested in solving the SAA problem for some finite, possibly very large N as well as obtaining asymptotic results i.e. the results that cover the case N →∞ even if in practical applications one deals with a finite value of N. A naive application of an optimization solver to (16) is very often prohibitively costly if N is large due to the cost of calculating N(x) and its gradient. Thus there is a vast literature dealing with variable sample scheme for solving (16).

Two main approaches can be distinguished. In the first approach the objective function N is replaced with (x) at each iteration k and the iterative procedure is essentially a two step procedure of the following form. Given the current approximation xk and the sample size Nk one has to find sk such that the value of (xk + sk) is decreased. After that we set xk+1 =xk + sk and choose a new sample size Nk+1. The key ingredient of this procedure is the choice of Nk+1.

The schedule of sample sizes {Nk} should be defined in such way that either Nk =N for k large enough or Nk →∞ if one is interested in asymptotic properties.

The second approach, often called the diagonalization scheme or the surface response method, is again a two step procedure. It consists of a sequence of SAA problems with different sample sizes that are approximately solved. So for the current xk and Nk the problem (16) with N = Nk is approximately solved (within an inner loop) for starting with xk as the initial approximation. After that we set xk+1 = Nk and choose the new sample size Nk+1. Two important points in this procedure are the choice of Nk+1 and the precision in solving each of the optimization problems min

Let us now look into algorithms of the first kind. Keeping in mind that min is just an approximation of the original problem and that the cost of each iteration depends on Nk,it is rather intuitive to start the optimization procedure with smaller samples and gradually increase the sample size Nk as the solution is approached. Thus the most common schedule sequence would be an increasing sequence N0,N1,.... The convergence theory for this kind of reasoning is introduced in Wardi, [40] where an Armijo type line search method is combined with SAA approach. In order to solve the problem of type (14), the iterative sequence is generated as

where Nk is the sample size used at iteration k and αk is the largest number in (0,1] satisfying the inequality

The method is convergent with zero upper density [40], assuming that Nk →∞. More precisely, the following statement is proved.

Theorem 3.3. [40] Assume that the function f is given by (14) and that F is twice continuously differentiable on ℝn for every ξ. Furthermore assume that for every compact set D ⊂ℝn, there exists K > 0 such that for every xD and every ξ.

If Nk →∞ then the sequence {xk}given by (19) converges with zero upper density on compact sets.

An extension of the above work is presented in Yan, Mukai [37] where the adaptive precision is proposed i.e. the sequence {Nk}k∈ℕ is not determined in advance as in [40] but it is adapted during the iterative procedure. Nevertheless the sample size has to satisfy Nk →∞. The convergence result is slightly stronger as the convergence with probability 1 is proved under the set of appropriate assumptions. The more general result that applies to both gradient and subgradient methods is obtained in Shapiro, Wardi [33] where the convergence with probability 1 is proved for sample average gradient and subgradient methods assuming that the sample size tends to infinity.

In practical applications, the sample size is finite. So, let us now suppose that Nmax is the sample size which makes good approximation of the original objective function. Very often we assume that the sample is generated at the beginning of the process which justifies considering the SAA objective function as deterministic. In this case one wishes again to decrease the cost of the optimization process by decreasing the number of function and gradient evaluations. Let us now look closer at the possible schedule of Nk. Clearly the sample size should be equal to Nmax at the final stages of the optimization procedure to ensure that the problem (16) with N =Nmax is solved. Thus one can consider even some heuristic schedule [14] to generate a non-decreasing sequence {Nk}which eventually becomes stationary with Nk =Nmax for k large enough. For example, a simple way to define such sequence could be to increase Nk by a fixed number every K iterations.

The problem of scheduling can be approached from a different perspective in the following manner. Instead of constantly increasing the sample size one could monitor the progress in decreasing the (approximate) objective function and choose the next sample size according to that progress. One algorithm of this kind is presented in Deng, Ferris [12] where the Bayes risk is used to decide the scheduling sequence within a trust region method. Another class of results in the framework of trust region methods is presented in Bastin [2] and Bastin et al. [3], [4]. The key point of the approach considered in [2,3,4] is that the sample sizes might oscillate during the iterative process, i.e. {Nk}is not necessarily non-decreasing at the initial stages of the iterative process. Eventually Nk =N max is reached and (16) is solved, but very often with smaller costs if compared with an increasing scheduling. The efficiency of this approach comes from the fact that there is a balance between the precision of the objective function approximation and the progress towards the solution. The same idea is further developed for the line search methods in Krejić, Krklec [22] as follows.

Let us assume that the gradient of ∇F is available and that the search direction pk satisfies . The armijo rule with η ∈ (0, 1) is applied to find αk such that xk+1 = xkk pk where

The sample size is updated as follows. First, the candidate sample size N +k is determined by comparing the measure of decrease in the objective function dmk = -αk(∇(xk))T pk and the so called lack of precision εNδ(x) defined by (17). The main idea is to find the sample size N+k such that dmk (xk). The reasoning behind this idea is the following. If the decrease measure dmk is greater than the lack of precision (xk), the current approximation is probably far away from the solution. In that case, there is no need to impose high precision and therefore the sample size is decreased if possible. The candidate sample size does not exceed Nmax,but there is also the lower bound, i.e. NkminNk+Nmax. This lower bound is increased only if Nk+1 > Nk and there is not enough progress concerning the function . After finding the candidate sample size, a safeguard check is performed in order to prohibit the decrease of the sample size which might be unproductive. More precisely, if Nk+ < Nk the following parameter is calculated

If ρk is relatively small, then it is presumed that these two model functions are too different and thus there is no gain in decreasing the sample size. So, Nk+1 = Nk. In all the other cases, the decrease is accepted and Nk+1 = Nk + . The convergence analysis relays on the following important result which states that after some finite number of iterations, the objective function becomes and (16) is eventually solved.

Theorem 3.4. [22] Suppose that F(·,ξ) is continuously differentiable and bounded from below for every ξ. Furthermore, suppose that there exist a positive constant κ and number n1 ∈ ℕ such that (xk) ≥ κ for every kn 1 . Then there exists q ∈ ℕ such that Nk = Nmax for every kq.

Let us now present some results for the second type of methods, the so called diagonalization methods described above. One possibility to determine the sample sizes in the sequence of optimization problems to be solved is presented in Royset [31] where an optimality function is used to determine when to switch on to a larger sample size. The optimality function is defined by mapping θ : ℝn → (-∞,0] which, under standard conditions, satisfies θ(x) = 0 if and only if x is a solution in some sense. For unconstrained problem and its SAA approximation is given by . Under the set of standard assumptions, almost sure convergence of θN (x) towards θ(x) is stated together with asymptotic normality.

Denote by the iterate obtained after a finite number of iterations of an algorithm applied on the SAA problem with sample size Nk where is the initial point. The point i s an approximate solution of (16) with N = Nk and it is assumed that the optimization algorithm used to determine that point is successful in the following sense.

R3. For any Nk every accumulation point of the sequence generated by the optimization method for solving

satisfies θNk ( ) = 0 almost surely.

The algorithm proposed in [31] increases the sample size when

θNk (xk) ≥ - δ1 Δ (Nk ),

where δ1 is some positive constant and Δ is a function that maps ℕ into (0,∞) and satisfies limN→∞ Δ ( N) = 0. The sample size is assumed to be strictly increasing and unbounded, but the exact dynamics of increasing is not specified. The convergence of the algorithm is proved under one additional assumption.

R4. On any given set S ⊂ ℝn, the function F(·,ξ) is continuously differentiable and F(·,ξ) and ||∇x F(·,ξ)|| are dominated by an integrable function.

Theorem 3.5. [31] Suppose that the assumptions R3-R4 are satisfied and that the sequence of iterates generated by the algorithm proposed in [31] is bounded. Then, every accumulation point of that sequence satisfies θ( ) = 0 almost surely.

The relation between the sample size and the error tolerance for each of the optimization problems solved within the diagonalization methods is considered in Pasupathy [28]. The error tolerance here is a small number εk which almost surely satisfies || - || ≤ εk where and represent the approximate and the true (unique) solution of the corresponding SAA problem, respectively. A measure of effectiveness is defined as qk =|| - ||2Wk where Wk represents the number of simulation calls needed to obtain the approximate solution . Since the almost sure convergence is analyzed, it is assumed that Nk →∞ and εk → 0. It is proved that the measure of effectiveness is bounded in a stochastic sense if the following three conditions hold.

R5. If the numerical procedure used to solve SAA problems exhibits linear convergence, we as sume that . If the numerical procedure used to solve SAA problems exhibits polynomial convergence of order p > 1, we assume .

R6. lim supk →∞(∑kj=1 Njk2 < ∞.

R7. lim supk→∞(∑kj=1 Nj )Nk -1 < ∞.

If any of the above conditions is violated, then qk tends to infinity in probability. The key point of analysis in [28] is that the error tolerance should not be decreased faster than the sample size is increased. The dynamics of change depends on the convergence rate of numerical procedures used to solve the SAA problems. Moreover, the mean squared error analysis implies the choice of εk and Nk such that

In order to further specify the choice of the optimal sequence of sample sizes, the following theorem is stated.

Theorem 3.6. [28] Suppose that (21) holds together with the assumptions R5-R7. If the numerical procedure used to solve SAA problems exhibits linear convergence, then

If the numerical procedure used to solve SAA problems exhibits polynomial convergence of order p > 1,then lim supk→∞Nk/N pk-1< ∞.

More specific recommendations are given for linear, sublinear and polynomial rates in [28]. For example, if the applied algorithm is linearly convergent, then the linear growth of a sample size is recommended, i.e. it can be set Nk+1 =1.1Nk l for example. Also, in that case, exponential or polynomial growth of order p > 1 are not recommended. However, if the polynomial rate of convergence of order p > 1 is achieved, then we can set Nk+1 =⌈ N 1.1 ⌉ or Nk+1 = ⌈ e Nk 1.1 ⌉ for instance. Furthermore, it is implied that the error tolerance sequence should be of the form K / √Nk where K is some positive constant.

The diagonalization methods are defined for a finite N as well. One possibility is presented in Polak, Royset [29] where the focus is on finite sample size N although the almost sure convergence is addressed. The idea is to approximately solve the sequence of SAA problems with N =Nk, k =1,..., s applying nk iterations at every stage k. The sample size is nondecreasing and the sample is assumed to be cumulative. The method consists of three phases. The first phase provides the estimates of relevant parameters such as the sample variance. In the second phase, the scheduling sequence is obtained. Finally, the sequence of the SAA problems is solved in the last phase.

An additional optimization problem is formulated and solved in the second phase in order to find the number s of the SAA problem to be solved, the sample sizes Nk,k =1,..., s and the number of iterations nk that are applied to solve the corresponding SAA problem. The objective function of this additional problem is the overall cost ∑s k=1 nkw(Nk ), where w(N) is the estimated cost of one iteration of the algorithm applied on the function N. For example w(N) = N. The constraint for this problem is motivated by the stopping criterion f (x) - f* ≤ ε ( f (x 0) - f* ) where f* is the optimal value of the objective function. More precisely, the cost-to-go is defined as ek = f (x knk) - f* where x knk is the last iteration at the stage k. Furthermore, the upper bound estimate for es is determined as follows. Let ∆(N) be the function defined as in Royset [31].

R8. There exists strictly decreasing function ∆(N) : ℕ →(0, ∞) such that limN→∞ ∆(N) =0 and | (x) - f (x)| ≤ ∆(Nk) holds almost surely.

One may use a bound like (18), but it is usually too conservative for practical implementations. Therefore, ∆(N) is estimated with the confidence interval bound of the form (17) where the variance is estimated in the first stage. The following bound is derived

where lk (s) represents the remaining number of iterations after the stage k and θ defines the rate of convergence of the deterministic method applied on SAA. The initial cost-to-go from e 0 = f (x 1 0 ) - f* is also estimated in the first phase. Finally, the efficient strategy is obtained as the solution of the following problem

In order to prove the asymptotic result, the following assumption regarding the optimization method used at each stage is imposed.

R9. The numerical procedure used to solve SAA problems almost surely exhibits linear rate of convergence with parameter θ ∈ (0, 1).

Theorem 3.7. [29] Suppose that the assumptions R8-R9 hold and that the sample size sequence tends to infinity. Then lims→∞ es = 0 almost surely.

4 APPLICATIONS TO DETERMINISTIC PROBLEMS

A number of important deterministic problems can be written in the form of

where fi (x) are given functions and N is a large integer. For example, least squares and maximum likelihood problems are of this form. The objective function in (22) and its gradient are generally expensive to compute if N is large. On the other hand for a given sample realization ξ1,...,ξN and fi (x) = F(xi ) the SAA problems discussed in Section 3 are the same as (22). Therefore the SAA methods that deal with finite N can be used for solving the deterministic problems specified in (22). The main idea of this approach is to use the same reasoning as in the variable sample schemes to decrease the cost of calculating the objective function and its gradient i.e. to approximate the function and the gradient with and ∇. One application of a variable sample method to the data fitting problem is presented in Krejić, Krklec Jerinkić, [23]. In this section we consider two important problems, data fitting and machine learning, and methods for their solutions that use stochastic gradient approximation in the sense of approximate gradient as explained above.

The data fitting problem of the form (22) is considered in Friedlander, Schmidt [14]. The problem is solved a by quasi Newton method but the gradient approximation of the SAA type is used. Let the gradient estimation at the current iteration k be given as gk =∇ f (xk) + ϵk where ϵ is the error term. T he following assumptions are stated.

P1. The functions f1,..., fN are continuously differentiable and the function f is strongly convex with parameter µ. Also, the gradient ∇ f is Lipschitz continuous with parameter L.

P2. There are constants β1 0 and β2 1 such that ||∇ fi (x)||2β1 + β2 ||∇ f (x) ||2 for all x and i = 1,...,N.

The algorithm can be considered as an increasing sample size method where the sample size is bounded with N. The main issue in [14] is the rate of convergence and the convergence analysis is done with the assumption of a constant step size. More precisely

Two approaches are considered: deterministic and stochastic sampling. The deterministic sampling assumes that if the sample size is Nk then the gradients to be evaluated ∇ fi (xk ), i = 1,...,Nk, are determined in advance. For example, the first Nk functions are used to obtain the gradient approximation gk = Nki=1 fi (xk ). On the other hand, stochastic sampling assumes that the gradients ∇ fi (x), i = 1,..., Nk, to be evaluated are chosen randomly. We state the relevant results considering R-linear rate of convergence. In the case of deterministic gradient, q-linear convergence is also attained but under stronger conditions on the increase of the sample size.

Theorem 4.1. [14] Suppose that the assumptions P1-P2 hold and that (N -Nk)/N = O (γ k/2) for some γ ∈ (0, 1). Then for any ε 0, σ = max{γ, 1 -µ/L} + ε and every k, in deterministic case we obtain

f (x ) - f (x*) = ( f (x 0) - f (x*)) O ((1 - µ/L + ε)k) + O(σk).

Moreover, if (N -Nk)/(Nk N) = O(γ k), in stochastic case we obtain

E [f (xk ) - f (x*)] = (f (x0) - f (x*)) O ((1 - µ/L +ε)) + O(σk ).

Machine learning applications which usually assume large number of training points can also be viewed as problems of the form (22). Methods for solving such problems are the subject of Byrd et al. [8] and Byrd et al. [9]. The main idea in [8] is to create methods which use the second order derivative information but with cost comparable to the steepest descent method. The focus is on using a cheep Hessian approximations ∇2 (xk) where Sk is a number of training points, i.e. the Hessian-related sample size at iteration k. More precisely, matrix-free conjugate gradient method is applied in order to obtain the search direction pk as an approximate solution of the system

Here, Nk is a sample size related to the gradient and the function approximation. This can be considered as an inexact Newton method. In the relevant examples, pk is guaranteed to be a descent search direction and therefore the Armijo line search is applied. The proposed method (named S-Newton) does not specify the dynamic of changing the sample sizes. It only requires that the variable sample strategy is used and Sk < Nk. The analysis is conducted for the full gradient case.

Theorem 4.2. [8] Suppose that the function is twice continuously differentiable and uniformly convex and that there exists a constant γ > 0 such that x T2(xk)xγ ||x|| for every k and x. Then the sequence generated by the S-Newton method with Nk = N satisfies limk→∞||∇(xk)||= 0.

The same result can be obtained for the so called SLM method which uses matrix-free limited memory BFGS method. In that case, the conjugate gradient method is used to obtain the search direction where the sub-sampled Hessian approximation ∇2 (xk) is used for the initial matrixvector product at every iteration. The line search uses Wolfe conditions for choosing the suitable step size.

The dynamic of increasing the sample size in the machine learning problem is addressed in [9]. The main idea is to estimate the sample size which makes the search direction pk descent for the objective function N without evaluating the true gradient ∇ fN (x). The approximation of the negative gradient pk = -∇ (xk) is a descend direction if for some θ ∈[0,1], the following inequality holds

Since and N is large, inequality (23) is approximated by

where is a sample variance related to the chosen sample of the size Nk . The algorithm for the sample size schedule proposed in [9] can be described as follows. After finding the step size αk such that (xk + αk pk) < (xk) and setting xk+1 = xk + αk pk, a new sample of the same size Nk is chosen. If inequality (24) holds for the new sample, the sample size remains unchanged, i.e. Nk+1 = Nk. Otherwise, the sample is augmented and the new sample size is determined by

In order to conduct the complexity analysis, the constant step size is considered and the q-linear convergence rate is analyzed.

Theorem 4.3. [9] Suppose that the function N is twice continuously differentiable and x* is a solution of the problem (22) with N(x*) = 0. Furthermore, assume that there are constants 0 < λ < L such that λ ||h||2h T2 (x)hL ||h||2 for all x and h. Let the sequence of iterates be generated by xk+1 = xk -α (xk) where α = (1 -θ) / L and θ ∈ (0,1). If the condition (23) is satisfied at iteration k, then

where β = . Moreover, if (23) holds for every k, then

The schedule {N } for the gradient estimations is extended to the second order approximations to define a Newton- method, the S-Newton method defined in [8]. This method uses the updating of Nk described above while the Hessian-related sample size Sk follows the dynamic of Nk. More precisely, Sk = RNk where R is some positive number substantially smaller than 1. Also, the sample used for the Hessian approximation is assumed to be a subset of the sample used for the gradient and the function approximations. The stopping criterion for the conjugate gradient method used for obtaining the search direction is more complex than in [8] since it is related to the sample size. Wolfe conditions are imposed to obtain a suitable step size.

5 CONCLUSIONS

In this survey we considered unconstrained problems with the stochastic or expectation objective function. We focused our attention on two specific classes of gradient-related methods: Stohastic Approximation and Sample Average Approximation and many other important approaches are left out for the sake of brevity. An interested reader should consults [32,34] for the initial guidance into stochastic optimization problems. Several natural extensions are easily incorporated in the framework considered in this paper, for example search directions with second order information which usually yield faster convergence, but also require additional cost, [8,9,10]. In order to decrease the linear algebra costs, one can consider preconditioners as their construction might be a nontrivial issue due to the presence of random variable. On the other hand, given that there are a lot of problems yielding only input-output information, an interesting approach within SA and SAA frameworks is based on zero order information, [17]. Constrained problems are always of great interest and some recent research of penalty methods within SA methodology is presented in [39]. Projection and filter methods with variable sample size might be a valuable topic of future research. Among the others, chance constrained problems are especially challenging. In all the considered methods, deriving some complexity bounds can be of great interest from practical point of view.

*Research supported by Serbian Ministry of Education Science and Technological Development, grant no. 174030.

REFERENCES

ANDRADOTTIR S. 1996. A Scaled Stochastic Approximation Algorithm. Management Science, 42(4):475-498. [ Links ]

BASTIN F. 2004. Trust-Region Algorithms for Nonlinear Stochastic Programming and Mixed Logit Models. PhD thesis, University of Namur, Belgium. [ Links ]

BASTIN F, CIRILLO C & TOINT PL. 2006. An adaptive Monte Carlo algorithm for computing mixed logit estimators. Computational Management Science 3(1):55-79. [ Links ]

BASTIN F, CIRILLO C & TOINT PL. 2006. Convergence theory for nonconvex stochastic programming with an application to mixed logit. Math. Program., Ser. B 108:207-234. [ Links ]

BENVENISTE A, METIVIER M & PRIOURET P. 1990. Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, New York. [ Links ]

BERTSEKAS DP & TSITSIKLIS JN. 2000. Gradient Convergence in Gradient Methods with Errors. SIAM J. Optimization, 10(2):627-642. [ Links ]

BIRGE JR & LOUVEAUX F. 1997. Introduction to Stochastic Programming, Springer. [ Links ]

BYRD RH, CHIN GM, NEVEITT W & NOCEDAL J. 2011. On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning. SIAM J. Optim., 21(3):977-995. [ Links ]

BYRD RH, CHIN GM, NOCEDAL J & WU Y. 2012. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127-155. [ Links ]

BYRD RH, HANSEN SL, NOCEDAL J & SINGER Y. 2014. A Stochastic Quasi-Newton Method for Large-Scale Optimization. Technical report, arXiv:1401.7020 [math.OC]. [ Links ]

DELYON B & JUDITSKY A. 1993. Accelerated stochastic approximation. SIAM J. Optimization 3(4):868-881. [ Links ]

DENG G & FERRIS MC. 2009. Variable-Number Sample Path Optimization. Mathematical Programming 117(1-2):81-109. [ Links ]

FABIAN V. 1968. On Asymptotic Normality in Stochastic Optimization. The Annals of Mathematical Statistics, 39:1327-1332. [ Links ]

FRIEDLANDER MP & SCHMIDT M. 2012. Hybrid deterministic-stochastic methods for data fitting. SIAM J. Scientific Computing, 34(3):1380-1405. [ Links ]

FU MC. 2006. Gradient Estimation, HENDERSON SG & NELSON BL. (Eds.). Handbook in OR & MS, 13:575-616. [ Links ]

GHADIMI S & LAN G & ZHANG H. 2014. Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization. Technical report arXiv:1308.6594 [math.OC]. [ Links ]

GHADIMI S & LAN G 2013. Stochastic first and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341-2368. [ Links ]

HOMEM-DE-MELLO T. 2008. On rates of convergence for stochastic optimization problem under non-independent and identocally distributed sampling. SIAM J. Optim. 19(2):524-551. [ Links ]

HOMEM-DE-MELLO T 2003. Variable-Sample Methods for Stochastic Optimization. ACM Transactions on Modeling and Computer Simulation, 13(2):108-133. [ Links ]

KESTEN H. 1958. Accelerated stochastic approximation. Ann. Math. Statist. 29:41-59. [ Links ]

KIEFER J & WOLFOWITZ J. 1952. Stochastic Estimation of the Maximum of a Regression Function. The Annals of Mathematical Statistics 23(3): 462-466. [ Links ]

KREJIĆ N & KRKLEC N. 2013. Line search methods with variable sample size for unconstrained optimization. Journal of Computational and Applied Mathematics, 245:213-231. [ Links ]

KREJIĆ N & KRKLEC JERINKIĆ N. 2014. Nonmonotone line search methods with variable sample size. Numerical Algorithms, 8:1-29. [ Links ]

KREJIĆ N, LUŽANIN Z & STOJKOVSKA I. 2013. A gradient method for unconstrained optimization in noisy environment. Applied Numerical Mathematics, 70:1-21. [ Links ]

KUSHNER H. 2010. Stochastic Approximation: A survey. Computational Statistics, 2(1):87-96. [ Links ]

MARTI K. 2005. Stochastic Optimization Methods, Springer. [ Links ]

NEMIROVSKI A, JUDITSKY A, LAN G & SHAPIRO A. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization 19(4):1574-1609. [ Links ]

PASUPATHY R. 2010. On Choosing Parameters in Retrospective-Approximation Algorithms for Stochastic Root Finding and Simulation Optimization. Operations Research, 58(4):889-901. [ Links ]

POLAK E & ROYSET JO. 2008. Eficient sample sizes in stochastic nonlinear programing. Journal of Computational and Applied Mathematics 217(2):301-310. [ Links ]

ROBBINS H & MONRO S. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics 22(3):400-407. [ Links ]

ROYSET JO. 2012. Optimality functions in stochastic programming. Math. Program. 135(1-2):293-321. [ Links ]

SHAPIRO A, DENTCHEVA D & RUSZCZYNSKI A. 2009. Lectures on Stochastic Programming. SIAM. [ Links ]

SHAPIRO A & WARDI Y. 1996. Convergence Analysis of Stochastic Algorithms. Mathematics of Operations Research 21(3):615-628. [ Links ]

SPALL JC. 2003. Introduction to Stochastic Search and Optimization. Wiley-Interscience serises in discrete mathematics, New Jersey. [ Links ]

SPALL JC 1998. Implementation of the Simultaneous Perturbation Algorithm for Stochastic Optimization. IEEE Transactions on Aerospace and Electronic Systems, 34(3):817-823. [ Links ]

YOUSEFIAN F, NEDIC A & SHANBHAG UV. 2012. On stochastic gradient and subgradient methods with adaptive steplength sequences. Automatica, 48(1):56-67. [ Links ]

YAN D & MUKAI H. 1993. Optimization Algorithm with Probabilistic Estimation. Journal of Optimization Theory and Applications, 79(2):345-371. [ Links ]

XU Z & DAI YD. 2012. New stochastic approximation algorithms with adaptive step sizes. Optimization Letters, 6:1831-1846. [ Links ]

WANG X, MA S & YUAN Y. Penalty Methods with Stochastic Approximation for Stochastic Nonlinear Programming. Technical report arXiv:1312.2690 [math.OC]. [ Links ]

WARDI Y 1990. Stochastic Algorithms with Armijo Stepsizes for Minimization of Functions. Journal of Optimization Theory and Applications 64(2):399-417. [ Links ]

Research supported by Serbian Ministry of Education Science and Technological Development, grant no. 174030.

Received: September 11, 2013; Accepted: December 13, 2013

**Corresponding author.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.