1 INTRODUCTION

Stochastic optimization problems appear in all areas or engineering, physical and social sciences. Typical applications are model fitting, parameter estimation, experimental design, performance evaluation etc. The models we are considering here can be written in the form

where *f*: ℝ*n* → ℝ is either observed with noise or is
defined as the mathematical expectation. In fact the objective function depends on a
vector of random variables ξ from some probability space that might be know or unknown,
depending on application. Thus the exact evaluation of *f (x)* is
impossible to evaluate and it is necessary to use simulation to estimate the objective
function value.

The feasible set Ω can be defined by constraints of different types - simple box
constraints, deterministic constraints, chance constraints, constraints in the form of
mathematical expectation etc. In this paper we consider only the case Ω =
ℝ*n* . More precisely, we consider only two types of stochastic
problems. The first type are problems with random objective function,

where ξ represents the noise (or randomness) and *x* is the decision
variable. Such models appear when the decision has to be taken before the full
information about the problem parameters is known. The lack of information in that case
is represented by random vector ξ . The second type of problems are the problems with
the mathematical expectation as the objective function

Although the noise is technically removed in the above problem, it is rather hard to solve it as the expectation is hard or even impossible to state analytically even if the distribution of ξ is known.

Methods for solving stochastic optimization problems are combination of ideas from
numerical optimization and statistics. Thus the class of popular methods include
simulation-based methods, direct methods for stochastic search, annealing type
algorithms, genetic algorithms, methods of reinforced learning, statistical methods and
many others, ^{[}^{34}^{], [}^{7}^{]}. Among all of them we restrict our
attention here on two methods typically used in simulation based optimization:
Stochastic Approximation, SA, and Sample Average Approximation, SAA.

Stochastic Approximation methods are introduced in the seminal paper of Robbins and
Monro, ^{[}^{30}^{]} and remain a
popular choice for solving stochastic optimization problems. They relay mainly on noisy
gradient evaluations and depend heavily on the choice of steplength sequence. The choice
of this sequence is the subject of many research efforts as well as other techniques for
accelerating the convergence of SA methods. Sample average Approximation methods can be
seen as an alternative to SA methods. In this approach a sample from the underlying
distribution is used to construct a deterministic sample average problem which can be
solved by optimization methods. However the sample used for the SAA approximation very
often needs to be large and a naive application of standard nonlinear optimization
techniques is not feasible. Therefore there has been extensive research in variable
sample size methods that reduce the cost of SAA.

Both SA and SAA methods are considered here in the framework of gradient-related
optimization (gradient methods, subgradient methods, second order quasi Newton methods)
as well as in the derivative-free framework. This survey will largely deal with gradient
methods for stochastic optimization and an interested reader can look at Spall
^{[}^{34}^{]} for an overview of
other methods. This paper is organized as follows. In Section 2 we discuss the SA method
and its modifications. Section 3 contains results for the SAA methods and unconstrained
problems with the mathematical expectation objective function. Two important
deterministic problems that are rather similar to SAA problem are discussed in Section 4
as well as methods for obtaining their solution that rely on stochastic gradients.
Finally, some conclusion and research perspectives are presented in Section 5.

2 STOCHASTIC APPROXIMATION METHODS

There have been countless applications of Stochastic Approximation (SA) method since the
work of Robbins and Monro ^{[}^{30}^{]}. In this section we give an overview of its main properties
and some of its generalizations. The problem we consider is

assuming that only the noisy measurements
(*x*) of the function and its gradient *ĝ(x)* are
available. Let us start by considering the SA algorithm for solving systems of nonlinear
equations as it was defined originally in ^{[}^{30}^{]}. The convergence theory presented here relays on imposition of
statistical conditions on the objective function and the noise. Convergence analysis can
be conducted throughout differential equations as well, see ^{[}^{34}^{]} and ^{[}^{25}^{]} for further references.

Consider the system of nonlinear equations

with *g(x)* being the gradient of *f (x)*. Suppose that
only the measurements with noise that depends on the iteration as well as on the
decision variable *x*

are available. Then the SA is defined by

The sequence of step sizes {*ak*}*k*∈N is also called the
gain sequence and it has dominant influence on the convergence.

Let {*xk*} be a sequence generated by an SA method. Denote by
F*k* the σ-algebra generated by *x*0,
*x*1,...,* xk*. If the problem has an unique solution
*x** the set of assumptions that ensures the convergence of an SA
method is the following.

S1. The gain sequence satisfies:

**S2.**
*For some symmetric, positive definite matrix B and for every* η ∈
(0,1),

**S4.**
*There exists a constant c*> 0 *such that for all x and
k,*

||*g*(*x*)||^{2} +* E *[
||ξ*k*(*x*)||^{2}| F*k *] ≤
*c* (1 +||*x*||^{2}).

The first assumption which implies that the step sizes converge to zero is standard in
stochastic algorithms, see ^{[}^{34}^{]}. The second condition, is imposed in
order to avoid inefficiently small step sizes. On the other hand, the summability
condition on *a _{k }*

^{2}ensures stability. Its role is to decrease the influence of the noise when the iterates come into a region around the solution. An example of a sequence that satisfies the first assumption is

where α ∈ (0.5, 1) and *a* is some positive constant. The condition of
zero mean is also standard in stochastic optimization. Its implication is that
*ĝ _{k}*(

*x*) is an unbiased estimator of

*g*(

*x*). Notice that under assumption S3, the condition in S4 is equal to

*E* [*ĝk(x)*||^{2 }] ≤* c *(1
+||*x||*
^{2}).

Therefore, the mean of
||ĝ* _{k}*(

*x*)||

^{2}can not grow faster than a quadratic function of

*x*. Under these assumptions, the almost sure convergence of the SA algorithm can be established. The convergence in mean square i.e. E[|

*x*|)

_{k}-x^{2}] → 0 as

*k*→ ∞ was proved in

^{[}

^{30}

^{]}and the theorem bellow states a stronger result, the almost sure convergence.

**Theorem 2.1.**
^{[}^{34}^{]} Consider the SA
algorithm defined by (6). Suppose that assumptions S1 -S4 hold and that x* is a unique
solution of the system (4). Then *x _{k}* converges almost surely
to x*.

Closely related and more general result is proved in Bertsekas, Tsitsiklis
^{[}^{6}^{]} where the gradient
related method of the form

is considered. Here ξ*k* is either stochastic or deterministic error,
*a _{k}* is a sequence of diminishing step sizes that
satisfy assumption S1 and

*s*is a descent direction. In this context, the direction

_{k}*s*is not necessarily the gradient but it is gradient-related. The convergence is stated in the following theorem.

_{k}**Theorem 2.2.**
^{[}^{6}^{]} Let
{*x _{k}*} be a sequence generated by (8), where

*s*is a descent direction. Assume that S1 and S3 -S4 hold and that there exist positive scalars c

_{k}_{1}and c

_{2}such that

*c*
_{1}||∇*f* (*x _{k}*)||

^{2}≤ - ∇

*f (x*|| ≤

_{k}) T sk, ||s_{k}*c*

_{2 }(1 + ||∇

*f*(

*x*)||).

_{k}Then, either *f (xk)* →-∞ or else* f (x _{k})*
converges to a finite value and lim

*∇*

_{k→∞}*f (x)*= 0. Furthermore, every limit of {x

_{k}} is a stationary point of

*f*.

The gain sequence is the key element of the SA method. It has impact on stability as
well as on convergence rate. Under some regularity conditions, Fabian ^{[}^{13}^{]}, the asymptotic normality of
*x _{k}* is obtained. More precisely,

where →*d* denotes the convergence in distribution, α refers to (7) and
is the covariance matrix that depends on the gain sequence and on the Hessian of
*f* . Therefore, the iterate *x _{k}*
approximately has normal distribution N (

*x*, k -α*◹) for large

*k*. Due to assumption S1, the maximal convergence rate is obtained for α = 1. However, this reasoning is based on the asymptotic result. Since the algorithms are finite in practice, it is often desirable to set α < 1 because α = 1 yields smaller steps. moreover, if we want to minimize ||◹||, the ideal sequence would be

where *H* (*x*) denotes the Hessian matrix of
*f* , Benveniste et al. ^{[}^{5}^{]}. Even though this result is purely theoretical, sometimes the
Hessian at *x** can be approximated by *H*
(*x _{k}*) and that way one can enhance the rate of
convergence.

Two main drawbacks of the SA method are slow convergence and the fact that the
convergence theory applies only if the solution of (4) is unique i.e. only if
*f* has an unique minimizer. Thus several generalizations are
developed to address these two issues. One can easily see from (7) that the gain
coefficients increase with the increase of *a*. On the other hand a large
*a* might have negative influence on stability. Therefore several
generalizations of the gain coefficients are considered in the literature. One
possibility, Spall ^{[}^{35}^{]}, is
to introduce the so called stability constant *A* > 0, and obtain

Now, the values of *a* and *A* can be chosen together to
ensure effective practical performance of the algorithm, allowing for larger
*a* and thus producing larger step sizes in latter iterations, when
the effect of *A* is small, and avoiding instability in early iterations.
The empirically recommended value of *A* is at most 10% of the iterations
allowed or expected during the optimization process, for more details see
^{[}^{35}^{]}.

Several generalizations of the SA method are based on adaptive step sizes that try to
adjust the step size at each iteration to the progress achieved in the previous
iterations. The first attempt of this kind has been made in Kesten ^{[}^{20}^{]} for one dimensional problems. The
main idea is to monitor the changes in the sign of
*x _{k+1}*-

*x*. If the sign of the difference between two consecutive iterations starts to change frequently we are probably in the domain of noise and therefore small steeps are needed to avoid oscillations. This idea is generalized in Delyon, Juditsky

_{k}^{[}

^{11}

^{]}for multidimensional problems. The gain coefficients are defined as

where *I* is the identification function defined as *I*
(*t*) = 1 if *t* < 0 and *I*
(*t*) = 0 if *(t*) ≥ 0. The method is accelerating the
convergence of SA and its almost sure convergence is proved under the standard
assumptions.

The idea of sign changes is further developed in Xu, Dai ^{[}^{38}^{]}. It is shown that the sequence
{*s _{k}* /

*k*} stands for the change frequency of the sign of

*ĝ*

_{k}T_{+1}

*ĝk*in some sense. The assumption in

^{[}

^{38}

^{]}is that the noise ξ

*is state independent. The theoretical analysis shows that in that case*

_{k}*s*/

_{k}*k*converges to

*P*(ξ

_{1}

*T*ξ

_{2}< 0) in the mean square sense. Based on that result, a switching algorithm that uses the switching parameter

*t*,defined as

_{k}is proposed. Then the gain coefficients are defined as

where 0.5 ≤ α < β ≤1, v is a small positive constant and *a, A* are
the constants from assumption S1. To prove the convergence of the switching algorithm
(9)-(10) one additional assumption is introduced in ^{[}^{38}^{]}.

**S5. **
*G*(*x*) =*g*(*x*)-*x
is a weighted maximum norm pseudo-contraction operator. That is for all x* ∈
ℝ*n there exists a positive vector* ω *and some x** ∈
ℝ*n such that*

||*G*(*x*)-*x**||ω ≤β||*x
-x**||w,

*where* β ∈(0,1) *and* ||·||ω *is defined
as* l*x*lw =
max{*x*(*i*)ω(*i*)^{-1} ,
*i* =1,...,*n*}*where
x*(*i*) *and* ω(*i*) *are
the ith components of x and* ω *respectively.*

**Theorem 2.3.**
^{[}^{38}^{]} Suppose that
assumptions S1 -S2 and S5 hold. Then for {*x _{k}* }generated
through (9)-(10) we have

*x*→

_{k}*x**as

*k*→∞ with probability 1.

If the objective function is given in the form of mathematical expectation the adaptive
step length sequence can be determined as proposed in Yousefian et al. ^{[}^{36}^{]}. For the problem

the following assumptions are stated.

**S6.**
* The function F*(·,ξ)* is convex on a closed and convex set
D* ⊂ ℝ*n for every* ξ ∈ Q,* and the expected value E
(F(x,ξ)) is finite for every x* ∈*D.*

**S7.**
* The errors* ξ* _{k} in the noisy gradient ĝ_{k}
are such that for some* µ > 0

*,*

A self-adaptive scheme is based on the error minimization and the convergence result is as follows.

**Theorem 2.4.**
^{[}^{36}^{]} Let assumptions S6 and
S7 hold. Let the function f be differentiable over the set D with Lipschitz gradient and
assume that the optimal set of problem (11) is nonempty. Assume that the step size
sequence {*ak*}is generated through the following self-adaptive
scheme

*a _{k}* =

*a*-1(1 -

_{k}*ca*) for all

_{k-1}*k*≥1,

*where c *> 0* is a scalar and the initial step size is such
that *0* < a*
_{0} < 1* / c. Then the sequence {x _{k} }converges almost
surely to a random point that belongs to the optimal set. *

An important choice for the gain sequence is a constant sequence. Although such
sequences do not satisfy assumption S1 and almost sure convergence to solution can not
be obtained, it can be shown that a constant step size can conduct the iterations to a
region that contains the solution. This result initiated development of a cascading
steplength SA scheme in ^{[}^{36}^{]} where a fixed step size is used until some neighborhood of the
solution is reached. After that, in order to come closer to the solution, the step size
is decreased and again the fixed step size is used until the ring around the solution is
sufficiently tighten up. That way, the sequence of iterates is guided towards the
solution.

A hybrid method which combines the SA gain coefficients and the step sizes obtained from
the inexact Armijo line search under the assumptions valid for the SA method is
considered in Krejić et al. ^{[}^{24}^{]}. The method takes the advantages of both approaches, safe
convergence of SA method and fast progress obtained by line search if the current
iterate if far away from the solution (where the SA steps would be unnecessarily small).
The step size is defined according to the following rule. For a given *C*
> 0 and a sequence {*a _{k}*}that satisfies assumption S1 we
define

where β* _{k}* is obtained from the Armijo inequality

After that the new iteration is obtained as

The existence of *C* such that the gain coefficient (12) is well defined
as well as the convergence of the sequence generated by (13) is proved in
^{[}^{24}^{]} under one
additional assumption.

**S8.**
*Observation noise is bounded and there exists a positive constant M such
that* ||ξ* _{k }*(

*x*)|| ≤

*M a.s. for all k and x.*

**Theorem 2.5.**
^{[}^{24}^{]}
*Assume that Assumptions S1-S4 and S8 hold, the gradient g is Lipschitz
continuous with the constant L, and the Hessian matrix H*
(*x**) *exists and is nonsingular. Let*

where

*Let* {*x _{k}*}

*be an infinite sequence generated by (13). Then x*→

_{k}*x* a.s.*

Many important issues regarding the convergence of the SA methods are not mentioned so
far. One effective possibility to speed up the convergence is to apply the averaging to
the sequence of gradient estimations
*ĝ*(*x _{k}*)as suggested in Andradottir

^{[}

^{1}

^{]}. It is shown that the rate of convergence could be significantly better than the rate of SA if two conditionally independent gradient estimations are generated and the new iteration is obtained using a scaled linear combination of the two gradients with the gain coefficient. More details on this procedure are available in

^{[}

^{1}

^{]}. Let us also mention a robust SA scheme that determines an optimal constant step length based on minimization of the theoretical error for a pre-specified number of steps

^{[}

^{27}

^{]}.

We have assumed in the above discussion that the noisy gradient values are available.
This is the case for example if the analytical expression of *F* in (11)
is available. In this case, under certain assumption we can interchange the expectation
and derivative and thus a sample average approximation of the gradient can be
calculated. It is important to be able to use a sample gradient estimation with
relatively modest sample size as calculation of the sample gradient is in general
expensive for large samples. However it is safe to claim that the analytical expression
for the gradient calculation is not available in many cases and thus the only input data
we have are (possibly noisy) function values. Thus the gradient approximation with
finite differences appears to be a natural choice in many applications. The first method
of this kind is due to Keifer, Wolfowitz, ^{[}^{21}^{]}. Many generalizations and extensions are later considered in
the literature, see Fu ^{[}^{15}^{]}
for example. Among many methods of this kind the Stochastic Perturbation method is
particularly efficient as it uses one two function values to obtain a good gradient
approximation, see ^{[}^{35}^{]} for
implementation details.

The questions of stopping criteria, global convergence, search directions which are not
gradient related, and other important questions are beyond the scope of this paper. An
interested reader might look at Spall ^{[}^{34}^{]}, Shapiro et al. ^{[}^{32}^{]} for guidance on these issues and relevant literature.

3 SAMPLE AVERAGE APPROXIMATION

Sample Average Approximation (SAA) is a widely used technique for approaching the problems of the form

The basic idea is to approximate the objective function *f (x)* with the
sample mean

where *N* is the size of a sample represented by i.i.d. random vectors
ξ1,...,ξ*N* . Under the standard assumption such as finite variance of
*F*(*x*,ξ) the (strong) Law of Large Numbers implies
that *N(x)* converges to *f*
(*x*) almost surely. Moreover, if
*F*(*x*,ξ)is dominated by an integrable function, then
the uniform almost sure convergence of *N(x)* on the compact subsets of
ℝ*n* is obtained.

Within the SAA framework the original problem (14) is replaced by the approximate problem

and thus the key question is the relationship between their respective solutions as
*N* tends to infinity. Denote by *X** the set of
optimal solutions of the problem (14) and let *f** be the optimal value
of the objective function. Furthermore, denote by *_{N} and
the set of optimal solutions and the corresponding optimal
values, respectively, of the problem (16). Then, the following result holds.

Theorem 3.1. (^{32}) Suppose that there exists a
compact set C ⊂ ℝn such that X* is nonempty and X*⊂ C. Assume that the function f is
finite valued and continuous on C and that
_{N} converges to f almost surely, uniformly on C. Also, suppose that for N
large enough the set is nonempty and
⊂* C. Then *
→ *f*** and the distance between sets *
* _{N }and X**

*tends to zero almost surely as N*→ ∞.

Let
_{N} be an approximate solution of the problem (14). Clearly
* _{N}* (

*) can be calculated for a given sample. The Central Limit Theorem can be used to obtain an approximation of the error bound*

_{N}*c*(

_{N}_{N}) such that the inequality holds with some high probability δ ∈(0,1). For example, using the sample variance the following error bound is obtained

with *z* being the quantile of the standard normal distribution. The
error bound is directly proportional to the variance of the estimator
*Var*(
_{N}(
_{N})). Therefore, in order to provide a tight bound one can consider some
techniques for reducing variance such as the quasi-Monte Carlo or Latin hypercube
sampling ^{[}^{32}^{]}. However,
these techniques tend to deteriorate the i.i.d. assumption. This issue is addressed
further on in this section.

The gap where *x** is a solution of the original problem
can be estimated as well. Clearly *g*(
_{N}) ≥ 0. To obtain an upper bound suppose that *M* independent
samples of size *N* are available, i.e. we have i.i.d. sample
ξ_{1}*m*,...,ξ* _{N}^{m}, m* =
1,...,

*M*.

Denote by the relevant (nearly) optimal values and define
and Then, the upper
bound estimator for the gap *g*(
_{N}) is

where *N' *is some large enough sample size and
*t _{M-1,δ}* is the quantile of Student's distribution with

*M*-1 degrees of freedom. It should be mentioned that the sample size bounds such that the solutions of an approximate problem are nearly optimal for the true problem with some high probability are mainly too conservative for practical applications in general. For further references on this topic, see

^{[}

^{32}

^{]}for instance.

Recall that almost sure convergence
_{N} (*x*)towards *f* (*x*)is
achieved if the sample is i.i.d. under standard assumptions. However, if the sample is
not i.i.d. the almost sure convergence of is achievable
only if the sample size *N* which defines the SAA problem grows at the
certain rate, Homem-de-Mello ^{[}^{19}^{]}. The analysis presented in ^{[}^{19}^{]} allows for biased estimators
_{N} (*x*) if
_{N} (*x*) is at least asymptotically unbiased. Let us first
assume that the sample
ξ_{1}*k*,...,ξ*k _{Nk}* generated at the
iteration

*k*is independent of the sample at the previous iteration for every

*k*. The following assumptions are needed.

**R1.**
* For each x, there exists M*(*x*)> 0 *such
that*
sup* _{i,k}F*(

*x*,ξ

*) ≤*

_{i}k*M*(

*x*)

*with probability 1.*

**R2.**
* For each x, we have that*
lim* _{k→∞}E*(

_{N}(

*x*)) =

*f*(

*x*).

**Theorem 3.2.**
* ^{(}^{19}^{)} Suppose that
assumptions R1-R2 hold and that the sample size sequence*
{

*Nk*}

*satisfies*∑

^{∞}

*α*

_{k=1 }*Nk*< ∞

*for all*α ∈(0,1)

*. Then*(

*x*)

*converges to f*(

*x*)

*almost surely.*

For example, *Nk* ≥ √*k* satisfies the previously stated
summability condition.

The rate of convergence is also addressed in ^{[}^{19}^{]}, i.e. the error bounds for |
(*x*)- *f* (*x*)| are developed. In the
case where *N _{k}*
≥

*c*

_{1}

*k*

^{ρ}for

*c*

_{1}> 0 and ρ > 2 it can be proved, under some additional assumptions, that for every

*k*sufficiently large the following inequality holds almost surely

Moreover, if the sample is cumulative, the corresponding error bound is

where *C* is some positive constant.

The above analysis provides a justification for the SAA approximation as well as a
guidance for choosing *N* in (16). Thus from now on we concentrate on
gradient methods for solving (16). Several papers exploit ideas from deterministic
optimization. Generally speaking we are interested in solving the SAA problem for some
finite, possibly very large *N* as well as obtaining asymptotic results
i.e. the results that cover the case *N* →∞ even if in practical
applications one deals with a finite value of *N*. A naive application of
an optimization solver to (16) is very often prohibitively costly if *N*
is large due to the cost of calculating _{N}(*x*) and its gradient. Thus there is a
vast literature dealing with variable sample scheme for solving (16).

Two main approaches can be distinguished. In the first approach the objective function
_{N} is replaced with (*x*) at each iteration *k* and the
iterative procedure is essentially a two step procedure of the following form. Given the
current approximation *x _{k}* and the sample size

*N*one has to find

_{k}*s*such that the value of (

_{k}*x*) is decreased. After that we set

_{k}+ s_{k}*x*+1 =

_{k}*x*and choose a new sample size N

_{k}+ s_{k}*. The key ingredient of this procedure is the choice of N*

_{k+1}*.*

_{k+1}The schedule of sample sizes {*N _{k}*} should be defined in such
way that either

*N*=

_{k}*N*for

*k*large enough or

*N*→∞ if one is interested in asymptotic properties.

_{k}The second approach, often called the diagonalization scheme or the surface response
method, is again a two step procedure. It consists of a sequence of SAA problems with
different sample sizes that are approximately solved. So for the current
*x _{k}* and

*N*the problem (16) with

_{k}*N = N*is approximately solved (within an inner loop) for starting with

_{k}*x*as the initial approximation. After that we set

_{k}*x*+1 =

_{k}*N*and choose the new sample size N

_{k }*. Two important points in this procedure are the choice of*

_{k+1}*Nk*+1 and the precision in solving each of the optimization problems min

Let us now look into algorithms of the first kind. Keeping in mind that min
is just an approximation of the original problem and that the
cost of each iteration depends on *N _{k}*,it is rather intuitive
to start the optimization procedure with smaller samples and gradually increase the
sample size

*N*as the solution is approached. Thus the most common schedule sequence would be an increasing sequence

_{k}*N*

_{0},

*N*

_{1},.... The convergence theory for this kind of reasoning is introduced in Wardi,

^{[}

^{40}

^{]}where an Armijo type line search method is combined with SAA approach. In order to solve the problem of type (14), the iterative sequence is generated as

where *N _{k}* is the sample size used at iteration

*k*and α

*is the largest number in (0,1] satisfying the inequality*

_{k}The method is convergent with zero upper density ^{[}^{40}^{]}, assuming that *N _{k}* →∞.
More precisely, the following statement is proved.

**Theorem 3.3.**
^{[}^{40}^{]} Assume that the
function *f* is given by (14) and that *F* is twice
continuously differentiable on ℝ*n* for every ξ. Furthermore assume that
for every compact set D ⊂ℝ^{n}, there exists* K > 0* such that
for every *x* ∈ *D* and every ξ.

If *N _{k}* →∞

*then the sequence {xk}*given by (19) converges with zero upper density on compact sets.

An extension of the above work is presented in Yan, Mukai ^{[}^{37}^{]} where the adaptive precision is
proposed i.e. the sequence {*Nk*}* _{k∈ℕ}* is not
determined in advance as in

^{[}

^{40}

^{]}but it is adapted during the iterative procedure. Nevertheless the sample size has to satisfy

*N*→∞. The convergence result is slightly stronger as the convergence with probability 1 is proved under the set of appropriate assumptions. The more general result that applies to both gradient and subgradient methods is obtained in Shapiro, Wardi

_{k}^{[}

^{33}

^{]}where the convergence with probability 1 is proved for sample average gradient and subgradient methods assuming that the sample size tends to infinity.

In practical applications, the sample size is finite. So, let us now suppose that
*N _{max}* is the sample size which makes
good approximation of the original objective function. Very often
we assume that the sample is generated at the beginning of the process which justifies
considering the SAA objective function as deterministic. In this case one wishes again
to decrease the cost of the optimization process by decreasing the number of function
and gradient evaluations. Let us now look closer at the possible schedule of

*N*. Clearly the sample size should be equal to

_{k}*N*max at the final stages of the optimization procedure to ensure that the problem (16) with

*N*=

*N*

_{max}is solved. Thus one can consider even some heuristic schedule

^{[}

^{14}

^{]}to generate a non-decreasing sequence {

*N*}which eventually becomes stationary with

_{k}*N*=

_{k}*N*

_{max}for

*k*large enough. For example, a simple way to define such sequence could be to increase

*Nk*by a fixed number every

*K*iterations.

The problem of scheduling can be approached from a different perspective in the
following manner. Instead of constantly increasing the sample size one could monitor the
progress in decreasing the (approximate) objective function and choose the next sample
size according to that progress. One algorithm of this kind is presented in Deng, Ferris
^{[}^{12}^{]} where the Bayes
risk is used to decide the scheduling sequence within a trust region method. Another
class of results in the framework of trust region methods is presented in Bastin
^{[}^{2}^{]} and Bastin et al.
^{[}^{3}^{], [}^{4}^{]}. The key point of the approach
considered in ^{[}^{2}^{,}^{3}^{,}^{4}^{]} is that the sample sizes might oscillate during the iterative
process, i.e. {*Nk*}is not necessarily non-decreasing at the initial
stages of the iterative process. Eventually *N _{k}*
=

*N*

_{max}is reached and (16) is solved, but very often with smaller costs if compared with an increasing scheduling. The efficiency of this approach comes from the fact that there is a balance between the precision of the objective function approximation and the progress towards the solution. The same idea is further developed for the line search methods in Krejić, Krklec

^{[}

^{22}

^{]}as follows.

Let us assume that the gradient of ∇*F* is available and that the search
direction *p _{k}* satisfies . The armijo
rule with η ∈ (0, 1) is applied to find α

*such that*

_{k}*x*+1 =

_{k}*x*+α

_{k}*where*

_{k}p_{k}The sample size is updated as follows. First, the candidate sample size *N
+ _{k}* is determined by comparing the measure of decrease in the
objective function

*dm*= -α

_{k}*(∇(*

_{k}*x*))

_{k}*T p*and the so called lack of precision ε

_{k}*N*

_{δ}(

*x*) defined by (17). The main idea is to find the sample size

*N*

^{+}

*such that*

_{k}*dm*≈

_{k}*(xk*). The reasoning behind this idea is the following. If the decrease measure

*dm*is greater than the lack of precision (

_{k}*x*), the current approximation is probably far away from the solution. In that case, there is no need to impose high precision and therefore the sample size is decreased if possible. The candidate sample size does not exceed

^{k}*N*,but there is also the lower bound, i.e.

_{max}*N*≤

_{k}min*N*

_{k}^{+}≤

*N*

_{max}. This lower bound is increased only if

*N*>

_{k+1}*N*and there is not enough progress concerning the function . After finding the candidate sample size, a safeguard check is performed in order to prohibit the decrease of the sample size which might be unproductive. More precisely, if

_{k}*N*<

_{k}+*N*the following parameter is calculated

_{k}If ρ* _{k}* is relatively small, then it is presumed that these
two model functions are too different and thus there is no gain in decreasing the sample
size. So,

*N*=

_{k+1}*N*. In all the other cases, the decrease is accepted and

_{k}*N*=

_{k+1}*N*

_{k }^{+}. The convergence analysis relays on the following important result which states that after some finite number of iterations, the objective function becomes and (16) is eventually solved.

**Theorem 3.4.**
^{[}^{22}^{]} Suppose that
*F*(·,ξ) is continuously differentiable and bounded from below for
every ξ. Furthermore, suppose that there exist a positive constant κ and number
*n*_{1} ∈ ℕ *such that*
(*x _{k}*) ≥ κ

*for every k*≥

*n*

_{1}

*. Then there exists q*∈ ℕ

*such that N*≥

_{k}= N_{max}for every k*q.*

Let us now present some results for the second type of methods, the so called
diagonalization methods described above. One possibility to determine the sample sizes
in the sequence of optimization problems to be solved is presented in Royset
^{[}^{31}^{]} where an optimality
function is used to determine when to switch on to a larger sample size. The optimality
function is defined by mapping θ : ℝ*n* → (-∞,0] which, under standard
conditions, satisfies θ(*x*) = 0 if and only if *x* is a
solution in some sense. For unconstrained problem and its SAA
approximation is given by . Under the set
of standard assumptions, almost sure convergence of θ* _{N}*
(

*x*) towards θ(

*x*) is stated together with asymptotic normality.

Denote by the iterate obtained after a finite number of iterations of an
algorithm applied on the SAA problem with sample size *N _{k}*
where is the initial point. The point i s an
approximate solution of (16) with

*N = N*and it is assumed that the optimization algorithm used to determine that point is successful in the following sense.

_{k}**R3.** For any *N _{k}* every accumulation point
of the sequence generated by the optimization method for
solving

satisfies θ* _{Nk}* ( ) = 0 almost
surely.

The algorithm proposed in ^{[}^{31}^{]} increases the sample size when

*θ _{Nk}* (

*x*) ≥ - δ

_{k}_{1 }Δ (

*N*),

_{k}where δ_{1} is some positive constant and Δ is a function that maps ℕ into (0,∞)
and satisfies lim* _{N→∞}* Δ (

*N*) = 0. The sample size is assumed to be strictly increasing and unbounded, but the exact dynamics of increasing is not specified. The convergence of the algorithm is proved under one additional assumption.

**R4.** On any given set *S* ⊂ ℝ*n*, the function
*F*(·,ξ) is continuously differentiable and *F*(·,ξ)
and ||∇* _{x} F*(·,ξ)|| are dominated by an integrable
function.

**Theorem 3.5.**
^{[}^{31}^{]} Suppose that the
assumptions R3-R4 are satisfied and that the sequence of iterates generated by the
algorithm proposed in ^{[}^{31}^{]}
is bounded. Then, every accumulation point of that sequence
satisfies θ( ) = 0 almost
surely.

The relation between the sample size and the error tolerance for each of the
optimization problems solved within the diagonalization methods is considered in
Pasupathy ^{[}^{28}^{]}. The error
tolerance here is a small number ε* _{k}* which almost surely
satisfies || -
|| ≤ ε

*where and represent the approximate and the true (unique) solution of the corresponding SAA problem, respectively. A measure of effectiveness is defined as*

_{k}*q*=|| - ||

_{k}^{2}

*W*where

_{k}*W*represents the number of simulation calls needed to obtain the approximate solution . Since the almost sure convergence is analyzed, it is assumed that

_{k}*N*→∞ and ε

_{k}*→ 0. It is proved that the measure of effectiveness is bounded in a stochastic sense if the following three conditions hold.*

_{k}**R5.** If the numerical procedure used to solve SAA problems exhibits linear
convergence, we as sume that . If the
numerical procedure used to solve SAA problems exhibits polynomial convergence of order
p > 1, we assume .

**R6.** lim sup* _{k
→∞}*(∑

*)ε*

^{k}_{j=1}N_{j}

_{k}^{2}< ∞.

**R7. **lim sup* _{k→∞}*(∑

*k*)

_{j=1 }N_{j}*N*

_{k}^{-1}< ∞.

If any of the above conditions is violated, then *qk* tends to infinity
in probability. The key point of analysis in ^{[}^{28}^{]} is that the error tolerance should not be decreased faster
than the sample size is increased. The dynamics of change depends on the convergence
rate of numerical procedures used to solve the SAA problems. Moreover, the mean squared
error analysis implies the choice of ε*k* and
*N _{k}* such that

In order to further specify the choice of the optimal sequence of sample sizes, the following theorem is stated.

**Theorem 3.6.**
^{[}^{28}^{]} Suppose that (21)
holds together with the assumptions *R5-R7*. If the numerical procedure
used to solve SAA problems exhibits linear convergence, then

If the numerical procedure used to solve SAA problems exhibits polynomial convergence of
order *p* > 1,then lim
sup* _{k→∞}N_{k}*/

*N p*< ∞.

_{k-1}More specific recommendations are given for linear, sublinear and polynomial rates in
^{[}^{28}^{]}. For example, if
the applied algorithm is linearly convergent, then the linear growth of a sample size is
recommended, i.e. it can be set *N _{k}*+1 =1.1

*Nk*l for example. Also, in that case, exponential or polynomial growth of order

*p*> 1 are not recommended. However, if the polynomial rate of convergence of order

*p*> 1 is achieved, then we can set

*N*+1 =⌈

_{k}*N*

^{1.1}⌉ or

*N*= ⌈

_{k+1}*e Nk 1.1*⌉ for instance. Furthermore, it is implied that the error tolerance sequence should be of the form

*K*/ √

*N*where

_{k}*K*is some positive constant.

The diagonalization methods are defined for a finite *N* as well. One
possibility is presented in Polak, Royset ^{[}^{29}^{]} where the focus is on finite sample size *N*
although the almost sure convergence is addressed. The idea is to approximately solve
the sequence of SAA problems with *N* =*N _{k}, k*
=1,...,

*s*applying

*n*iterations at every stage

_{k}*k*. The sample size is nondecreasing and the sample is assumed to be cumulative. The method consists of three phases. The first phase provides the estimates of relevant parameters such as the sample variance. In the second phase, the scheduling sequence is obtained. Finally, the sequence of the SAA problems is solved in the last phase.

An additional optimization problem is formulated and solved in the second phase in order
to find the number *s* of the SAA problem to be solved, the sample sizes
*N _{k},k* =1,...,

*s*and the number of iterations

*n*that are applied to solve the corresponding SAA problem. The objective function of this additional problem is the overall cost ∑

_{k}^{s}

*(*

_{k=1 }n_{k}w*N*), where

_{k}*w*(

*N*) is the estimated cost of one iteration of the algorithm applied on the function

_{N}. For example

*w*(

*N*) =

*N*. The constraint for this problem is motivated by the stopping criterion

*f*(

*x*) -

*f** ≤ ε (

*f*(

*x*

_{0}) -

*f** ) where

*f**is the optimal value of the objective function. More precisely, the cost-to-go is defined as

*e*(

_{k}= f*x k*) -

_{nk}*f** where

*x k*is the last iteration at the stage

_{nk}*k*. Furthermore, the upper bound estimate for

*e*is determined as follows. Let ∆(

_{s}*N*) be the function defined as in Royset

^{[}

^{31}

^{]}.

**R8. **There exists strictly decreasing function ∆(*N*) : ℕ
→(0, ∞) *such that* lim* _{N}*→∞
∆(

*N*) =0

*and*| (

*x*) -

*f (x)*| ≤ ∆(

*Nk*) holds almost surely.

One may use a bound like (18), but it is usually too conservative for practical
implementations. Therefore, ∆(*N*) is estimated with the confidence
interval bound of the form (17) where the variance is estimated in the first stage. The
following bound is derived

where *l _{k}* (

*s*) represents the remaining number of iterations after the stage

*k*and θ defines the rate of convergence of the deterministic method applied on SAA. The initial cost-to-go from

*e*

_{0}=

*f*(

*x 1*

_{0 }) -

*f** is also estimated in the first phase. Finally, the efficient strategy is obtained as the solution of the following problem

In order to prove the asymptotic result, the following assumption regarding the optimization method used at each stage is imposed.

**R9.** The numerical procedure used to solve SAA problems almost surely
exhibits linear rate of convergence with parameter θ ∈ (0, 1).

**Theorem 3.7.**
^{[}^{29}^{]} Suppose that the
assumptions *R8-R9* hold and that the sample size sequence tends to
infinity. Then lim* _{s→∞ }e_{s}* = 0 almost surely.

4 APPLICATIONS TO DETERMINISTIC PROBLEMS

A number of important deterministic problems can be written in the form of

where *f _{i}* (

*x*) are given functions and

*N*is a large integer. For example, least squares and maximum likelihood problems are of this form. The objective function in (22) and its gradient are generally expensive to compute if

*N*is large. On the other hand for a given sample realization ξ1,...,ξ

*and*

_{N}*f*(

_{i}*x*) =

*F*(

*x*,ξ

*i*) the SAA problems discussed in Section 3 are the same as (22). Therefore the SAA methods that deal with finite

*N*can be used for solving the deterministic problems specified in (22). The main idea of this approach is to use the same reasoning as in the variable sample schemes to decrease the cost of calculating the objective function and its gradient i.e. to approximate the function and the gradient with and ∇. One application of a variable sample method to the data fitting problem is presented in Krejić, Krklec Jerinkić,

^{[}

^{23}

^{]}. In this section we consider two important problems, data fitting and machine learning, and methods for their solutions that use stochastic gradient approximation in the sense of approximate gradient as explained above.

The data fitting problem of the form (22) is considered in Friedlander, Schmidt
^{[}^{14}^{]}. The problem is
solved a by quasi Newton method but the gradient approximation of the SAA type is used.
Let the gradient estimation at the current iteration *k* be given as
*g _{k}* =∇

*f*(

*x*) + ϵ

_{k}*where ϵ is the error term. T he following assumptions are stated.*

_{k}**P1.** The functions* f _{1}*,...,

*f*are continuously differentiable and the function f is strongly convex with parameter µ. Also, the gradient ∇

_{N}*f*is Lipschitz continuous with parameter

*L*.

**P2.** There are constants *β _{1} ≥* 0 and

*β*1 such that ||∇

_{2}≥*f*||

_{i}(x)^{2}≤

*β*||∇ f (x) ||

_{1}+ β_{2}^{2}for all

*x*and

*i*= 1,...,

*N*.

The algorithm can be considered as an increasing sample size method where the sample
size is bounded with *N*. The main issue in ^{[}^{14}^{]} is the rate of convergence and the
convergence analysis is done with the assumption of a constant step size. More
precisely

Two approaches are considered: deterministic and stochastic sampling. The deterministic
sampling assumes that if the sample size is *N _{k}* then the
gradients to be evaluated ∇

*f*(

_{i}*x*),

_{k}*i*= 1,...,

*N*, are determined in advance. For example, the first

_{k}*N*functions are used to obtain the gradient approximation g

_{k}_{k}= ∑

*Nk*∇

_{i=1 }*f*. On the other hand, stochastic sampling assumes that the gradients ∇

_{i}(x_{k})*f*= 1,...,

_{i}(x), i*N*, to be evaluated are chosen randomly. We state the relevant results considering R-linear rate of convergence. In the case of deterministic gradient, q-linear convergence is also attained but under stronger conditions on the increase of the sample size.

_{k}**Theorem 4.1.**
^{[}^{14}^{]} Suppose that the
assumptions P1-P2 hold and that (*N -Nk*)/*N* = O (γ
*k/2*) for some γ ∈ (0, 1). Then for any ε 0, σ = max{γ, 1
-µ/*L*} + ε and every k, in deterministic case we obtain

*f* (*x* ) - *f* (*x**) = (
*f* (*x*
_{0}) - *f* (*x**)) O ((1 - µ/*L* +
ε)*k*) + O(σ*k*).

Moreover, if (*N -N _{k}*)/(

*N*) = O(γ

_{k}N*k*), in stochastic case we obtain

E [f (xk ) - f (x*)] = (f (x_{0}) - f (x*)) O ((1 - µ/L +ε)) + O(σk ).

Machine learning applications which usually assume large number of training points can
also be viewed as problems of the form (22). Methods for solving such problems are the
subject of Byrd et al. ^{[}^{8}^{]}
and Byrd et al. ^{[}^{9}^{]}. The
main idea in ^{[}^{8}^{]} is to
create methods which use the second order derivative information but with cost
comparable to the steepest descent method. The focus is on using a cheep Hessian
approximations ∇^{2}
(*x _{k}*) where

*S*is a number of training points, i.e. the Hessian-related sample size at iteration

_{k}*k*. More precisely, matrix-free conjugate gradient method is applied in order to obtain the search direction

*p*as an approximate solution of the system

_{k}Here, *N _{k}* is a sample size related to the gradient and the
function approximation. This can be considered as an inexact Newton method. In the
relevant examples,

*p*is guaranteed to be a descent search direction and therefore the Armijo line search is applied. The proposed method (named S-Newton) does not specify the dynamic of changing the sample sizes. It only requires that the variable sample strategy is used and

_{k}*S*<

_{k}*N*. The analysis is conducted for the full gradient case.

_{k}**Theorem 4.2.**
^{[}^{8}^{]} Suppose that the
function is twice continuously differentiable and uniformly convex and
that there exists a constant *γ* > 0 such that *x T*
∇^{2}*(x _{k})x* ≥

*γ*|

*|x*|| for every

*k*and

*x*. Then the sequence generated by the S-Newton method with

*N*=

_{k}*N*satisfies lim

*||∇*

_{k→∞}*(x*||= 0.

_{k})The same result can be obtained for the so called SLM method which uses matrix-free
limited memory BFGS method. In that case, the conjugate gradient method is used to
obtain the search direction where the sub-sampled Hessian approximation ∇^{2}
(*xk*) is used for the initial matrixvector
product at every iteration. The line search uses Wolfe conditions for choosing the
suitable step size.

The dynamic of increasing the sample size in the machine learning problem is addressed
in ^{[}^{9}^{]}. The main idea is to
estimate the sample size which makes the search direction *p _{k}*
descent for the objective function

*without evaluating the true gradient ∇ f*

_{N}*(*

_{N}*x*). The approximation of the negative gradient

*p*= -∇ (

_{k}*x*) is a descend direction if for some θ ∈

_{k}^{[}0

^{,}

^{1}

^{]}, the following inequality holds

Since and *N* is large, inequality (23) is approximated
by

where is a sample variance related to the chosen sample of the size
*N _{k}* . The algorithm for the sample size schedule
proposed in

^{[}

^{9}

^{]}can be described as follows. After finding the step size α

*k*such that (

*x*+ α

_{k}*) < (*

_{k}p_{k}*x*) and setting

_{k}*x*+1 =

_{k}*x*+ α

_{k}*, a new sample of the same size*

_{k}p_{k}*N*is chosen. If inequality (24) holds for the new sample, the sample size remains unchanged, i.e.

_{k}*N*=

_{k+1}*N*. Otherwise, the sample is augmented and the new sample size is determined by

_{k}In order to conduct the complexity analysis, the constant step size is considered and the q-linear convergence rate is analyzed.

**Theorem 4.3.**
^{[}^{9}^{]} Suppose that the
function * _{N}* is twice continuously differentiable
and

*x**is a solution of the problem (22) with

_{N}*(x*)*= 0. Furthermore, assume that there are constants 0 <

*λ*<

*L*such that

*λ*||

*h*||

^{2}≤

*h T*∇

^{2}

*(x)h*≤

*L*||

*h*||

^{2}for all

*x*and

*h*. Let the sequence of iterates be generated by

*x*+1 =

_{k}*x*-

_{k}*α*∇ (x

_{k}) where α = (1 -θ) /

*L*and θ ∈ (0,1). If the condition (23) is satisfied at iteration

*k*, then

where β = . Moreover, if (23) holds for every k, then

The schedule {*N* } for the gradient estimations is extended to the
second order approximations to define a Newton- method, the S-Newton method defined in
^{[}^{8}^{]}. This method uses
the updating of *N _{k}* described above while the Hessian-related
sample size

*S*follows the dynamic of

_{k}*N*. More precisely,

_{k}*S*where

_{k}= RN_{k}*R*is some positive number substantially smaller than 1. Also, the sample used for the Hessian approximation is assumed to be a subset of the sample used for the gradient and the function approximations. The stopping criterion for the conjugate gradient method used for obtaining the search direction is more complex than in

^{[}

^{8}

^{]}since it is related to the sample size. Wolfe conditions are imposed to obtain a suitable step size.

5 CONCLUSIONS

In this survey we considered unconstrained problems with the stochastic or expectation
objective function. We focused our attention on two specific classes of gradient-related
methods: Stohastic Approximation and Sample Average Approximation and many other
important approaches are left out for the sake of brevity. An interested reader should
consults ^{[}^{32}^{,}^{34}^{]} for the initial guidance into
stochastic optimization problems. Several natural extensions are easily incorporated in
the framework considered in this paper, for example search directions with second order
information which usually yield faster convergence, but also require additional cost,
^{[}^{8}^{,}^{9}^{,}^{10}^{]}.
In order to decrease the linear algebra costs, one can consider preconditioners as their
construction might be a nontrivial issue due to the presence of random variable. On the
other hand, given that there are a lot of problems yielding only input-output
information, an interesting approach within SA and SAA frameworks is based on zero order
information, ^{[}^{17}^{]}.
Constrained problems are always of great interest and some recent research of penalty
methods within SA methodology is presented in ^{[}^{39}^{]}. Projection and filter methods with variable sample
size might be a valuable topic of future research. Among the others, chance constrained
problems are especially challenging. In all the considered methods, deriving some
complexity bounds can be of great interest from practical point of view.