STOCHASTIC KNAPSACK PROBLEM: APPLICATION TO TRANSPORTATION PROBLEMS

In this paper, we study the stochastic knapsackproblem with expectation constraint. We solve the relaxed version of this problem using a stochastic gradient algorithm in order to provide upper bounds for a branch-and-bound framework. Two approaches to estimate the needed gradients are studied, one based on Integration by Parts and one using Finite Differences. The Finite Differences method is a robust and simple approach with efficient results despite the fact that estimated gradients are biased, meanwhile Integration by Parts is based upon more theoretical analysis and permits to enlarge the field of applications. Numerical results on a dataset from the literature as well as a set of randomly generated instances are given.


INTRODUCTION
The deterministic knapsack problem is a well known and well studied NP-hard combinatorial optimization problem.It consists in filling a knapsack with items out of a given set such that the weight capacity of the knapsack is respected and the total reward maximized.Several variants of the knapsack problem have been considered for many practical problems amongst all network design problems.In telecommunication network design, the knapsack problem is often used as a subproblem for modelling the capacity of the edges (see Kellerer et al., 2004).It is now generally admitted that the uncertainty is an inherent property in many practical problems.The main reason relies on the fact that the use of average parameters could lead to severe service quality deterioration whereas the use of extreme values could result into costly conservative solutions.In telecommunication networks, the main sources of uncertainty are the traffic demands and the costs.Additional sources could be network and equipments failures.There are two main approaches to deal with uncertainty, namely robust optimization (Ben-Tal & Nemirowski, 2008) or stochastic optimization (Birge & Louveaux, 1997).Demand uncertainty in telecommunication network design problems leads to several stochastic models amongst all the stochastic knapsack problem.The latter was used either with chance constraints (Kosuch & Lisser, 2010) or within a two-stage stochastic knapsack problems either for the single formulation of for the multiple knapsack one (Tönissen et al., 2016(Tönissen et al., , 2017)).The stochastic knapsack problem with random item sizes has met a large interest in the literature in the last decade.There are at least two ways to deal with a possible overload.When the overload is not considered, Bhalgat & Khanna (2011) proposed to remove the last inserted item.In Chen & Ross (2014), the authors proposed that the knapsack returns zero when it overflows.In the case where the overflow is accepted, the knapsack constraint is replaced by a probabilistic constraint (see Goal & Indyk, 1999;Kosuch & Lisser, 2010).
In stochastic optimization there are three major approaches to deal with stochasticity: Probabilistic constraint, two-or multi-stage settings as well as on-line or dynamic modeling.In the first case, all decisions are made before the random parameters are revealed.The objective function and/or the constraints are thus formulated using expectations and probability measures.In the second case, recourse actions occur when random parameters are known.These recourse actions are formulated as a penalty that has to be paid in the case of violated constraint or are used to correct the decisions made in previous stages.In the last approach, on-line problems are solved dynamically.More precisely, the algorithm implemented to solve the problem is provided with further information during its execution and reacts dependently on this new information as well as previous decisions.The solution of an on-line problem is a decision policy while in the first two cases the current decisions are computed.
In this paper we focus on a particular variant of the single-stage stochastic knapsack problem with random weights: the expectation constrained knapsack problem.The paper is organized as follows: In section 2 the mathematical formulation of the problem is presented and discussed.In section 3 we present the stochastic gradient algorithm used to solve the corresponding relaxed problem.Two further methods to estimate the needed gradients are presented.On the opposite to the Approximation by Convolution method studied in Kosuch & Lisser (2010) which gives an approximation of the gradient, the method of Integration by Parts used in this article replaces the expected function by another one with exactly the same expectation.In section 4 the convergence of the stochastic gradient algorithm is analyzed from a theoretical as well as numerical points of view and test results concerning the solution of the relaxed problem are presented.In section 6 we use these methods to solve the combinatorial problem described in Cohn & Barnhart (1998) using a branch-and-bound framework presented in Kosuch & Lisser (2010).Numerical results on a set of randomly generated data are given and analyzed.

MATHEMATICAL FORMULATIONS
We consider a stochastic knapsack problem of the following form: Given a set of n items, we want to choose these items without knowing the exact value of their weights.Therefore, we handle the weights as random variables and assume that weight χ i of item i is independently normally distributed with mean μ i > 0 and standard deviation σ i .Furthermore, each item has a fixed reward per weight unit r i > 0. The choice of a reward per weight unit can be motivated by the fact that the value of an item often depends on its weight which we do not know in advance 1 .We denote by χ, μ, σ and r the corresponding n-dimensional vectors.The aim is to maximize the expected total gain E[ n i=1 r i χ i x i ].Our knapsack problem has a fixed weight capacity c > 0 but due to the stochastic nature of the weights, we need to define correctly what respecting capacity means.In this paper, we solve the following expectation constrained knapsack problem: where E [•] denotes the expectation, g(x, χ) := n i=1 χ i x i is the total weight of the chosen items, H R + (•) denotes the indicator function of the positive real interval where • = 1 if • ∈ R + and 0 otherwise, and p ∈ (0.5, 1) is the prescribed probability.
The choice of p is a decision parameter that restricts the percentage of cases where the capacity is exceeded.p is not connected with the amount of overweight.The constraint (2) can be equivalently reformulated as the following chance constraint: Without loss of generality, we assume that E [H R + (c − χ i )] ≥ p for all i ∈ {1, . . ., n}: any item that does not satisfy this constraint could be excluded from the beginning.It follows, that the optimal solution vector x * has at least one non-zero component.
We call J the objective function of the above maximization problem defined by J (x) = E[ n i=1 r i χ i x i ].We denote j (x, χ) = n i=1 r i χ i x i .Furthermore, we refer to the function on the left hand side of the constraint as and to the function inside the expectation of as θ, i.e.
Throughout this paper, we assume that the weights are normally distributed.Normally distributed items have the nice property that their linear combination is still normally distributed.This property ensures easier calculations but is not a foremost condition.In some extent, in the case of normal distributions, it is even possible to compute directly the gradient of the expectation function.Assuming normal distribution is, in a large part, just a practical frame in order to produce numerical results and theoretical formulations.Whenever we use normal distribution properties, we will explain how to overpass this special consideration.
Normally distributed random variables can have negative realizations, and assuming normally distributed weights might thus seem contradictory to the fact that item weights are always strictly 1 All the methods used and results presented in this article are still valid in the case where the rewards are deterministic or random but independent of the weights.We do not use the fact that the rewards (per item) are normally distributed, therefore any other distribution could be considered.
positive.However, in most real life applications the standard deviation is several times smaller than mean values of the unknown parameters.In this case, the probability of negative weights becomes negligible.

PROBLEM SOLVING METHOD
Due to its combinatorial nature, EC K P can be solved using a branch-and-bound framework as presented in Kosuch & Lisser (2010).To obtain upper bounds, the authors propose to solve the corresponding continuous optimization problem.A stochastic gradient algorithm can be used to solve this problem.This method needs to evaluate the gradient of the expected value function, which is an indicator function H R + (•).In Kosuch & Lisser (2010), this computation has been done using Approximation by Convolution.In this paper, we study two different approaches: the first one is a non-biased estimator based on Integration by Parts (called hereafter IP-method).The second approach is a Finite Differences estimator (FD-method) presented in Andrieu et al. (2007).Like the Approximation by Convolution method, FD-method provides a biased estimator of the gradient.
Instead of replacing {0, 1} n by [0, 1] n when relaxing EC K P, the theoretical analysis will compel us to consider a complementary set of a neighborhood of 0 [0,1] n .Considering that an empty knapsack is not an optimal solution, it follows that the optimal solution vector of the continuous problem contains at least one component x κ with x κ ≥ 1/n.We are thus allowed to replace Accordingly, we obtain the following feasible set of the relaxed EC K P: X ad cont = {x ∈ X cont : (x) ≥ p}

The stochastic gradient type algorithm
To solve the relaxed version of EC K P, we use a stochastic gradient type algorithm.Applying the gradient method is promising as the objective function is concave and, in addition, constraint (2) defines a convex feasible set due to the assumption that the weights are independently normally distributed.
We propose to solve EC K P with a Stochastic Arrow-Hurwicz algorithm (Culioli & Cohen, 1995) (hereafter called S AH -algorithm; see Algorithm 3.1) that uses Lagrangian multipliers to deal with the probability constraint.
denotes the Lagrangian function associated with EC K P.
1. Choose x 0 ∈ X ad cont and λ 0 ∈ [0, ∞) as well as two σ -sequences and update x and λ as follows: where r k , λ k and τ k are defined in Algorithm 3.1.Note that in the deterministic form of the Arrow-Hurwicz algorithm we can use the gradients of the Lagrangian itself.But this function is here an expectation function and its gradient is thus difficult to compute.By drawing independent samples of the random variables at each iteration of the algorithm, the expectation of the gradient is approximated (see Culioli & Cohen, 1995).At each iteration of the algorithm, we need to calculate the gradient of θ at (x k , χ k ).However, θ is mainly an indicator function and therefore not useful to differentiate.In the following subsection, we present two ways to bypass this disadvantage: either by approximation using Finite Differences or by reformulation of the constraint function using Integration by Parts.

Computation of the gradient of θ
The first method consists in approximating the h th component of the gradient of θ by the corresponding difference ratio θ(x + δν h , χ) − θ(x − δν h , χ) 2δ where δ > 0 and ν h ∈ {0, 1} n such that ν h h = 1 and ν h i = 0 for i = h.This leads to the following approximation of the h th component of the gradient of θ: The second method (IP-method) involves Integration by Parts to reformulate E[θ(x, χ)] and to obtain a function in expectation . θ is the function obtained by Integration by parts, it is differentiable and the idea is to use the gradient of θ in S AH -algorithm.
Andrieu et al. (Andrieu et al., 2007) presented how to compute the gradient of an indicator function in expectation using Integration by Parts (see Theorem 5.5 in Andrieu, 2004).We state and proof their theorem for the case of EC K P. The variables and functions used in this proposition are defined in section 2: Let x ∈ X ad cont and let κ = κ(x) ∈ {1, . . ., n} be defined such that x κ = ||x|| ∞ ≥ 1/n.Then, using Integration by Parts, we get Proof.Let ϕ denote the density function of the random vector χ = (χ 1 , . . ., χ n ) and define Integration by Parts over χ κ leads to In our case the random variables are independently distributed.With ϕ i being the density function of χ i we get Pesquisa Operacional, Vol.37(3), 2017 It follows and therefore x κ σ 2 κ we get the result.Thus, it remains to proof the following claim: Proof of the claim.This is a classical result under the assumption of uniform bounding of the gradient function inside the integral.In our case, we can easily show that this bounding is obtained according to the assumption that x κ ≥ 1/n.First of all let us observe that we can choose x κ σ 2 κ can be reformulated as follows: If (χ κ − μ κ ) < 0, the bounds are inversed.
For the other components of the gradient, we just have x κ σ 2 κ χ h and the same limitations holds for the κ-th component.Note that ˜ is differentiable except in 0.

Theoretical convergence
Culioli & Cohen (1995) showed how to obtain the convergence of S AH -algorithm in the case where the objective function J is convex despite the fact that j is not (for a global survey see Carpentier, 2010).More precisely, they adapt the classical S AH -algorithm in the case where the constraint is given in expectation by considering a subgradient both in dual variables x and λ instead of only in λ.The drawback is to check technical assumptions on both gradients that have to be linearly bounded.The way we transform the expected function in the constraint permits to check all the assumptions with some limitations explained below.
h2: The associated Lagrangian admits a saddle point x and J is strictly convex.
h8: There exists γ, δ > 0 such that Then, the sequence (x k , λ k ) is bounded and x k converges weakly towards x.
We are now going to check all the assumptions on our sample set and situation, we focus on the IP-method: Using the IP-method we obtain the following gradient: The fact that H R + (c − g(x, χ)) = 1 only for c − g(x, χ) ≥ 0 limits χ to a compact domain and h1 is checked.
h2 We suppose that the Lagrangian function admits a saddle point.Nonetheless, we observe that strict convexity of J is a sufficient condition to get the stability of the Lagrangian function.
In our case, the function J is linear.
h3 For all χ ∈ R n , θ (•, χ) is locally Lipschitz continuous according to the expression of h5 As we assume the item weights to be normally distributed, (x) = P{g(x, χ) ≤ c} is Lipschitz continuous: Let F be the cumulative distribution function of the standard normal distribution.Then we have It is easy to see that is continuous on X cont .In addition, we have 0 ≤ (x) ≤ 1 for all x ∈ R n and as X cont is a compact set, it follows that is Lipschitz continuous on X cont .
We don't really need normal distribution to establish this assumption, as we replaced θ by θ, we get that : where τ bounds ||∇ x θ|| given in h1 on X cont .Then, we integrate the inequality, and we get the bounding on .
The question of concavity of is more complex and depends on the sample values of the objects.
There are two different questions to solve: is the constraint concave on the hypercube, and if not, is the constraint concave on a partial subset of the hypercube?The answer to the first question is definitely no.For further details concerning chance constraints, we refer to Prekopa (1995).
The second question depends on initial parameters: in a first approach we begin to observe that in the case of a mono dimensional vector x, is concave for x < x where x depends on the initial conditions.
In case of a higher dimensional problem, the situation becomes more complex.Despite there exists a domain for x near to 0 where is concave (see Fig. 1 for the value of observed for a two dimensional vector), the boundaries of this domain still depend on the characteristics of the sample set.
A practical check combined with a theoretical analysis of the concavity is then possible.We begin to compute the eigenvalues of the Hessian matrix H ( x ) at a point x close to 0 of the set X cont and we find them all negative.According to the fact that the determinant of the Hessian matrix is a continuous function of x, it follows that there exists a topologically connected component C ⊆ X cont of x such that det H (x) = 0 for all x ∈ C and x ∈ C. We consider the boundary C − C of this set and we set p = min{ (x) : x ∈ C − C}.To conclude on the above observations, we formulate the following proposition: Proposition 2. Let χ 1 , . . ., χ n be a fixed set of independently normally distributed random variables and let c be a fixed capacity such that for the vector x with all components near to 0, the Hessian matrix has all eigenvalues strictly negative.Then, there exists p = p(χ 1 , . . ., χ n , c) such that (x) is concave on the set {x ∈ X cont | (x) ≥ p}.
In other words, we establish that for a given instance there exists p such that the corresponding Lagrangian relaxation is concave on the feasible set.
h6 j is linear in each component of x. h6 is thus satisfied.
h7 Condition h7 is satisfied for all sequences of type const k with const > 0.

Convergence of the Stochastic Arrow-Hurwicz Algorithm involving FD-method
A stopping criterion with S AH algorithm cannot be expressed with an observation of a decreasing speed of variation for the objective value.The classical approach is to set a maximum number of iterations.We fixed the number of iterations to 500 for FD-method according to several tests where only slight variations can be observed after around 300 iterations (for numerical results see subsection 5 and Table 2).

Enhancements in problem formulation when using the Stochastic Arrow-Hurwicz Algorithm involving IP-method
Some improvements in our implementations worth being notified: During our first numerical tests, we remarked that S AH -algorithm involving the IP-method was inefficient and even did not converge in some cases (see Fig. 2).Then we analyzed and modified the implementation of the IP-method.The first enhancement deals with the formulation of EC K P. The expectation constraint states We thus get the Lagrangian Using the IP-method, we rewrite this Lagrangian as follows: Let us denote l the function inside the expectation of the Lagrangian (4), i.e.
Remark that the term that multiplies λ is zero whenever the capacity constraint is not satisfied.
In these cases all the components of the current x k are incremented (as all the components of (r 1 χ 1 , . . ., r n χ n ) T are positive) although at least one component should be decremented in order to better fit the capacity.
The constraint in expectation can be equivalently reformulated as In this case the h th component of the gradient of l is Here the term that multiplies λ is zero whenever the capacity constraint is satisfied.In this case the components of x k are incremented by the components of the positive vector (r 1 χ k 1 , . . ., r n χ k n ) T (multiplied by the corresponding factor σ k ).When the capacity constraint is not satisfied the term with coefficient λ is subtracted from this vector in order to correct x k .This term is positive whenever (χ k κ − μ κ ) > 0 and the Lagrange multiplier λ is thus playing its assigned role of a penalty factor in case the constraint is violated.A second improvement can be obtained by a specific choice of the component to integrate by parts: The h th component of the gradient of the Lagrangian function obtained by the IP-method after reformulation: In case of an overload, we expect this gradient to be negative for some indexes h in order to decrease the total weight of the knapsack.However, for h = κ and H R + (g(x, χ) − c) = 1, the term that multiplies λ is positive if and only if (χ κ − μ κ ) > 0, which is not always the case.When this event does not occur, the gradient is strictly positive and all components of x with index different from κ are incremented despite the overload.Due to this observation we propose the following improved choice of the index κ: Instead of just choosing κ in the k th iteration such that x k κ = ||x k || ∞ (see Proposition 1), we choose κ as follows: we choose κ as before, i.e. such that With this modification we were able to significantly improve the convergence of S AH -algorithm involving the IP-method (see Fig. 3) and could reduce the maximum number of iterations to 1000 with effective results.

NUMERICAL RESULTS
All numerical tests have been carried out on an Intel PC with 2G B RAM. S AH -algorithm algorithms as well as the branch-and-bound framework have been implemented in C language.The random numbers needed for the computations of the stochastic gradient algorithm were generated in advance using the gsl C-library and stored in a file.
We tested S AH -algorithm on the instance used in Cohn & Barnhart (1998)   First of all, we remark that the optimal solution values of S AH -algorithm involving FD-method are comparable to those produced by the IP-method.Some fluctuations occur and can be interpreted in terms of choice of implementation, like the choice of the two σ -sequences for S AHalgorithm.Choosing the right parametrization for these sequences has an important influence on the convergence of the algorithm and the best found solution.
In terms of running time, both methods outperform S OC P-algorithm for small and medium size instances.For higher dimensional problems, this is still true for FD-method.However, S AHalgorithm involving the IP-method needs approximately twice the time than when using FD-method.This is of course due to the total number of iterations that we fixed at 500 when using FD-method, while we need 1000 iterations with the IP-method in order to obtain equally good solutions.For higher dimensional instances Mosek interior point method needs therefore less CPU-time than the IP-method.

SOLVING THE (COMBINATORIAL) ECKP -NUMERICAL RESULTS
The combinatorial problem has been solved using a branch-and-bound algorithm as described in Cohn & Barnhart (1998) and Kosuch & Lisser (2010).The algorithm has been tested on the instances previously used for the tests of S AH -algorithm.
We stored the random numbers needed for the test runs of S AH in a batch file.As the total number of runs during the branch-and-bound algorithm is unknown and the number of random numbers needed for all those runs is generally very high, we only stored random numbers for a limited number of runs.Before running S AH , we chose randomly one of the instances of random numbers.Remark that, as the runs of S AH are independent, one stored instance of random numbers would theoretically be sufficient.
The results given in Table 3 indicate that the branch-and-bound algorithm that uses S OC Pprogram to obtain upper bounds, considers more nodes than when using S AH -algorithm.This is not due to a better choice of the upper bounds in the latter method as in both algorithms the upper bounds are supposed to be the same (i.e. the optima of the corresponding relaxed problems).However, as the best solution found by the approximate S AH -algorithm might be slightly smaller than the solution value of the relaxed problem, more branches are pruned than with the primal-dual S OC P-algorithm.Note that this could theoretically also cause the pruning of a subtree that contains the optimal solution.Similarly, S AH -algorithm involving the IP-method considers an average number of nodes greater than when using FD-method.This implies that the solutions of the relaxed subproblems produced by FD-method are not as good as those obtained when using IP-method.As both methods perform equally on the relaxed overall problems (see subsection 5), we conclude that FD-method is less robust: Instead of choosing particular σ -sequences for each instance or even each subproblem that has to be solved during the branch-and-bound algorithm, we fixed one parametrization for each dimension.However, the subproblems solved during the branch-andbound algorithm are mostly lower dimensional problems.Moreover, S AH -algorithm using the fixed σ -sequences is less performant on these subproblems.This seems to be especially the case when using FD-method.
If we only allow an average computing time of 1h, S OC P-algorithm can only be used up to a dimension of 50.On the contrary, when using FD-method and allowing 500 iterations in S AHalgorithm, we are able to solve problems up to a dimension of 250 within an average CPU-time of 1h.

CONCLUSION
In this paper, we study and analyse two methods for computing the gradient of the Stochastic Knapsack Problem, where the expectation constraint is considered.More than just considering efficiency of compared methods, it is interesting to analyze what can be concluded when using such methods.In one case, FD-method is very simple, robust, efficient and fast, but presents some disadvantages, such as computing an approximated value of the gradient of the constraint.On the other case, IP-method introduces no approximation in the computation, but is more complex to handle correctly.Nonetheless, this second method opens more opportunities to investigate.First of all, handling an approaching value of the gradient seems not to be a source of disorder here but remains a first order approximation.Secondly, we keep in mind that when using integration by parts, we always have to choose one part to integrate, and one part to derivate in a product, but the way we choose these parts is not fixed.This allows for instance to check conditions of boundaries of ˜ .In the future, we will examine ways of choosing the different parts of the Integration by Parts, and also will extend our approach to other stochastic combinatorial problems.

Figure 2 -
Figure 2 -Initial divergence for the Stochastic Arrow-Hurwicz algorithm solving the continuous EC K P : FD-method (bold curve) versus initial IP-method.

Table 1 -
Numerical results for the continuous EC K P .
*Exceeding of the available memory space.
Cohn & Barnhart, 1998.aswell as on 50 randomly generated instances for each dimension.The data given in the tables 2 and 3 are average values over these instances.As noticed at the end of section 2, the choice of normally distributed weights implies that theoretically weights can take negative values.The test instances were generated in such a way that the variance/mean ratio is below 1/4 (seeCohn & Barnhart, 1998or Kosuch & Lisser, 2010 for details).This implies a very low probability for the realization of negative weights: Although a high number of scenarios were generated (either 500 or 1000 for each run of S AH -algorithm), we got no negative weight realization.In Table2, the numerical results of S AH -algorithm involving FD-or IP-method are compared with those using a Second Order Cone Programming-solver (SOCP): As mentioned, EC K P can be equivalently reformulated as a chance constrained knapsack problem that, in turn, can be reformulated as a deterministic equivalent S OC P-problem (for details see Boyd et al., 1998 or Kosuch & Lisser, 2010).To solve the S OC P-problem we used the S OC P-solver by MOSEK in C-programming language that applies an interior point method (MOSEK, 2009).The gaps given in tables 2 and 3 are the relative gaps between the MOSEK solution and the approximate solution obtained when using the stochastic gradient algorithm.

Table 2 -
Numerical results for the continuous EC K P .

Table 3 -
Numerical results for the (combinatorial) EC K P .