METAHEURISTICS EVALUATION : A PROPOSAL FOR A MULTICRITERIA METHODOLOGY *

In this work we propose a multicriteria evaluation scheme for heuristic algorithms based on the classic Condorcet ranking technique. Weights are associated to the ranking of an algorithm among a set being object of comparison. We used five criteria and a function on the set of natural numbers to create a ranking. The discussed comparison involves three well-known problems of combinatorial optimization – Traveling Salesperson Problem (TSP), Capacitated Vehicle Routing Problem (CVRP) and Quadratic Assignment Problem (QAP). The tested instances came from public libraries. Each algorithm was used with essentially the same structure, the same local search was applied and the initial solutions were similarly built. It is important to note that the work does not make proposals involving algorithms: the results for the three problems are shown only to illustrate the operation of the evaluation technique. Four metaheuristics – GRASP, Tabu Search, ILS and VNS – are therefore only used for the comparisons.


Heuristics evaluation in the literature
This work is dedicated to a proposal of a multicriteria evaluation scheme for heuristic algorithms, which we called the Weight Evaluation Method (WOM).It involves an application of the Condorcet ranking technique, presented in Item 1.2.The initial discussion of WOM is the object of Item 1.3.Sections 2 and 3 present, respectively, quick explanations on the three problems and the four metaheuristics used in the tests.The use of the evaluation technique is detailed in Section 4 with the aid of an example.Section 5 presents the results of the comparison among the metaheuristics when used with the three problems.The conclusions are exposed in Section 6.
The use of metaheuristics to find good quality solutions for discrete optimization problems has the double advantage of working with algorithms based on models already known and the efficiency of the methods themselves.This is very important when dealing with problems that have an exponential number of feasible solutions.When there are many techniques available, it is clearly important to evaluate their efficiency with respect to a given problem.A number of direct evaluation techniques, both deterministic and probabilistic, is commonly used, such as in Aiex et al. [2,3] where their observations lead to the hypothesis that the iteration processing times of the heuristics based on local searches, aiming at a result with a determined target, follow an exponential distribution.The use of instance collections, available in the Internet for many problems, appears there and in a number of other works as an efficient way to evaluate metaheuristics and compare their efficiency when dealing with a variety of situations.
Tuning parameters of an algorithm for improved efficiency is also a work that benefits from a system of assessment.Averages of execution times and final solution values, with their standard deviations, are often used.The normal distribution is commonly considered on these occasions, an option which is criticized by Taillard et al. [23] as a hypothesis not always verified: for example, if there are many global optima, the distribution will have a truncated tail, since it is impossible to go beyond the optimum.
There are in the literature several techniques for this purpose: following this reference, the most common are: 1.When dealing with optimization, a set of problem instances is solved with a couple of methods that should be compared, by calculating mean and standard deviation (possibly also other measures such as median, minimum, maximum, etc.) of the values obtained in a series of algorithm rounds.
2. In the context of exact problem-solving, the computational effort required to obtain the best solution is measured, and its mean, standard deviation an so on, are calculated.
3. The maximum computational effort is fixed, as well as a goal to reach, by counting the number of times each method achieves the goal within the computational time allowed.
In practice, often the measures computed by the first and second techniques are very primitive and it is common to calculate only the averages, which are insufficient to assert a statistical advantage of a method of solution in relation to another.

The Condorcet technique
This work proposes a multicriteria evaluation scheme based on the classic Condorcet ranking technique, [1,4,18].This technique allows us to substitute orders for concept values, which are subsequently submitted to a pairwise comparison.The initial concepts can be either qualitative or quantitative.For instance, a blind test for wine quality evaluation could involve a group of tasters, each one giving a ranking for a set of similar products by considering mouth sensations, bouquet, color and so on.When applied to algorithm evaluation, we could use the algorithms on a given set of instances and consider value rankings for different criteria, such as final value averages, processing times and so on.These results can be presented as a matrix where we will be able to evaluate coherence and inconsistency levels in order to arrive to a decision concerning the quality of the studied options (see Item 4.3).

The Weight Evaluation Method (WOM)
In this method we begin with a Condorcet-type ranking matrix.We use weights associated to the ranking of the algorithms from a set being object of comparison with respect to instances of a given problem.In this work, we define five evaluation criteria (see Item 4.2 below) and we apply a function on the set of natural numbers to the ranking given by each criterion.The valuation is defined such that the better results are associated to the lesser criterion values.
In this work, we cross three combinatorial optimization problems -the Traveling Salesperson Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP) and the Quadratic Assignment Problem (QAP) -against four different metaheuristics: the Greedy Randomized Adaptive Search Procedure (GRASP), the Iterated Local Search (ILS), the Tabu Search (TS) and the Variable Neighborhood Search (VNS).To do that, we took instance collections of each problem from public libraries and made ten independent runs with each one, using an execution time limit of 600 seconds.We looked for solutions with Optimal or Better-Known Values (OBKV), according to the more recent information available on the Internet.
The algorithms were programmed in C language and ran on a Linux platform.In order to allow for a very basic comparison, each algorithm was used with essentially the same structure, the same local search was used in every case and the initial solutions were similarly built.The only differences are the specific characteristics of each problem: the problem constraints and the objective function calculation.We adopted this option owing to the number of improvements already existing in the literature, since the paper objective is to use the problem-algorithm crossing to show the functioning of the method and not to propose any algorithm improvement.
The use of a sort function facilitates the visualization of orders.The sum of the values obtained with each algorithm for each problem makes easier the comparison between the algorithms and their sensitivities to every problem.It also allows us to evaluate the in-the-whole performance of an algorithm.

TEST PROBLEMS USED IN THE STUDY
The problems used in this study and presented below are widely known by the scientific community and often used as benchmarks for the validation of new algorithms, owing to their algorithmic complexity [8,21].

Traveling Salesperson Problem (TSP)
In simple terms, the Traveling Salesperson Problem (TSP) can be viewed as a list of cities and their distances in pairs, where the task is to leave the origin and to follow the shortest possible circuit which visits each city exactly once before returning to the origin.It was formulated as a mathematical problem by Karl Menger [17].The TSP is one of the most intensively studied combinatorial optimization problems.It is of great importance from a practical as well as a theoretical point of view, given its relationship to other combinatorial optimization problems.It is used as a benchmark for many optimization methods.Even being computationally difficult (NPhard), a large number of exact methods and heuristics have been applied to it, so that instances with tens of thousands of cities can be solved.

Capacitated Vehicle Routing Problem (CVRP)
Since the work published by Dantzig & Ramser in 1959 [5], many papers related to the Vehicle Routing Problem (VRP) has been seen in the literature.Some studies show different variants, such as more than one deposit, a time limit of delivery, different types of vehicles, delivery and collection of products, among others.In this work we make use of its classical modeling, which is to meet a set of customers through a fleet of vehicles of the same capacity.Each vehicle comes from a deposit and the sum of the demands associated with each customer cannot exceed the vehicle capacity.

Quadratic Assignment Problem (QAP)
Consider the problem of allocating pairs of activities to pairs of locations, taking into account the costs of travel distances between locations and some flow units conveniently defined between activities.The Quadratic Assignment Problem (QAP), proposed by Koopmans & Beckmann [14], is the problem of finding a minimum cost allocation of activities to locations where costs are determined by the sum of the products distance-flow.

METAHEURISTICS USED IN THE STUDY
The implementations used here for the metaheuristics vary greatly in efficiency.This was deemed appropriate to facilitate the observation of how WOM works.

Tabu Search
The Tabu Search was introduced by Fred Glover [9,10] for integer programming problems and more recently perfected by Taillard [22].This metaheuristic is based on the establishment of restrictions that effectively guide a heuristic search in exploring the solution space, trying to avoid that the search returns to previously visited solutions.These restrictions work in different ways, such as excluding the search of certain alternatives, classifying them as temporarily banned or taboos, or modifying ratings and selection probabilities, designating them as aspiration criteria.

GRASP
GRASP -Greedy Randomized Adaptive Search Procedure -proposed by Feo & Resende [7] -can be seen as a metaheuristic which uses the good characteristics of the purely random algorithms and the purely greedy processes in the construction phase.It is a multistart iterative process in which each iteration consists of two phases: the construction phase, where a feasible solution is constructed, and the local search phase, where a local optimum is found in the vicinity of the initial solution and, if necessary, the update of the best solution found so far is made.

ILS
ILS -Iterated local search -proposed by Lourenc ¸o et al. [15], is a simple method that iteratively applies local search to disturb the site of the current search, leading to a random walk in the space of local optima.To apply an ILS algorithm, four procedures must be specified: (a) generation of initial solution, (b) disturbance, which generates new starting points for local search, (c) the acceptance criterion that decides from which solution the search will be continued, (d) the local search procedure is defined as the search space.

VNS
VNS -Variable Neighborhood Search -was proposed by Hansen & Mladenovic´ [11,12].It is based on a systematic neighborhood exchange associated with a random algorithm to determine starting points of local search.The basic VNS scheme is very simple and easy to implement.Unlike other metaheuristics based on local search methods, VNS does not follow a trajectory but explores incrementally more or less distant neighborhoods of the current solution, ranging from the current solution to the new, if and only if an improvement occurs.According to the authors, the advantage of using various neighborhoods is that the local optimum in relation to a neighborhood is not necessarily the same from others: thus, the search should continue in a way downward (or upward) until the solution current is a minimum (or maximum) location of all structures of the pre-selected neighborhoods.

DETAILS ON CONDORCET AND WOM TECHNIQUES
In this section we present in more detail the proposed performance criteria, the application of the Condorcet method and the use of its results to generate the indicators associated with WOM.

Performance criteria for the proposed versions
Following the already cited concern of Taillard et al. [23] about the dependency of heuristic efficiency on instance type, we built a multicriteria evaluation with five comparison criteria.The criteria definitions below are such that lower values represent better results.

a) Number of not-OBKV solutions obtained (nopt) b) Average relative value distance (avd)
This is the average of the values (obtained value -OBKV)/OBKV, for those tests not reaching the OBKV, expressed in percentages.

c) Quality index (qual)
This is a tailor-made function used to express performance, whether the algorithm reaches the OBKV or not.We used: where nSolOtm is the number of solutions with the OBKV value, nSolNotOtm is the number of solutions with worse values and AvgError is the average error percentage of those instances where the OBKV was not reached.The last term is disregarded whether no OBKV value is found.
As an example, let us consider that in 10 executions of an instance, 8 produced the OBKV and 2 presented an average error of 1.3%.The index value is then I = 0.125 + (2 × 1.3) = 2.725.If the error of those two instances were 21.6%, we would have I = 0.125 + (2 × 21.6) = 43.325.We can see that the index is sensitive to the presence of bad solutions and that its value decreases when the number of OBKV solutions grows.

d) Average execution time (exec)
This includes only the instances where OBKV was obtained before the maximum execution time (600 seconds).

e) Average stagnation time complement (stag)
This shows the average difference between the maximum execution time of 600 seconds and the time associated with the last improvement in the solution value before the algorithm stops by maximum time criterion.Whenever the algorithm gets the OBKV, this value is nullified.This criterion can be associated with an algorithm capacity to avoid sticking at local optima.

Some details on the Condorcet technique
As discussed in Item 1.
This technique can be used to compare the performance of a pair of algorithms with respect to a given instance.Let then |W | = w the number of algorithms that will be compared with a total of z evaluation criteria, whose values refer to a given instance, each criterion generating a possibly different order.A criterion-algorithm table is obtained for each instance, where each position contains the algorithm number and its corresponding criterion value.In this table, the algorithm number in each entry corresponds to that of the corresponding column.
We exemplify the method with the QAP instance Tai25a, used among other QAP instances for testing five VNS variations, [16] (Table 1).
After this, we order each line by nondecreasing order of the corresponding criterion value.The algorithm identifiers are carried on along the ordering.After the execution, the initial matrix stays as in Table 2.In the next step, we examine each value pair along each line, considering the algorithms which produced the corresponding results.We represent the comparison result by a matrix where each column corresponds to a pair of algorithms and each entry value is k ∈ {−1, 0, +1}.
The next step is the (Condorcet) distance determination.An expanded matrix is built where the lines correspond to criteria pairs and columns to algorithm pairs.Here, we say there is a discrepancy (expressed by a unity in Table 4), when the entries corresponding to a criteria pair in Table 3 have opposite signs.The remaining entries of Table 4 are null.The line sums express the distances between the orders given by the algorithms, while the column sums correspond to the distances between the orders given by the criteria (e.g., (a, [1,2]) = 1 and (b, The last row and column of Table 4 are used for indicator evaluation.If all pairs have very high disagreements, for example, over 75%, a questioning about their validity will be convenient. For this example of Tai25a instance, only the criteria pair [a, d] shows a higher disagreement (70%).The other pairs have better consistency, which indicates this criteria set as having good evaluation capacity for the algorithms applied to this instance.We can also look at the columns sum.It is interesting to observe that [1,5] column indicates no discrepancy, which is the same to say that Algorithms 1 and 5 are equivalent, according to all criteria utilized.
The Condorcet method proceeds by calculating the relative errors to be included in Eqn.4.2 and preparing comparison tables based on those results.The number of comparisons will grow to O(w 2 ) for each instance.The final evaluation would be done by inspection, since it becomes difficult to establish logical criteria which could be used for computational evaluation.Since the number of alternatives may be large, according to the value of w, we consider the Condorcet technique as becoming impractical.

The Weight Ordering Method -WOM
The situation we have just described calls for some evaluation improvement.It led us to propose a Condorcet-like technique where the comparison can be easily made by calculation, the Weight Ordering Method (WOM).Here we have the advantage of automatically translate the results of the comparisons into numeric values.We do it with the aid of a function designed to be injective for the considered value set: then, we can be sure it will condense in numbers the information provided by Table 4 above.
From the ordered array of the Condorcet method (Table 2), we look for equal-valued elements.
If they exist, we proceed to a rearrangement to condense these values in a single entry.Otherwise we proceed with Table 2 without changing.Anyway, we obtain Table 5, where equal values in the various entries were condensed in a unique position (e.g., Algorithms 1, 4 and 5, with Criteria a and d).For the instance Tai25a, Table 5 will be: We do not consider the empty entries in the ordering (e.g., Line (d): Algorithm 2 in column 4 will be second in order, not fourth; Algorithm 3 will be third, not fifth).
With these data we are able to create an O i j -type matrix, similar to that of Condorcet method.(Table 6), where each entry (i, j ) contains the number of times Algorithm i appears in order j , for the whole criteria set applied to a given instance.(e.g., Table 5 shows that Algorithm 1 obtained one first position (with Criterion d), three third positions (criteria a, b, c) and one fourth position (criterion e)).

Algorithms
Order lst 2nd 3rd 4th 5th To quantify the performance of each algorithm we use this matrix to associate with a weight function over the obtained set of orders, where a first-rated algorithm receives a greater value than the second-rated one and so on.The suggested weight function (4.3) for a given algorithm considers the number w of algorithms, the order of the algorithm i for a given instance, an exponent basis k and the matrix O = [O i j ] of the instance, as follows: This function becomes injective for k sufficiently high.One has to test some values, given the arrangement set obtained.For the example, we found that k = 13/3 guarantees the injective property.Here, a higher value corresponds to a better performance.
It is crucial to observe that we are already working with an ordered set: since the function values reflect the ordering of the multicriteria evaluation for each algorithm, they correspond to the pairwise ordering used by Condorcet method, condensing its results into numeric values which indicate the algorithm performance order according to the proposed criteria.
Table 7 is Table 6 with a new column showing WtFunc values.We can see that the best global performance was that of Algorithm 3 (279) and the worst, that of Algorithm 2 (59).It is important to mention that these results are consistent only within a given situation, since the orderings obtained in two different situations may not be consistent with one another and it may not be significant to add up their respective ratings.

COMPUTATIONAL RESOURCES AND RESULTS
For each problem, we used about 100 test instances, taken from their respective websites, [24] for TSP and CVRP, and [19] for QAP.
All algorithms departed with randomly generated initial solutions.We performed a set of ten executions for each instance, each one initialized with a new seed in order to ensure independence.The seeds were randomly selected from the list of prime numbers between 1 and 2,000,000, [6].The tests were run on a computer with an Intel Core 2 Quad 2.4 GHz with 4 GB of RAM, under the Linux operating system, openSUSE distribution.
Table 8 contains the list of instances from the three problems, with the corresponding sizes.
Table 9 shows the values of the weight function associated with the four algorithms, working on the problems used in the test.We can see that GRASP was the better technique both on TSP and on CVRP, while VNS worked more efficiently on QAP.
It may be noted that no algorithm was better than the others for the three problems.Although behavior differences should be expected between an algorithm-problem pair and another one, the results have also been influenced by our use of basic versions, which detailed descriptions can easily be found in the literature.
A comparison test for WOM was designed with the use of boxplots [20].The boxplot description followed the pattern used in Table 6, that is, for each problem we built a boxplot set for each criterion, involving all four algorithms.The graphics is shown in Appendix 1.

CONCLUSIONS
The WOM technique allows us to choose the level of detail in an algorithm performance study.
For example, we can check performances by using an isolated instance or a set of instance classes, as in [16].Comparison between different versions of the same algorithm can be made much more easily than by using the Condorcet method (whose output file increases with the square of the number of elements and is designed to give results by inspection), since the WOM gathers the evaluation results on a single parameter.It is also easily adaptable to an insertion, a replacement or a removal of a criterion or algorithm under study, allowing for faster scanning and analysis of their results.
Based on the Condorcet method, WOM shows very clearly both algorithm strengths and weaknesses and also allows for an overall comparison in terms of performance ordering.We believe, even with this small example, that we can show its efficiency to make comparisons and sorting techniques by performance in the midst of a much larger number of alternatives.
We think WOM can be very useful in algorithm development, when a researcher has to deal with a number of different, but similar, algorithm versions, or with several sets of different parameter values for a given algorithm.As for the Condorcet method, the proposed criteria set can be changed or modified according to the research objective.
A comparison with the boxplot analysis (Appendix 1) shows most of its results comparable with those of WOM, CVRP being the less precise, TSP matching well and QAP fairly good.
The stagnation time (stag) has the lesser median for VNS.TS presented the higher stagnation times and the higher median.GRASP was second and ILS, third.On the other hand, GRASP had the lesser value spread, followed by ILS, VNS, then TS.
We can say the boxplot comparison matches WOM results, VNS being easily the first, TS and ILS having near results and GRASP certainly worse.
The TSP boxplots are in Figure A1-2 below.The criteria nopt and exec were not effective: since the TSP instances have real values, the algorithms spent all the allowed execution time of 600 seconds, within the ten executions for instance, trying to obtain better solutions within an interval of 1% fixed around the originally OBKV value given by the site, associated to the problem.
We can observe that GRASP produced low avd and qual values.This behavior allows us to understand its stag behavior as a strong search for better values, most of them falling in the immediate neighborhood of the 1% region around OBKV.Since GRASP is a multistart method, along this process it would have less chance of sticking to local optima.
The same analysis, applied to the other three algorithms, points to less precision.We have to remember that, by the definition of qual, it approaches avd when the number of successful trial goes to zero.Then the painting of the two criteria, here, is very similar and indicates that the stagnation time was consumed with worse solutions than those found by GRASP.The early stagnation also should mean the influence of local optima.
Considering this last point, VNS is the most susceptible and it presents also the higher values for avd and qual, showing the worst performance in this test.GRASP is evidently the most efficient and to decide between TS and ILS to be second and third it is convenient to consider the somewhat lesser avd and qual values of TS.It should then rank second and ILS third.
This result is the same obtained by the WOM technique (Table 7).
The CVRP boxplots are in Figure A1-3 below.
The analysis is somewhat similar of that made with TSP results.There are nevertheless some interesting differences.CVRP is a more difficult problem than TSP.This difficulty reflects itself in the differences of avd and qual in this case: we can observe that the very sensible qual The value choice for k is given by (+1; >); (−1, <); (0, =) (e.g., Criterion a gives the second position to Algorithm 2 (value 9.0000) and the fourth one to Algorithm 4 (value 10.0000), hence (a,[2,4]) = −1, while Criterion d gives the first position to Algorithm 1 and the second one to Algorithm 4 (both with value 0.0000); hence, (d,[1,4]) = 0).

Table 1 -
Instance Tai25a -The matrix with the values obtained by the algorithms.

Table 3 -
Instance Tai25a -The value pair comparison matrix.

Table 4 -
Instance Tai25a -Comparison between pairs (by algorithms and by criteria).

Table 5 -
WOM method -rearrangement of equal values for Tai25a.

Table 9 -
Comparison among the four metaheuristics using WOM.