STEPWISE SELECTION OF VARIABLES IN DEA USING CONTRIBUTION LOADS

In this paper, we propose a new methodology for variable selection in Data Envelopment Analysis (DEA). The methodology is based on an internal measure which evaluates the contribution of each variable in the calculation of the efficiency scores of DMUs. In order to apply the proposed method, an algorithm, known as “ADEA”, was developed and implemented in R. Step by step, the algorithm maximizes the load of the variable (input or output) which contribute least to the calculation of the efficiency scores, redistributing the weights of the variables without altering the efficiency scores of the DMUs. Once the weights have been redistributed, if the lower contribution does not reach a previously given critical value, a variable with minimum contribution will be removed from the model and, as a result, the DEA will be solved again. The algorithm will stop when all variables reach a given contribution load to the DEA or until no more variables can be removed. In this way and contrary to what is usual, the algorithm provides a clear stop rule. In both cases, the efficiencies obtained from the DEA will be considered suitable and rightly interpreted in terms of the remaining variables, indicating the load themselves; moreover, the algorithm will provide a sequence of alternative nested models – potential solutions – that could be evaluated according to external criterion. To illustrate the procedure, we have applied the methodology proposed to obtain a research ranking of Spanish public universities. In this case, at each step of the algorithm, the critical value is obtained based on a simulation study.


INTRODUCTION
Data Envelopment Analysis (DEA) is a methodology introduced by Charnes, Cooper and Rodes in [14], that it is used to compare the efficiencies of a set of homogeneous units (DMUs) that produces several outputs from the same set of inputs.This methodology has become very popular in several fields of mathematics and management science, including finance, banking, education, healthcare, . . .For each DMU, the DEA analysis does not only provide an efficiency score, but it also provides a peer set.The peer set can be used to guide those who are involved in the decision making process, leading to an optimal DMUs performance.
DEA is a non parametric method that builds and estimates the technological frontier as the region defined by the efficiency units.For each DMU, the DEA considers two sets of weights, one set for inputs and another for outputs, obtaining the efficiency scores of the DMUs from the optimization of the ratio between a combination of both sets of variables.The weights are selected in the most favourable way for the unit that has been evaluated.Thus, the initial set of variables, inputs and outputs, can lead to different ways of measuring the efficiency, so it is very important that the set is chosen correctly.Sexton [49], Smith [47] and Dyson [18] get different results regarding the sensitivity in the calculation of efficiencies dependent on variable selection.
As the selection of the variables to be included in the analysis is usually made by the decision makers and politicians, the researchers assume, a priori, that the selection is a correct one.This means that, generally, a very small attention is devoted to the selection of variables as shown in Cook & Zhu [17].But taking into account that efficiency is measured using the variables included in the model, the inclusion in it of inappropriate or irrelevant variables may cause bias in the results.Thus, the selection of the set of variables becomes an important task.
As will be described below, the usual methods of selection of variables in DEA mainly try to preserve the efficiency scores of the initial model, so the bias remains hidden for the method.We propose a new measure that can detect such bias in the selection of variables.To resolve the procedures involved in the proposed methodology, we have developed a package in R, called aBenchmarking, which will be available in cran [43].Currently, an interactive online application is available at http://knuth.uca.es/DEA.
In order to apply the proposed methodology, we will obtain the efficiencies of the research activity of Spanish public universities.For this purpose, we will use the same source of data used by the authors of the "Ranking 2013 de investigación de las universidades públicas españolas" [12], popularly known as Granada ranking.This is one of the most important rankings of Spanish universities.The Granada ranking includes two rankings, one of production and one of productivity, taking into account in the latter case the human resources that each university has to obtain its scientific production.The ranking of Granada is elaborated since year 2009 with annual periodicity, see [8,9,10,11,12].One of the objectives that we intend to address in future works, as application of the methodology proposed in this paper, is to compare the productivity results of the Granada ranking with the efficiencies obtained from a DEA model oriented to output.
The rest of this paper is organized as follows.In the next section we review some articles that deal with the problem of variable selection in DEA.The Section 3 presents the variable selection methods based on their intrinsic contribution to the calculation of DMU efficiencies, including theoretical background, algorithms and one step by step example.Section 4 is devoted to the determination of the minimum admissible value to be able to consider a variable as relevant in the model.This is done using a Monte Carlo simulation in which dummy variables have been included in the model.This section also includes the analysis of the aforementioned data set of Spanish universities to obtain a research ranking of Spanish public universities.The final section of the paper is devoted to conclusions and to the presentation of future lines of work.

LITERATURE REVIEW
The usual procedure of dealing with the aforementioned problem is to apply a backward/forward variable selection method, starting with a full model and dropping variables in an stepwise algorithm.Many of these methods are taken from classical statistics procedures.Jenkins & Anderson in [31], use the partial correlation coefficient to preserve a subset of variables that retain most of the original information.Simar & Wilson in [51] use bootstrap methods to include significant variables in a forward selection procedure.Ruggiero in [44] proposes a forward selection method in which, at each step, the entry criterion is based on the correlation of the candidate entry variables with the efficiencies obtained in the current model.Other authors, such as Wagner & Shimshak [56] propose similar stepwise methods but using as a criterion the minimization of the change in average efficiency scores.Norman & Stoker in [40] and Sigala et al.,in [45], have proposed forward procedures taking into account the correlation between the variables not included in the model and the efficiency scores.Ueda & Hoshiai in [54] and Adler & Golany in [2,3], developed, independently, a method based on replacing the original variables with the principal component analysis, removing the effect of the redundancy of information.
A different point of view is proposed in other papers.Pastor et al.,in [42], suggest a forward model based on the marginal impact of a variable (input-output), which is estimated through an efficiency contribution measure (ECM).
Other method for the selection of variables is the one proposed by Lins & Moreira [35] that starts from a model with only one input and output, the most correlated.From here on, they include that variable that causes higher average efficiency in the DEA, regardless of how many DMUs are efficient.Soares de Mello & others [46] use a convex combination of two indicators, to take into account both the average efficiency and the number of DMUs.To give equal importance to the indicators, the coefficients of the combination are the same, unless there are reasons for not being.A variant of this method is proposed by Senra & others [1], since they begin with the combination of variables that present greater value in the previous combination, although those models that have less variables will have a lower efficiency value since it is known that by increasing the number of variables in DEA increases the average efficiency.
Madhanagopal & Chandrasekaran in [38] first sort the variables by their relevance using a genetic algorithm, and then they apply the method proposed by Pastor.Fanchon in [23] suggests a methodology to identify the optimal number of variables, evaluating the contribution of these in the construction of the efficiency frontier.Morita & Avkiran in [37] propose an input/output selection method that uses diagonal layout experiments to find an optimal combination.Sharma & Jin in [52] evaluated the importance of each variable using a Kruskal-Wallis test.
Finally, Sirvent et al., in [48], Adler & Yazhemsky in [6] and Nataraja & Johnson in [39], make a comparative analysis of some of the variable selection techniques proposed in the literature.Table 1 reviews in tabular form the aforementioned papers in a chronological line.

METHOD FOR SELECTING VARIABLES BASED ON CONTRIBUTIONS
Our methodology proposes to establish a process of selection of variables that takes into account the contribution of the variables to the calculation of the efficiencies of the DMUs.This has led us to propose in this paper a normalized internal measure of the contribution.This measure, called load in the following, considers no external information to the procedure, how it happens with the use of regression techniques, principal components, etc.After some technical considerations, the constant return to scale, DEA-CRS, also known as DEA-CCR, model input oriented, consider for each DMU the problem: where the unit 0 is the unit taken into account.
The procedure solves n D linear programs, one for each DMU, and takes the score of each DMU as the optimal value of the program for that unit.Note that this score is the maximum virtual output amount allowed by the model.This approach does not allow consideration of measures and conditions inter-units.
In order to allow the handling of such measures and conditions, a model which considers all DMUs simultaneously can be built.As a first step, replacing in the previous problem the DMU-0 for the DMU-d we have the problem the (P d ) problem as: In order to solve simultaneously all (P d ) for all DMUs, the second step is to merge all the (P d ) problems into one.In order to do that, consider u od the weight of the o-th output in (P d ) problem, and v id the weight of the i-th input in ( P d ), and merging all together, we have: This problem, that solves the DEA model for all DMUs at the same time, has n D × (n I + n O ) non-negative variables and n D × (n D +1) constraints.That means that the dual program is easier to solve than primal, but if we want to handle constraints involving weights it is preferable to stay in primal space.
As the weights are included in the objective function of the optimization problem, the variables with greater weights have a greater influence in the final calculation of the efficiencies of the DMUs.So, to compare or to measure the importance of an input or an output variables in the final DMU rating, the use of the weights may be the first choice.However the weights lack some desirable properties such as being bounded or not subject to variation under changes of scale.To address that question a new measure is introduced in the following subsection.

The load of a variable
Definition 1.For any u and v feasible weights for ( P) consider: ᾱ I i and ᾱ O o will be called the contribution of the i-th input and the contribution of the o-th output respectively.
Notice that for the o-th output ᾱ O o is the ration between the contribution of that output variable to the objective function of (P), this is the part of the objective function that depends of said variable, and the total contribution of all outputs.Analogously, for the i-th input, ᾱ I i is the ratio between the contribution of the i-th input and all inputs.From another point of view, ᾱ O o is the ratio between the virtual output provided by the o-th output and the total virtual output.
One desirable property of any measure is to have a bounded range because it is thus possible to compare the value of the measure with its maximum value.The range of ᾱ-ratios is established in the next property.
Property 1.For any u and v weights feasible for ( P) we have: The proof follows directly from the definitions of ᾱ-ratios.
If all the ratios for input variables were equal then ᾱ I i would be 1/n I .Analogously, for the ratios for output variables ᾱ O o = 1/n O , which means that the ideal values of ratios depend on the number of inputs and outputs.In the next definition, a standardized version of ᾱ-ratios is introduced, correcting such a drawback.Definition 2. For any u and v feasible weights for ( P) consider: From the previous property follows the next one.
Property 2. For any u and v feasible weights for ( P) we have: Now, in the ideal case of equal ratios, all ᾱ I i and ᾱ O o ratios will be 1.In order to understand the meaning of the ratios, note that these are the quotient between the virtual output that comes from each output and the average value of all outputs.Thus, for example, α O 1 = 0.75, means that the contribution of output 1 is 75% of the average value for all outputs.The remaining 25% will go to increase the remaining output ratios.
Usually, (P) has got multiple alternate solutions that, in general, could lead to multiple values of α-ratios.Thus, the next task will be to fix a suitable choice for such ratios.
Following the optimistic approach of DEA methodology, the first potential approach for choosing suitable α-ratios could be to increase all of them.But, as the sum of α-ratios are fixed, if we try to increase one of them, the remaining ratios will decrease.
Thus, another possible way to redistribute all α-ratios is to increase the minimum value of the ratios.The new α-ratios after such redistribution of ratios will be called α-load or simply load.In this way, a low value for the load of a variable means that the contribution of such variable can not be increased without change the efficiency scores.So, the loads are very optimistic measures.This approach leads, when possible, to equal value of all loads, which means that the contribution of each variable is the same.Such choice should be made without changing the main results of DEA, i.e., without changing the efficiency score of each unit.
The next section of the paper is devoted to computing the values of these loads.

Computing the loads
In a natural way, to compute the loads we could consider, for a suitable value of , the next problem But this program to compute simultaneously α-ratios and, u and v weights, (P α ), is a non-linear program.
Let us take one step back, and consider again ( The above property and previous hypothesis allow us to use a two step procedure.In the first step, ( P) is solved and scores computed.In the second step, the maximum value α-ratios are computed.max α s.t.
The solution of ( P α ) gives α value that represents α-loads in such a way that the minimum of input ratios and output ratios are maximized.Such values are constrained to preserve the score of each DMU.So, any solution of ( P α ) is a suitable choice for the α-ratios and will be called the load of a variable as proposed in the following Definition 4. Given an optimal solution of ( P α ), let α I i : the load of i-th input variable, for i from 1 to n I .
α O o : the load of o-th output variable, for o from 1 to n O .
α: the load of the model.
From the optimality condition we can say that the load, α, is well defined, i.e., the value is unique.
The load of a variables in which the minimum is reached is also well established.But in all other cases the values could change, so they have no useful meaning.
If we consider that only input (or output) variables should be dropped from the model, a similar linear program can be considered removing the group of restrictions α I (or α O ) from (P α ).
The model in ( P α ) is input oriented, but the output oriented DEA can be handled in very similar way.

ADEA an stepwise variables selection algorithm
The definition of the loads for input and output variables allows the generation of an alternative DEA methodology, that we have called ADEA, for the selection of variables in DEA model.
Note that, for a given model, this algorithm provides the same efficiency scores than standard DEA, but the weights are not the same.

ADEA Algorithms
The usual methods of selection of variables in DEA mainly deal with efficiency scores and tries to preserve the scores of the initial model.We propose a new algorithm based in the contribution load of variables.At each step the variable with lower value of its load is dropped from the model.The algorithm continues until the load of the model reaches a previous given desired load value or until no more variables can be removed.To do that, in each step of the algorithm, the problem ( P α ) is solved and a variable with minimum load is dropped from the model, until one of the above mentioned condition is reached, see Table 2.
It may seem natural that successive cutoffs loads in the stages of stepwise algorithm are increasing, but this is not true, as evidenced by the example of Tokyo libraries data set in Section 3.3.2.
If after discarding a variable the resulting load is lower than the previous model, then the variables that remain in the model have not increased their load and, therefore, the new model is worse.The previous algorithm can be modified to generate an increasing sequences of loads, see Table 3.
Starting with an small value of the required load, the previous algorithm is applied by increasing the value in each step.

Inputs:
x amounts of inputs required by each DMU y amounts of outputs produced by each DMU Outputs: A sequences of models with increasing loads Steps: 1. x = x, y = y, λ = , > 0, an small value 2. For x as input, y as output and λ apply ADEA stepwise algorithm and output the model If the number of inputs or outputs is not 1 then go to step 2

Tokyo libraries data set
In order illustrate how both algorithms work, consider, as an example, the Tokyo libraries case (involving a set of 23 libraries in Tokyo), which has been used frequently in DEA literature, see [16,56,52,15].The Tokyo data set, has 4 input and 2 output variables.The inputs are: Area, Books, Staff and Populations and outputs are: Registration and Borrowing.
Table 4 shows the loads of the variables after solving ( P α ).The load of the model is 0.41 which is reached at variable Area.This means that the contribution of the variable Area to the efficiency scores is 41% instead of 100%.If we thinks that 41% is not enough, then we can apply the ADEA stepwise algorithm.
Table 5 shows the results of each step of the algorithm.To solve the tie in step 3, the variable that leads to a better model in the next step are selected.If 90% is considered a suitable value for the model load, the variables Area and Registrations should be dropped from the initial model.Notice that the load of the model at step 3 is 0.9, but in step 4, the load goes down to 0.76.This means that the model in step 4 is worse than the model in step 3, because it has lower load and less variables.To avoid that, we can apply the ADEA parametric algorithm that generate a sequence of models with increasing values of loads.Table 6 shows each step of the algorithm.Previously 0.7 has been selected as minimum desired value for the load of the model.But how can we compute such value?In the next section we use a Monte Carlo method to simulate the value of the load after include a dummy variable in the model.These simulated values of the loads allow to us to established such minimum desired value.
But what would have happened if the procedure had been applied with another initial set of variables?To answer this question the procedure has been applied to each model resulting from deleting one variable in the initial model.Tables from 7 to 12 show that, with only one exception, in all models all variables have been deleted in the same sequence.The model without Borrowing, in Table 12, shows differences with the other models, but these differences are expected because the output with more relevance has been eliminated of the model, reason why this model is essentially different to the others.This example suggests that the procedure has some robustness and is not very sensitive to the initial choice of the set of variables.

SELECTING A CRITICAL VALUE FOR LOADS. APPLICATION TO THE ANALYSIS OF RESEARCH EFFICIENCY IN SPANISH UNIVERSITIES
The proposed methodology works dropping variables until some previously given value is reached by the loads.But until now, nothing is said about how such value can be selected.In this section an ad hoc Monte Carlo simulation is made in order to give a suitable value of such parameter.To do this, we will apply a DEA to obtain the research efficiency of the Spanish public universities in 2013.The data set, which has been called Spanish University data set, has been obtained from official sources and includes one input RP : The average number, considering the courses 2103, 2014 and 2015, of permanent research professors from [22].
And seven outputs that measure the research production: JCR : Number of articles published in journals indexed in the JCR.Number of published articles indexed in "Main Collection of Web of Science (WoS)" in 2013 from [55].
RAR : Each permanent professor in Spanish universities can submit every six years his research activity to be evaluate by the a national agency.This is the ratio between the number of positive evaluations and all the six years periods that they could submit to evaluation.Data comes from, CNEAI 2009 report [4], the report of the National Commission for the Evaluation of Research Activity.An output orientation is used to handle this model.The data have been obtained from the same official sources as the Granada ranking for 2013.

RDP
About output variable RAR we must say that it is a ratio and that there are many published works, as example [28,21] disregarding the use of ratios in DEA.Also it must be said that there are many other works that use variables of type ratio as in [5,37,24].In this case, the use of this variable is due to an attempt to reproduce as accurately as possible the original study with which to compare the results of the analysis.

Monte Carlo Simulation of Loads
Generally speaking, if we add a dummy, randomly generated, variable to a model and compute the load.Such load shows how much higher can be the load of a variable without meaning in the model, and a suitable quantile can be taken as lower limit for the load of the variables remaining in the model.But which distribution we can select to such dummy variable?
As a first try, we can consider the normal distribution.If we make a Shapiro-Wilk normality test to the variables in the initial Spanish universities model gets p-values from 10 −7 till 10 −4 except for one variable.A logarithmic transformation seems needed.After making that transformation all the new p-values of Shapiro-Wilk test are higher than 0.18.So a log-normal distribution is a good selection for the distribution of the dummy variable.
Table 13 n-th step of load simulation.
Let l i the load of the model after adding Y as output 9.
i = i + 1 If i < 1000 go to to step 6. 10.Let l 0.9 the 0.9 quantile of l Let l 0.95 the 0.95 quantile of l Table 13 shows how in each simulation two uniformly generated random values are taken from interval [1,40].Let μ min the lower and μ max the higher.In i-th step, from each of the one thousand, an uniformly random generated value μ i are taken from [μ min , μ max ].Analogously, let σ min and σ max the lower and the higher values from two randomly generated values in [1,5].
And let σ i an uniformly random generated value taken from [σ min , σ max ].The dummy variable is generated as Y ∼ exp(N (μ i , σ i )), added to the model and the load of it is calculated.The 0.9 and 0.95 quantiles of load are stored in a database.Taking into account that the loads are invariant by scale changes, the initial intervals chosen run through the values of the parameters that cross the sample values of the variables in the model.
Each simulation is repeated one thousand times, so 1 million of variables are generated and 1 million models are solved.The average value of 0.9 and 0.95 quantiles are 0.52 and 0.53.
According to these, we must drop all variables in the model with load lower than 0.53.Table 14 shows that in first step the load of JCR and STSR are under 0.53.A new simulation was made dropping STSR and JCR variables.And now the 0.9 and 0.95 quantiles are 0.54 and 0.62.But in this case the load of model in step 3, 0.71, is higher than 0.62, so no new simulation is needed.
Summarizing, 0.62 is an upper bound in 95% of cases when introducing a dummy variable in the model, thus 0.62 can be considered, with 95% of confidence, as a minimum value of the load of the variables in the model.Taking into account that the values obtained in the three simulations are around 0.6 and taking into account the meaning of the loads, we recommend this value as a general value for its use in the application of this methodology.

Variable selection
The application of the step-by-step algorithm, shown in Table 14, gives three models: the initial model that we will call IM, the resulting model of eliminating the STSR variable that we will call M1, and the resulting final model of eliminating the variables STSR and JCR that we will call M2.
From a functional point of view, the elimination of the STRS and JCR variables implies obtaining a simpler model that helps to better understand the research productivity of universities.Although all the variables considered explain, to a greater or lesser extent, the research done, contributing a percentage of the total of this research, the variables are more or less related among them.In particular, JCR is closely related to the number of six-years of research, regis-

CONCLUSIONS AND FUTURE RESEARCH
We propose in this paper a new methodology for selecting variables in DEA models based on a measure of the contribution of the variables to the efficiency scores of the DMUs.A Monte Carlo simulation has been used to determine a suitable value for the minimum value that the load of the variables in the model must have.In this way a useful tool for decision makers is provided to test the role of a variable in a DEA model.
A cross-validation of the results obtained with the proposed methodology was carried out through the correlation coefficients between the different models obtained.
We have illustrated our procedure through one classical example in DEA: the Tokyo libraries data set, and using a new data set related to Spanish universities.In both cases, step by step results are provided.
At http://knuth.uca.es/shiny/DEA/ an online interactive application is available.Some data example are ready to load to play with the software.Also, the user can upload its own data set and apply the algorithms proposed in this paper.It is our purpose to prepare a package for R for publication in R public repositories as free software.
As we have discussed above, we will try to compare technologies based on productivity rankings with DEA model, to compare Spanish public universities according to its research results.
The validation of the algorithms proposed through an extensive simulation and its development for other DEA models are some of the future lines of work.

Following the usual notation
in DEA, a set of n D DMUs is considered to be rated.Each DMU uses different amounts of n I inputs to produce n O different outputs.Let x id the amount of the i-th input that uses the d-th DMU for i = 1, 2, . . ., n I and d = 1, 2, . . ., n D , and let y od the amount of the o-th output produced by d-th DMU for o = 1, 2, . . ., n O and d = 1, 2, . . ., n D .

3 .Property 3 .
P d ).The value of the objective function of that program is defined as the score of the d-th DMU, considering the following Definition Let s d , ∀d = 1, . . ., n D the score in DEA of d-th DMU, i.e.: s d = n O o=1 u od y od and under ( P) constraints we have: For any u and v feasible weights for (P), From the ( P) feasibility conditions for u and v, we have n I i=1 v id x id = 1 Pesquisa Operacional, Vol.38(1), 2018Thus, taking the sum over d and rearranging the sum, we have

:
Number of projects awarded in State Research and Development Programs, by the Ministry of Economy and Competitiveness in 2013) from [29, 30].Phd : Number of doctoral thesis from database of doctoral thesis TESEO (Ministry of Education, Culture and Sport) between 2007 and 2011 from [53].STSR : Number of fellowships awarded for researcher training, Ministry of Education, Culture and Sport in 2013 from [25, 26].DE : Number of Doctorate programs with Mention towards Excellence, Ministry of Education Culture and Sport in 2011 from [19, 20].P : Number of patents registered between 2009 and 2013, Database of the Spanish Patent and Trademark Office (OEPM) from [41].

Table 1 -
Main contributions to the selection of variables in DEA.

Table 2 -
ADEA Stepwise Algorithm.For x as input and y as output solve (P d ) 3. If α > λ then stop.Consider x and y as final input and output set.4.Drop from x or y a variable that reach the minimum.Go to Step 2.

Table 4 -
Loads of variables Tokyo libraries data set.

Table 5 -
Steps of ADEA stepwise algorithm.

Table 6 -
Steps of ADEA parametric algorithm.

Table 7 -
Steps of ADEA stepwise algorithm for Tokyo libraries without Area.

Table 8 -
Steps of ADEA stepwise algorithm for Tokyo libraries without Books.

Table 9 -
Steps of ADEA stepwise algorithm for Tokyo libraries without Staff.

Table 10 -
Steps of ADEA stepwise algorithm for Tokyo libraries without Populations.

Table 11 -
Steps of ADEA stepwise algorithm for Tokyo libraries without Registration.

Table 12 -
Steps of ADEA stepwise algorithm for Tokyo libraries without Borrowing.

Table 14 -
Stepwise ADEA for Spanish Universities data set.But if the STFR variable is removed from the model, a new simulation must be ran.The criterion for undoing draws is the same as previously applied.The new value for 0.9 and 0.95 quantiles are 0.55 and 0.57.And again the load of JCR in step 2 is under this values.