1 INTRODUCTION
Data Envelopment Analysis (DEA) is a methodology introduced by Charnes, Cooper and Rodes in ^{14}, that it is used to compare the efficiencies of a set of homogeneous units (DMUs) that produces several outputs from the same set of inputs. This methodology has become very popular in several fields of mathematics and management science, including finance, banking, education, healthcare, ... For each DMU, the DEA analysis does not only provide an efficiency score, but it also provides a peer set. The peer set can be used to guide those who are involved in the decision making process, leading to an optimal DMUs performance.
DEA is a non parametric method that builds and estimates the technological frontier as the region defined by the efficiency units. For each DMU, the DEA considers two sets of weights, one set for inputs and another for outputs, obtaining the efficiency scores of the DMUs from the optimization of the ratio between a combination of both sets of variables. The weights are selected in the most favourable way for the unit that has been evaluated. Thus, the initial set of variables, inputs and outputs, can lead to different ways of measuring the efficiency, so it is very important that the set is chosen correctly. Sexton ^{49}, Smith ^{47} and Dyson ^{18} get different results regarding the sensitivity in the calculation of efficiencies dependent on variable selection.
As the selection of the variables to be included in the analysis is usually made by the decision makers and politicians, the researchers assume, a priori, that the selection is a correct one. This means that, generally, a very small attention is devoted to the selection of variables as shown in Cook & Zhu ^{17}. But taking into account that efficiency is measured using the variables included in the model, the inclusion in it of inappropriate or irrelevant variables may cause bias in the results. Thus, the selection of the set of variables becomes an important task.
As will be described below, the usual methods of selection of variables in DEA mainly try to preserve the efficiency scores of the initial model, so the bias remains hidden for the method. We propose a new measure that can detect such bias in the selection of variables. To resolve the procedures involved in the proposed methodology, we have developed a package in R, called aBenchmarking, which will be available in cran ^{43}. Currently, an interactive online application is available at http://knuth.uca.es/DEA.
In order to apply the proposed methodology, we will obtain the efficiencies of the research activity of Spanish public universities. For this purpose, we will use the same source of data used by the authors of the “Ranking 2013 de investigación de las universidades públicas españolas” ^{12}, popularly known as Granada ranking. This is one of the most important rankings of Spanish universities. The Granada ranking includes two rankings, one of production and one of productivity, taking into account in the latter case the human resources that each university has to obtain its scientific production. The ranking of Granada is elaborated since year 2009 with annual periodicity, see ^{8}^{), (}^{9}^{), (}^{10}^{), (}^{11}^{), (}^{12}. One of the objectives that we intend to address in future works, as application of the methodology proposed in this paper, is to compare the productivity results of the Granada ranking with the efficiencies obtained from a DEA model oriented to output.
The rest of this paper is organized as follows. In the next section we review some articles that deal with the problem of variable selection in DEA. The Section 3 presents the variable selection methods based on their intrinsic contribution to the calculation of DMU efficiencies, including theoretical background, algorithms and one step by step example. Section 4 is devoted to the determination of the minimum admissible value to be able to consider a variable as relevant in the model. This is done using a Monte Carlo simulation in which dummy variables have been included in the model. This section also includes the analysis of the aforementioned data set of Spanish universities to obtain a research ranking of Spanish public universities. The final section of the paper is devoted to conclusions and to the presentation of future lines of work.
2 LITERATURE REVIEW
The usual procedure of dealing with the aforementioned problem is to apply a backward/forward variable selection method, starting with a full model and dropping variables in an stepwise algorithm. Many of these methods are taken from classical statistics procedures. Jenkins & Anderson in ^{31}, use the partial correlation coefficient to preserve a subset of variables that retain most of the original information. Simar & Wilson in ^{51} use bootstrap methods to include significant variables in a forward selection procedure. Ruggiero in ^{44} proposes a forward selection method in which, at each step, the entry criterion is based on the correlation of the candidate entry variables with the efficiencies obtained in the current model. Other authors, such as Wagner & Shimshak ^{56} propose similar stepwise methods but using as a criterion the minimization of the change in average efficiency scores. Norman & Stoker in ^{40} and Sigala et al., in ^{45}, have proposed forward procedures taking into account the correlation between the variables not included in the model and the efficiency scores. Ueda & Hoshiai in ^{54} and Adler & Golany in ^{2}^{), (}^{3}, developed, independently, a method based on replacing the original variables with the principal component analysis, removing the effect of the redundancy of information.
A different point of view is proposed in other papers. Pastor et al., in ^{42}, suggest a forward model based on the marginal impact of a variable (inputoutput), which is estimated through an efficiency contribution measure (ECM).
Other method for the selection of variables is the one proposed by Lins & Moreira ^{35} that starts from a model with only one input and output, the most correlated. From here on, they include that variable that causes higher average efficiency in the DEA, regardless of how many DMUs are efficient. Soares de Mello & others ^{46} use a convex combination of two indicators, to take into account both the average efficiency and the number of DMUs. To give equal importance to the indicators, the coefficients of the combination are the same, unless there are reasons for not being. A variant of this method is proposed by Senra & others ^{1}, since they begin with the combination of variables that present greater value in the previous combination, although those models that have less variables will have a lower efficiency value since it is known that by increasing the number of variables in DEA increases the average efficiency.
Madhanagopal & Chandrasekaran in ^{38} first sort the variables by their relevance using a genetic algorithm, and then they apply the method proposed by Pastor. Fanchon in ^{23} suggests a methodology to identify the optimal number of variables, evaluating the contribution of these in the construction of the efficiency frontier. Morita & Avkiran in ^{37} propose an input/output selection method that uses diagonal layout experiments to find an optimal combination. Sharma & Jin in ^{52} evaluated the importance of each variable using a KruskalWallis test.
Finally, Sirvent et al., in ^{48}, Adler & Yazhemsky in ^{6} and Nataraja & Johnson in ^{39}, make a comparative analysis of some of the variable selection techniques proposed in the literature.
Table 1 reviews in tabular form the aforementioned papers in a chronological line.
3 METHOD FOR SELECTING VARIABLES BASED ON CONTRIBUTIONS
Our methodology proposes to establish a process of selection of variables that takes into account the contribution of the variables to the calculation of the efficiencies of the DMUs. This has led us to propose in this paper a normalized internal measure of the contribution. This measure, called load in the following, considers no external information to the procedure, how it happens with the use of regression techniques, principal components, etc.
Following the usual notation in DEA, a set of n _{ D } DMUs is considered to be rated. Each DMU uses different amounts of n _{ I } inputs to produce n _{ O } different outputs. Let x _{ id } the amount of the ith input that uses the dth DMU for i = 1, 2, ..., n _{ I } and d = 1, 2, ..., n _{ D } , and let y _{ od } the amount of the oth output produced by dth DMU for o = 1, 2, ..., n _{ O } and d = 1, 2, ..., n _{ D } .
After some technical considerations, the constant return to scale, DEACRS, also known as DEACCR, model input oriented, consider for each DMU the problem:
where the unit 0 is the unit taken into account.
The procedure solves n _{ D } linear programs, one for each DMU, and takes the score of each DMU as the optimal value of the program for that unit. Note that this score is the maximum virtual output amount allowed by the model. This approach does not allow consideration of measures and conditions interunits.
In order to allow the handling of such measures and conditions, a model which considers all DMUs simultaneously can be built. As a first step, replacing in the previous problem the DMU0 for the DMUd we have the problem the (P _{ d } ) problem as:
In order to solve simultaneously all (P _{ d } ) for all DMUs, the second step is to merge all the (P _{ d } ) problems into one. In order to do that, consider u _{ od } the weight of the oth output in (P _{ d } ) problem, and v _{ id } the weight of the ith input in (P _{ d } ), and merging all together, we have:
This problem, that solves the DEA model for all DMUs at the same time, has n _{ D } × (n _{ I } + n _{ O } ) nonnegative variables and n _{ D } × (n _{ D } +1) constraints. That means that the dual program is easier to solve than primal, but if we want to handle constraints involving weights it is preferable to stay in primal space.
As the weights are included in the objective function of the optimization problem, the variables with greater weights have a greater influence in the final calculation of the efficiencies of the DMUs. So, to compare or to measure the importance of an input or an output variables in the final DMU rating, the use of the weights may be the first choice. However the weights lack some desirable properties such as being bounded or not subject to variation under changes of scale. To address that question a new measure is introduced in the following subsection.
3.1 The load of a variable
Definition 1. For any u and v feasible weights for (P) consider:
Notice that for the oth output
One desirable property of any measure is to have a bounded range because it is thus possible to compare the value of the measure with its maximum value. The range of
Property 1. For any u and v weights feasible for (P) we have:
The proof follows directly from the definitions of
If all the ratios for input variables were equal then
Definition 2. For any u and v feasible weights for (P) consider:
From the previous property follows the next one.
Property 2. For any u and v feasible weights for (P) we have:
Now, in the ideal case of equal ratios, all
Usually, (P) has got multiple alternate solutions that, in general, could lead to multiple values of
Following the optimistic approach of DEA methodology, the first potential approach for choosing suitable
Thus, another possible way to redistribute all
The next section of the paper is devoted to computing the values of these loads.
3.2 Computing the loads
In a natural way, to compute the loads we could consider, for a suitable value of є, the next problem
(
But this program to compute simultaneously
Let us take one step back, and consider again (P _{ d } ). The value of the objective function of that program is defined as the score of the dth DMU, considering the following
Definition 3. Let s _{ d } , ∀d = 1, ... , n _{ D } the score in DEA of dth DMU, i.e.:
and under (P) constraints we have:
Property 3. For any u and v feasible weights for (P), one has
Proof. From the (P) feasibility conditions for u and v, we have
Thus, taking the sum over d and rearranging the sum, we have
Analogously,
Thus,
The above property and previous hypothesis allow us to use a two step procedure. In the first step, (P) is solved and scores computed. In the second step, the maximum value
The solution of (P
_{
α
} ) gives α value that represents
Definition 4. Given an optimal solution of (P _{ α } ), let
α: the load of the model.
From the optimality condition we can say that the load, α, is well defined, i.e., the value is unique. The load of a variables in which the minimum is reached is also well established. But in all other cases the values could change, so they have no useful meaning.
If we consider that only input (or output) variables should be dropped from the model, a similar linear program can be considered removing the group of restrictions α _{ I } (or α _{ O } ) from (P _{ α } ).
The model in (P _{ α } ) is input oriented, but the output oriented DEA can be handled in very similar way.
3.3 ADEA an stepwise variables selection algorithm
The definition of the loads for input and output variables allows the generation of an alternative DEA methodology, that we have called ADEA, for the selection of variables in DEA model.
Note that, for a given model, this algorithm provides the same efficiency scores than standard DEA, but the weights are not the same.
3.3.1 ADEA Algorithms
The usual methods of selection of variables in DEA mainly deal with efficiency scores and tries to preserve the scores of the initial model. We propose a new algorithm based in the contribution load of variables. At each step the variable with lower value of its load is dropped from the model. The algorithm continues until the load of the model reaches a previous given desired load value or until no more variables can be removed. To do that, in each step of the algorithm, the problem (P _{ α } ) is solved and a variable with minimum load is dropped from the model, until one of the above mentioned condition is reached, see Table 2.
Inputs:  x amounts of inputs required by each DMU  

y amounts of outputs produced by each DMU  
λ∈ [0, 1] the minimun value of load required  
Outputs:  Scores for each DMU  
A reduced DEA model  
Steps:  1. 

2.  For


3.  If α > λ then stop.  
Consider


4.  Drop from


Go to Step 2. 
It may seem natural that successive cutoffs loads in the stages of stepwise algorithm are increasing, but this is not true, as evidenced by the example of Tokyo libraries data set in Section 3.3.2.
If after discarding a variable the resulting load is lower than the previous model, then the variables that remain in the model have not increased their load and, therefore, the new model is worse.
The previous algorithm can be modified to generate an increasing sequences of loads, see Table 3. Starting with an small value of the required load, the previous algorithm is applied by increasing the value in each step.
Inputs:  x amounts of inputs required by each DMU  

y amounts of outputs produced by each DMU  
Outputs:  A sequences of models with increasing loads  
Steps:  1. 

2.  For


apply ADEA stepwise algorithm and output the model  
3.  λ= α + є  
If the number of inputs or outputs is not 1 then go to step 2 
3.3.2 Tokyo libraries data set
In order illustrate how both algorithms work, consider, as an example, the Tokyo libraries case (involving a set of 23 libraries in Tokyo), which has been used frequently in DEA literature, see ^{16}^{), (}^{56}^{), (}^{52}^{), (}^{15}. The Tokyo data set, has 4 input and 2 output variables. The inputs are: Area, Books, Staff and Populations and outputs are: Registration and Borrowing.
Table 4 shows the loads of the variables after solving (P _{ α } ). The load of the model is 0.41 which is reached at variable Area. This means that the contribution of the variable Area to the efficiency scores is 41% instead of 100%. If we thinks that 41% is not enough, then we can apply the ADEA stepwise algorithm.
Inputs  Outputs  

Area  Books  Staff  Populations  Registration  Borrowing  
Load  0.41  1.37  0.98  1.24  0.64  1.36 
Table 5 shows the results of each step of the algorithm. To solve the tie in step 3, the variable that leads to a better model in the next step are selected. If 90% is considered a suitable value for the model load, the variables Area and Registrations should be dropped from the initial model.
Inputs  Outputs  

Step  Area  Books  Staff Populations  Registration  Borrowing  
1  0.41  1.37  0.98  1.24  0.64  1.36 
2  1.26  0.77  0.97  0.61  1.39  
3  1.20  0.90  0.90  1  
4  1.24  0.76  1  
5  1  1 
Notice that the load of the model at step 3 is 0.9, but in step 4, the load goes down to 0.76. This means that the model in step 4 is worse than the model in step 3, because it has lower load and less variables. To avoid that, we can apply the ADEA parametric algorithm that generate a sequence of models with increasing values of loads. Table 6 shows each step of the algorithm.
Inputs  Outputs  

Steps  Area  Books  Staff Populations  Registration  Borrowing  
1  0.41  1.37  0.98  1.24  0.64  1.36 
2  1.26  0.77  0.97  0.61  1.39  
3  1.20  0.90  0.90  1  
4  1  1 
Previously 0.7 has been selected as minimum desired value for the load of the model. But how can we compute such value? In the next section we use a Monte Carlo method to simulate the value of the load after include a dummy variable in the model. These simulated values of the loads allow to us to established such minimum desired value.
But what would have happened if the procedure had been applied with another initial set of variables? To answer this question the procedure has been applied to each model resulting from deleting one variable in the initial model. Tables from 7 to 12 show that, with only one excep tion, in all models all variables have been deleted in the same sequence. The model without Borrowing, in Table 12, shows differences with the other models, but these differences are expected because the output with more relevance has been eliminated of the model, reason why this model is essentially different to the others. This example suggests that the procedure has some robustness and is not very sensitive to the initial choice of the set of variables.
Inputs  Outputs  

Step  Area  Books  Staff  Populations  Registration  Borrowing 
1  1.26  0.77  0.97  0.61  1.39  
2  1.20  0.90  0.90  1  
3  1.24  0.76  1  
4  1  1 
Inputs  Outputs  

Step  Area  Books  Staff  Populations  Registration  Borrowing 
1  0.39  1.51  1.10  0.71  1.29  
2  1.26  0.74  0.70  1.30  
3  1.10  0.90  1  
4  1  1 
Inputs  Outputs  

Step  Area  Books  Staff  Populations  Registration  Borrowing 
1  0.34  1.56  1.10  0.63  1.37  
2  1.21  0.79  0.59  1.41  
3  1.24  0.76  1  
4  1  1 
Inputs  Outputs  

Step  Area  Books  Staff  Populations  Registration  Borrowing 
1  0.19  1.65  1.16  0.68  1.32  
2  1.17  0.83  0.65  1.35  
3  1.28  0.72  1  
4  1  1 
Inputs  Outputs  

Step  Area  Books  Staff  Populations  Registration  Borrowing 
1  0.37  1.53  0.92  1.17  1  
2  1.2  0.90  0.90  1  
3  1.24  0.76  1  
4  1  1 
4 SELECTING A CRITICAL VALUE FOR LOADS. APPLICATION TO THE ANALYSIS OF RESEARCH EFFICIENCY IN SPANISH UNIVERSITIES
The proposed methodology works dropping variables until some previously given value is reached by the loads. But until now, nothing is said about how such value can be selected. In this section an ad hoc Monte Carlo simulation is made in order to give a suitable value of such parameter. To do this, we will apply a DEA to obtain the research efficiency of the Spanish public universities in 2013. The data set, which has been called Spanish University data set, has been obtained from official sources and includes one input
RP: The average number, considering the courses 2103, 2014 and 2015, of permanent research professors from ^{22}.
And seven outputs that measure the research production:
JCR: Number of articles published in journals indexed in the JCR. Number of published articles indexed in “Main Collection of Web of Science (WoS)” in 2013 from ^{55}.
RAR: Each permanent professor in Spanish universities can submit every six years his research activity to be evaluate by the a national agency. This is the ratio between the number of positive evaluations and all the six years periods that they could submit to evaluation. Data comes from, CNEAI 2009 report ^{4}, the report of the National Commission for the Evaluation of Research Activity.
RDP: Number of projects awarded in State Research and Development Programs, by the Ministry of Economy and Competitiveness in 2013) from ^{29}^{), (}^{30}.
Phd: Number of doctoral thesis from database of doctoral thesis TESEO (Ministry of Education, Culture and Sport) between 2007 and 2011 from ^{53}.
STSR: Number of fellowships awarded for researcher training, Ministry of Education, Culture and Sport in 2013 from ^{25}^{), (}^{26}.
DE: Number of Doctorate programs with Mention towards Excellence, Ministry of Education Culture and Sport in 2011 from ^{19}^{), (}^{20}.
P: Number of patents registered between 2009 and 2013, Database of the Spanish Patent and Trademark Office (OEPM) from ^{41}.
An output orientation is used to handle this model. The data have been obtained from the same official sources as the Granada ranking for 2013.
About output variable RAR we must say that it is a ratio and that there are many published works, as example ^{28}^{), (}^{21} disregarding the use of ratios in DEA. Also it must be said that there are many other works that use variables of type ratio as in ^{5}^{), (}^{37}^{), (}^{24}. In this case, the use of this variable is due to an attempt to reproduce as accurately as possible the original study with which to compare the results of the analysis.
4.1 Monte Carlo Simulation of Loads
Generally speaking, if we add a dummy, randomly generated, variable to a model and compute the load. Such load shows how much higher can be the load of a variable without meaning in the model, and a suitable quantile can be taken as lower limit for the load of the variables remaining in the model. But which distribution we can select to such dummy variable?
As a first try, we can consider the normal distribution. If we make a ShapiroWilk normality test to the variables in the initial Spanish universities model gets pvalues from 10^{−7} till 10^{−4} except for one variable. A logarithmic transformation seems needed. After making that transformation all the new pvalues of ShapiroWilk test are higher than 0.18. So a lognormal distribution is a good selection for the distribution of the dummy variable.
Table 13 shows how in each simulation two uniformly generated random values are taken from interval ^{1}^{ ), ( }^{40}. Let μ _{min} the lower and μmax the higher. In ith step, from each of the one thousand, an uniformly random generated value μ _{ i } are taken from [μ _{min} , μ _{max}]. Analogously, let σ _{min} and σ _{max} the lower and the higher values from two randomly generated values in ^{1}^{ ), ( }^{5}.
Inputs:  μ _{ m } = 1, μ _{ M } = 40  

σ _{ m } = 1, σ _{ M } = 5  
Outputs:  l _{0.9}, l _{0.95}  
Steps:  1.  u _{1} = U(μ _{ m } , μ _{ M } ), u _{2} = U(μ _{ m } ,μ _{ M } ) 
2.  μ _{min} = min{u _{1} , u _{2}}, μ _{max} = max{u _{1} , u _{2}}  
3.  s _{1} = U(σ _{ m } , σ _{ M } ), s _{2} = U(σ _{ m } , σ _{ M } )  
4.  σ _{min} = min{s _{1} , s _{2}}, σ _{max} = max{s _{1} , s _{2}}  
5.  i = 1  
6.  μ _{ i } ~ U(μ _{min} , μ _{max}), σ _{ i } ~ U(σ _{min} , σ _{max})  
7.  Y _{ i } ~ exp N (μ _{ i } , σ _{ i } )  
8.  Let l _{ i } the load of the model after adding Y as output  
9.  i = i + 1  
If i < 1000 go to to step 6.  
10.  Let l _{0.9} the 0.9 quantile of l  
Let l _{0.95} the 0.95 quantile of l 
And let σi an uniformly random generated value taken from [σ _{min} , σ _{max}]. The dummy variable is generated as Y ∼ exp(𝒩(μ _{ i } , σ _{ i } )), added to the model and the load of it is calculated. The 0.9 and 0.95 quantiles of load are stored in a database. Taking into account that the loads are invariant by scale changes, the initial intervals chosen run through the values of the parameters that cross the sample values of the variables in the model.
Each simulation is repeated one thousand times, so 1 million of variables are generated and 1 million models are solved. The average value of 0.9 and 0.95 quantiles are 0.52 and 0.53. According to these, we must drop all variables in the model with load lower than 0.53. Table 14 shows that in first step the load of JCR and STSR are under 0.53.
Outputs  

Step  Model  JCR  RAR  RDP  Phd  STRS  DE  P 
1  IM  0.36  1.16  0.55  1.62  0.36  1.26  1.69 
2  M1  0.35  1.04  0.60  1.52  1.10  1.39  
3  M2  0.87  0.71  1.42  0.91  1.08 
But if the STFR variable is removed from the model, a new simulation must be ran. The criterion for undoing draws is the same as previously applied. The new value for 0.9 and 0.95 quantiles are 0.55 and 0.57. And again the load of JCR in step 2 is under this values.
A new simulation was made dropping STSR and JCR variables. And now the 0.9 and 0.95 quantiles are 0.54 and 0.62. But in this case the load of model in step 3, 0.71, is higher than 0.62, so no new simulation is needed.
Summarizing, 0.62 is an upper bound in 95% of cases when introducing a dummy variable in the model, thus 0.62 can be considered, with 95% of confidence, as a minimum value of the load of the variables in the model. Taking into account that the values obtained in the three simulations are around 0.6 and taking into account the meaning of the loads, we recommend this value as a general value for its use in the application of this methodology.
4.2 Variable selection
The application of the stepbystep algorithm, shown in Table 14, gives three models: the initial model that we will call IM, the resulting model of eliminating the STSR variable that we will call M1, and the resulting final model of eliminating the variables STSR and JCR that we will call M2.
From a functional point of view, the elimination of the STRS and JCR variables implies obtaining a simpler model that helps to better understand the research productivity of universities. Although all the variables considered explain, to a greater or lesser extent, the research done, contributing a percentage of the total of this research, the variables are more or less related among them. In particular, JCR is closely related to the number of sixyears of research, registered in RAR, since to obtain a sixyears five articles are needed in JCR journals; for its part, the number of fellowships awarded for researcher training, collected in STSR, depends heavily on the number of research projects, registered in RDP. We can conclude, then, that the final model hardly changes the efficiencies of the model that contains all the variables, but it is simpler and contains less redundant information.
In order to validate that the eliminated variables are indeed nonsignificant, the efficiencies associated with each of the three models have been collected in Table 15. As can be seen in it the differences are very small and the Pearson correlation coefficient between the complete model and the other two models are r _{ I M,M1 } = 0.9993 y r _{ I M,M2 } = 0.9969, while the Spearman coefficients, which quantify the changes in the order, are r _{ I M,M1 } = 0.9983 y r _{ I M,M2 } = 0.9918. Therefore, the elimination of the variables JCR and STSR has a very low impact on the calculation of efficiencies and is fully justified.
University  IM  M1  M2  University  IM  M1  M2 

A Coruña  2.46  2.46  2.46  León  2.13  2.13  2.13 
Alcalá  2.20  2.20  2.20  Lleida  1.59  1.59  1.59 
Alicante  1.50  1.50  1.50  Málaga  1.58  1.71  1.71 
Almería  2.55  2.55  2.55  Mig. Hernán.  1.11  1.11  1.13 
Aut. Barcelona  1.12  1.12  1.12  Murcia  2.40  2.40  2.40 
Aut. Madrid  1.36  1.36  1.36  Oviedo  3.05  3.05  3.05 
Barcelona  1.32  1.32  1.42  Pab. Olavide  1.00  1.00  1.00 
Burgos  1.23  1.23  1.23  País Vasco  1.57  1.57  1.57 
Cádiz  1.38  1.38  1.38  Pol. Cartag.  1.28  1.28  1.28 
Cantabria  1.51  1.51  1.53  Pol. Catalun.  1.00  1.00  1.00 
Carlos III  1.32  1.32  1.32  Pol. Madrid  1.77  1.77  1.77 
Cast. Mancha  2.20  2.20  2.20  Pol. Valencia  1.49  1.58  1.58 
Comp. Madrid  2.14  2.14  2.14  Pomp. Fabra  1.00  1.00  1.00 
Córdoba  1.92  1.96  1.96  Púb. Navarra  1.68  1.68  1.68 
Extremadura  2.70  2.70  2.70  Rey J. Carlos  2.43  2.43  2.43 
Girona  1.73  1.73  1.73  Rov. Virgili  1.00  1.00  1.00 
Granada  1.60  1.79  1.79  Salamanca  1.82  1.82  1.82 
Huelva  1.25  1.25  1.25  Santiago  1.45  1.58  1.58 
Illes Balears  1.43  1.43  1.43  Sevilla  1.61  1.69  1.69 
Jaén  2.09  2.09  2.22  UNED  1.91  1.91  1.91 
Jaume I  1.55  1.55  1.55  Valencia  2.63  2.66  2.66 
La Laguna  3.47  3.47  3.47  Valladolid  2.36  2.54  2.54 
La Rioja  1.14  1.14  1.14  Vigo  1.54  1.54  1.54 
Las Palmas  3.85  3.85  3.85  Zaragoza  1.96  1.96  1.96 
5 CONCLUSIONS AND FUTURE RESEARCH
We propose in this paper a new methodology for selecting variables in DEA models based on a measure of the contribution of the variables to the efficiency scores of the DMUs. A Monte Carlo simulation has been used to determine a suitable value for the minimum value that the load of the variables in the model must have. In this way a useful tool for decision makers is provided to test the role of a variable in a DEA model.
A crossvalidation of the results obtained with the proposed methodology was carried out through the correlation coefficients between the different models obtained.
We have illustrated our procedure through one classical example in DEA: the Tokyo libraries data set, and using a new data set related to Spanish universities. In both cases, step by step results are provided.
At http://knuth.uca.es/shiny/DEA/ an online interactive application is available. Some data example are ready to load to play with the software. Also, the user can upload its own data set and apply the algorithms proposed in this paper. It is our purpose to prepare a package for R for publication in R public repositories as free software.
As we have discussed above, we will try to compare technologies based on productivity rankings with DEA model, to compare Spanish public universities according to its research results.
The validation of the algorithms proposed through an extensive simulation and its development for other DEA models are some of the future lines of work.