On the performance of three indices of agreement : an easy-to-use r-code for calculating the Willmott indices

A key step for any modeling study is to compare modelproduced estimates with observed/reliable data. The original index of agreement (also known as original Willmott index) has been widely used to measure how well model-produced estimates simulate observed data. However, in its original version such index may lead the user to erroneously select a predicting model. Therefore, this study compared the sensibility of the original index of agreement with its two newer versions (modified and refined) and provided an easy-to-use R-code capable of calculating these three indices. First, the sensibility of the indices was evaluated through Monte Carlo Experiments. These controlled simulations considered different sorts of errors (systematic, random and systematic + random) and errors magnitude. By using the R-code, we also carried out a case of study in which the indices are expected to indicate that the AGROMETEOROLOGY Article On the performance of three indices of agreement: an easy-to-use r-code for calculating the Willmott indices Heloisa Ramos Pereira, Monica Cristina Meschiatti, Regina Célia de Matos Pires, Gabriel Constantino Blain* Instituto Agronômico Centro de Ecofisiologia e Biofísica Campinas (SP), Brazil. *Corresponding author: gabriel@iac.sp.gov.br Received: Feb. 15, 2017 – Accepted: May 29, 2017 empirical Thornthwaite’s model produces poor estimates of daily reference evapotranspiration in respect to the standard method Penman-Monteith (FAO56). Our findings indicate that the original index of agreement may indeed erroneously select a predicting model performing poorly. Our results also indicate that the newer versions of this index overcome such problem, producing more rigorous evaluations. Although the refined Willmott index presents the broadest range of possible values, it does not inform the user if a predicting model overestimate or underestimate the simulated data, resulting in no extra information regarding those already provided by the modified version. None of the indices represents the error as linear functions of its magnitude in respect to the observed process.


INTRODUCTION
The great capacity of modern personal computers has enabled the use of complex models to simulate and describe increasing numbers of natural and man-made processes.In this view, numerical models of environmental, hydrological and agro-meteorological systems have grown in number and complexity (Willmott et al. 2012).Accordingly, the evaluation of model performance, i.e., to compare model-produced estimates with observed/reliable values, is a fundamental step for model development and use (Willmott et al. 1985;Willmott et al. 2012).This validation process commonly includes a criteria definition that relies on mathematical measurements of how well modelproduced estimates simulate the observed values (Willmott et al. 1985;Krause et al. 2005;Willmott et al. 2012).Willmott (1981;1982) and Willmott and Wicks (1980) proposed and used an index of agreement -currently referred to as 'the Willmott index' , 'the original d index' or simply 'the d index' (d orig ) -that is intended to be a dimensionless measurement of model accuracy.The d orig is bounded by 0, meaning no agreement, and 1, meaning a perfect fit (Willmott 1984).Authors such as Legates and McCabe (1999) stated that d orig represented a remarkable improvement in respect to the coefficient of determination.In this view, the d orig index has been used in several meteorological, agrometeorological and hydrological studies (e.g.Wu et al. 2005;Meschiatti and Blain 2016).In spite of this widespread use, Willmott et al. (1985) noted that the use of squared differences in its calculation algorithm might result in high values of this index (d orig ≈ 1) even in the presence of large errors.In addition, sums-ofsquares-based measurements vary in response to both variability and central tendency within a set of deviations (Willmott et al. 2009).On such background, Willmott (1984) proposed a modification in the d orig index that replaces the square function by the modulus of the deviations.This modified version is frequently referred to as the modified index of agreement (d mod ).The advantages of d mod over d orig is that errors and differences are given their appropriate weighting factors (e.g.Willmott et al. 1985;Willmott et al. 2009).The d mod may be regarded as a more rigorous method than d orig (Bardin-Camparotto et al. 2013) because when these two indices are applied to the same validation process, d mod tends to approach its maximum value more slowly as the predicted values approach the observed data (Legates and McCabe 1999;Willmott et al. 2012).
In spite of the advantages of d mod over its original version, Willmott et al. (2012) stated that the overall range of d orig and d mod [0:1] is narrow to adequately represent the great variety of forms that predicted values can differ from observed data.Therefore, these authors proposed a new index, referred to as the refined index of agreement (d ref ), that is bounded by -1.0 and 1.0.Willmott et al. (2012) claimed that the d ref is more rationally related to model accuracy than d orig and d mod (Willmott et al. 2012).
Despite the above-mentioned efforts to improve the original version of the d orig index, an overview on the scientific literature indicates that the use and evaluation of d mod and d ref still need to be enhanced.This statement is particularly true for tropical developing countries that use d orig as a tool for crop modeling and other agrometeorological studies.Therefore, in order to motive the use of the two modified versions of the index of agreement, the goals of this study were (i) to evaluate and compare the performance of d orig , d mod and d ref to different sorts of errors (systematic, random and systematic+random) and (ii) to provide an easy-to-use R-code capable of calculating the three indices.

MATERIAL AND METHODS
The performance of the three indices was evaluated (i) by means of controlled Monte Carlo experiments, and (ii) by means of a well-documented case of study in which the empirical Thornthwaite's model is used to estimate daily amounts of reference evapotranspiration (ETo).As it will be further described, the indices are expected to inform the user that this empirical model tends to produce poor estimates of daily ETo values.The Willmott indices (d orig , d mod and d ref ) can be calculated as follow: (1) (2) (3) when where P and O are, respectively, the predicted and observed values and the 95% confidence interval of each index value were estimated through the bootstrap approach, as recommended by Willmott et al. (1985).

Monte Carlo Simulation: Systematic errors
All sets of Monte Carlo simulations were performed using a sample size N = 100.This N value was adopted to avoid the influence of different sample sizes on the outcomes of the simulations.Naturally, the effect of different N values on d orig , d mod and d ref should be addressed in future studies.The first set of simulations evaluated the performance of the three indices in the presence of systematic errors.In order to cover a great range of behaviors commonly found in agrometeorological variables (Blain 2014), the observed values (o i ) were generated using the gamma 2-parameter distribution (expression 4).As highlighted by authors such as Wilks (2011), the gamma distribution can assume several shapes, depending on the value of its parameters.This flexible distribution either can assume the exponential form or can approach the bell-shaped form of the Gaussian distribution (Wilks 2011).Therefore, the shape (α) and scale (β) parameters of this distribution were respectively set to the following values: G(1,30); G(1,50); G(2,30); G(2,50); G(4,30); G(4,50).Within the R-software environment, gammadistributed variables can be generated as follows.
errors".However, instead of systematic deviations, we added normally distributed random errors proportional to the process mean value, as described in expression 6.The change factor assumed the same values as those of the section "Systematic errors" (p and o have the same mean value).o = as.vector(rgamma(N,α,1/β)) (4) The predicted data (p i ) were generated from the observed values adding a systematic error proportional to the mean value of the o i series (mean (o i ); expression 5) so that the magnitude of the systematic deviations, in respect to the mean o i values, were 10%, 20% and 30%.p = as.vector(o+(mean(o)*CF)) (5.1) where CF is the change factor and it was set to the values of 0.1, 0.2 and 0.3.

Monte Carlo Simulation: Random and Systematic errors
In this section, the predicted values were generated from the same observed values described in section Systematic errors" .Both systematic and random deviations were added according to expression 7. The change factor assumed the same magnitudes as those of the two previous sections.All Monte Carlo simulations were performed considering errors equal to or lower than 30% of the process mean value.This limit was adopted based on the assumption that models leading to errors equal to or larger than this threshold would be dismissed without the need to use a measurement of agreement.A simple visual analysis of the estimated values would indicate that the prediction model is performing poorly.

Easy-to-use R-code
In order to motive the use of the Willmott's indices, we developed a computational algorithm by using the R-software, which is a free environment for graphics and statistical computing (www.r-project.org).The code was developed so that practically no previous knowledge about the software is required.Naturally, advanced users can easily modify the code according to their needs.The code is described in Table 1.As a case of study, this code was applied to compare the performance of the empirical Thornthwaite's model (TW) in estimating daily amounts of reference evapotranspiration (ETo) in Campinas, State of São Paulo, in respect to those estimated from physically-based Penman-Monteith method (Allen et al. 1998).The Climatic variability in this location is influenced by monsoon system (Carvalho et al. 2004), in which the wet season occurs during the austral summer associated with the South Atlantic Convergence Zone.In the winter, the high pressing system of the South Atlantic leads to climatically dry conditions.Regarding the performance of the two above-mentioned models, it is well documented that the TW model is not suited for estimating ETo values at daily scale (Carvalho et al. 2011).On the other hand, due to its solid physical fundamentals, the PM model (Allen et al. 1998)

RESULTS AND DISCUSSION
The analysis of the Monte Carlo simulations must first take into account an important difference between d orig and the other two versions of the Willmott's indices.As can be noted from Equations 1, 2 and 3, only the original index (Equation 1) squares the errors (o i -p i ).However, when applied to large errors magnitude, the square function increases the influence of these deviations on the sum-of-squared errors (Willmott 1982;1984;Willmott et al. 1985;Legates and McCabe 1999;Willmott et al. 2012).In practical terms, the result of such feature is that high d orig values may be obtained in the presence of relatively large errors.Considering all Monte Carlo Simulations performed in this study, no d orig value lower than 0.80 was observed.This statement holds for all sorts of errors evaluated in this study (Figures 1 to 4), indicating that relatively high d orig values may be observed in the presence of a prediction model performing poorly (Willmott et al. 1985;Legates and McCabe 1999;Willmott et al. 2012;Bardin-Camparotto et al. 2013).Therefore, the findings of this study support the statement that both d ref and d mod are more rigorous than d orig , and that they should be preferred over their original version.
The results of the Monte Carlo simulation also indicated that errors with the same magnitude but opposite signs (expressions 4.1 against 4.2; 5.1 against 5.2; 6.1 against 6.2) lead both d mod and d orig to assume a unique value.This feature  The outcomes of the Monte Carlo Simulations also indicated that the three indices may assume different values for a particular errors magnitude (Figures 1 to 4) in respect to the process mean value.For instance, considering a gamma distribution with shape parameter equal to 1 and scale parameter equal to 30 [G(1,30)], d orig , d mod and d ref respectively ranged from ~0.95 to ~098, ~0.78 to ~0.85 and ~0.78 to ~0.85 when the systematic errors were set to 20% (Figure 1).This feature implies that none of the three indices can be directly used to evaluate the average error of the predicted/estimated values in respect to the real/ observed process average value.This statement is further supported by the fact that the three indices are affected by the parameters of the distribution from which the observed data were drawn.As previously described, d mod ranged from ~0.78 to ~0.85 when the systematic error were set to 20% and the Gamma distribution G(1,30) were used to generate the observed data.Considering the same errors magnitude, d mod ranged from ~0.63 to ~0.77 when G(4,50) were used to generate the observed data.This latter statement holds for the three indices and all sorts of errors.

Case Study
It is worth mentioning that atmospheric water demand is mainly driven by the following variables: incoming solar radiation, air temperature, wind speed and vapor pressure deficit (Vicente-Serrano et al. 2014; among many others).The PM equation has two components (energetic and aerodynamic) that seek to represent the contribution of all these variables to a given ETo value.In general, the energetic component contribution to an ETo daily amount is higher than the aerodynamic factor   Matsoukas et al. 2011).However, the significance of this latter component to a particular daily ETo value tends to increase in dry regions and/or during dry period/seasons (Matsoukas et al. 2011;McVicar et al. 2012).In other words, a relative decrease in the energetic component tends to be associated with an increase in the significance of the aerodynamic component.Finally, it is also worth emphasizing that the TW model considers only variables related to the energetic component of the atmospheric water demand, i.e., air temperature and those related to the photoperiod.On such theoretical background and considering the climate conditions of Campinas, the level of agreement between ETo daily amounts derived from both TW and PM models is expected to vary over the seasons, reaching its lower level during the austral winter (dry) season.Likewise, a particular index used to evaluate model performance is expected to indicate that the TW model cannot be applied to estimate ETo daily amounts in Campinas-SP.The seasonal variability of the three indices calculated through the R-code is consistent with the above-mentioned theoretical background.In addition, the results depicted in Figure 5 also agree with those of the Monte Carlo simulations that indicated that both d ref and d mod are more rigorous than d orig , and hence they should be preferred over their original version.This statement is particularly true for the summer season when, considering the 95% confidence interval, the results of Figure 5

FINAL REMARKS
Considering the errors magnitude adopted in the Monte Carlo Simulations as well as the case of study, our findings indicate that the original version of the Willmott index may lead the user to erroneously select a predicting model that generates poor estimates.This statement is consistent with previous studies.Our results also indicate that the two newer versions of this index (modified and refined) overcome such problem, leading to more rigorous evaluations of the predicting models.Therefore, they should be preferred over the original version.
Although the refined Willmott index presents the broadest range of possible values [-1.0:1.0], it does not inform the user if a predicting model overestimate or underestimate the simulated data.Therefore, it added no extra information in respect to those already provided by the modified version of the agreement index.Naturally, this statement holds for the simulations and case of study carried out in this paper.None of the indices represents the error as linear functions of its magnitude in respect to the observed process.

Figure 1 .
Figure 1.Performance of the original (d orig black dots), modified (d mod red dots) and refined (d ref blue dots) indices of Agreement subjected to (positive) systematic errors.The observed values were generated from a 2-parameter gamma distribution [G(.)] with shape parameter set to 1 and 4 and scale parameter set to 30 and 50.

Figure 4 .
Figure 4. Performance of the original (d orig black dots), modified (d mod red dots) and refined (d ref blue dots) indices of Agreement subjected to (negative) random+systematic errors.The observed values were generated from a 2-parameter gamma distribution [G(.)] with shape parameter set to 1 and 4 and scale parameter set to 30 and 50.
indicated that d orig may reach values as high as 0.80 in the presence of absolute mean errors larger than 0.80 mm•day -1 .Considering that d orig =1 indicates a perfect model, the original version of the Willmott index may lead the user to erroneously accept that the TW model is (at least) suitable for estimating daily ETo amounts in Campinas during summer seasons.On the other hand, the upper limits (95% confidence interval) of all d mod values remained lower than 0.60 in all seasons.Naturally, the d mod values are consistent with the above-mentioned theoretical background, indicating

Figure 5 .
Figure 5. Original (d orig black dots), modified (d mod red dots) and refined (d ref blue dots) indices of Agreement (left axis).The blue bars (right axis) represent the Absolute Mean Error (AME; a and b) and the ratio between AME and the mean value of the daily process [AME/Mean(o)].The dashed lines represent the 95% confidence interval.
that the TW model cannot be applied to estimate ETo daily amounts in Campinas-SP, regardless the season.For the summer season, d ref presented similar values as those shown by d mod .However, d ref assumed negative values in the winter of 2008.Considering that the Monte Carlo Simulations indicated that a particular d ref does not inform the user if a predicting model overestimate or underestimate the simulated data, this negative value only indicates that the TW model had its worst performance in the winter of 2008.Finally, none of the three indices can be regarded as linear functions of AME/Mean(PM).

Table 1 .
R-code for calculating the Willmott's indices.
is regarded as the standard method for estimating EP amounts at daily scale.Therefore, the Willmott's indices should indicate significant differences between ETo values estimated from these two methods.The Willmott indices were applied to daily ETo values (December 2007 to November 2009; Campinas-SP) within each season -Summer (December to February), Fall (March to May), Winter (June to August) and Spring (September to November).
Willmott et al. (2012)exemplified in Figure4.This is a natural behavior of indices developed to vary between zero and one, using the modulus and/or the square function in their calculation algorithm.On the other hand, considering that d ref was developed to vary between -1.0 and 1.0, one could expect that errors with same magnitude but opposite signs would lead to different d ref values.However, as demonstrated inWillmott et al. (2012), negative values of d ref only indicate large predictive errors.This is the reason why Monte Carlo simulations revealed that the three indices presented similar behaviors in such cases (Figures3 and 4).Therefore, as d mod and d orig , a given d ref does not indicate that a predicting model overestimate or underestimate the simulated data.
ref Figure 2. Performance of the original (d orig black dots), modified (d mod red dots) and refined (d ref blue dots) indices of Agreement subjected to (positive) random errors.The observed values were generated from a 2-parameter gamma distribution [G(.)] with shape parameter set to 1 and 4 and scale parameter set to 30 and 50.Figure 3. Performance of the original (d orig black dots), modified (d mod red dots) and refined (d ref blue dots) indices of Agreement subjected to (positive) random+systematic errors.The observed values were generated from a 2-parameter gamma distribution [G(.)] with shape parameter set to 1 and 4 and scale parameter set to 30 and 50.