Objective functions used as performance metrics for hydrological models: state-of-the-art and critical analysis

Hydrological models (HMs) can be applied for different purposes, and a key step is model calibration using objective functions (OF) to quantify the agreement between observed and calculated discharges. Fully understanding the OF is important to properly take advantage of model calibration and interpret the results. This study evaluates 36 OF proposed in the literature, considering two watersheds of different hydrological regimes. Daily simulated streamflow time-series, using a distributed hydrological model (MGB-IPH), and ten daily streamflow synthetic time-series, generated from the observed and calculated streamflows, were used in the analysis of each watershed. These synthetic data were used to evaluate how does each metric evaluate hypothetical cases that present isolated very well known error behaviors. Despite of all NSE-derived (Nash-Sutcliffe efficiency) metrics that use the square of the residuals in their formulation have shown higher sensitivity to errors in high flows, the ones that use daily and monthly averages of flow rates in absolute terms were more stringent than the others to assess HMs performance. Low flow errors were better evaluated by metrics that use the flow logarithm. The constant presence of zero flow rates deteriorate them significantly, with the exception of the metrics TRMSE (Transformed root mean square error) did not demonstrate this problem. An observed limitation of the formulations of some metrics was that the errors of overestimation or underestimation are compensated. Our results reassert that each metric should be interpreted specifically thinking about the aspects it has been proposed for, and simultaneously taking into account a set of metrics would lead to a broader evaluation of HM ability (e.g. multiobjective model evaluation). We recommend that the use of synthetic time series as those proposed in this work could be useful as an auxiliary step towards better understanding the evaluation of a calibrated hydrological model for each study case, taking into account model capabilities and observed hydrologic regime characteristics.


INTRODUCTION
Hydrologic models (HMs) are used to represent hydrological processes in order to obtain information for water resources planning and management. They enable a rapid response to several scenarios and assist decision-making processes regarding land use change, climate variability, and water-intensive scenarios, among others, for water resources in a given region (Tucci, 2005;Beven, 2012).
HMs usually need to be calibrated to be useful in solving practical problems. During the calibration process, parameter values are defined to enable the model to closely match the behavior of the real-world system. Model calibration partially compensates for different types of hydrological uncertainties, such as those associated with input data, hydrological processes, mathematical formulation of the hydrologic model, spatiotemporal discretization, and observations (Efstratiadis & Koutsoyiannis, 2010;Beven, 2012).
The response of the hydrological system is commonly represented by observed streamflow time-series. Thus, during calibration of HMs, observed and calculated hydrographs are compared at points along the drainage network. More recently, efforts have been made to combine such analysis together with other hydrological process variables such as evapotranspiration (Herman et al., 2018), soil moisture (Rajib et al., 2016) and surface temperature (Zink et al., 2018). However, comparison between observed (Qo) and calculated (Qc) streamflows predominates as the most widely used approach (Troin et al., 2015;Zhang et al., 2016;Molina-Navarro et al., 2017).
One approach for HM calibration is manual calibration, based on manually changing model parameters and visually comparing observed and calculated hydrographs. This is an intuitive way to judge the fit quality and is even preferred by many users (Pappenberger & Beven, 2004), being actually the most widely used one (Boyle et al., 2000). This procedure uses the experience of the hydrologist to assess several aspects of observed and calculated hydrograph similarities, such as peak flows, peak times, rise and recession limbs, drought flows, and flood durations. However, subjectivity in choosing one of many different parameter sets results from personal preferences for denoting more the peak flow or drought errors (Krause et al., 2005;Garcia et al., 2017), even when a model that represents the overall behavior of the observed hydrographs is intended. Another remarkable shortcoming is that the manual search for optimal parameters poorly explores the parameter space.
Automatic calibration is a second approach for HM calibration. It uses metrics to mathematically assess the degree of agreement between Qo and Qc. Each metric weights the error between Qo and Qc, considering a specific mathematical formulation that must be minimized or maximized as an objective function (OF) of an optimization problem. Manual calibration could also be performed by manually varying model parameters and evaluating model performance by inspecting such metrics.
Metrics such as correlation coefficient (r), coefficient of determination (r 2 ), root mean square error (RMSE), and Nash-Suttclife efficiency (NSE) are the most widely used (Gupta et al., 2009;Westerberg et al., 2011;Wohling et al., 2013). Coefficients such as r and r 2 evaluate the collinearity between Qo and Qc, while metrics such as RMSE measure the mean error between Qc and Qo in the flow unit itself. Metrics such as NSE assess the HM performance against a baseline model represented by the mean of all streamflow observations.
As each metric weights the error between Qo and Qc in different ways, its formulation and selection criteria should be considered for the correct interpretation of results. An HM may be applied for different purposes, which means that the ability of an HM to reproduce different aspects of the observed streamflow regime may vary in relevance for a given application (Garcia et al., 2017). For example, an HM developed for estimating water availability in semiarid climate regions should be evaluated for its ability to represent the drought period. On the other hand, an HM for flood warning must be evaluated regarding its capability to simulate high streamflows.
The use of HMs is increasing mainly due to the development of user-friendly interfaces, the integration and automation of data preparation steps within Geographic Information Systems, and the inclusion of automatic calibration modules. All of this speeds up the application of HMs, but it means that less attention and time is dedicated to critical appraisal of the data, evaluation of the calibration process, and analysis of overall HM results. One of the usually neglected steps is ensuring the correct selection of OFs for HM calibration. The calibration process requires other issues to be addressed, such as the size of observed streamflow time-series (e.g. Li et al., 2010;Nelson et al., 2017), the mathematical method of searching for the optimal parameters set (e.g. Bravo et al., 2009), and the computational cost involved (Gutierrez et al., 2019).
In the literature, dozens of metrics are used as OFs for HM calibration (e.g. Legates & McCabe Junior, 1999;Krause et al., 2005;Moriasi et al., 2007;Gupta et al., 2009;Muleta, 2012;Romanowicz et al., 2013;Wohling et al., 2013;Fowler et al., 2018). This large number of alternatives contrasts with the repeated use of a small set of metrics in current HM calibration approaches. Often, such use is made without criteria, which may lead to mistaken conclusions about the HM performance.
This study assesses 36 metrics that have been proposed in literature for HM calibration by comparing calculated and observed hydrographs.

3/15
Each metric selected for this study has its specific formulation that differs from the others and has been explicitly adopted in one or more model calibration applications according to the mentioned references. But indeed some metrics present strong similarities among them. This review of metrics is exactly one of the contributions of this research. Moreover, the similarities or differences obtained within our results may help readers to better understand which metrics work similar to each other.
In addition, an analysis of each metric is carried out in order to verify how it is influenced by errors in several components of the calculated hydrographs (e.g. errors in the drought season, errors in the rainy season, magnitude of the error). Ten synthetic streamflow time-series were generated to be tested and evaluated by each metric, in order to see how does each one evaluate hypothetical cases that present isolated very well known error behaviors. Two Brazilian large-scale watersheds with contrasting characteristics (perennial vs intermittent streamflows) form the case study.

METRICS AS OBJECTIVE FUNCTIONS IN HYDROLOGIC MODEL CALIBRATION
A total of 36 metrics commonly used as OFs for HM calibration were identified and selected from an extensive literature review (Table 1). This list cannot be considered exhaustive, and other metrics not included in the list were used in specific analyses during HM calibration (e.g. the Richard-Bark flashness index proposed by Parker et al., 2019).
Each OF listed in Table 1 is presented with its mathematical formulation and its minimum, maximum and optimal values. The following section discusses the main issues related to each OF, presenting several references for further details.
Metrics r and r 2 are some of the most commonly used in several scientific areas and evaluate the degree of linear association and dispersion between two datasets (e.g. Qo and Qc).
NSE is one of the most widespread OFs adopted for HM calibration. Metrics such as NSE assess the HM performance Table 1. Metrics used to assess the performance of hydrological models.

Name (Symbology) Mathematical formulation Min, Max, Optimal Units References
Linear correlation Nash-Sutcliffe efficiency (NSE) NSE with calendar day mean calculated on log transformed daily flows (LNSD)   Gupta et al. (1999) High flow (HF) Mean error (ME) Mean square error (MSE) Root mean square error (RMSE) Romanowicz et al. (2013) Transformed root mean square error (TRMSE) Modification of RMSE to high flow errors (NHF)

5/15
against a baseline model represented by the mean of all streamflow observations. An adaptation of NSE is the NSE calculated with the logarithm of the daily streamflows (LNS). In this way the oversensitivity of NSE to extreme values is reduced and the sensitivity for lower values is increased (Krause et al., 2005). Another adaptation of NSE with this aim is the modified form of the NSE (MNS), computed with the absolute value of the linear difference between Qo and Qc (Krause et al., 2005).
Other modifications to NSE are related to alternative benchmark models used instead of the mean of all streamflow observations (e.g. Schaefli & Gupta, 2007;Krause et al., 2005;Muleta, 2012). One of these metrics measures HM performance relative to a reference model given by the interannual calendar day mean (named NSD by Muleta, 2012). Similar to LNS, the LNSD (NSE with calendar day mean calculated on log transformed daily streamflows) was proposed. Following MNS, the MNSD (Modified form of NSE with calendar day mean) uses the absolute value of the linear differences. The NSE that use calendar monthly mean streamflow as a reference model (NSM) was used for daily HM calibration. Also derived from NSE, Persistence Index (PI) uses previously observed values as the reference model, which is appropriate in a streamflow forecasting context (Bennett et al., 2013). This index measures the relative magnitude of the residual variance against the variance of errors obtained by a persistence model (Gupta et al., 1999).
HF (high flow) metric was proposed to evaluate the performance of a HM in reproducing peak streamflow values (Rwetabula et al., 2012). Willmott's index of agreement (D) was

Name (Symbology) Mathematical formulation Min, Max, Optimal Units References
Modification of RMSE to low flow errors (NLF) Fenícia et al. (2007) Sum of squared erros of the streamflows logarithmic (SLOGQ) Hogue et al. (2000) Sum squared errors of daily streamflows (SSEQ) Wohling et al. (2013) Sum squared errors of monthly streamflows normalized by basin area (SSEMQ)   Objective functions used as performance metrics for hydrological models: state-of-the-art and critical analysis proposed to overcome the limitation of r 2 related to poor HMs that consistently overestimate or underestimate the observations (Muleta, 2012). Another metric that resembles NSE is the Kling-Gupta Efficiency (KGE). This is an adaptation and at the same time decomposition of NSE, which facilitates the analysis of the relative importance of its different components (correlation, bias, and variability measure -α) in the context of HM calibration (Gupta et al., 2009). According to Pechlivanidis et al. (2012), the KGE sees the calibration process from a multi-objective optimization perspective. A modification of KGE has also been proposed by Pool et al. (2018) aiming at achievieng a non-parametric calibration criteria.
The normalized bias of flows (β) indicates the relationship between the mean flow difference (Qo and Qc) normalized by the standard deviation of the observed flows (Wohling et al., 2013).
In contrast to the metrics that follow NSE-like formulations, there are metrics based on the direct difference between Qo and Qc, which are therefore referred to as a type of error. Examples of this group of metrics are mean error (ME), mean absolute error (MAE), mean absolute relative error (MARE), mean square error (MSE), root mean square error (RMSE), and transformed root mean square error (TRMSE). ME is the average of the time-series of errors, thus it identifies whether the HM is more biased to overestimate or underestimate streamflows. However, it does not quantify these errors distinctly. Despite other metrics mentioned in this group do not compensate for the positive and negative error values like ME, their values do not indicate if the HM overestimates or underestimates the observations. MAE quantifies the average of the time-series of absolute values of the errors, while MARE quantifies the average of a time-series of absolute values of the error relative to the observed streamflow. MSE averages the time-series of squared errors, avoiding the error compensation of ME, but making the interpretation of the metric's value difficult as it is in a different unit (i.e. square m 3 /s). RMSE overcomes the limitation of MSE by applying the root over MSE. TRMSE uses a Box-Cox transformation of the streamflow to quantify the RMSE. The Box-Cox transformation, in addition to emphasizing low-flow periods, also reduces the impact of heteroscedasticity in the RMSE calculation (Hogue et al., 2000;Kollat et al., 2012).
Other metrics are derivations of RMSE as RSR (ratio between RMSE and the standard deviation of the streamflow observations (Moriasi et al., 2007), NHF (modification of RMSE for increasing sensitivity to high-flow errors) and NLF (modification of RMSE for increasing sensitivity to low-flow errors) presented by Fenícia et al. (2007).
SLOGQ (sum of squared errors of the streamflows logarithm) metric is a function selected for the calibration of parameters that influence the hydrograph recessions (Hogue et al., 2000). SSEQ (Sum of squared errors of daily streamflows) and SSEMQ (Sum of squared errors of monthly streamflows normalized by basin area) metrics, although not calculating averages, have similarities to MSE because they represent the sum of squared deviations and result in distinct units of the variable under analysis, which makes interpretation difficult (Wohling et al., 2013).
The discrepancy between peak flow values is quantified by the MAXAE (maximal absolute error), which has the disadvantage of being subject to a time-interval error (Janssen & Heuberger, 1995). Metric DHQMAX (maximum difference in the largest peak flows) uses a timeless relationship to quantify the difference between maximum observed and calculated streamflows. Both metrics are directly related to errors in peak streamflows (Wohling et al., 2013).
ΔV (relative volume error) is usually called Bias and is the mean error between observed and calculated streamflows expressed as a fraction of the average observed streamflows (Rwetabula et al., 2012). It is commonly recommended for quantifying water balance errors (Rientjes et al., 2013) and indicates whether the model is poor in representativeness (Moriasi et al., 2007;Van Liew et al., 2007). VE (volumetric efficiency), on the other hand, evaluates the deviation between observed and calculated hydrographs by measuring the area between them, expressed as a fraction of the average observed streamflows (Criss & Winston, 2008). ROCE (runoff coefficient percent error) metric considers water balance as the average annual runoff coefficient percent error. As presented in Table 1, the sum occurs during years 1 to k of the calibration period, for which an average annual value is then calculated (Kollat et al., 2012).
Other metrics combine previously presented metrics to measure more than one issue, as Y (combined form of NSE and ΔV (Akhtar et al., 2009)) and RV(combined form of NSE and MARE (Lindström et al., 1997), weighted by a parameter ω. The best results of this metric are obtained with ω equals to 0.1 according to the application of the HBV hydrologic model by Lindström et al. (1997) and Dakhlaoui et al. (2012).
Finally, SFDCE (slope of the streamflow duration curve) and SDCI (streamflow duration curve index) metrics refer to the comparison between the calculated and observed streamflow duration curves. SFDCE represents the error in simulating the slope of the streamflow duration curve (Westerberg et al., 2011;Kollat et al., 2012). SDCI evaluates the similarity between the observed and calculated streamflow duration curves from the sum of the differences between all the points that define the curves (Tucci, 2005).

METHODOLOGY
The metrics showed in Table 1 were applied to assess the performance of the calculated and synthetic streamflow timeseries in the Piancó River and Furnas subcatchments relative to the observed streamflows. This procedure aims to evaluate how metrics are influenced by the quality of the synthetic streamflow time-series, and also to compare metrics from synthetic time-series to metrics from a calculated streamflow time-series obtained from a calibrated HM. Based on the results of this procedure, a critical analysis of each metric was carried out, showing use recommendations and limitations.
A four-step procedure was used and is described in the following sections: 1) metrics selection, 2) data collection from two case studies, 3) definition of ten synthetic streamflow timeseries, and 4) test analysis and results. Ferreira et al.

Metrics selection
A total of 36 metrics commonly used as OFs for HM calibration were identified and selected from an extensive literature review, considering issues related to their frequency of use; whether they are modifications, adaptations or combinations of preexistent metrics; or comprise new concepts, as explained in the previous section and summarized in Table 1. It is worth mentioning that we have not used these metrics for model calibration, but rather to provide the evaluation of the output of a hydrologic model previously calibrated and also of hypothetical cases based on synthetic time series, as further detailed.

Case studies
Two case studies were selected based on data availability and the existence of previously calibrated HMs, with distinct hydrological regimes (intermittent or perennial rivers) and drainage area, in order to provide a broader picture regarding the results and findings.
The first case study is a subcatchment of the Piancó (drainage area of 4,603.39 km 2 ), located in the Piancó River basin ( Figure 1B) in northeast Brazil. This is a semiarid region with a large number of intermittent rivers. This study used daily time-series of observed streamflow from 1970 to 2011 (42 years) from Felix & Paz (2016), in which the MGB-IPH model (Collischonn et al., 2007) was applied to the subcatchment.
The hydrological regime of the Piancó River is characterized by strong seasonality, with monthly streamflow ranging from 405.49 to 0.09 m 3 /s in the rainy season (January to May) and typically zero flows in the driest moths. The river was dry in 37% of the daily time intervals of the time-series and the driest year was 1980, which saw 79% of the days without streamflow. The MGB-IPH model was calibrated and validated by Felix & Paz (2016) for the periods 1970-1990 and 1991-2011, respectively, through an automatic multi-objective calibration procedure (using NSE, LNS and ΔV as OFs), followed by a calibration refinement procedure done manually in order to obtain more representative and coherent parameters between different hydrological response units. A total of 11 parameters was calibrated as detailed in the mentioned reference.
The second case study covers the Furnas subcatchment with a drainage area of 51,784.41 km 2 , located in the Grande River Basin in southeastern Brazil, in the Paraná hydrographic region ( Figure 1C). This basin is widely used for hydroelectric power generation (Tucci et al., 2008). This subcatchment was modeled by Bravo et al. (2009), who applied the MGB-IPH model Objective functions used as performance metrics for hydrological models: state-of-the-art and critical analysis using daily streamflow data from 1981 to 2001 (21 years). The hydrological regime in Furnas is strongly seasonal, ranging from 350 m 3 /s during low flows to over 2,000 m 3 /s in summer, with flood peaks typically reaching 4,000 m 3 /s (Bravo et al., 2009). These authors applied the MGB-IPH for the period 1970-1980 during calibration. Validation was carried out for the period 1981-2001. As with the first study case, calibration of the MGB-IPH to Furnas subcatchment was performed through an automatic multiobjective calibration procedure with the same OFs, considering a total of 10 parameters, as detailed in the mentioned reference.
For both study cases, it was used the version of the MGB-IPH model that adopts a square-grid discretization, as presented in Collischonn et al. (2007). Pianco subcatchment was divided in 151 cells of approximately 5 x 5 km and Furnas subcatchment was discretized into 519 cells of roughly 10 x 10 km. This model was selected due to satisfactory results being achieved on several applications in different hydrologic regimes (e.g. Oliveira et al., 2018;Pereira et al., 2014;Paiva et al., 2013;Ribeiro Neto et al., 2006;Tucci et al., 2005) and due to availability of previous works by the authors. But, in fact, this study could have been performed considering the outputs of any calibrated hydrologic model.

Synthetic streamflow time-series
Eleven daily streamflow time-series were used in the analysis carried out in each watershed.
One of these time-series used the daily calculated streamflow (Qhid) from the previous studies of Felix & Paz (2016) for the Piancó subcatchment, and Bravo et al. (2009) for the Furnas subcatchment. The Qhid time series were useful for providing the basis for developing the synthetic time-series and also for serving as comparison to these time-series, as detailed bellow.
Ten synthetic daily streamflow time-series were generated based on the calculated and observed values in each watershed, as result of idealized error behavior in hypothetical cases ( Figure 2). The general idea is simple and of practical understanding: to analyze how does each metric evaluate hypothetical cases that present isolated very well known error behaviors. We want to assess if the metric is able to detect this known error or if the metric considered it as a perfect model; if there is a compensation effect between systematic errors and perfect match in distinct time periods; how much do the metrics penalize each type of well known error or valorized each type of perfect model capability; and how the evulation of these hypothetical cases relatively to an actual typical output of a calibrated hydrologic model. These synthetic time series represent in some cases exaggerated systhematic errors or model capabilities that do occur when calibrating a hydrological model but at smaller intensity and not isolated from other errors.
For example, the synthetic time-series Qox2 (Qo/2) shows in each time interval a streamflow value that is equal to twice (half) that which was observed. These time series were proposed to detect how each metric evaluated an hypothetical case that systematically calculates half or double of the discharges in each time step. They are perfect models in terms of predicting timing of recession and peak flows, for instance. And also we intended to analyse if each metric evaluated the actual calibrated model better or worse than these Qox2 and Qo/2 hypothetical cases.
Two other synthetic time-series were based on the use of the Q50 (median of the observed streamflow time-series), which was equal to 0.23 m 3 /s in the Piancó subcatchment and to 703 m 3 /s in the Furnas subcatchment. Thus, the synthetic time-series Qo+Q50 (Qo-Q50) shows a streamflow value that is equal to the observed one plus (minus) Q50 in each time interval. If the resulting value of the streamflow for Qo-Q50 was less than zero in a time interval, it was considered as zero. The Qo+Q50 and Qo-Q50 time series represent hypothetical cases that systematically shift up or down, respectively, the observed hydrograph by a constant value.
Two synthetic time-series combine calculated and observed streamflows over different time periods within the year. The represents a hypothetical case that is perfect during wet period in reproducing observed values, while maintaining the typical error of a calibrated model during the dry period. Analogously, the dry Qo time series is like a hypothetical case in which it perfectly reproduces observed flows during the dry period and presents typical error during the wet period.
The last four synthetic time-series were based on mean values derived from the observed streamflow time-series. The idea is to have hypothetical cases that conservatively predict streamflow following the historic pattern according to the mean values at different ways. The synthetic time-series Qo is simply the same streamflow value in each day, equal to the mean observed streamflow. The synthetic time-series ( shows the same streamflow value in each day of a given month, equal to the mean observed streamflow of that specific month. In this way, the streamflow values are distinct between months in a given year and in another year. Finally, any daily streamflow in the synthetic time-series ( ) day Qo is equal to the mean observed streamflow derived from the data for that day in all years of the observed time-series. Between years, the daily streamflow values are the same in each day.

Performance of the synthetic streamflow time-series
Results of the performance assessment of calculated and synthetic time-series by the 36 selected metrics are discussed below (Table 2 and Figure 3).  ( The synthetic streamflow time-series Qox2 represents the output of a hypothetical case that always doubled the observed values. Such a time-series presents perfect linear correlation with Qo and, therefore, the r and r 2 metrics reached the maximum value, superior to the Qhid performance for both basins, as expected. For the Furnas subcatchment, all other metrics assessed Qox2 performance as inferior to the ones obtained with Qhid. Due to intermittence and very low streamflows in the Piancó subcatchment, however, metrics that use logarithm of streamflows (e.g. LNS, LNSD and SLOGQ) assessed Qox2 performance as much better than Qhid, which showed difficulty in representing low streamflow values( Figures 3D, 3G and 3W). Furthermore, the TRMSE metric assessed Qox2 performance higher than Qhid, as this metric uses a transformation of the streamflows that expands the lower end of the scale and thus gives higher emphasis to recessions ( Figure 3U). All other metrics assessed Qox2 performance as inferior to the ones obtained with Qhid in the Piancó subcatchment ( Table 2).
The Qox2 performance in both subcatchments was lower when assessed by NSE, NSD, NSM, PI, HF, D KGE, and RSR metrics, which use the square of the residual in their formulation. Similar lower performance results were obtained by error-type metrics, whether they compute squared, absolute, or linear errors, as with ME, MAE, MARE, MSE, RMSE, NHF, NLF, SSEMQ, SSEQ, ROCE, or DHQMAX and MAXAE. These metrics are sensitive to systematic overestimation of streamflows, especially during floods, whether or not the river is intermittent. The MARE metric shows higher values in most of the synthetic time-series when compared to the MARE obtained with Qhid. Its values were very high for the Piancó River subcatchment ( Figure 3R) due to recurrent zero streamflows. This factor also was reflected in RV values, as this is a MARE-dependent metric.
The synthetic streamflow time-series Qo/2 is similar to Qox2 and represents a streamflow value that is equal to half the observed one, in each time interval. For this reason, the r and r 2 metrics had the maximum value for Qo/2 in both subcatchments, as expected ( Figure 3A and 3B). For Furnas, the MAXAE metric showed a lower value for Qo/2 than for Qhid, meaning the HM outputs are lower than half of the Qhid values in some time intervals during flood periods ( Table 2). For the Piancó River subcatchment, the performance results for Qo/2 were quite different from the results from the Furnas subcatchment, except for r and r 2 . The performance of the Qo/2 time-series was assessed as better than the Qhid performance by more than half of the metrics. Among the metrics that did not follow this behavior are ME, ∆V, KGE and metrics based on streamflow duration curves as SFDCE and SDCI ( Figures 3P, 3AD, 3O, 3AI, and 3AJ). These latter metrics did not perform satisfactorily for both subcatchments.
The performance of two synthetic time-series that increased (Qo+Q50) or decreased (Qo-Q50) by a constant quantity (Q50) the observed streamflow values was assessed. Both time-series  (  showed better performance than Qhid in the Furnas subcatchment when assessed by r, r 2 , MAXAE, DHQMAX, and SFDCE, while α showed a better performance only for Qo+Q50. The result for r and r 2 are the same as the other time-series that present a linear correlation with Qo. For the other mentioned metrics, even with a high Q50 value in the Furnas subcatchment (Q50 = 703 m 3 /s, which is 72% of the average daily flow and 9.4% of the maximum daily flow), this was not enough to cause peak streamflow errors (which are the focus of MAXAE and DHQMAX) greater than those in Qhid time-series. The performance of both time-series was optimal when assessed by the SFDCE metric since the slope of the streamflow duration curves is exactly the same as the observed one ( Figure 3AI). In the case of the Piancó River subcatchment, Q50 is extremely low (Q50 = 0.23 m 3 /s, equivalent to 1.4% of the daily average streamflow and < 0.02% of the daily maximum), making the Qo+Q50 and Qo-Q50 time-series very similar to Qo. Thus, most of the metrics showed superior performance on these synthetic time-series when compared to Qhid, except for the ME, ∆V, and β, while SFDCE and SDCI showed superior performance only for Qo-Q50. The slightly superior performance of Qo-Q50 compared to Qo+Q50 was because of the occurrence of zero streamflow values, meaning that for these days Qo-Q50 = Qo. The effect of having streamflows equal to zero in the observed time-series is also responsible for the slope of the duration curve of these synthetic time-series being different from the observed one. Thus, SFDCE metrics did not reach the ideal value in this subcatchment, as occurred for Furnas subcatchment.
The errors in the low streamflows in the synthetic timeseries Qo/2 and Qo-Q50 in the Piancó River subcatchment did not affect the performance assessed by the LNS, LNSD, LOGQ, and TRMSE metrics ( Figures 3D, 3G, 3W, and 3U). Unlike in the Piancó River subcatchment, the Qo-Q50 synthetic time-series for the Furnas subcatchment present higher errors, reducing its performance when assessed by those metrics.
The synthetic streamflow time-series wet Qo and dry Qo represent the output of a hypothetical case that has no error in the wet (dry) periods, and keeps the error of the adjusted HM in should be evaluated as having a better performance than Qhid. Actually, the performance of both time-series was assessed as better than the performance of Qhid in almost all metrics. Only two metrics assessed dry Qo with same performance as Qhid: MAXAE and DHQMAX ( Table 2). As these two metrics assess the largest error during floods (wet period), the maximum flood error found in dry Qo was equal to the one in Qhid (Figures 3AB and 3AC).
Several metrics which compensate for positive and negative errors assessed wet Qo and dry Qo as having a lower performance than Qhid in the Piancó River subcatchment. This highlights the negative aspect of such metrics (e.g. ME, ∆V and β), as errors of overestimation or underestimation are compensated for. That means that a hypothetical case that reproduces exactly the wet season but has errors during the dry period, when compared to another hypothetical case that also shows errors in the wet period, will present lower performance when assessed by these metrics (the same results would occur when changing the wet/dry periods). For example, the ME of Qhid in the Piancó River subcatchment was -0.14 m 3 /s, while the ME of wet Qo was -0.80 m 3 /s and the ME of dry Qo was 0.66 m 3 /s (note that the sum of the latter two MEs are equal to the ME of Qhid). Since the metric Y uses ∆V and NSE (which do not compensate errors) in its formulation, this effect was not predominant, but led to Y assessing dry Qo as of lower performance than Qhid. It is important to emphasize that this result is local-dependent, as a distinct adjusted HM error behavior in wet and dry periods could occur (e.g. if just positive or negative errors occur in both periods, there will not be a compensation effect).
The remaining synthetic time-series present daily streamflow values that are based on temporal averages derived from the observed time-series: Qo (mean streamflow), For the Furnas subcatchment, which has perennial rivers, these four synthetic time-series were assessed as having lower performance than Qhid by most of the metrics (Table 2). Few metrics assessed the performance of these time-series as better than Qhid: the metrics with compensating errors effect (e.g. ME, ∆V, and β); DHQMAX, which focuses on a point error; and SDCI which evaluates the similarity between streamflow duration curves. Thus, as synthetic time-series are based on average values, errors in high and low values are compensated for, avoiding larger errors in higher streamflows. In addition, the time-series ( ) month year Qo showed a better performance than Qhid when assessed by MAXAE, ROCE, and SFDCE, as this time-series present a lower error in the maximum daily streamflow, in the average annual runoff coefficient, and in the slope of the streamflow duration curve.
For the Piancó River subcatchment, the performance results of the four synthetic streamflow time-series, based on mean values of the observed streamflows, were partially the same as in the Furnas subcatchment. The error-compensating effect of the ME, ∆V, and β metrics improved the performance of Qo and ( ) month year Qo time-series when compared to Qhid, as also ROCE metric. But a distinct pattern was observed in the Piancó River subcatchment in logarithm-based metrics (e.g. LNS, LNSD, and SLOGQ). These metrics assessed the performance of only ( ) month year Qo as better than Qhid. This means that the burden of the HM errors in reproducing the streamflow in the dry period in the Piancó River subcatchment, with intermittent rivers, was large enough for metrics LNS, LNSD, and SLOGQ to assess the performance of Qhid as lower than a synthetic time-series with mean monthly streamflow by year. However, Qhid performance assessed by these metrics was higher than the performance of time-series based on mean streamflows, mean monthly streamflows, and even mean daily streamflows.

Closure to response of performance metrics
It is well described in literature that each metric used for hydrologic model calibration has been proposed focused on one or some aspects of the comparison between calculated and observed streamflows (e.g. Gupta et al., 1998;Wohling et al., 2013;Pushpalatha et al., 2012;Madsen, 2000). As evidenced by our results, systematic or large errors in other aspects non-focused by each metric may not be accounted or may not have significant effect in its evaluation. Users could, therefore, conduct a misjudgement of the overall behaviour of their model. For example, correlation coefficient and coefficient of determination evaluate the linear correlation of the data. A hypothetical time-series that systematically doubled the discharges is evaluated as perfect by those metrics, while a distributed model carefully calibrated using state-of-the art method does not achieve such performance, as expected. This is a classic example in literature, but there were other situations we found and that were more distinct from those previously discussed in literature.
For instance, it could be highlighted the hypothetical timeseries that represent a perfect reproduction of observed flows during the dry or wet periods and present a behaviour exactly the same of the calibrated hydrological model in the opposite period. It means that these hypothetical cases are better or equal to the hydrological model throughout the year in reproducing observed flows. There is no doubt about that, it is conceptual. However, metrics that are practically restricted to assessing wet periods (Maximal absolute error; and Maximum difference in the largest peak flows) were not influenced whether the model was perfect or not during the dry period. More importantly, metrics that make compensation of positive and negative errors (e.g. mean error, relative volume error, combined form of NSE and ∆V and normalised bias of flows) may lead to the judgement that a model being wrong in both wet and dry periods may be of better performance than being wrong just in one of these periods (considering the same behaviour in the other time period).
The results obtained with hypothetical cases that reproduce temporal averages of discharges provided another interesting question: how useful is a calibrated HM that performs worse than simply assuming as model prediction the monthly or other average discharges on time based on observed time series for a past period of time? If we could simply construct such average discharges time series, why to spend time and effort in developing hydrological models that perform worse? But is the calibrated HM really worse than those hypothetical time-series? Two issues need to be discussed to think about the answers for all these questions.
First of all: a better or worse model for what? The purpose of the model, for which it will be used for, is crucial for properly answering the usefulness of each model. For instance, whether the model will be used to estimate and manage water resources availability in dry periods or to estimate flood impacts of climate change scenarios request distinct model capabilities as priority.
This discussion leads the question about 'how good is a model?' to move towards the second point, the issue we addressed within this study, regarding 'how good is a metric to evaluate a model?'. This second question is linked to the first one and concerns the way we evaluate model performance. The aim of the model use should be always in mind as a major driver for selecting metrics for model evaluation. In the first case, for a model being applied for managing water resources availability in dry periods, the reproduction of observed recession flows is crucial and thus model calibration should focus on this issue. Our results recommend the use of metrics such as NSD, KGE and RV for perennial rivers and LNSD, TRMSE and Y for intermittent ones. For the second case, the estimation of flood impacts using hydrological modeling, model calibration should give emphasis on adjusting peak flows. Metrics such as MAE, RSR, ∆V and SFDCE are recommended, independently of the river being intermittent or not.

CONCLUSION
This study assessed 36 metrics that are frequently used for HM calibration by comparing calculated and observed hydrographs. Daily streamflow time-series were used from calculated values by MGB-IPH model from previous studies and ten synthetic timeseries generated based on the calculated and observed values, as a result of idealized error behavior in hypothetical cases. Two Brazilian large-scale watersheds with contrasting characteristics were adopted as case studies.
This study highlighted that knowing the limitations and recommendations of a metric used as an OF is important for adequately evaluating a HM output in terms of observed flow regime reproduction. It is already known that the parameter values obtained through the calibration process are influenced by the OF selected. As the calculated streamflows are dependent on the parameter values, this means the OF must be chosen according to the reason for the use of the HM. The purpose for which the model will be used for is decisive for properly answering the usefulness of each calibrated model.
Our results reassert that each metric should be interpreted specifically thinking about the aspects it has been proposed for. In this sense, simultaneously taking into account a set of metrics would lead to a broader evaluation of HM ability. This highligts to the advantages of adopting a multiobjective model evaluation by combining metrics that assess distinct aspects.
For this it is important to initially understand the actual behaviour of observed streamflows. This analysis should not be disregarded and will be crucial for adequately interpreting metrics results of HM evaluation.
This study supplies a guideline for the choice of OFs, while the use of synthetic time series as those proposed in this work could be useful as an auxiliary step towards better understanding the evaluation of a calibrated hydrological model for each study case.