PHYSICAL ANALYSIS OF REGIONALIZED FLOW AS AN AID IN THE IDENTIFICATION OF HYDROLOGICALLY HOMOGENEOUS REGIONS

Regionalization is an important technique for estimating the flow of hydrographic sections with a lack of data. First, it is necessary to identify hydrologically homogeneous regions (HHRs), which are commonly validated via statistical analyses. Because this step is understood to be subjective, studies that contribute to a greater reliability in identifying regions are needed. In this context, the objective was to evaluate the inclusion of a physical analysis of the average regionalized flow rates as an aid to identify HHRs. The groupings were defined on the basis of geographical convenience methods and cluster analysis. For the assessment of regionalized flows, six statistical indices were used with a physical analysis that was performed via a comparison of the runoff coefficient to the spatial distribution of precipitation values. It was concluded that the physical analysis reduced the subjectivity in the identification of HHRs.


INTRODUCTION
The study of flows is the basis for important decisions in the planning and management of water resources (Coxon et al., 2015;Westerberg et al., 2014) and solving environmental and engineering problems, such as sizing structures for water use and control, economic evaluation of flood protection projects, planning and management of land use, water quality control, among others. (Agarwal et al., 2016). The flow data are restricted to places where stream gauge stations are available (Pruski et al., 2016), and not always providing information in regions of interest (Nruthya & Srinivas, 2015). For these reasons, it is necessary the use of tools for prediction of flow (Kim et al., 2016).
The regionalization of flows comprises techniques used to overcome the lack of hydrological data in places where they are scarce or nonexistent (Kult et al., 2014;Beskow et al., 2014, Liew & Mittelstet, 2018. Although there is no standard methodology for regionalization in the literature (Razavi & Coulibaly, 2013), regression models are the most widely used, and consist of estimating flows based on equations that relate hydrological information of interest to catchment characteristics (Arsenault & Brissette, 2014).
The regionalization of stream gauge indices is based on the premise that basins with a similar climate, geology, topography, vegetation and soils have similar hydrological responses; however, river basins with large drainage areas may have distinct hydrologic behaviors along their hydrography (Smakhtin, 2001). Thus, to ensure greater security and predictive reliability (Nathan & Mcmahon, 1990;Smakhtin, 2001), a step in the study of flow regionalization consists of dividing the studied area into regions with a similar behavior (Hosking & Wallis, 1997), which are known as hydrologically homogeneous regions (HHRs).
In the literature, many methods to identify HHRs are mentioned, such as cluster analysis, residual standard, seasonality indices, classification and regression trees, canonical correlation analysis, entropy and geographical convenience. However, this is a subjective and difficult step, as there is no agreement on an objective technique for the determination of these regions (Hosking & Wallis, Engenharia Agrícola, Jaboticabal, v.40, n.3, p.334-343, may/jun. 2020 1997), and its validation is usually based on statistical analyses only (Nathan & Mcmahon, 1990).
In addition to defining HHRs, another difficulty with respect to regionalization is that the use of regression models is not recommended beyond the limits of the sample data (Naghettini & Pinto, 2007), which in practice means that flow regionalization conducted using stream gauge stations associated with large drainage areas is restricted to an unimportant part of the hydrography, making the planning and management of water resources more difficult (Silva Júnior et al., 2003). In this sense, the procedural association that helps to understand the physical behavior of the process is important for extracting more information from the available data (Pruski et al., 2013;Pruski et al., 2015).
This study was developed under the assumption that the physical behavior of regionalized flows, together with statistical analysis, may help to identify HHRs, lessening the uncertainties associated with flow regionalization. The aim, therefore, was to evaluate the incorporation of a physical analysis into average regionalized flows as an aid to identify HHRs.

Study area and data used
The study region corresponds to part of the Doce River basin in Minas Gerais (Figure 1), which covers an area of 71724 km² and represents 86% of the total basin area. The pluviometric regime of the Doce River basin is characterized by two quite distinctive periods: a rainy season from October to March, with the highest rates during December, and a dry season from April to September (Louzada et al., 2018).
In this study, we used data from 1975 to 2005 from 38 stream gauge stations and 80 pluviometric stations that belong to the hydro-meteorological network of the Hydrological Information System (Hidroweb) of the Brazilian National Water Agency (ANA) in addition to data from 1961 to 1990 from 14 climatological stations that are part of the station network of the Brazilian National Institute of Meteorology (INMET). The study stations are presented in Figure 1.

HHR identification
To identify HHRs, we used two methods: geographical convenience and cluster analysis. For this paper, identification of an HHR using geographical convenience was performed considering the basin precipitation map together with statistical analyses of regression models and the behavior of regionalized flows to attempt to reduce delimitation mistakes.
For application of cluster analysis, it is necessary to choose the variables, agglomeration method and dissimilarity measure, all of which have a strong influence on the results. Cluster analysis was conducted using different combinations of latitude, longitude, long-term average rainfall and real evapotranspiration variables to characterize their influence on the identification of HHRs and thus on the regionalized flows. Latitude and longitude were used to obtain geographically continuous regions (Rao & Srinivas, 2006), and the long-term average precipitation and real evapotranspiration were used to represent the input and output, respectively, of water in the hydrographic basin as they are usually the main components of the water balance.
Long-term average precipitation (P) was obtained from data from the analyzed historical series, as this information's spatialization to the basin was applied by interpolation using the IDW (Inverse Distance Weighting) method, as shown in Figure 2a.
Evapotranspiration calculus was conducted using the Penman-Monteith FAO 56 as described by Allen et al. (1998) to obtain the monthly reference evapotranspiration (ETo) for each climatological station. Later, ETo was interpolated using the IDW method. For the real evapotranspiration (ETr) estimation at every pluviometric station, we used the climatological water balance (CWB) method of Thornthwaite & Mather (1955), adopting the available water capacity as equal to 100 mm (Passos et al., 2017). The yearly average ETr at the pluviometric station locations was obtained by summing the monthly amounts. By performing interpolation using the IDW method, we obtained a specialized map of the yearly average ETr (Figure 2b).
Cluster analysis is dependent on the units and scales by which the descriptive variables used are measured (Nathan & Mcmahon, 1990); thus, the variables was reescaled base on [eq. (1)]. In the cluster grouping analysis, the non-hierarchical K-means method was used and required a priori the number of groups to be formed. As this value was not known in advance, we used four indices for group validation to estimate the number of groups : Calinski Harabasz (Calinski & Harabasz, 1974), Dunn (Dunn, 1974), Silhouette width (Rousseeuw, 1987) and Xie-Beni (Xie & Beni, 1991).
The dissimilarity measure adopted in this study was generalized Euclidean distance. FIGURE 2. Average long-term precipitation (a) and average annual evapotranspiration (b) of the Doce River Basin.

Flow regionalization
Parametric regression is one of the most widely used methods in regionalization studies (Pruski et al., 2013). To relate the flow to the basin characteristics, it is usually used as a potential function (Samuel et al., 2011).
The dependent variable used in flow regionalization was the long-term average flow (Qmlt), which represents the potential water availability of the basin. The independent variable used (equation 2) was the flow equivalent to the precipitated volume, considering the subtraction of the precipitation abstraction factor for the flow formation (Novaes, 2005;Pruski et al., 2013;Gonçalves et al., 2018). This factor considers a part of the rain that is not converted to the stream mainly due to evapotranspiration, so [eq. (2)] it is an indirect estimate of the mean runoff in the watercourses. The value of 750 mm was used because demonstrates better results in previously studies in Doce river basin (GPRH & IGAM, 2012). Where: Peq750 = equivalent flow of a precipitated volume of less 750 mm of the abstraction in m 3 s -1 ; P = average annual precipitation in the considered drainage area in mm; A = drainage area in km 2 , 31536 = a constant that converts P to meters, A to squared meters, and converts the time-step from year to seconds.
The use of Peq750 allows one to relate the precipitation and drainage areas as a single variable that has the advantage of a two-dimensional representation of the relationship between the dependent and independent variables. In addition, another benefit is obtaining an additional degree of freedom for statistical analysis, reducing the variance of the estimates for an equation with a single explanatory variable compared to an equation with two variables (Pruski et al., 2013).
To obtain the drainage area of the total of each river section, we used the Hydrographically Conditioned Digital Elevation Model (HCDEM) generated from curve graphs by the Brazilian Institute of Geography and Statistics (IBGE) and the mapped hydrography at a scale of 1:100.000 and 1:25.000, when available.

Statistical behavior analysis of the regionalized flows
An analysis of the regression model fitting to the sample data was conducted for each HHR based on the coefficient of determination (R²). A comparison of the method's efficiency for identifying HHRs from the flow estimations was based on five statistical indices: (a) amplitude of relative error (AEi), which is the largest and smallest relative error value (Pruski et al., 2013); (b) the average relative error (ERm); (c) the Nash-Sutchliffe efficiency with logarithm data (E1) (Krause et al., 2005;Samuel et al., 2011); (d) the relativized Nash-Sutchliffe efficiency (E2) (Krause et al., 2005); and (e) the modified Nash-Sutchliffe efficiency (E3) (Krause et al., 2005).

Physical behavior analysis of the regionalized flow
Regionalization models obtained for each HHR allow one to estimate the Qmlt along the hydrography. However, performing a physical examination of the estimated absolute flows is a difficult task because of their interval of variation. To address this problem, physical behavior analysis of Qmlt was performed considering the runoff coefficient (RC) obtained using the following equation: Engenharia Agrícola, Jaboticabal, v.40, n.3, p.334-343, may/jun. 2020 Where: Peq = equivalent flow of a precipitated volume in m 3 s -1 .
The physical behavior evaluation was conducted with an analysis of the RC spatial distribution by checking the trend of the behavior of this variable in relation to the precipitation map. In addition, the RC mean values were compared to the precipitation mean and the estimated average RC with the range observed in the HHRs.
At the same time, a safety analysis associated with the obtained estimates was proposed to mitigate the uncertainty associated with the flow regionalization, especially in extrapolation areas at the limit of the validity of the regression equation.
Safety analysis was performed by means of a security histogram built in two stages. The first refers to the construction of an individual histogram for each of the identified HHRs. Each individual histogram interval was defined on the basis of the minimum and maximum values of the observed RC in the HHR stream gauge stations in the analysis and considering that the RC maximum physical limit was 1. It is noteworthy that the RC may be greater than one if subsurface flows exist between two or more adjacent catchments (Şen & Altunkaynak, 2006), like in Karstic regions, which is not the case for Doce river basin. The second step consisted of the union of the hydrography sections of the different HHRs contained in the same individual histogram intervals. In this way, one can build a final histogram for each grouping.
An estimated RC value lower than the lowest RC value observed in the stream gauge stations is physically possible and provides greater security in comparison to the planning and management of water resources, since compared to the values observed in the stream gauge stations, they underestimate the RC.
The estimated RC values contained in the interval consisted of the maximum and minimum values for the RC observed at the stream gauge stations that are physically acceptable and likely.
The estimated RC values between the maximum value observed in the stream gauge stations and maximum value of the physically possible RC (RC = 1) characterize an interval whose values are considered to be unreliable for the planning and management of water resources, as they may lead to an overestimation of the RC in the considered section. An interval consisting of an RC value greater than the unit comprises the fourth grade of the relative risk histogram for Qmlt and characterizes a behavior that is defined, in this study, as physically unacceptable.

HHR identification
HHR identification using the geographical convenience method resulted in three groupings, termed 'Geo 1', 'Geo 2' and 'Geo 3', as presented in Figure 3. The grouping 'Geo1' (Figure 3a) was proposed to characterize the physical and statistical behavior of the regionalized flows when the entire area under study was considered to be hydrologically homogeneous, while the groupings 'Geo 2' and 'Geo 3' (Figures 3b and 3c) were proposed because they had the best statistical performances among the groupings tested using the geographical convenience method. Regarding the HHR identification using cluster analysis, which his presented in Table 1, the results of the validation indices considering the combinations of variables are as follows: (a) precipitation and evapotranspiration (Cluster 1); precipitation (Cluster 2); latitude, longitude, and real evapotranspiration (Cluster 3); and latitude, longitude and precipitation (Cluster 4). The values in bold indicate the optimal number of clusters for each validation index.
As listed in Table 1, the validation indices for all of the analyzed groupings are divergent in relation to the recommended number of clusters in which the stations must be divided. Thus, the choice of a number of clusters was conducted considering the prevalence of the four validation indices.
Engenharia Agrícola, Jaboticabal, v.40, n.3, p.334-343, may/jun. 2020 For 'Cluster 1', the validation indices predominantly indicate that the recommended number of clusters is four. For the grouping 'Cluster 2', there is no predominance for the number of clusters because the Silhouette and Xie Beni indices lead to a recommendation of two clusters, while the Calinski Harabasz and Dunn indices indicate a number of clusters equal to five. Therefore, for the grouping 'Cluster 2', both options were considered, as defined as 'Cluster 2/2' and 'Cluster 2/5'.
The validation indices for the groupings 'Cluster 3' and 'Cluster 4' predominantly indicate that the recommended number of clusters for both groupings is two. Therefore, cluster analyses in which the latitude and longitude variables were included ('Cluster 3' and 'Cluster 4') showed greater agreement in regard to their validation indices compared to analyses in which they were not included ('Cluster 1' and 'Cluster 2').
The groupings 'Cluster 2/2', 'Cluster 3' and 'Cluster 4' have the same configuration in relation to the stream gauge stations, that is, the HHRs formed for these groupings are identical. Thus, the three groupings were represented by the grouping 'Cluster 3'. The grouping 'Cluster 2/5' was renamed to 'Cluster 2'. Briefly, cluster analysis yielded three distinct results, defined as 'Cluster 1', 'Cluster 2' and 'Cluster 3', as presented in Figure 4.

Behavior analysis of the regionalized flows
In Table 2, the equations of regression and adjusted coefficients of determination obtained for each HHR identified using the grouping methods in the study are listed. The methodologies generated regression models with adjusted coefficients of determination higher than 0.85.
The statistical indices used to analyze the behavior of the regionalized flows are listed in Table 3. It was Engenharia Agrícola, Jaboticabal, v.40, n.3, p.334-343, may/jun. 2020 found that these indices disagreed with showing which grouping method produced a better statistical adjustment. The groupings 'Geo 2' and 'Geo 3' presented the best statistical performance with regard to AE, E1, E2 and ERm. Thus, we considered that these groupings had better statistical results.  Grouping The group 'Geo 1' showed the worst performance in regards to AE, E1, E2 and E3 and was considered to be the grouping with the worst statistical result. Taking into account that this grouping corresponds to the basin in the study, which is one hydrologically homogeneous region, this result highlights the importance of identifying HHRs for regionalization studies, mainly for large basins.
Regarding the groupings obtained using cluster analysis, 'Cluster 1' showed a better statistical performance in relation to 'Cluster 3' for all of the evaluated indices, while 'Cluster 2' had a better statistical behavior in relation to 'Cluster 3' in four of the five evaluated indices. Thus, among them, 'Cluster 3' was considered to have the worst statistical result. In regards to 'Cluster 1' and 'Cluster 2', we cannot infer which showed the best statistical performance, as there was no prevalence in the analyzed indices.
As seen in Figure 5a, it was observed that the grouping 'Geo 1' presented a distribution of RC consistent with the precipitation map; however, it was found that the extent of the estimated RC values was from 0.20 to 0.45, while the RC values observed in the stream gauge stations ranged from 0.17 to 0.60. Namely, consideration of the area under study as a single HHR led to a reduction in the amplitudes of the estimated RC values compared to the observed values, causing all of the segments of hydrography to be classified as acceptable/likely in the security analysis. It was also observed that even considering the extrapolation regions of the equation of regionalization, the RC estimated value amplitudes were much lower than the observed values, which characterizes an inconsistent behavior under a physical point of view. This result is the reason that grouping 'Geo 1' was considered to be unsatisfactory for the regionalization of the Qmlt in the Doce River Basin.
Analysis of the estimated RC spatial distribution relative to the grouping 'Cluster 2' (Figure 5b) showed that in 17.8% of the hydrography, the RC values were greater than 1. From a physical point of view, this behavior was also considered to be unsatisfactory for Qmlt regionalization.
Qmlt regionalization, considering the grouping 'Cluster 1' (Figure 5c) presented for HHR 4, had estimated RC values less than 0.01 for 76% of the hydrography, while the smallest RC value observed in the region was 0.17. This behavior was observed in the hydrography segments within a small drainage area, while for those associated with larger drainage areas, the estimated RC reached values from 0.31 Engenharia Agrícola, Jaboticabal, v.40, n.3, p.334-343, may/jun. 2020 to 0.45, which are near to those observed in the RC stream gauge stations. This behavior is physically inconsistent, as watercourses with estimated RC values of approximately 0.01 produce rivers with estimated RC values greater than 0.31. A possible physical justification for the occurrence of this behavior is increased precipitation downstream; however, the precipitation map does not agree with this justification, namely, precipitation decreases downstream. On the basis of this behavior, we considered 'Cluster 1' to not be satisfactory for Qmlt regionalization.
From these results, we observe that the physical behavior analysis of regionalized flows assists in determining HHRs and validating regionalized flows because although the groupings 'Cluster 1' and 'Cluster 2' showed better statistical results among the groupings from the cluster analysis (Table 3), their physical behaviors were considered to be unsatisfactory.
The groupings 'Geo 2', 'Geo 3' and 'Cluster 3', unlike the groupings 'Geo 1', 'Cluster 1', 'Cluster 2 ', were considered to be satisfactory from a physical point of view. However, for some of the hydrography sections, dubious physical behaviors were identified and are described in the following section. This result demonstrates that for the Doce River Basin, acceptable results were found when the HHRs were geographically continuous.
The RC spatial distribution of the grouping 'Geo2' (Figure 5d) presented RC values ranging from 0.03 to 0.78, with an increase in the estimated amplitudes of the RC value compared to the interval of variation of the values observed in the stream gauge stations (0.17 to 0.60), which resulted in 23.2% of the hydrography being potentially overestimated and 13.4% of the hydrography being potentially underestimated (Table 4).
The amplitude of the estimated RC along the hydrography and that observed at the stream gauge stations as well as the average of the estimated RC values in the hydrography sections and the P for each HHR of grouping 'Geo 2' are listed in Table 4.
The Qmlt regionalization from the grouping 'Geo 2' generated the following: (a) 63% of the hydrography sections with RC values greater than the maximum observed at the stream gauge stations (HHR 1); (b) the average of the estimated RC values along the hydrography was 0.62, while the maximum value observed in the stream gauge stations was 0.60 (HHR 1); and (c) the minimum estimated RC value along the hydrography was higher than Engenharia Agrícola, Jaboticabal, v.40, n.3, p.334-343, may/jun. 2020 the minimum observed at the stream gauge stations (HHR 1 and 2). These behaviors can generate false expectations regarding the availability of water in the respective regions.
The regionalization of Qmlt from the grouping 'Geo 2' presented some inconsistencies between the distribution map and isohyet map. Although the precipitation in HHR 1 and 2 is similar, the difference in the average estimated RC values along the hydrography is significant (Table 4). A similar behavior was observed for HHRs 3 and 4. HHR 4 from the grouping 'Geo 2' showed upstream hydrography segments with RC estimated values smaller than those in the downstream hydrography sections. This behavior is physically dubious, but unlike the grouping 'Cluster 2', the minimum estimated RC value was 0.03 and average estimated value in the hydrography sections was 0.08. The Qmlt regionalization from the grouping 'Geo 3' (Figure 5e) presented RC values ranging from 0.03 to 0.63. The percentage of hydrography that was potentially underestimated was 13.17%, and that which was acceptable/likely was 86.63%. HHRs 2 and 3 from 'Geo 3' are identical, respectively, to HHRs 3 and 4 from 'Geo 2'; thus, the considerations are the same as previously reported.
The estimated RC amplitude along the hydrography and observed at the stream gauge stations, as well as the average of the estimated RC values of the hydrography sections and P, in each HHR from grouping 'Geo 3' are listed in Table 4.
The Qmlt regionalization in HHR 1 presented the minimum estimated RC value along the hydrography as higher than the minimum observed at the stream gauge stations, which leads one to believe that the model overestimates Qmlt at the hydrography sections and thus can generate false expectations of water availability.
The Qmlt regionalization from the grouping 'Cluster 3' (Figure 5f) presented RC values ranging from 0.04 to 0.63. The percentage of hydrography that was potentially underestimated was 30%, and that within the acceptable/likely zone was 70%. HHR 1 from grouping 'Cluster 3' is identical to that of HHR 1 from grouping 'Geo 3'; thus, the comments are the same as previously reported.
The estimated RC amplitude along the hydrography and observed at stream gauge stations as well as the average of the estimated RC values in the hydrography sections and the P in each HHR from the grouping 'Cluster 3' are listed in Table 4. HHR 2 from the grouping 'Cluster 3' presented upstream hydrography segments with an estimated RC smaller than that in the downstream hydrography. This behavior is physically dubious, but unlike the grouping 'Cluster 2 ', the minimum estimated RC value was 0.04.
The regionalization conducted considering the groupings 'Geo 2', 'Geo 3' and 'Cluster 3' was considered to be acceptable from a physical point of view, despite the discussed dubious behaviors. In this way, a comparative analysis between the groupings was conducted to identify which grouping's statistical and physical behaviors presented the lower risk for Qmlt regionalization.
Comparative analysis between the groupings 'Geo 2' and 'Geo3' showed that the results obtained from the Qmlt regionalization considering the grouping 'Geo 3' to have a better performance in the following respects: a) a greater resemblance between the isohyet map and spatialized RC behavior (HHR 1); (b) a reduction in the inconsistency noted in Table 4, in which similar precipitation values are associated with distinct estimates for RC; and (c) lower estimated RC values in the southern area of the study, resulting in values that were classified as overestimated being moved to the likely/acceptable zone, reducing the risk associated with regionalization. Regarding the results obtained for the Qmlt regionalization considering the grouping 'Geo 2', we observed that the minimum estimated RC value in the HHR 2 (0.32) was higher than that observed at the stream gauge stations (0.31), yet less than the estimated value when considering 'Geo 3' (0.37), resulting in, for this region, grouping 'Geo 2' having less risk of overestimating Qmlt. Thus, we could not conclusively determine which of the two groupings represented a lower risk of overestimating the Qmlt regionalization.
A comparative analysis between the groupings 'Geo 2', 'Geo 3' and 'Cluster 3' showed that the results obtained from Qmlt regionalization considering the groupings 'Geo 2' and 'Geo 3' showed a better statistical performance. The Engenharia Agrícola, Jaboticabal, v.40, n.3, p.334-343, may/jun. 2020 grouping 'Cluster 3' presented one of the worst statistical performances; however, its physical behavior was the most consistent as it showed more similarity between the isohyet map and specialized RC behavior, which contributed to a reduction in inconsistencies, as shown in Tables 4, in which similar precipitation values are associated with different RC estimates. The comparative analysis between the groupings 'Geo 2', 'Geo 3' and 'Cluster 3' did not highlight which groupings represented less risk for the Qmlt regionalization because while groupings 'Geo 2' and 'Geo 3' showed a superior statistical performance, grouping 'Cluster 3' proved to be more satisfactory when considering the physical behavior of the average regionalized flow.

CONCLUSIONS
In this study, an analysis of the regionalized longterm average flow physical behavior obtained from HHRs, identified by means of geographical convenience methods and cluster analysis, assisted in choosing the most satisfactory results and in reducing the subjectivity of HHR identification.
The influence of different combinations of latitude, longitude, precipitation and evapotranspiration was examined in the HHRs identified by cluster analysis of the regionalized flows. The combinations that included latitude and longitude presented more compliance for the grouping validation indices, and the physical behavior of the regionalized flows was considered to be satisfactory.
From the results obtained using cluster analysis, the physical analysis was of crucial importance in choosing the HHRs because although 'Cluster 1 and 2' showed better statistical results, their physical behavior was considered to be unsatisfactory. On the other hand, 'Cluster 3' showed the worst statistical performance, but its physical behavior was evaluated as satisfactory.
The HHR delimitation method by geographical convenience produced a better statistical result for regionalized flows; however, the physical result was less than that obtained from cluster analysis.