MULTIVARIATE STATISTICAL ANALYSIS TO SUPPORT THE MINIMUM STREAMFLOW REGIONALIZATION

This study aimed to develop a methodology based on multivariate statistical analysis of principal components and cluster analysis, in order to identify the most representative variables in studies of minimum streamflow regionalization, and to optimize the identification of the hydrologically homogeneous regions for the Doce river basin. Ten variables were used, referring to the river basin climatic and morphometric characteristics. These variables were individualized for each of the 61 gauging stations. Three dependent variables that are indicative of minimum streamflow (Q7,10, Q90 and Q95). And seven independent variables that concern to climatic and morphometric characteristics of the basin (total annual rainfall – Pa; total semiannual rainfall of the dry and of the rainy season – Pss and Psc; watershed drainage area – Ad; length of the main river – Lp; total length of the rivers – Lt; and average watershed slope – SL). The results of the principal component analysis pointed out that the variable SL was the least representative for the study, and so it was discarded. The most representative independent variables were Ad and Psc. The best divisions of hydrologically homogeneous regions for the three studied flow characteristics were obtained using the Mahalanobis similarity matrix and the complete linkage clustering method. The cluster analysis enabled the identification of four hydrologically homogeneous regions in the Doce river basin.


INTRODUCTION
The hydrological performance of a river basin depends on its geomorphological characteristics, climate aspects and the type of plant coverage.Thereby, several physical and biotic variables of a basin play an important role in the hydrological cycle processes.
River basins with large drainage areas can present different hydrological performances in distinct parts.For that reason, the identification of hydrological homogeneity of a determined region is one of the first goals to be reached in order to have a correct water resources management.
Hydrological generalization is defined as any process of transferring information from one region of identified hydrological performance to other places, usually without observations.This transfer can be directly reported to data series, or to certain relevant statistical parameters such as average, variance, maximum and minimum events, or even equations and parameters related to these statistical parameters.MISHRA & COULIBALY (2009) describe the importance of having reliable information in a watershed scope due to its innumerous practical uses: hydrology, agronomy, climatology, hydrogeology, management and planning of water resources, decision processes for implementation of public policies and industrial plants installations.
In this context, multivariate statistical analyses can expressively assist the conduction of hydrological generalization studies, reducing the processing time of database and increasing the reliability of obtained results.In an international level, this affirmation can be ascertained through the development of several studies aiming hydrological generalization (ASSANI et al., 2011;MWALE et al., 2011;SAMUEL et al., 2011;ENGELAND & HISDAL, 2009;CASTIGLIONI et al., 2009).
The principal component analysis aims to review the correlations between studied variables, to summarize a large set of variables in a smaller one and in an equivalent purpose, to evaluate the importance of each variable and to promote the elimination of the ones that contribute little, in terms of variation, in the group of evaluated individuals (WILKS, 2006).
Clustering multivariate statistical analysis is a data exploratory tool that aims to classify homogeneous groups (WILKS, 2006).It has been employed in several areas of knowledge, for instance, genetics (CARVALHO et al., 2009), management (COUTO JR & GALDI, 2012), health (RESENDES et al., 2010) and environmental engineering (HATVANI et al., 2011).In hydrology, cluster analysis is a technique often used to define classes or to group stations into homogeneous climate regions, i.e. regionalization.
In light of the above, this study aimed at developing a methodology based on multivariate statistical analyses of principal component and cluster analysis, targeting the identification of the most representative variable in studies of minimum streamflow regionalization, as well as to optimize the obtainment of hydrological homogeneous regions for the Doce river basin.

Region of study
The Doce river basin is located in the Southeast region of Brazil, between the parallels 17°45' and 21°15' S and the meridians 39°30' and 43°45' W, with average altitude of 578 m.It presents drainage area of 83.400 km², approximately, of which 86% belong to the state of Minas Gerais and 14% to the state of Espírito Santo.The population in the basin is of approximately 3.1 million habitants, with 70% living in urban areas.The Doce river basin in inserted, in 98% of its area, in the Brazilian biome called Atlantic Rainforest and the rest belongs to the Cerrado biome.The leading developed economic activities are mining, metallurgy, forestry and farming (CBH-Doce, 2010).

Database and Applications
The study was conducted using data from 61 gauging stations belonging to hydro meteorological network of the National Water Agency -NWA.The employed series consisted of daily data of flow corresponding to the base period from 1976 to 2005 (Table 1).It is highlighted that the use of data up to the year 2005 was limited due to the fact that, in the beginning of the study, this was the most recent year with consisted data provided by NWA.
The vector base of elevation (contour lines and elevated points) and of hydrography of the hydrographic region obtained with the Brazilian Institute of Geography and Statistics-IBGE at a scale of 1:250.000.
The generation of hydrographically conditioned digital elevation methods (HCDEM), the automatic achievement of morphometric variables, of average precipitations and the spatialization of results were conducted with the aid of the ArcGIS® 10.0 software, as a geoprocessing tool of vectors and spatial representation of data.The multivariate statistical analyses were conducted with the Statistica® 7.0 software and the multiple regression equations were obtained with the aid of SisCORV 1.0.3 software (SOUSA et al. 2008) developed by the Research Group of Water Resources, linked to the Department of Agricultural Engineering of the Viçosa Federal University.
In the present study, 10 variables were considered, as three were dependent variables to be regionalized (minimum average flow of seven consecutive days and recurrence period of ten years -Q 7,10 ; and the minimum streamflow associated with permanence in time of 90% -Q 90 and 95% -Q 95 ) and seven independent variables (total annual rainfall -P a , total semiannual rainfall of the dry season-P ss and rainy season -P sc , in mm; watershed drainage area -A d , in km²; length of the main river -L p , in km; total length of the rivers -L t , in km; and average watershed slope -S L , in %).

Principal component analysis
Based on the principal components analysis, the original set of observed independent variables (P a , P ss , P sc , A d , L p , L t and S L ) was transformed into a new set of variables, named principal components, meeting the following criteria (JOLLIFE, 2002): Considering that Y i is a principal component of data matrix, it will be a linear combination among the seven independent variables regarded; b) The sum of the coefficients square aij is equal to 1; c) each principal component has its own coefficients; d) the components are not correlated, which means that they are independent of one another; e) among all the components, Y1 presents the greatest variance, Y2 the second greatest and so forth; f) the sum of variances of each principal component (Yi) is equal to the sum of variances of each variable (Xj).
As R is a symmetric correlation matrix, of dimensions p x p, from which the eigenvalues (λi) and the eigenvectors (ai) are extracted, the solution was achieved by solving the system (eq.1): (1) In which, λ iare the root characteristics (or eigenvalues) of R matrix.There are p eigenvalues corresponding to the variances of each one of the p principal components; Iis the identity matrix of p x p dimension; a ieigenvector or characteristic vector or a p x 1 matrix, containing the p coefficients for each eigenvalue λ i corresponding to Y i principal component, Φis a zero vector, of p x 1 dimension.
One of the most common problems found in the application of multivariate statistical models is that these depend on the unities and scales in which the variables were measured.The data of variables were standardized through the [eq.( 2)], eliminating the dependence of unities and scales in which the variables were presented: (2) in which, Z ijstandardized variable; σ(X j )standard deviation, and average of j-th original variable.
The importance of each principal component was evaluated through the existing correlation with each X j variable, which is (eq.3): (3) The following criteria were used to select the principal components in this study:  accumulated percentage of original data total variance greater or equal to 75% (JOLLIFE, 2002); and  eigenvalues greater or equal to the average of eigenvalues (RENCHER, 2002).

Cluster analysis
The cluster process was based on two steps: firstly, the estimation of Mahalanobis similarity between the 61 grouped gauging stations and secondly the adoption of a cluster technique between the single linkage method and complete linkage method, to form the groups.
When Euclidean distance is used for cluster analysis, all variables must be considered not correlated among themselves, although this presumption is usually ignored.In order to avoid this common problem in studies of hydrological generalization, a matrix of similarities was constituted with Mahalanobis distance.In practice, Mahalanobis distance is summarized in the application of the Euclidean distance (eq.4) to the standardized matrix of data.
The single linkage method consists, initially, of a distance matrix (dissimilarity) between gauging stations (individuals).The two most similar individuals were identified (by smaller distance between them) and were reunited in the initial group.In sequence, the distance from the first group in relation to the other individuals was calculated.
The distance between a group and an individual was provided through the expression (eq.5): (5) Which means, the distance between the group constituted of the individuals a and b and the individual c was provided by the smallest element from the set of distances of the pairs ac and bc.
From the identification of the smallest distances between the constituted group and the neighboring individuals, a new matrix of dissimilarity was developed of smaller dimensions compared to the initial group and the most similar individuals and/or groups were identified.They were then either incorporated into the initial group or arranged into a second group, depending on whether or not the smallest distance of the new matrix of dissimilarity had been visualized between two other individuals.
In the subsequent stages, increasingly smaller dissimilarity matrices were employed, completing the grouping of all individuals in a single group and composing a dendrogram or tree.
The complete linkage clustering method presents a procedure similar to single linkage method, with one important difference: in each stage, the distance was given by the one that enabled the greatest distance between two individuals and/or groups.
The distance between a group and an individual was provided by the expression: (eq.6): (6) which means, the distance between the group constituted by the individuals a and b and the individual c was provided by the greatest element of the distance between sets of pairs of ac and bc.
The construction of dissimilarity matrices, of smaller dimensions than the initial, followed the same procedure described in the single linkage method.The only difference was the creation of groups through maximum distances (complete linkage) rather than through minimum distances (dingle linkage).
The definition of the number of homogeneous regions of flow characteristics was carried out using the criterion of inertia between jumps, in which the first visible discontinuity of the graphic is defined as 'cut-off' (MELO JÚNIOR et al., 2006;RENCHER, 2002;WILKS, 2006).

Multiple regression analysis
The regression models used to create regionalization equations for each hydrologically homogeneous region were linear, potential, exponential, logarithmic and reciprocal.
The models resulting from the application of multiple regression considered in the hydrologically homogeneous regions provided by the cluster analysis, were selected through the following observations:  representative equation of the studied event;  lower number of independent variables according to the relative significance provided by the principal components analysis;  greater values of adjusted determination coefficient;  lower values of factorial standard-error;  significant results by the F test; continuity of flows; and  Convenience of geographic spatialization of the obtained equations.
In order to verify the adjustment of the adopted regression models to the data, an adjusted determination coefficient (r²a 0.70), a standard error of estimate lower than 0.5 (EP < 0.5) and a significance level of 5% by F test, were used.

Principal components analysis
Based on the seven independent variables used (P a , P ss , P sc , A d , L p , L t e S L ) for each one of the 61 gauging stations adopted, analysis of the principal components was conducted.The total variance existing in the set of analyzed multivariate data was equal to the number of analyzed variables after data samples were standardized with average and variance equal to 0 and 1, respectively.
In Table 2, the correlation matrix between the standardized independent variables is displayed.In order to evaluate the importance of each variable and promote the elimination of the ones that contribute little in terms of variation, the principal components for the studied variables were identified in the group of individuals evaluated in the regionalization analysis of flows (Table 3).1.00 Caption: P aaverage total annual rainfall; P ssaverage total semiannual rainfall of the dry season; P scaverage total semiannual rainfall of the rainy season; A d -watershed drainage area; L p -length of the main river; L t -total length of the rivers and S L -average watershed slope.According to HELENA et al. (2000), correlation coefficients superior to 0.5 express a strong relationship between evaluated variables.Table 2 demonstrates that climate variables P a and P sc are strongly correlated to one another and variable P ss is moderately correlated to variables P a and P sc .Morphometric variables A d , L p and L t are highly correlated to one another, however the morphometric variable S L presents weak correlation to the remaining analyzed variables (R<0.5), which indicates that it should possibly be excluded from this study.
Based on data presented on Table 3, only the two first components (Y 1 and Y 2 ) were considered, as they simultaneously met two adopted criteria of selection (the accumulated variance explaining a value greater or equal to 75% of the total data variation and eigenvalues greater or equal to 1).The other components were not considered, which together explained 22.08% of the total variation.Table 4 presents the correlations, or load factors, between the seven standardized variables and the two first principal components.It is observed on Table 4 that the standardized variables Z 4 , Z 5 and Z 6 present greater correlations with the first principal component (Y 1 ), while the variables Z 1 , Z 2 and Z 3 indicate greater correlations with the second principal component (Y 2 ).The variable Z 7 can be discarded from the study as it contributes little to the group of evaluated individuals in terms of variation, confirming the result obtained by the analysis of correlation matrix R.
The average watershed slope (S L ) variable presents insignificant representativeness in relation to the performance of studied flow characteristics, as it defines a uniform surface of all drainage areas, which does not physically represent the natural process of river channel runoff.For this reason, the exclusion of the variable S L from the selected set was expected.
The monitoring of water resources involves a great number of variables and the quantitative reduction of unnecessary information directly leads to savings in time and resources.MISHRA & COULIBALY (2009) demonstrated in their study, the importance of having reliable variables in engineering studies in a watershed.CASTIGLIONI et al. (2009) also used physiographic variables when trying to identify hydrologically homogeneous regions.
Physically, the principal component Y 1 represents the most representative morphometric variables and the principal component Y 2 represents the average rainfalls in drainage areas upwind of each gaging station.ASSANI et al. (2011) achieved great results using the technique of principal components analysis in river basins in Canada.
According to WILKS (2006), the obtained results showed that the use of principal components tool for the regionalization of flows, even in a preliminary way, is fundamental to the elimination of little expressive variables, thus increasing the spatial reliability of hydrologically homogeneous regions.

Cluster analysis
After disregarding the variable S L, from the results achieved in the principal components analysis, the homogeneous regions for the three flows were obtained separately, based on standardized variables that presented greater correlations with the two first principal components (A d , L t , L p , P a , P sc e P ss ) from the distance matrix of Mahalanobis.
The closest neighbor method presented irregular clusters for the three studied flow characteristics and was discarded.MELO JÚNIOR et al. (2006) found a similar situation and also disregarded the clusters obtained for the nearest neighbor method.
The complete linkage method presented easy interpretation of results and equal number of clusters for the three evaluated flows.For this method, the cut-off can be identified as the approximate distance of 19% of dissimilarity in a dendrogram, in which four groups are formed with homogeneous characteristics of flow for all the considered flows.In order to illustrate the achieved result, in Figures 1  and 2, the graphics of dissimilarity distance vs clustering steps and dendrogram for the variable Q 7,10 are each presented.By analyzing Figure 1, the first discontinuity is observed between the clustering steps 56 and 57.From this result, four hydrologically homogeneous regions were identified for the three analyzed flow characteristics, which followed the same performance through the application of clustering method (Figure 2).MELO JÚNIOR et al. (2006) obtained good results with the complete linkage clustering method in studies of precipitation.

Homogeneous regions
Through the complete linkage clustering method, four regions with homogeneous characteristics of flow for the Doce river basin were obtained, as described:  Region Icomposed of stations with smaller flows and drainage areas.Spatially comprised of headwater regions and small tributaries.Seventeen (17) gauging stations comprise this region for all studied flows with drainage areas varying from 166 to 970 km².
 Region IIintermediate region between regions I and III, which were composed of 12 gauging stations with drainage areas varying from 757 km² to 1.396 km².
 Region IIIintermediate region between regions II and IV.Spatially constituting of the main tributaries of the greater flow rivers of the basin and comprised of 13 gauging stations with drainage areas varying from 1,200 to 3,055 km².
 Region IVcomprised of stations with greater flows and drainage areas.Spatially constituted of the key channel of the Doce River and its main tributaries: Piracicaba, Santo Antônio, Suaçuí and Manhuaçu.Nineteen gauging stations comprise this homogeneous region with drainage areas varying from 2,578 to 81,940 km².
Figure 3 presents the spatial configuration of the four hydrologically homogeneous regions for the flows Q 7,10 , Q 90 and Q 95 , that presented recurring hydrological performance.For the delimitation of homogeneous regions, the influence areas of gauging stations that comprise them were extended up to the outflow region in the largest river downstream, in accordance with the process described by MARQUES et al. (2009).In analyzing Figure 3, it is noticed that the homogeneous region of greatest spatial scope is region I (headwater regions and smaller drainage areas), followed by region IV (channel of main river and main tributaries), region III and region II.
It is highlighted that drainage areas inferior to 160km² and superior to 82,000km² were included in the hydrological regions I and IV, respectively.However, it is important to emphasize that the major parts of the Doce river basin do not allow for adequate monitoring (drainage areas smaller than 160km²), thus requiring adoption of other criteria for result projections in these regions.RIBEIRO et al. (2005) worked with minimum streamflow of reference (Q 7,10 , Q 90 and Q 95 ) and obtained seven hydrologically homogeneous regions for the Doce River basin.MARQUES et al. (2009) investigated the same watershed and employed minimum streamflow of reference (Q 7,10 , Q 90 and Q 95 ) in quarterly periods, obtaining hydrologically homogeneous regions.
It is highlighted that the regions the mentioned authors considered are subdivisions and/ or spatial junctions of hydrologically homogeneous regions found through the application of methodology presented in this study, based on unities of existing water resources management and sub-basins of the Doce River basin.
It is essential to distinguish that the methodology proposed in this study is based on multivariate statistical analysis and the obtained results point to the general hydrological performance of the Doce River basin.
The result obtained by the proposed methodology was complemented by the multiple regression analysis between the dependent variables (minimum streamflow characteristics) and the independent variables (climate and morphometric variables), to obtain regional equations for the four hydrologically homogeneous regions.

Multiple regression analysis
Considering the hydrologically homogeneous regions obtained through the application of the proposed scientific approach, for the investigated flow characteristics, the equations of multiple regression of linear, potential, exponential, logarithmic and reciprocate types were adjusted.Table 5 presents for each homogeneous region, the regression equations that adjusted best to the variables Q 7,10 , Q 90 and Q 95 .In order to meet the selection criteria of regression equations, it was necessary to exclude three gauging stations for region I (56570000, 56935000, 56993002) and one gauging station for region 4 (56880000).

By analyzing Table 5, it can be observed:
 The regression model that adjusted best to the flow data was the potential.The same performance for the regional equations was achieved by RIBEIRO et al. (2005) andMARQUES et al. (2009) for the Doce River basin;  The most important independent variable for the study was drainage area (A d ) followed by average semiannual rainfall in rainy season (P sc );  The regional equations presented for the four hydrologically homogeneous regions, defined by the methodology proposed in this study, showed determination coefficients higher than 0.70, standard errors of estimate lower than 0.5 and significance levels of 5% by the F test.
The results achieved through multiple regression analysis were considered satisfactory, validating the scientific methodology presented in this study.
From previous knowledge of the region, the use of spatial analysis tools and the experience of an hydrologist, multivariate statistical analyses of both principal components and of clustering can contribute to the subdivision of hydrologically homogeneous regions, thus enabling more consistent decision-making, from a more reliable database (eliminating variables that contribute little to the study) of obtained clusters (verification of statistical performance of flow characteristics from the dendrogram).

CONCLUSIONS
Principal components analysis presented satisfactory results for excluding little representative variables in the identification of hydrologically homogeneous regions.
The first two principal components, Y 1 eandY 2 , were responsible for 77.92% of data total variation.
The Mahalanobis similarity matrix and the complete linkage clustering method demonstrated great results in the identification of hydrologically homogeneous regions for all studied flows.Four hydrologically homogeneous regions were obtained for all studied minimum flow characteristics.
The regionalization equations obtained through multiple regression analysis for the minimum flow characteristics were considered satisfactory, validating the scientific methodology presented in this study.
The methodology proposed for identification of the number of homogeneous regions showed great results, enabling the elimination of subjectivity in the identification of hydrologically homogeneous regions.

FIGURE 1 .
FIGURE 1. Dissimilarities distance vs. cluster steps to Q 7,10 from the furthest neighbor method.

FIGURE 2 .
FIGURE 2. Dendrogram for Q 7,10 showing the clustering steps from the furthest neighbor method.

FIGURE 3 .
FIGURE 3. Hydrologically homogeneous regions for minimum streamflow obtained for the Doce river basin.

TABLE 1 .
Gauging stations selected for a minimum streamflow regionalization.

TABLE 2 .
Correlation matrix R between the independent variables considered.

TABLE 4 .
Load factors between the standardized variables (VP) and the principal components (CP) and variance (λi) of each principal component (i = 1, 2).

TABLE 5 .
Regression models that adjusted best to the minimum and average flow characteristics and the obtained adjustments.Flows in m³ s -1 , A d in km² and P sc in mm.(**) Equations valid to the interval of independent variables of the gauging stations that constitute the hydrologically homogeneous region.(***) adjusted determination coefficient (r²a ), the standard error of estimate (EP) and significance level of5% by the F test.