Optimal pressure management in water distribution networks through district metered area creation based on machine learning

ABSTRACT Integrated management of water supply systems with efficient use of natural resources requires optimization of operational performances. Dividing the water supply networks into small units, so-called district metered areas (DMAs), is a strategy that allows the development of specific operational rules, responsible for improving the network performance. In this context, clustering methods congregate neighboring nodes in groups according to similar features, such as elevation or distance to the water source. Taking into account hydraulic, operational and mathematical criteria to determine the configuration of DMAs, this work presents the k-means model and a hybrid model, that combines a self-organizing map (SOM) with the k-means algorithm, as clustering methods, comparing four mathematical criteria to determine the number of DMAs, namely Silhouette, GAP, Calinski-Harabasz and Davies Bouldin. The influence of three clustering topological criteria is evaluated: the water demand, node elevation and pipe length, in order to determine the optimal number of clusters. Furthermore, to identify the best DMA configuration, the particle swarm optimization (PSO) method was applied to determine the number, cost, pressure setting of Pressure Reducing Valves and location of DMA entrances.


INTRODUCTION
Water supply systems play a key role in urban design, not only to ensure that citizens can have access to essential goods, but also for public safety reasons (DI NARDO, DI NATALE, 2011;GRAYMAN et al., 2009). The management of water supply systems become increasingly complex in the face of the reduction of available natural resources, with the need to reduce energy consumption and water loss.
The division of the water distribution network (WDN) into districts allows a better management and increase of hydraulic and energy efficiency, since the operations are directed to the needs of each district, besides the greater control from measurements and monitoring. However, such division can be a complex task due to the size of the network and its peculiarities, such as the number of loops, the variation of the geometric dimensions and the modification in the hydraulic conditions, which can make such a division inconsistent if they are not considered (DIAO et al., 2012).
For the definition of a district metered area (DMA) it is necessary to determine the supply points (entrance points) and their influence regions. In this definition, water supply should provide sufficient quantity and quality to consumers. Operating pressures must be ensured inside a standardized range, a condition normally achieved by using pressure reducing valves (PRVs). The location of supply points in the district and the operating pressure are fundamental in the clustering process.
Corroborating the importance of the division of networks into districts, important work are proposed in the literature for the development of clustering tools. Tzatchkov et al. (2006) present a model based on graph theory for the segmentation of supply networks. The authors were based on graph analysis and graph partition in order to find a suitable design for the DMAs. Swamee and Sharma (2008) propose the segmentation of multiple sources assigning pre-defined zones of influence for the clustering. Herrera et al. (2010) proposed the use of partitioning with methods based on machine learning for the definition of DMAs. Also based on graph partition, the authors included the non-supervised learning approach to the DMA design, developing a hybrid graph theory / data mining algorithm for the DMA design. Diao et al. (2012) proposed the automatic creation of boundaries for the determination of measurement districts based on social structures, a tool in the field of Artificial Intelligence, and the decomposition theorem of complex systems (Simon, 1962). Campbell et al. (2014) proposed a clustering method based on social networks for the determination of districts using energy efficiency as criteria. In this work, the authors found a robust and computational efficient technique for DMA design in large networks based on graph partitioning and data mining technique for nodal clustering. The authors used topological criteria, such as the maximal demand of a district, or the maximal difference in node elevation as criteria for the graph partition . Di Nardo et al. (2014) proposed a method based on graph theory coupled to an optimization algorithm for the determination of the districts of a supply network also aiming energy efficiency improvement.
Among the several clustering tools, the k-means algorithm is the most prominent. Initially proposed by Steinhaus (1956), it is widely used for clustering problems due to its simplicity, versatility and speed of operation (WU et al., 2008), emphasizing its ability to handle a large amount of data (HUANG, 1998). On the other hand, with the advent of modern neurology and the consequent discoveries of cerebral functioning, mathematical models based on the behavior of this organ were proposed. Among them, Alhoniemi et al. (1999), Vesanto and Alhoniemi (2000) and Kohonen (2001) proposed the use of a self-organizing map (SOM), which simulates the recognition of patterns by the brain for grouping, classifying, estimating and predicting different types of problems, being widely used in the area of water resources.
The challenge of creating DMAs in supply networks is not fully solved from a database. Once defined the districts, it is necessary to define the entrance of each of these districts, thus allowing the installation of control elements, such as PRVs, to ensure complete isolation in cases of emergency or maintenance. The current propositions make use of hybrid optimizer-cluster models to determine the districts, minimizing structural costs and deterioration (GALDIERO et al., 2015).
During the last decades, the water companies have developing to divide the water network, aiming a better management. The recommendation of United Kingdom, early of 1980's (FARLEY, 1985 has change the management of water distribution systems and, by the strategical placement of pressure control devices, the leakage rate could be reduced. Nevertheless, the task to create DMAs is still a complex task because many variables are playing important rules, such as topological and topographic features, costs, benefits etc. In order to develop an automatic tool for DMA design coupled to the optimal pressure management, this work develops and analyzes two models of DMA creation in water supply networks using two sets of criteria, the mathematical and topological. The first model is based on the k-means clustering algorithm and the second one is a hybrid method, combining the SOM and k-means methods, both with the purpose of determining the optimum number of groups of nodes with similar characteristics. Four mathematical criteria to determine the number of DMAs are evaluated, namely Silhouette, GAP, Calinski-Harabasz and Davies Bouldin. In addition, the influence of three clustering topological criteria is evaluated: the maximal water demand, maximal difference in node elevation and total pipe length. Finally, an optimization model, based on the bio-inspired particle swarm optimization (PSO) algorithm, is applied for the allocation of control and isolation valves of the districts, as well as their operation point, minimizing the installation costs.
In this sense, the purposed method is composed by 2 stages. The first one, based on physical (elevation) and topological (space position) parameters of the networks, a clustering algorithm (K-means) is applied. The algorithm will divide the network in K groups, based on Euclidian distances from K-centers, initially randomly distributed, and recurrently self-organized, based on the mean value of each k-group. The important task at this stage is to define the value of K. To help the solution of this task, mathematical and topological criteria are explored in this paper. Each criterion is considered separately. For future works, mainly for the topological criteria, the analysis of correlation or interference between the criteria could be considered.

Self-organizing maps
The main objective of a SOM is to process input data in arbitrary dimensions and bring them to a one or two-dimensional set of data, with transformations that guarantees topological similarity (HAYKIN, 2001). In general, the algorithm distributes a group of neurons within the characteristic space and as iterations occur, this group changes so that the synaptic weights are representative of the multidimensional space, without previous knowledge of the behavior of such surface.
The position of each node j of the network j w , also called the neuron can be represented by equation 1: where N is the total number of neurons in the network. The similarity between a weight vector ji w and an input pattern i x can be measured in terms of the distance between the two vectors. The neuron that satisfies the optimal condition of minimum distance is called the winning neuron and has associated to it a topological neighborhood that will define an activation zone. The criterion of similarity is given by equation 2: in which -C x w represents the Euclidean distance between the network neurons and c represents the chosen winning neuron.
The weights of the winning neuron and its neighboring neurons are then adjusted according to the following equation 3: where t represents the iteration of the training, x i (t) is the input pattern and ( ) C h t is the neighborhood nucleus around the winning neuron.
The definition of the neighborhood usually follows the idea in which the activation of nearby neurons is greater than the activation of distant neurons. Figure 1 presents, in a simplified way, a two-dimensional SOM with a two-dimensional input vector. The darker circle at the center represents the winning neuron and the gray scale shows the influence of the neighborhood in the adaptive process.
Once the actuation neighborhood is defined, each of the weights is updated so that all topological proximity information is considered. With the learning process finalized, each neuron will be close to a certain set of input data represented in the output space. Each of the neurons can then be defined as the center of a cluster with a set of data around it, then labeled.

k-means
K-means is an unsupervised learning algorithm used to group the points of a network according to similar characteristics. The algorithm works by determining the centroid for each cluster. The best clustered data will have their centroids located farthest from each other, allocating the points of the network to the nearest centroids. The k centroids are selected randomly in the input space and each input data is classified according to their distance to the centroids. After the allocation, it is necessary to recalculate the position of the centroids and evaluate if there is any change regarding the previous position, repeating the process until there are no changes. The new position of the centroids is calculated with equation 4.
is the distance between an input vector ( ) j i x and the centroid j c , k is the number of centroids and n is the number of nodes in the network. In this study, the input vector x i has four dimensions, representing the demand, elevation, latitude and longitude of node i of the network, as shown in equation 5.

Criteria for clustering in districts
Clustering criteria are used to feed the algorithms with information in order to identify similar network nodes, grouping them in specific DMAs. Two types of criteria were considered for clustering: topological and mathematical. The first takes into consideration only the physical features of water supply networks. The second considers the quality of the clusters created.

Topological criteria
The topological criteria of a water supply system such as the maximal water demand, maximal difference in node elevation and total pipe length. define the hydraulic behavior of the network. Identifying such criteria in the clustering process can favor the pressure management in the districts.
The maximum water demand, the maximum elevation difference between nodes and the maximum pipe length of the  (2006)).
Optimal pressure management in water distribution networks through district metered area creation based on machine learning 4/11 same district were used to determine the number of clusters, varying the limit values of each one separately to verify the influence of each factor.

Mathematical criteria
The main purpose of clustering data is to determine groups with solid characteristics that differ as much as possible from each other. In addition, the more compact the clusters, the less ambiguity the overall clustering. Thus, measurements of quality are shown in the literature as means to evaluate both the distance between clusters and their compactness. The mathematical criteria used for the quality-cluster analysis were: GAP, Silhouette, Davies-Bouldin and Calinski-Harabasz.

GAP
The GAP criterion (TIBSHIRANI et al., 2001) consists of obtaining a graph of error measurements in the clustering relating to the number of clusters of the network. The optimal clustering occurs when the maximum reduction of related error is achieved. Reduction in errors in relation to the number of clusters represents higher GAP values, with the optimal result occurring at the highest GAP value, local or global, considering tolerance limits. The GAP value is defined as shown in equation 6: where n is the sample size, k is the number of clusters being evaluated and Wk is the measure of dispersion within each cluster.
The expected value E * n {log (Wk)} is determined by the Monte Carlo method through a reference distribution and the log (Wk) is computed by the sample data, as shown in equation 7.
where n r is the number of data in a cluster r, and Dr is the sum of the distance between all points of cluster r.

Silhouette
The Silhouette criterion (ROUSEEUW, 1987;KAUFMAN;ROUSEEUW, 1990) consists of a similarity analysis of specific data points in relation to the data of the same cluster compared in relation to the data of other clusters. The silhouette value ranges from -1 to +1, with low or negative values representing poor results and high values representing appropriate clustering results. This value is given by equation 8: where i a is the average distance of the i th point in relation to other points in the same cluster and i b is the smallest mean distance of the i th point in relation to other points in different clusters.

Davies-Bouldin
The Davies-Bouldin criterion (DAVIES; BOULDIN, 1979) consists of a ratio of the distance of nodes within a given cluster to the distance between clusters. The Davies-Bouldin index is given by equation 9: where D i,j is the ratio of distances within the same cluster i and the distances between clusters i and j. The equation 10 shows the ratio of distance in mathematical terms: where di is the mean distance between each point in the i th cluster and its centroid, dj is the mean distance between each point in the j th cluster and its centroid, and , i j D is the Euclidean distance between the centroids of i th and j th clusters. The maximum value of , i j D results in the worst DMA creation performance, while the minimum value represents optimal creation.

Calinski-Harabasz
The Calinski-Harabasz criterion, or "variance ratio criterion" (VRC) (CALINSKI; HARABASZ, 1974), consists of the relation between intra-cluster distances. The VRC is given by equation 11: where B SS is the total variance between clusters -equation 12, W SS is the total variance within each cluster equation 13, k is the number of clusters and N is the number of observations.
where i m is the centroid of cluster i, m is the overall average of the sample data, and i m m − is the L 2 norm (Euclidean distance) between the two vectors.
where x is a sample data, i c is the i th cluster and i x m − is the L 2 norm (Euclidean distance) between the two vectors.
High values of B SS and low values of W SS represent well defined clusters. The higher the k VRC index, the better the clustering, with optimum number of clusters defined by the solution with highest Calinski-Harabasz index.

Optimal pressure management
Considering the optimum pressure management within each of the DMA, it is proposed in this study the optimal allocation of valves in the entrance of each district, and their pressure setting, aiming the highest uniformity of pressure within the district.

5/11
The choice of the nodes belonging to a previously grouped DMA should comply with the minimum and maximum pressure constraints in addition to operational criteria that are raised throughout the study and enable better management of the districts.
Considering as decision variables the location of each of the valves and their respective pressure setting, the problem can be written as the minimization of the operating pressures of the system and the pressure uniformity parameter ( k PU ), which expresses the pressure deviation of each node with respect to the mean pressure of the nodes of a district. This measure was proposed by Alhimiary and Alsuhaily (2007) shown in equation 14. The minimization problem is subject to the pressure constrains (Equation 15) and the number of nodes belonging to a DMA (equation 16).

( )
, , , where k PU is the pressure uniformity parameter for a given district k, is the total simulation period, k N is the number of nodes belonging to district k, , is the pressure at a given node i for the time step t, ,t is the mean pressure of district k in time step t, i and a are the minimum and maximum standardized pressures respectively. The bio-inspired Particle Swarm Optimization (PSO) algorithm is used to determine the position of the valves and respective pressure settings.

Particle Swarm Optimization -PSO
Particle Swarm Optimization (PSO) is a population-based algorithm that has particles as the elemental unit. The particles are composed of two vectors of size D (dimension of the problem).
One of these vectors represents the position of the particle and the other its displacement velocity. The first step of the method is the initialization of the particles, done randomly within a range of interest, both for position and for the velocity. At each iteration n, the particle information is updated, considering its best position ever achieved (p id ) and the group best position (g id ) as shown in Equations 18 and 19 (EBERHART; KENNEDY, 1995). The process continues until one of the stopping criteria is reached, such as the maximum value with the arbitrated error, the maximum number of iterations, the lack of improvement in the objective function for a determined iteration interval and other stopping criteria widely used in numerical problems (FAIRES; BURDEN, 1998) where d = 1,2, ... m, with m the number of variables of the problem, n = 1,2, ... N, with N the maximum number of iterations. Also, r 1 and r 2 are numbers randomly chosen within the range [0,1], and c 1 and c 2 the cognitive and social coefficients respectively. The first is used in the initial iterations to perform a global search, while the second improves the local search, for the final iterations, when it is expected to be close to an optimal solution.

RESULTS AND DISCUSSION
The method proposed was applied to the D-Town network (MARCHI et al., 2013), composed of 398 nodes, 458 pipes, 7 tanks, 1 reservoir, 13 pumps and 4 valves, as shown in Figure 2.
The SOM was configured to have 25 rows and 25 columns with a squared topology, to execute a maximum number of 4000 iterations and a defined topological neighborhood size of 4 neurons. This arrangement was chosen through a sensitivity analysis, considering the processing time and the efficiency of the algorithm, measured by the quantification of the errors.

Topological criteria
A total of 18 scenarios were generated, 9 with the k-means algorithm and 9 with the hybrid algorithm, varying the district's maximal water demand, maximal difference in node elevation and total pipe length for the district. For each criterion, the cluster quality was evaluated using the Calinski-Harabasz index (VRC), in which a higher index value represents a higher quality of DMA creation.
Starting with the demand criterion, Table 1 presents the VRC index values for each of the limits used. There is a slight difference between the demand limit of 140 l/s when compared to the other values for hybrid clustering. Still, the best value of VRC is obtained by creating DMAs with the k-means method. Figure 3 presents the best creation scenario for each of the methods using the demand of 140 l/s as the limit value.
It can be noticed a spatial difference of the clustering patterns between one method and another. The k-means method generates more circular districts, around a center of gravity, which is more compatible with reality.
Following the evaluation of the criterions for DMA creation, Table 2 shows the value of the VRC index using the elevation as parameter. In both methods the best value for VRC occurs with the maximum elevation difference of 75 m. Figure 4 shows the final distribution of the districts for each of the algorithms.
The last topological criterion analyzed was the maximum total pipe length for the district. Table 3 shows the value of the VRC index for each of the criteria boundaries. It is observed that the district with a maximum of 15 km has the best performance, and the clustered network for this limit value, in each one of the algorithms is presented in Figure 5.     Within the topological criteria, the one that presented the best performance, when evaluated by the VRC index, was the scenario generated by the k-means algorithm with the maximum district length criterion. This result is very close to the districts generated by the same algorithm with the maximum demand criterion. In general, the k-means algorithm presented better performance alone when compared to the districts generated by the hybrid model.

Mathematical criteria
A total of 8 scenarios were generated, 4 with the k-means algorithm and 4 with the hybrid algorithm, varying the mathematical criteria. For each mathematical criterion, the quality of the district was also evaluated using the Calinski-Harabasz index. Table 4 shows the value of the VRC index for each of the mathematical criteria used.
It can be noticed, for the k-means method, the scenario obtained by the VRC criterion itself had the best result, similar with those found in the topological criteria. On the other hand, the evaluation of the hybrid model had a better result with the scenario generated by the Davies-Bouldin criterion (DB), but once again, in all cases of the hybrid model, the clustering had lower quality values than the method k-means pure. Figure 6 shows the final distribution of the districts for each of the criteria.

Optimization of entrance location and operational point of VRPs
For each criterion used in the creation of DMAs, an optimization was performed on the k-means method, with the purpose of analyzing the cost involved in the optimal allocation of PRVs and the distribution of the pressures in the network under conditions of maximum and minimum demand for a period of 24 hours. The choice of k-means models is justified because they presented better results in the creation of DMAs, with well-distributed and compact clusters.
The total cost represents the cost involved in the installation of PRVs, while the unit cost represents the cost per valve implanted. The cost for PRV are based on Saldarriaga et al. (2019). This analysis was made in order to obtain insights on the costs associated with the pressure optimization. Table 5 presents the optimization results for each criterion.   A good pressure distribution in the network occurs when the operating pressures of the system and the standard deviation between them are minimized, both in the conditions of minimum and maximum demand, comparing to the situation without optimization. The topological criteria presented an improvement in the distribution of pressure in the network, with emphasis on the "Length 15 km" criterion, which showed a significant reduction in the pressure required by the network, evident in Figure 7. The mathematical criteria also showed an improvement in the distribution of pressure in the network, with Calinski-Harabasz, which presented a significant reduction in the pressure required by the network, evident in Figure 8.

DISCUSSION
It is possible to notice from Figures 3-6 that the models that only used k-means to group the nodes of the network had a well distributed and compact aspect districts. By using the hybrid model, it is possible to notice that all the clusters maintained the same pattern of clustering in diagonal bands, losing the essence of compact clusters and possibly representing difficulties in the strategic management of the districts, since they have an elongated aspect.
The variation of the topological criteria resulted in changes in the arrangements and number of districts, in which the increase of the criteria values tended to reduce the number of districts.  Novarini et al.

9/11
The mathematical criteria did not show drastic differences among them, with the Calinski-Harabasz criterion presenting the largest number of districts and the GAP criterion the lowest number of districts in the case of the model using only k-means.
When analyzing the Calinski-Harabasz index in the clustering, it is possible to notice that the models with the k-means algorithm presented, in general, higher indexes, thus with a higher quality. The best clustering with respect to the topological criteria was given for the maximum DMA water demand equal to 140 L/s (6 DMAs generated), the difference in node elevation between DMAs equal to 75 m (6 DMAs generated) and the maximum total pipe length of the DMA equal to 15 km (6 DMAs generated). The best clustering in relation to the mathematical criteria was obtained by using the Calinski-Harabasz method (8 DMAs generated), although the methods Silhouette (2 DMAs generated) and Davies-Bouldin (2 DMAs generated) presented very close indexes.
When analyzing the creation of DMAs in terms of mathematical criteria, the Silhouette, Davies-Bouldin, and GAP presented poor hydraulic results with only 2 DMAs created, which is not a significant improvement for management purpose. The Calinski-Harabasz criterion presented a good result, with 8 compact districts well distributed throughout the network and good quality evaluation indexes, in addition to a lower unit cost for PRVs installation (U$ 1,780).
When analyzing the creation of DMAs in terms of the topological criteria, all presented good results, with 6 DMAs created, compact and with well distributed characteristics. The criterion "Demand 140 l/s" presented the lowest total cost (U$ 33,092) and unit cost (U$ 1,947) for PRVs installation.
It is possible to notice that the total cost of installation increases with the number of DMAs. However, the unit cost tends to decrease, as there are more limit tubes and more likely to work with smaller diameters, reducing the costs of PRVs Figures 7 and 8 highlight the efficiency of the network optimization as to the distribution of pressures under the conditions of minimum and maximum demand of the system, reducing the overall pressure required by the network distribution. From the quantitative point of view, the PU in the network was reduced from 52,07 to 44,33 in average. This reduction at the PU corresponds to a leakage reduction of 30% at the entire network. This leakage is calculated following the methodology presented by Brentan et al. (2017) and take into account a scenario of the network in operation without DMA's and the scenarios with DMAs.
Even if the benefits of DMA design are clear, knowing the diversity and dynamic of WDN, it's a hard task to evaluate how much will be this benefit for a water utility without simulations and deeper studies of particular cases.

CONCLUSION
This work presented the comparison between a hybrid model (SOM + k-means) and a k-means method model for the creation of DMAs with the purpose of optimizing the water supply system, considering the similarity of the topological conditions of the nodes of the network, mathematical criteria and topological criteria to find the optimum number of DMAs.
The topological similarity of nodes in the water distribution network was essential for the effective creation of DMAs. The k-means method performed well, presenting good quality assessment indexes and the ability to simplify the water supply network, an important feature for water distribution management.
The use of mathematical criteria by itself can generate an impractical solution from the hydraulic point of view and for future work, the topological criteria must be considered jointly with the mathematical criteria to improve the quality of the creation of DMA.
Depending on the criteria used, the size and configuration of the DMAs will be unique and it is up to the system's managers to choose the criteria that will best suit the water distribution network, considering the costs involved.
From the mathematical point of view, the DMA design process can be affected not only by the hydraulic or physical features, but for the optimization design problem. In this work, the optimization is applied for the optimal control valves placement. In this sense, the costs of the valves (related to the number of valves and diameter size) are minimized, taking into account operational parameters, such as the pressure deficit and pressure uniformity in a single-objective approach. he problem could be easily passed for a multi-objective optimization, considering the evaluation parameters (Resilience, Pressure uniformity, etc) as objectives, or becoming the constrains of the problem, in objectives to be reached. If in one hand, the multi-objective approach can be useful for real and complex problems, on the other hand, the final Pareto's front should be treated and the opinion of decision makers will play an important rule for the final solution of the problem.