Parallel Implementations of RCM Algorithm for Bandwidth Reduction of Sparse Matrices

. The Reverse Cuthill-McKee (RCM) algorithm is a well-known heuristic for reordering sparse matrices. It is typically used to speed up the computation of sparse linear systems of equations. This paper describes two parallel approachesfor the RCM algorithm as well as an optimized version of each one based on some proposed enhancements. The ﬁrst one exploits a strategy for reducing lazy threads, while the second one makes use of a static bucket array as the main data structure and suppress some steps performed by the original algorithm. These related changes led to outstanding reordering time results and signiﬁcant bandwidth reductions. The performance of two algorithms is compared with the respective implementation made available by Boost library. The OpenMP framework is used for supporting the parallelism and both versions of the algorithm are tested with large sparse and structural symmetric matrices.


INTRODUCTION
Computation involving sparse matrices have been of widespread use since the 1950s, and its application includes electrical networks and power distribution, structural engineering, reactor diffusion, and, in general, solutions to partial differential equations [29].The typical way to solve such equations is to discretize them, i.e., to approximate them by equations that involve a finite number of unknowns.The linear systems that arise from these discretizations are of the type Ax = b, in which A is a large and sparse matrix, that is, it has very few nonzero entries.In this context, a matrix with a small bandwidth is useful in direct methods for solving sparse linear systems since it allows a simple data structure to be used.It is also useful in iterative methods because the nonzero elements will be clustered close to the diagonal, thereby enhancing data locality.Given a symmetric matrix A, a bandwidth-reduction ordering aims to find a permutation P so that the bandwidth of P AP T is small.The bandwidth of a matrix A denoted by β(A) is defined as the greatest distance from the first nonzero element to the diagonal, considering all rows of the matrix [29].More formally, for the i-th row of A, i = 1, 2, . . ., n, let f i (A) = min{ j | a i j = 0}, and b i (A) = i − f i (A).So, β(A) = max i=2,3,...,n {b i (A)}.
Since Papadimitrou [23] proved that the bandwidth minimization problem is NP-complete, several heuristic algorithms have been presented in the literature aiming to find good quality solutions as fast as possible.An important class of these algorithms treats a matrix bandwidth reduction under the perspective of a graph labeling problem.In this way, reordering a sparse matrix is considered a problem of labeling the vertices of the corresponding graph in such way that closest labels are assigned to most linked vertices.
In this paper, the parallel Unordered and Leveled implementations of the RCM algorithm proposed by [16] are presented.In the first one, the underlying matrix graph is completely processed in one step, followed by the permutation vector generation.On the other hand, in the second one, the graph traversing and the vector computation take place iteratively level-by-level.The original implementation of these algorithms is based on the Galois system 1 .However, as the OpenMP 2 framework has widely been used in researches related to shared-memory parallel systems, this work analyses the results obtained by an OpenMP implementation of these algorithms.Furthermore, an optimized version of each one is introduced.Actually, the set of proposed optimizations leads to outstanding improvements on the performance of the algorithms.In the Unordered RCM, a strategy to reduce lazy threads is evaluated.On the other hand, the Leveled RCM is presented with an alternative main data structure -a static Bucket Array [3] -in replacement of FIFO queue [2] proposed originally.Relevant CPU time improvements were observed by means of this change.
The outline of the paper is as follow.In the next section, an overview of the serial RCM is presented.Section 3 is dedicated to description of the Unordered RCM algorithm as well as some proposed optimizations.In Section 4, the Leveled RCM algorithm is presented along with an optimized version of it.All tests and achieved results are described in Section 5. Conclusions and future works are addressed in Section 6.

SERIAL REVERSE CUTHILL-MCKEE
The serial Cuthill-McKee algorithm [8] is based on a Breadth-First-Search (BFS) strategy, in which the graph is traversed by level sets 3 .As soon as a level set is traversed, its nodes are marked and numbered.The neighbors of each of these nodes are then inspected.Each time a neighbor of a visited vertex that is not numbered is encountered, it is added to a list and labeled as the next element of the next level set.The order in which each level itself is traversed gives rise to different orderings or permutations of rows and columns.In the Cuthill-McKee ordering, the nodes adjacent to a visited node are always traversed from the lowest to the highest degree.Galois is a system that automatically executes serial C++ or Java code in parallel on shared-memory machines [10].OpenMP is a specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in Fortran and C/C++ programs [22] A level set structure of a graph is defined recursively as the set of all unmarked neighbors of all nodes of a previous level set.Initially, a level set consists of one node.
Tend.Mat.Apl.Comput., 18, N. 3 (2017) However, in 1971, the Reverse Cuthill-McKee algorithm was presented by [20].It was empirically observed that reversing the Cuthill-McKee ordering yields a better permutation scheme for matrix reordering problems.Algorithm 1 presents the pseudo-code of serial RCM.
Find all unlabelled neighbors of the vertex v i 4: Label the vertices found in increasing order of degree The root r of the level structure is usually chosen from the pseudo-peripheral4 nodes of the associated graph.In this work, the parallel pseudo-peripheral node finding algorithm presented by [30] was implemented and used for obtaining the root vertex required by RCM.

UNORDERED PARALLEL REVERSE CUTHILL-MCKEE
The first RCM parallelization suggested by [16] is composed of four main steps.Based on this, the overall skeleton of the Unordered Parallel RCM is not iterative, that is, there is no a global loop wherewith the algorithm traverses the graph gradually.Instead, the four steps are executed just once and the permutation array is yielded as final output.These related steps are detailed as follow.
Step 1: Generate a level structure through the Unordered BFS algorithm.The Unordered BFS algorithm relies on the perception of local minimums in the graph regarding the levels of the adjacent nodes.More precisely, the level of every node (excepting the root) may be computed as the highest level among the respective neighbors added of one [11].In this way, the level computation for a node n may be described as a fixpoint system5 : Algorithm 2 shows the pseudo-code of the implemented Unordered BFS and the respective optimizations suggested by this work.In order to explore the fixed point feature, an unordered worklist (wl) structure must by maintained by the algorithm.A structure of this type makes possible any node to be picked up.Thereafter, the algorithm is able to processing several nodes in parallel.Nevertheless, as threads access the worklist in a non-deterministic way, a node may be eventually set with a level higher than the correct final value.Despite this, the successive iterations executed by threads over the worklist generate a monotonic decrement of the level of each processed nodes.This decrement stabilizes at the appropriated node level (fixed point iteration).
This process is exemplified by Figure 1.Initially at step (i), three nodes have already been inspected by the algorithm.The root r is the first node processed (black background) by a thread, and its respective neighbors are active (gray background) in the worklist.The level of them has been defined as the level of the parent (zero) added of one.At this moment, both nodes (a and b) may be taken by a thread.Eventually, in the step (ii), the node b was selected and processed by a thread.Consequently, its unprocessed neighbors (just the node c) were selected and became actives in the worklist.Next, two nodes are available to be processed: a (from the previous step) and c (from the current step).As there is no a specific order to process the nodes, anyone may be taken.In this case, the node c was selected and its not processed neighbor (node d) become active with the level defined as three.Finally, at step (iii), just the node a remains in the worlist.So, it is taken by a thread, and the level of its not processed neighbor (node d) is set to the appropriated value, i.e., two.Two optimizations previously presented by [11] were incorporated into the Unordered BFS (Algorithm 2).

Work chunking:
In order to minimize the access to the global worklist ws by threads, it was adopted the strategy of making each thread able to dequeue a bunch of nodes (WORK CHUNK) from the worklist rather than only one at a time (lines 6-10).As each thread caches a local worklist threadws (a subset of ws), they can process it independently of each other.Thus, a local set of new actived nodes (relaxedws) is built by the running threads, and finally, each relaxedws is merged into the global worklist (lines [20][21][22].The size of the bunch of nodes picked up by each thread is dynamically defined in the algorithm.Actually, empirical test points out to a better algorithm performance when considering the WORK CHUNK size as fifty percent of the total of nodes available in the global worklist [24].int level = n.getLevel()+ 1; 16: v.setLevel(level); relaxedws.add(v);atomic ws.add(m); 23: } } } }

Wasted work reduction:
It was implemented a strategy to reduce the time wasted by each waiting thread (idle threads) in which all threads remove active elements from one end of the worklist and add to the other.The concurrent access of each worklist end is managed by two distinct access lock.Naturally, this approach relaxes the strict order in which the worklist is processed.However, to ensure this strict order increases the access time of the worklist beyond the benefit of reducing the amount of wasted work.
These two related optimizations were earlier explored in the work [24].But the comparison of the parallel Unordered RCM was made against a correspondent serial algorithm implemented by [13].
Step 2: Count nodes by level.Initially, all nodes of the graph are divided among the set of threads.Each thread counts locally how many nodes, from its respective subset of nodes, belong to each level.Moreover, a local maximum level is determined by each thread.Next, the global maximum level is computed through the comparison of each maximum local level of each thread.Finally, as the number of levels of the graph is already computed, thus a range of levels is assigned to each thread that, in turn, counts how many nodes were computed by all threads in its respective range.The result is stored in the global array.
Step 3: Compute the prefix sum of levels.The prefix sum6 calculus implemented in this work is based on the algorithm proposed by [1].Initially, each thread computes the prefix sums of the n p elements it has locally.The total number of elements (n) corresponds to the maximum level accounted for by the previous step of the Unordered RCM algorithm.The value p is related to the number of threads.Hereafter, the last prefix sum of each thread is assigned to arrays responsible for guiding the data exchanging process among the threads.In fact, the local prefix sum values are exchanged and each thread accumulates the respective received value.The rule to determine a pair of threads that is going to communicate is through a XOR (exclusive OR) bitwise operation between the unique identifier of the sender thread and a constant related to the group of the receiver thread.Finally, each thread merges the result from the accumulated prefix sums with each local prefix sum initially computed.
Step 4: Build the permutation array.The underlying concept behind the operation of this phase is the pipelining of threads actions among the levels of the graph.For this, one thread is assigned for each level, and the communication among them takes place in pairs: a thread responsible for a level l plays a producer (writer) role, while a thread assigned to the level l + 1 acts as a data consumer (reader).Every read/write operation happens over the permutation array.The controller of this implemented producer/consumer paradigm is done through the prefix sum array (sums) generated in the previous step.Actually, it is never changed once its values are used as bounds for threads operations.In this way, this pipeline makes possible the construction of the permutation array in a parallel way: while a thread writes the children nodes from a level l in the permutation array, another thread reads these ones in order to write the corresponding neighbors of them at level l + 1 in the permutation array.

LEVELED PARALLEL REVERSE CUTHILL-MCKEE
The second parallelization of RCM presented by [16] makes use of a different approach.There is no an unordered execution of threads through the graph as a whole.Instead, the graph is traversed level by level, and the parallelism is restricted to one level per time.In each level-iteration, the permutation array is built gradually.The step-by-step executed by the Leveled RCM is detailed in Algorithm 3.
Iterations are composed of four steps carried out inside the outermost parallel loop (line 3).Initially, threads explore the neighbors of the parent nodes previously stored in the permutation array (lines 6-16).This expansion step sets the appropriated level of each explored neighbor.In the two next steps, the number of children (neighbors) per level is accounted (reduction steplines 18-20), and the prefix sum is generated subsequently (lines [22][23][24].Finally, the neighbors processed in the current iteration are stored in the permutation array (placement step).} }

Optimized Leveled Parallel RCM
The proposed new version of the Leveled Parallel RCM follows the same strategy of the original one.It is based on a graph traversing level-by-level and the parallelism is restricted to each level.However, an alternative main data structure is used in order to conduct the nodes processing iteratively [25].Actually, during each iteration, the processing relies on a Bucket Array in which inspected nodes are stored in specific buckets or cells.In order to determine the correct bucket for storing a node, it is used a hashing function.As this type of function may generate a same output for distinct inputs, each bucket array position is able to enqueue nodes.Figure 2 presents an operating scheme of the bucket array implemented for the optimized Leveled RCM.Furthermore, the use of the Bucket Array as main data structure produces an important impact on the design of the new proposed algorithm.In effect, the number of steps of the original algorithm were merged into just two.The algorithm idea is presented in Algorithm 4. // Expansion The algorithm starts setting a pseudo-peripheral node (source node) at the first position of the permutation array.Hereafter, the parallel processing comes into play inside the outermost loop (line 3).Every iteration is composed of two steps.In the first one, all nodes to be processed are read from the permutation array (P), and the respective children of each node are inspected.A global range (parent1:parentN -line 6) is kept in order to control the parents to be processed in each iteration.Thus, the appropriated level is set to every U N L AB E L E D child (line 11), the status of each one is updated to L AB E L E D (line 13), and they are stored in the correct positions in the bucket array generat ion (line 14).Two features related to the use of a bucket array in this step are ever relevant for the algorithm performance: • Implicit reduction step.As each processed child node is stored in the bucket of its respective parent, the number of children per parent becomes automatically available through a simple node counter.It is incremented at each bucket insertion.Thus, there is no necessity of a separated step (reduction) for to execute this computation, as is done by the original algorithm.
• Cheap hashing function.Each cell or bucket of the main data structure corresponds to a parent node of some inspected node in the current iteration.So, the hashing function employed in mapping nodes to buckets involves just a memory address computation.In order words, to find out the correct position of a node in the bucket array is related to the cost of one memory access.
After all threads conclude the building bucket array process by means of the parent nodes inspection, they advance to the second iteration step.In this phase, each thread takes a bucket and stored the children mapped to it in the final permutation array (lines [21][22].This way of traversing the processed nodes, i.e. bucket-by-bucket instead of node-by-node, represents an important improvement for the performance of the algorithm.In effect, this strategy makes possible threads execute all placement work independently of each other.This because only two synchronized operations (lines 18 and 19) are necessary.Both of them are related to updating the range of the permutation array to be made available for the next iteration.In contrast, the original algorithm guides the threads work through the prefix sum computed in a previous step.This specific step is not necessary in the bucket-by-bucket computation proposed by the optimized algorithm.
All aforementioned modifications implemented in the original algorithm were early evaluated in a previous work [25].However, the comparison was made against a free OpenMP-based implementation of the original algorithm.The evaluation shown that the Leveled RCM speedup is critically compromised by a straightforward replication of it to the OpenMP platform.

EXPERIMENTAL RESULTS
The performance evaluation of the Optimized Leveled and the Optimized Unordered RCM algorithms was against a serial implementation of RCM made available by the Boost Library [28].
A set of twenty structural symmetric and square matrices was selected from the University of Florida Sparse Matrix Collection [9].These matrices cover multiple type of problems in order to increase the dataset variety and the percentage of sparsity of each one is higher than 99.97%.The set of tested matrices is shown in Tables 2 and 3 show a quality and performance comparison, respectively, among the serial Boost Library implementation of RCM and the two modified parallel versions of the algorithm, Unordered and Leveled, which have been proposed in this work.The algorithms were performed five times for each pair (m i , t j ), where m i is a matrix of Table 1, and t j is the number of threads between 1 and 12 (in steps of 2) used by each algorithm.For each (m i , t j ) tested pair, the average was calculated from the reported values.In order to compare both algorithms, for each matrix m i , it was selected the number of threads t j that reached the best time reordering for the Unordered algorithm.This same number of threads t j was used to select the corresponding (m i , t j ) tested pair from the Leveled algorithm.Moreover, the Compressed Sparse Row format was the mechanism used to store each tested matrix.The operations applied on them were also performed using this format.

Reordering Quality
Table 2 shows the new bandwidth obtained after applying the permutation generated by each algorithm.In the column Bandwidth the original bandwith of the matrices is presented.The same band reordering values were reached by the three implementations for five of tested matrices.The Boost library overcomes both parallel algorithms only for three matrices: Ga41 As41H 72, audikw 1, and diel Filt er V 2real.For the remaining matrices, both proposed algorithms attained the smallest values for the bandwidth.In particular, only for the Serena matrix, the algorithms reached a less expressive reordering result around 0.17%.Five matrices achieved an intermediate reduction rate (up to 56.04%): Fault 639, Ga41 As41H 72, tmt sym benzene, and tmt unsym.With the other matrices, the percentage of reduction various between 95.60% (nlpkkt 80) until 99.75% (helm2d03).The quality reordering may also be graphically attested through Figure 3.It shows the five matrices with best bandwidth reduction rate reached by the algorithms.The first row of each group presents the matrix sparsity before reordering.In the below rows, each respective matrix is exhibited as result of a permutation of rows and columns derived from the Unordered RCM algorithm.Actually, the reordering pattern reached by any one of the three tested algorithms might be used to depicts the quality of applied permutation.This because the differences among the bandwith reduction reached by each one is less than 0.002% and may be negligible.

Reordering Performance
Table 3 shows the elapsed time (Time columns) by the algorithms, in scale of 10 −3 seconds, to reorder each matrix considering the best set of threads (column #Th.).Other additional columns (Reduction) present the time reduction percentage reached by the respective parallel algorithm.The pseudo-peripheral vertex used as source node by the three evaluated algorithms is obtained through the algorithm mentioned in Section 2. Because this, the time spent in each pseudoperipheral computation was not included in this performance analysis.As showed by the Table 3, for all tested instances both proposed algorithms attained better performance than Boost library.In fact, the Unordered RCM decreased the time reordering in two order of magnitude for twelve tested matrices.For other seven matrices, there was one order of magnitude in the time reduction.The Leveled RCM also presented an outstanding efficiency for the set of matrices.A direct performance comparison between the two parallel approaches points out the superior efficiency of the Unordered algorithm.Indeed, the best times were reached by the Unordered RCM for sixteen tested matrices.On the other hand, the Leveled RCM obtained the best values for another subset of four instances.The time reduction percentage reached by the Unordered algorithm various between 69.83% (G3 circuit ) and 98.08% (F1).Analogously, the reduction ratio obtained by the Leveled RCM was from 71.37% (t hermomech T C) to 92.96% (msdoor).
Figure 4 presents how speedups of the Unordered RCM scale as the number of threads is increased.Five matrices with the highest rate of speedup were selected.As shown by this figure, audikw 1 and diel Filt er V 3real are the matrices with best speedups of this group.It is relevant to notice the related matrices are able to scale until 12 threads, and the maximum speedup ratio reached by each one was 4.48X and 4.07X respectively.The speedup ratio reached by the other three matrices was of 3.06X (Ga41 As41H 72), 3.02X (F1), and 2.46X (msdoor).An important feature to point out about these five matrices is the large number of nonzeros per row of each one (approximately an average of 71 NNZ/row -for details see Table 1).Therefore, speedups of parallel algorithms like Unordered RCM which are based on a BFS approach are higher for graphs with a larger number of edges per node.Another feature is related to the no significant speedup observed for the Leveled RCM.As it uses a different general strategy based on a parallelism by level [16], a barrier must be placed between each level.Because of this, increasing the number of threads do not necessarily lead to an improvement of performance.Both parallel implementations of the RCM presented in this work were supported by the OpenMP framework.However, some works have addressed the reordering problem through the use of another kind of parallelism tool.As example, [7] has developed a parallel tool for graph partitioning, and several works have discussed the parallelization of algorithms by the Galois System [11].Moreover, other data structures and BFS strategies have been proposed for the parallelism of RCM.In fact, Hassam et al. [11] present relevant results from a wavefront BFS implementation, and Leiserson and Schardl [18] propose a novel implementation of a workset data structure, called "bag", in place of FIFO queue usually employed in BFS algorithms.The use of these new structures and strategies might promotes more improvements to the algorithms studied in this work.Moreover, additional comparisons against other mathematical libraries may ever endorse the relevant performance reached by the algorithms.In this way, a confrontation with the Harwell Subroutine Library (HSL) [12] -a collection of state-of-the-art packages for large-scale scientific computation -constitutes an important complementary test to be made.Palavras-chave: RCM paralelo, reduc ¸ão de largura de banda, matrizes esparsas.

Figure 3 :
Figure 3: Sparse matrix pattern yielded by the Unordered RCM.

Table 1 .
[31]columns present matrices and some characteristics of them: dimension, number of non-zeros, and average of non-zeros per row (NNZ/row).The program was coded in the C language and the parallelism was supported by OpenMP framework.The experiments were performed on a PC running Ubuntu Linux, version 14.04.5 LTS, with Kernel version 3.19.0-25.It consists of 2 Xeon processors of 8 cores each, operating at 2.4 GHz.Each core has a unified 256KB L2 cache and each processor has a shared 20MB L3 cache.The PC contains 32GB of main memory and code was compiled with GNU gcc version 4.8.4.The complete source code is available on GitHub repository[31].
This paper analysed two parallel strategies for the traditional Reverse Cuthill-McKee reordering algorithm.Two modified implementations were presented and the results achieved by both algorithms represent a significant improvement on the reordering time.With the Unordered RCM algorithm, it reached a time reduction of until 98.08% of the serial time obtained by the Boost library.Other significant results show the Unordered RCM achieving speedups until 4.478 with 12 threads.A relevant performance was also achieved by the Leveled RCM algorithm.The permutations generated by it reduced the bandwidth until 99.75%.It is noticeable the changes proposed for the two parallel algorithms led to outstanding performance enhancements.About the reordering quality, both implementations attained relevant bandwidth reduction.Therefore, the two modified parallel RCM algorithms may be considered as efficient approaches for the bandwidth minimization problem.