Parallel Implementation of a Two-level Algebraic ILU(k)-based Domain Decomposition Preconditioner †

. We discuss the parallel implementation of a two-level algebraic ILU(k)-based domain decomposition preconditioner using the PETSc library. We present strategies to improve performance and minimize communication among processes during setup and application phases. We compare our implementation with an off-the-shelf preconditioner in PETSc for solving linear systems arising in reservoir simulation problems, and show that for some cases our implementation performs better.


INTRODUCTION
This paper discusses the formulation and the parallel implementation of an algebraic ILU(k)based two-level domain decomposition preconditioner first introduced in [2].
In this work we present and discuss details of the implementation using the MPI-based PETSc suite [3], a set of data structures and routines for the parallel solution of scientific applications modeled by partial differential equations.We also present results of computational experiments involving matrices from oil reservoir simulation.We have tested with different number of processes and compared the results with the default PETSc preconditioner, block Jacobi, which is a usual option in the oil industry.
The multilevel preconditioner has been an active research area for the last 30 years.One of the main representatives of this class is the algebraic multigrid (AMG) method [19,22,30,31], when used as a preconditioner rather than as a solver.It is widely used in oil reservoir simulation as part of the CPR (constrained pressure residual) preconditioner [6,33].Despite its general acceptance there is room for new alternatives, as there are problems where AMG can perform poorly, see, for instance [13].Among these alternatives we find the two-level preconditioners.There are many variations within this family of preconditioners, but basically we can discern at least two subfamilies: algebraic [1,4,12,14,18,20,23,28,32] and operator dependent [15,16,17].Within the operator dependent preconditioners we should highlight the spectral methods [21,25].
Incomplete LU factorization ILU(k) [26] has long been used as a preconditioner in reservoir simulation (an ingenious parallel implementation is discussed in [11]).Due to the difficulty in parallelizing ILU(k), it is quite natural to combine ILU(k) and block-Jacobi, so much so that this combination constitutes PETSc's default parallel preconditioner [3].The algorithm proposed in [2], whose parallel implementation we discuss in this article, seeks to combine the use of (sequential) ILU(K) with two ideas borrowed from domain decomposition methods: (i) the introduction of an interface that connects subdomains, allowing, as opposed to block-Jacobi, for the interaction between subdomains to be taken into account, and (ii) the introduction of a second level, associated to a coarse version of the problem, that speeds up the resolution of low frequency modes.These improvements come at the cost of greater communication, requiring a more involved parallel implementation.
The main contribution of the two-level preconditioner proposed in [2] is the fine preconditioner, as the coarse level component is a quite simple one.Accordingly, the main contribution in our article is the development of a careful parallel implementation of that fine part; nonetheless, we also take care of the coarse part by proposing low-cost PETSc-based parallel codes for its construction and application.Besides the parallel implementation, we present and discuss a set of performance tests of this preconditioner for solving synthetic and real-world oil reservoir simulation problems.
The paper is organized as follows.Section 2 introduces the notation used throughout the paper.Section 3 describes in detail the proposed two-level algebraic ILU(k)-based domain decomposition preconditioner, which we call iSchur.Its parallel implementation is detailed in Section 4, including the communication layout and strategies to minimize data transfer between processes, while results of performance experiments and comparisons with another preconditioner are presented and discussed in Section 5.The conclusion is drawn in Section 6 along with some future work directions.

NOTATION
In this section, we introduce some notation that will be necessary to define iSchur.

Consider the linear system of algebraic equations
arising from the discretization of a system of partial differential equations (PDEs) modeling the multiphase flow in porous media by a finite difference scheme with n gridcells.We denote by n dof the number of degrees of freedom (DOFs) per gridcell, so that A is a matrix of dimension n dof n × n dof n.For instance, for the standard black-oil formulation, see [24], the DOFs are oil pressure, oil saturation and water saturation, so that n dof = 3.
It will be convenient to think of A as a block-matrix: , where a block, denoted a i j , is a matrix of dimension n dof × n dof .
The block-sparsity pattern of A is determined by the stencil of the finite-difference scheme used for the discretization of the PDEs. Figure 1 depicts a bidimensional domain discretized by a 3 × 3 mesh and the sparsity pattern of the associated matrix for the black-oil model, assuming a fivepoint finite-difference stencil.We say that two gridcells i and j are neighbors if either one of the blocks a i j or a ji is not null.For the five-point stencil, gridcells are neighbors when they share an edge.The gridcells and their respective indices are identified, in a way that Ω denotes either the  domain of the PDE or the set of indices {1, 2, . . ., n}.We introduce a disjoint partition of Ω, i.e., we break Ω = {1, 2, . . ., n} into P pieces Ω 1 , Ω 2 , . . ., Ω P , in such a way that each gridcell belongs to exactly one subdomain Ω k , see Figure 2.More precisely, we define We note that there are gridcells that, while belonging to one subdomain, have neighbors in other subdomains.Our domain decomposition approach requires the definition of disconnected subdomains, that can be dealt with in parallel.For that sake, we define a separator set, called interface and denoted by Γ, and (disconnected) subdomain interiors Ω Int J .A gridcell j is said to belong to Γ if it is in a subdomain Ω J while being neighbor of at least one gridcell in another subdomain with greater index, i.e., Ω K with K > J:  We now define the subdomain interior Ω Int J as the portion of Ω J not in Γ: see Figure 2.These definitions are just enough to ensure that if gridcell j is in Ω Int J and gridcell k is in Ω Int K , with J = K, then they are not neighbors. 1Indeed, to fix the notation, assume J < K. Since j ∈ Ω Int J ⊂ Ω J and k ∈ Ω Int K ⊂ Ω K , if j and k were neighbors, j would be in Γ by definition (2.3) and therefore not in Ω Int J .We now define the local interface Γ J associated with each subdomain Ω J as the intersection of Γ and Ω J , or equivalently, Notice that {Γ J } 1≤J≤P form a disjoint partition of Γ. See Figure 2.
Finally, we define extended subdomains Ω J and extended local interfaces Γ J , which incorporate the portions of the interface connected to the subdomain interior Ω Int J .We define Ω J as and Γ J as its restriction to Γ, i.e., Γ J = Γ ∩ Ω J , see Figure 2.
Notice that Γ J ⊂ Γ J ⊂ Γ.We point out that Γ J is the result of augmenting Γ J with the gridcells of the interface Γ that are neighbors of gridcells in Ω J .We refer to Γ J as an extended interface, see Figure 2.
If the equations/variables are reordered, starting with the ones corresponding to Ω Int 1 , followed by the other Ω Int J and finally by Γ, then A has the following block-structure: where the submatrices A JJ contain the rows and columns of A associated with Ω Int J , A ΓΓ the ones associated with Γ, A JΓ the rows associated with Ω Int J and the columns associated with Γ, and A ΓJ the rows associated with Γ and the columns associated with Ω Int J .Therefore, denoting by |S| the number of elements in a set S, the dimensions of A JJ , A ΓΓ , A JΓ and A ΓJ are, respectively, n dof n J × n dof n J , n dof m × n dof m, n dof n J × n dof m, and n dof m × n dof n J , where n J = Ω Int J and m = |Γ|.It is important to notice that, since a jk = 0 for any j ∈ Ω Int J and k ∈ Ω Int K with J = K, the submatrices A JK of rows associated with Ω Int J and columns associated with Ω Int K are all null (and therefore omitted in (2.7)).This block-diagonal structure of the leading portion of the matrix (which encompasses most of the variables/equations) allows for efficient parallelization of many tasks, and was the ultimate motivation of all the definitions above.We point out that although not necessary by the method, the implementation discussed in this work assumes that the matrix A is structurally symmetric2 , in this case the matrices A JJ and A ΓΓ are also structurally symmetric.Furthermore, A JΓ and A T ΓJ have the same nonzero pattern.

DESCRIPTION OF THE TWO-LEVEL PRECONDITIONER
The goal of a preconditioner is to replace the linear system (2.1) by one of the equivalent ones: A good preconditioner shall render AM or MA much better conditioned than A (i.e., M should approximate, in a sense, A −1 ) and, in order to be of practical interest, its construction and application must not be overly expensive.
We now present our preconditioner, which combines ideas from domain decomposition methods, DDM, and level-based Incomplete LU factorization, ILU(k).In DDM, one tries to build an approximation to the action of A −1 based on the (approximation of) inverses of smaller, local versions of A (subdomain-interior submatrices A JJ , in our case).In this work, the action of the local inverses is approximated by ILU(k Int ), while other components required by the preconditioner (see equation (3.4)) are approximated with different levels (k Bord , k Prod or k Γ ).The motivation for this approach is that ILU(k) has been widely used with success as a preconditioner in reservoir simulation and DDM has been shown to be an efficient and scalable technique for parallel architectures.
It is well established that in linear systems associated with parabolic problems, the highfrequency components of the error are damped quickly, while the low-frequency ones take many iterations to fade, see [31].In reservoir simulation, the pressure unknown is of parabolic nature.
A two-level preconditioner tackles this problem by combining a "fine" preconditioner component M F , as the one mentioned in the previous paragraph, with a coarse component M C , the purpose of which is to annihilate the projection of the error onto a coarse space (associated with the low frequencies).If the two components are combined multiplicatively (see [29]), the resulting two-level preconditioner is In Subsection 3.1 we present a ILU(k)-based fine component M F and in Subsection 3.2 we describe a simple coarse component.

ILU(k) based Domain Decomposition
The fine component of the domain decomposition preconditioner is based on the following block LU factorization of A: where is the Schur complement of A with respect to the interior points.
From the decomposition in (3.2), we can show that the inverse of A is We want to define a preconditioner M F approximating the action of A −1 on a vector.Therefore, we need to define suitable approximations for L −1 J , U −1 J , B J and C J , J = 1, . . ., P, and for S −1 .These approximations are denoted by L −1 J , U −1 J , B J , C J , and S −1 F , respectively.The fine preconditioner is then defined as In the remaining of this subsection, we describe precisely how these approximations are chosen.
First we define L J and U J as the result of the incomplete LU factorization of A JJ with level of fill k Int , [ L J , U J ] = ILU(A JJ , k Int ).In our numerical experiments, we used k Int = 1, which is a usual choice in reservoir simulation.Even though L J and U J are sparse, L −1 J A JΓ and U −T J A T ΓJ (which would approximate C J and B T J ) are not.To ensure sparsity, C J ≈ L −1 J A JΓ is defined as the result of the incomplete triangular solve with multiple right-hand sides of the system L J C J = A JΓ , by extending the definition of level of fill as follows.Let v l and w l be the sparse vectors corresponding to the l-th columns of A JΓ and C respectively, so that L J w l = v l .Based on the solution of triangular systems by forward substitution, the components of v l and w l are related by The level of fill-in of component k of w l is defined recursively as where Lev(v l k ) = 0 when v l k = 0 and Lev(v l k ) = ∞ otherwise, and Lev( L J ki ) is the level of fill of entry ki in the ILU(k Int ) decomposition of A JJ when L J ki = 0 and Lev( L J ki ) = ∞ otherwise.The approximation C J to C J = L −1 J A JΓ is then obtained by an incomplete forward substitution (IFS) with level of fill k Bord , in which we do not compute any terms with level of fill greater than k Bord during the forward substitution process.We denote J is defined analogously.Similarly, in order to define an approximation S to S, we start by defining F J = B J C J and a level of fill for the entries of F J , where m = #Ω Int J is the number of columns in B J and rows in C J , Lev(A ΓΓ kl ) = 0 when A ΓΓ kl = 0 and Lev(A ΓΓ kl ) = ∞, otherwise, and Lev( C J ki ) is the level of fill according to definition (3.7) when C J ki = 0 and Lev(C J ki ) = ∞, otherwise Lev( B J il ) is defined in the same way .Next, F J is defined as the matrix obtained keeping only the entries in F J with level less than or equal to k Prod according to (3.8).We refer to this incomplete product(IP) as F J = IP( B J , C J , k Prod ).

S is then defined as
(3.9) While S approximates S, we need to define an approximation for S −1 .Since S is defined on the global interface Γ, it is not practical to perform ILU on it.Instead, we follow the approach employed in [7] and define for each subdomain a local version of S, where R J : Γ → Γ J is a restriction operator such that S J is the result of pruning S so that only the rows and columns associated with Finally, our approximation S −1 F to S −1 is defined as where L S J and U S J are given by ILU(k Γ ) of S J .Here T J : Γ J → Γ is an extension operator that takes values from a vector that lies in Γ J , scales them by w J 1 , . . ., w J n Γ J (which we call weights), and places them in the corresponding position of a vector that lies in Γ.Therefore, using the same notation as before, the k-th column of T J is w J k e i k , Different choices for the weights gives rise to different options for T J , our choice is one that avoids communication among processes and is defined as follows.
We note that T J , J = 1, . . ., P, form a partition of unity.Our approach is based on the Restricted Additive Schwarz method, proposed on [5].

Coarse Space Correction
We define a coarse space spanned by the columns of an (n × P) matrix that we call R T 0 .The J-th column of R T 0 is associated with the extended subdomain Ω J and its i-th entry is where the i-th entry of µ Ω , denoted µ Ω i , counts how many extended subdomains the i-th node belongs to.Notice that (R T 0 ) iJ = 1 ∀i ∈ Ω Int J and that the columns of R T 0 form a partition of unity, in the sense that their sum is a vector with all entries equal to 1.
We define M C by the formula Notice that this definition ensures that M C A is a projection onto range(R T 0 ) and for A symmetric positive definite this projection is A-orthogonal.Since R 0 AR T 0 is small (P × P), we use exact LU (rather than ILU) when applying its inverse.
Summing up, the complete preconditioner has two components: one related to the complete grid, called M F in equation (3.5), and another related to the coarse space (3.12).It is possible to subtract a third term that improves the performance of the whole preconditioner, although increasing its computational cost.The combined preconditioners are written down as This formulation implies that the preconditioners will be applied additively (3.13) or multiplicatively (3.1) and can be interpreted as having two levels, see [7].In the following sections we address this two-level algebraic ILU(k)-based domain decomposition preconditioner by iSchur (for "incomplete Schur").

PARALLEL IMPLEMENTATION
In this section we discuss the parallel implementation of the preconditioner, considering data locality, communication among processes, and strategies to avoid communication.
We use distributed memory model, so the matrix A is distributed among the processes and, as we use PETSc, the rows corresponding to the elements of each subdomain, including interior and local interfaces, reside in the same process, see Figure 3 for a four domain representation.

Preconditioner Setup
Algorithm 1 describes the construction of the fine and coarse components of the preconditioner.
Step 1 of Algorithm 1 is the incomplete LU factorization.Because each subdomain is completely loaded in only one process and there is no connections between subdomain interiors, this step can be done in parallel in each process without communication.We use a native and optimized PETSc routine that computes the incomplete LU factorization.Algorithm 1: Preconditioner Setup

. , P;
2: C J = IFS( L J , A JΓ , k Border ), J = 1, . . ., P; In Step 2, C J = IFS( L J , A JΓ , 0) computes an approximation C J = L −1 A JΓ through a level zero incomplete triangular solver applied to the columns of A JΓ .For each column a i of A JΓ , we take I i , the set of indices of the nonzero elements of a i .Then we solve c i (I i ) = L(I i , I i )\a i (I i ), which is a small dense triangular system, where c i is the i-th column of C J .See light gray rows and columns represent the I i indices applied to the rows and columns of the matrix L J and the dark gray elements are the ones in L(I i , I i ).We use a PETSc routine to get the submatrices L(I i , I i ) as a small dense matrix and use our own dense triangular solver to compute c i (I i ), which is a small dense vector, and then use another PETSc routine to store it in the correct positions of c i , which is a column of the sparse matrix C J .We observe that the nonzero columns of A JΓ that take part in these operations, in each subdomain, are related to the extended interface, but are stored locally.
IFS( L J , a i , 0) Note that L J and A JΓ (see Equation (2.5)) are in process J, as can be seen in Figure 3, which means that Step 2 is done locally without any communication.
Step 3 computes B J also through the incomplete forward substitution using the same idea described for Step 2 , but A ΓJ is not stored entirely in process J. Actually, A ΓJ has nonzero elements in the rows associated with the extended interface Γ J , therefore the elements of A ΓJ are distributed among process J and its neighbors.So each process has to communicate only with its own neighbors to compute B J .The matrix B J is stored locally in process J.
The communication layout of Step 3 is illustrated in an example for four subdomains in Figure 5, where the arrows show the data transfer flow.In this example, using, for instance, five-point centered finite differences, each subdomain has at most two neighbors, but in different 2-D or 3-D grids and with different discretization schemes, each subdomain can have more neighbors.
The Step 4 computes the Schur complement through an incomplete product as described by equation (3.9).The submatrix A ΓΓ is formed by the subdomains interfaces and therefore is distributed among the processes.Each process computes locally its contribution IP( B J , C J , k Prod ) and subtracts globally from A ΓΓ to build up S, which is also distributed among the processes in the same way A ΓΓ is.Again, each subdomain communicates with their neighbors following the pattern illustrated in Figure 5, but with arrows pointers reversed, and there is a synchronization barrier to form the global matrix S.
In Step 5 and 6, each process takes a copy of the local part of S relative to the extended interface Γ J as in (3.10), communicating only with neighbors, and makes an incomplete LU factorization.The use of local copies avoids communication during the parallel application of S throughout the linear solver iterations when using the RAS [5] version of the preconditioner.In Step 7 the matrix R 0 AR T 0 is initially formed as a parallel matrix, distributed among the processes.Local copies of this matrix is made in all processes and then a LU factorization is done redundantly.The order of this matrix is P, which is in general quite small.

Preconditioner Application
Algorithm 2 describes the application of the preconditioner M to the vector r, i.e., the computation of z = Mr. We define two additional restriction operators, R Ω J : Ω → Ω Int J and R Γ : Ω → Γ.These restriction operators are needed to operate on the correct entries of the vector, it can be seen as a subset of indices in the implementation.
Step 1 of Algorithm 2 solves a triangular linear system in each process involving the L J factor and the local interior portion of the residual, obtaining z J .This process is done locally and no communication is necessary because the residual r is distributed among the processes following the same parallel layout of the matrix.
Step 2 applies B J multiplying z J and subtracting from r Γ obtaining z Γ and the communication between the neighbor subdomains is necessary only in r Γ and z Γ because B J is stored locally in process J.
The Schur complement application is done in Step 3.Because each process has a local copy of its parts of S factored into L S e U S , only the communication in z Γ between the neighbor subdomains is necessary.As described above, T J is a diagonal operator, so we can compute it as a vector, Coarse correction: which is applied through an element wise product.The result is then summed globally in z Γ .At this point there is a synchronization barrier among all processes.
In Step 4 we apply the U J factor to z J − C J z Γ .The matrix-vector product C J z Γ requires communication among neighbors in z Γ and then the triangular solver is done locally with no communication.
The vectors z J and the vector z Γ are gathered in z F in Step 5, where z F is the residual preconditioned by the fine component of the preconditioner.
Step 6 applies the coarse component, completing the computation of the preconditioned residual.Triangular solvers involved are local because there are local copies of L C and U C in each process, so no communication is necessary.

COMPARISON TESTS
We use two metrics to evaluate the preconditioners: the number of iterations and the solution time.In this section we describe the platform where the tests were run, present the results of the proposed preconditioner, and compare with a native PETSc preconditioner.
We consider five matrices from reservoir simulation problems.They are linear systems arising from applying Newton's method to solve the nonlinear equations discretizing the system of PDEs that model multiphase flow in porous media.Matrices F8, U26, U31 and Psalt were generated from a Petrobras reservoir simulation tool.The Psalt problem is a synthetic presalt reservoir model.The fifth matrix is the standard SPE10 model [10] with over a million active cells.All the problems are black oil models using fully implicit method and there are four degrees of freedom per grid cell.
Table 1 shows the problems' size considering the total number of cells and the size of their respective matrices.For both preconditioners the problem was solved using native PETSc right preconditioned GM-RES method [27], with relative residual tolerance of 1e-4, maximum number of iterations of 1000, and restarting at every 30 th iteration.The mesh was partitioned using PT-Scotch [9].ISchur has fill-in level 1 in the interior of the subdomains and zero in the interfaces, namely k Int = 1, k Bord = k Prod = k Γ = 0, as this specific configuration presented a better behavior in preliminary Matlab tests.
We compare the proposed two-level preconditioner with PETSc's native block Jacobi preconditioner (see [3]) combined (as in equation (3.1)) with the same coarse component presented in this paper.We refer to this combination simply as block Jacobi.We note that block-Jacobi/ILU is the default preconditioner in PETSc and is a common choice of preconditioner in reservoir simulation, which makes it an appropriate benchmark.
The first set of tests was run on an Intel Xeon(R) 64-bit CPU E5-2650 v2 @ 2.60GHz with 16 cores (hyper-threading disabled), 20MB LLC, 64GB RAM (ECC enabled).The iSchur preconditioner, both in setup and application phases, requires more computation and communication than block Jacobi, so it has to present a pronounced reduction in number of iterations compared with block Jacobi in order to be competitive.Table 2 shows the number of iterations and the total time, in seconds, for both preconditioners, for all the tested problems, ranging from one to sixteen processes.For each test, the best results are highlighted in gray.We can observe that the iSchur preconditioner does reduce the number of iterations compared to block Jacobi for all problems in any number of processes.For one process (sequential execution), both preconditioners are equal, as there are no interface points, so we show the results just for the sake of measuring speed up and scalability.In average, iSchur does 80% as many iterations as block Jacobi.In this set of experiments, iSchur was more robust than block Jacobi, as the latter failed to converge in two experiments.
We also observe that, for some of the matrices, the number of iterations increases with the number of processes.This is to be expected, since block Jacobi neglects larger portions of the matrix as the size of the subdomains decreases.A correlated fact occurs with iSchur.In this case as we are using different levels of fill in the interface related parts and in the interior parts, more information is neglected with more subdomains, as the number of interface points increases.Table 2 also shows the total time of the solution using both preconditioners.We can see that despite the reduction in the number of iterations the solver time of iSchur is smaller than block Jacobi only in a few cases.One of the main reasons is that the iSchur setup time is greater than block Jacobi's, especially when running on few processes.Although iSchur's setup is intrinsically more expensive than block Jacobi's, we believe that this part of our code can still be further optimized to reduce this gap.
In reservoir simulation, the sparsity pattern of the Jacobian matrix only changes when the operating conditions of the wells change, and therefore it remains the same over the course of many nonlinear iterations.This is an opportunity to spare some computational effort in the setup phase, since in this case the symbolic factorization does not need to be redone.We can see in Table 3 the time comparison when the symbolic factorization is not considered; in this scenario iSchur is faster in most cases.Furthermore, block Jacobi is a highly optimized, mature code (the default PETSc preconditioner), while iSchur could possibly be further optimized.
In order to examine the behavior of the preconditioners as the number of subdomains increases, we ran tests up to 512 processes on a cluster consisting of nodes with 16GB RAM and two Intel Xeon E5440 Quad-core processors each, totaling 8 CPU cores per node.The results are depicted in Figure 6.We see that iSchur consistently outperforms block Jacobi in terms of number of iterations.Nonetheless, the behaviour of both preconditioners, with respect to scalability, cannot be predicted from these experiments.Tests with more MPI processes, in a larger cluster, are still required.

CONCLUSIONS AND REMARKS
The presented preconditioner successfully reduced the number of iterations for the reservoir problems compared to block Jacobi for the full range of partitions tested.The scalability of both preconditioners is an on-going research as tests in a larger cluster are still necessary, but these initial results indicate a little advantage to iSchur.
For the current implementation the total solution time, setup and iterative solver, of the iSchur is smaller in only a few cases, where one important factor is the inefficiency in the setup step.The block Jacobi in PETSc is highly optimized while our implementation has room for improvement.
Removing the symbolic factorization time, considering the case when it is done once and used multiple times, which is usual in reservoir simulation, the solution using the iSchur preconditioner is faster for most tested cases.It shows that the presented preconditioner can be a better alternative than block Jacobi for the finer component when using two-level preconditioners to solve large-scale reservoir simulation problems.Furthermore, we deem that the application part of the code can still be optimized.

Figure 1 :
Figure 1: Discretization grid and associated matrix, assuming three unknowns per gridcell and a five-point stencil.

Figure 3 :
Figure 3: Subdomain and matrix distribution among processes.

Figure 6 :
Figure 6: Comparison between block Jacobi and iSchur up to 512 processes.

Table 1 :
Test case problems.
The operating system is a Debian GNU/Linux distribution running kernel version 3.2.65.The compiler is Intel Parallel Studio XE 2015.3.187Cluster Edition.

Table 2 :
Number of iterations and the total time for both preconditioners from 1 to 16 processes.(NC) stands for "not converged".

Table 3 :
Time of iterative solver without the symbolic factorization.