Acessibilidade / Reportar erro

Parallel Implementation of a Two-level Algebraic ILU(k)-based Domain Decomposition Preconditioner This article is based on work presented at CNMAC2016.

ABSTRACT

We discuss the parallel implementation of a two-level algebraic ILU(k)-based domain decomposition preconditioner using the PETSc library. We present strategies to improve performance and minimize communication among processes during setup and application phases. We compare our implementation with an off-the-shelf preconditioner in PETSc for solving linear systems arising in reservoir simulation problems, and show that for some cases our implementation performs better.

Keywords:
Two-level preconditioner; domain decomposition; Krylov methods; linear systems; parallelism; PETSc

RESUMO

Discutimos a implementação paralela de um precondicionador algébrico de decomposição o de domínios em dois níveis baseado em ILU(k), utilizando a biblioteca PETSc. Apresentamos estratégias para melhorar a performance, minimizando a comunicação entre processos durante as fases de construção e de aplicação. Comparamos, na solução de sistemas lineares provenientes de problemas de simulação de reservatórios, a nossa implementação com um precondicionador padrão do PETSc. Mostramos que, para alguns casos, nossa implementação apresenta um desempenho superior.

Palavras-chave:
Precondicionador de dois níveis; decomposição de domínio; métodos de Krylov; sistemas lineares; paralelismo; PETSc

1 INTRODUCTION

This paper discusses the formulation and the parallel implementation of an algebraic ILU(k)-based two-level domain decomposition preconditioner first introduced in22. D. A. Augusto, L. M. Carvalho, P. Goldfeld, I. C. L. Nievinski, J.R.P. Rodrigues, & M. Souza. An algebraic ILU(k) based two-level domain decomposition preconditioner. In Proceeding Series of the Brazilian Society of Computational and Applied Mathematics, 3 (2015), 010093-1-010093-7..

In this work we present and discuss details of the implementation using the MPI-based PETSc suite33. S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B.F. Smith, S. Zampini, H. Zhang, & H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.7, Argonne National Laboratory, (2016)., a set of data structures and routines for the parallel solution of scientific applications modeled by partial differential equations. We also present results of computational experiments involving matrices from oil reservoir simulation. We have tested with different number of processes and compared the results with the default PETSc preconditioner, block Jacobi, which is a usual option in the oil industry.

The multilevel preconditioner has been an active research area for the last 30 years. One of the main representatives of this class is the algebraic multigrid (AMG) method1919. A. Muresan & Y. Notay. Analysis of aggregation-based multigrid. SIAM Journal on Scientific Computing, 30(2) (2008), 1082-1103.), (2222. Y. Notay. Algebraic multigrid and algebraic multilevel methods: a theoretical comparison. Numer. Linear Algebra Appl, 12(5-6) (2005), 419-451.), (3030. J. Tang, R. Nabben, C. Vuik & Y. Erlangga. Theoretical and numerical comparison of various projection methods derived from deflation, domain decomposition and multigrid methods. Reports of the Department of Applied Mathematical Analysis 07-04, Delft University of Technology, January 2007.), (3131. U. Trottenberg, C. Oosterlee & A. Schuller. Multigrid. Academic Press, Inc. Orlando, USA, (2001)., when used as a preconditioner rather than as a solver. It is widely used in oil reservoir simulation as part of the CPR (constrained pressure residual) preconditioner66. H. Cao, H. Tchelepi, J. Wallis & H. Yardumian. Parallel scalable unstructured CPR-type linear solver for reservoir simulation. In SPE Annual Technical Conference and Exhibition, 9-12 October 2005, Dallas, Texas. Society of Petroleum Engineers, (2005).),(3333. J. Wallis, R. Kendall & T. Little. Constrained residual acceleration of conjugate residual methods. In SPE Reservoir Simulation Symposium, number SPE 13563 in 8 th Symposium on Reservoir Simulation, Dallas, 1985. Society of Petroleum Engineers.. Despite its general acceptance there is room for new alternatives, as there are problems where AMG can perform poorly, see, for instance1313. S. Gries, K. Stüben, G. L. Brown, D. Chen & D. A. Collinns. Preconditioning for efficiently applying algebraic multigrid in fully implicit reservoir simulations. SPE Journal, 19(04): 726-736, August 2014. SPE-163608-PA.. Among these alternatives we find the two-level preconditioners. There are many variations within this family of preconditioners, but basically we can discern at least two subfamilies: algebraic11. T. M. Al-Shaalan, H. Klie, A.H. Dogru, & M. F. Wheeler. Studies of robust two stage preconditioners for the solution of fullyimplicit multiphase flow problems. In SPE Reservoir Simulation Symposium, 2-4 February, The Woodlands, Texas,number SPE-118722 in MS, (2009).), (44. M. Bollhöfer & V. Mehrmann. Algebraic multilevel methods and sparse approximate inverses. SIAM Journal on Matrix Analysis and Applications, 24(1) (2002), 191-218.), (1212. L. Formaggia & M. Sala. Parallel Computational Fluids Dynamics. Practice and Theory., chapter Algebraic coarse grid operators for domain decomposition based preconditioners. Elsevier, (2002), 119-126.), (1414. H. Jackson, M. Taroni & D. Ponting. A two-level variant of additive Schwarz preconditioning for use in reservoir simulation. arXiv preprint arXiv: 1401.7227, (2014).), (1818. A. Manea, J. Sewall & H. Tchelepi. Parallel multiscale linear solver for reservoir simulation. Lecture Notes - slides - 4th SESAAI Annual Meeting Nov 5, 2013, (2013).), (2020. R. Nabben & C. Vuik. A comparison of deflation and coarse grid correction applied to porous media flow. SIAM J. Numer. Anal., 42 (2004), 1631-1647.), (2323. Y. Notay. Algebraic analysis of two-grid methods: The nonsymmetric case. Numerical Linear Algebra with Applications, 17(1) (2010), 73-96.), (2828. Y. Saad & J. Zhang. BILUM: Block versions of multielimination and multilevel ILU preconditioner for general sparse linear systems. SIAM Journal on Scientific Computing, 20(6) (1999), 2103-2121.), (3232. C. Vuik & J. Frank. Coarse grid acceleration of a parallel block preconditioner. Future Generation Computer Systems, JUN, 17(8) (2001), 933-940. and operator dependent1515. P. Jenny, S. Lee & H. Tchelepi. Multi-scale finite-volume method for elliptic problems in subsurface flow simulation. Journal of Computational Physics, 187(1) (2003), 47- 67.), (1616. P. Jenny, S. Lee & H. Tchelepi. Adaptive fully implicit multi-scale finite-volume method for multi-phase flow and transport in heterogeneous porous media. Journal of Computational Physics, 217(2) (2006), 627-641.), (1717. P. Jenny, S. H. Lee & H. A. Tchelepi. Adaptive multiscale finite-volume method for multiphase flow and transport in porous media. Multiscale Modeling & Simulation, 3(1) (2005), 50-64.. Within the operator dependent preconditioners we should highlight the spectral methods2121. F. Nataf, H. Xiang & V. Dolean. A two level domain decomposition preconditioner based on local Dirichlet-to-Neumann maps. Comptes Rendus Mathematique, 348(21-22) (2010), 1163-1167.), (2525. A. Quarteroni & A. Valli. Applied and Industrial Mathematics: Venice - 1, 1989, chapter Theory and Application of Steklov-Poincaré Operators For Boundary-Value Problems. Kluwer, (1991), 179-203..

Incomplete LU factorization ILU(k)2626. Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, 2nd edition, (2003). has long been used as a preconditioner in reservoir simulation (an ingenious parallel implementation is discussed in1111. D. A. Collins, J.E. Grabenstetter & P. H. Sammon. A Shared-Memory Parallel Black-Oil Simulator with a Parallel ILU Linear Solver. SPE-79713-MS. In SPE Reservoir Simulation Symposium, Houston, 2003. Society of Petroleum Engineers.. Due to the difficulty in parallelizing ILU(k), it is quite natural to combine ILU(k) and block-Jacobi, so much so that this combination constitutes PETSc’s default parallel preconditioner33. S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B.F. Smith, S. Zampini, H. Zhang, & H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.7, Argonne National Laboratory, (2016).. The algorithm proposed in22. D. A. Augusto, L. M. Carvalho, P. Goldfeld, I. C. L. Nievinski, J.R.P. Rodrigues, & M. Souza. An algebraic ILU(k) based two-level domain decomposition preconditioner. In Proceeding Series of the Brazilian Society of Computational and Applied Mathematics, 3 (2015), 010093-1-010093-7., whose parallel implementation we discuss in this article, seeks to combine the use of (sequential) ILU(K) with two ideas borrowed from domain decomposition methods: (i) the introduction of an interface that connects subdomains, allowing, as opposed to block-Jacobi, for the interaction between subdomains to be taken into account, and (ii) the introduction of a second level, associated to a coarse version of the problem, that speeds up the resolution of low frequency modes. These improvements come at the cost of greater communication, requiring a more involved parallel implementation.

The main contribution of the two-level preconditioner proposed in22. D. A. Augusto, L. M. Carvalho, P. Goldfeld, I. C. L. Nievinski, J.R.P. Rodrigues, & M. Souza. An algebraic ILU(k) based two-level domain decomposition preconditioner. In Proceeding Series of the Brazilian Society of Computational and Applied Mathematics, 3 (2015), 010093-1-010093-7. is the fine preconditioner, as the coarse level component is a quite simple one. Accordingly, the main contribution in our article is the development of a careful parallel implementation of that fine part; nonetheless, we also take care of the coarse part by proposing low-cost PETSc-based parallel codes for its construction and application. Besides the parallel implementation, we present and discuss a set of performance tests of this preconditioner for solving synthetic and real-world oil reservoir simulation problems.

The paper is organized as follows. Section 2 introduces the notation used throughout the paper. Section 3 describes in detail the proposed two-level algebraic ILU(k)-based domain decomposition preconditioner, which we call iSchur. Its parallel implementation is detailed in Section 4, including the communication layout and strategies to minimize data transfer between processes, while results of performance experiments and comparisons with another preconditioner are presented and discussed in Section 5. The conclusion is drawn in Section 6 along with some future work directions.

2 NOTATION

In this section, we introduce some notation that will be necessary to define iSchur.

Consider the linear system of algebraic equations

A x = b (2.1)

arising from the discretization of a system of partial differential equations (PDEs) modeling the multiphase flow in porous media by a finite difference scheme with n gridcells. We denote by n dof the number of degrees of freedom (DOFs) per gridcell, so that A is a matrix of dimension n dof n × n dof n. For instance, for the standard black-oil formulation, see2424. D. W. Peaceman Fundamentals of Numerical Reservoir Simulation. Elsevier, (1977)., the DOFs are oil pressure, oil saturation and water saturation, so that n dof = 3.

It will be convenient to think of A as a block-matrix:

A = [ a 11 a 12 a 1 n a 21 a 22 a 2 n a n 1 a n 2 a n n ] ,

where a block, denoted a ij , is a matrix of dimension n dof × n dof.

The block-sparsity pattern of A is determined by the stencil of the finite-difference scheme used for the discretization of the PDEs. Figure 1 depicts a bidimensional domain discretized by a 3×3 mesh and the sparsity pattern of the associated matrix for the black-oil model, assuming a five-point finite-difference stencil. We say that two gridcells i and j are neighbors if either one of the blocks a ij or a ji is not null. For the five-point stencil, gridcells are neighbors when they share an edge.

Figure 1:
Discretization grid and associated matrix, assuming three unknowns per gridcell and a five-point stencil.

The gridcells and their respective indices are identified, in a way that Ω denotes either the domain of the PDE or the set of indices {1,2,…,n}. We introduce a disjoint partition of Ω, i.e., we break Ω ={1,2,…,n} into P pieces Ω1, Ω2,..., Ωp , in such a way that each gridcell belongs to exactly one subdomain Ωk , see Figure 2. More precisely, we define

{ Ω J } 1 J P s .t . J = 1 P Ω J = Ω and Ω I Ω J = 0 I J . (2.2)

Figure 2:
2D Domain partitioned into 4 subdomains.

We note that there are gridcells that, while belonging to one subdomain, have neighbors in other subdomains. Our domain decomposition approach requires the definition of disconnected subdomains, that can be dealt with in parallel. For that sake, we define a separator set, called interface and denoted by Γ, and (disconnected) subdomain interiors ΩJInt. A gridcell j is said to belong to Γ if it is in a subdomain ΩJ while being neighbor of at least one gridcell in another subdomain with greater index, i.e., ΩK with K > J:

Γ = { j Ω | k , J , K s .t . j Ω J , k Ω K , K > J , ( a j k 0 o r a k j 0 ) } . (2.3)

We now define the subdomain interior ΩJInt as the portion of ΩJ not in Γ:

Ω J I n t = Ω J Γ , (2.4)

see Figure 2. These definitions are just enough to ensure that if gridcell j is in ΩJInt and gridcell k is in ΩKInt, with JK, then they are not neighbors.1 1 Had the condition K>J been dropped from definition (2.3),  would still be a separator set. But it would be unnecessarily large, which would yield a more expensive preconditioner. Indeed, to fix the notation, assume J < K. Since jΩJIntΩJ and kΩKIntΩK, if j and k were neighbors, j would be in Γ by definition (2.3) and therefore not in ΩJInt.

We now define the local interface ΓJ associated with each subdomain ΩJ as the intersection of Γ and ΩJ , or equivalently,

Γ J = { j Ω J | ( K > J a n d k Ω K ) s .t . ( a j k 0 o r a k j 0 ) } . (2.5)

Notice that {ΓJ }1≤ J ≤ p form a disjoint partition of Γ. See Figure 2.

Finally, we define extended subdomains Ω¯J and extended local interfaces Γ¯J, which incorporate the portions of the interface connected to the subdomain interior ΩJInt. We define Ω¯J as

Ω ¯ J = Ω J { k Γ | j Ω J I n t s .t . ( a j k 0 o r a k j 0 ) } . (2.6)

and Γ¯J as its restriction to Γ, i.e., Γ¯J=ΓΩ¯J, see Figure 2.

Notice that ΓJΓ¯JΓ. We point out that Γ¯J is the result of augmenting ΓJ with the gridcells of the interface Γ that are neighbors of gridcells in Ωj . We refer to Γ¯J as an extended interface, see Figure 2.

If the equations/variables are reordered, starting with the ones corresponding to Ω1Int, followed by the other ΩJInt and finally by Γ, then A has the following block-structure:

A = A 11 A 1 Γ A P P A P Γ A Γ 1 A Γ P A Γ Γ , (2.7)

where the submatrices A JJ contain the rows and columns of A associated with ΩJInt, A ΓΓ the ones associated with Γ, A the rows associated with ΩJInt and the columns associated with Γ, and A ΓJ the rows associated with Γ and the columns associated with ΩJInt. Therefore, denoting by |S| the number of elements in a set S, the dimensions of A JJ , A ΓΓ, A and A ΓJ are, respectively, n dof n J × n dof n J, n dof m× n dof m, n dof n J × n dof m, and n dof m× n dof n J , where nJ=|ΩJInt| and m=|Γ|. It is important to notice that, since a jk = 0 for any jΩJInt and kΩKInt with JK, the submatrices A JK of rows associated with ΩJInt and columns associated with ΩKInt are all null (and therefore omitted in (2.7)). This block-diagonal structure of the leading portion of the matrix (which encompasses most of the variables/equations) allows for efficient parallelization of many tasks, and was the ultimate motivation of all the definitions above. We point out that although not necessary by the method, the implementation discussed in this work assumes that the matrix A is structurally symmetric2 2 If A is structurally symmetric, then A T and A have the same nonzero structure but are not necessarily equal. , in this case the matrices A JJ and A ΓΓ are also structurally symmetric. Furthermore, A and AΓJT have the same nonzero pattern.

3 DESCRIPTION OF THE TWO-LEVEL PRECONDITIONER

The goal of a preconditioner is to replace the linear system (2.1) by one of the equivalent ones:

( M A ) x = M b ( left preconditioned ) or ( A M ) y = b ( right preconditioned , where x = M y ) .

A good preconditioner shall render AM or MA much better conditioned than A (i.e., M should approximate, in a sense,A -1) and, in order to be of practical interest, its construction and application must not be overly expensive.

We now present our preconditioner, which combines ideas from domain decomposition methods, DDM, and level-based Incomplete LU factorization, ILU(k). In DDM, one tries to build an approximation to the action of A -1 based on the (approximation of) inverses of smaller, local versions of A (subdomain-interior submatrices A JJ , in our case). In this work, the action of the local inverses is approximated by ILU(k Int), while other components required by the preconditioner (see equation (3.4)) are approximated with different levels (k Bord, k Prod or k Γ). The motivation for this approach is that ILU(k) has been widely used with success as a preconditioner in reservoir simulation and DDM has been shown to be an efficient and scalable technique for parallel architectures.

It is well established that in linear systems associated with parabolic problems, the high-frequency components of the error are damped quickly, while the low-frequency ones take many iterations to fade, see3131. U. Trottenberg, C. Oosterlee & A. Schuller. Multigrid. Academic Press, Inc. Orlando, USA, (2001).. In reservoir simulation, the pressure unknown is of parabolic nature. A two-level preconditioner tackles this problem by combining a “fine” preconditioner component M F , as the one mentioned in the previous paragraph, with a coarse component M C , the purpose of which is to annihilate the projection of the error onto a coarse space (associated with the low frequencies). If the two components are combined multiplicatively (see2929. B. Smith, P. Bjørstad & W. Gropp. Domain Decomposition, Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, New York, 1st edition, (1996).), the resulting two-level preconditioner is

M = M F + M C M F A M C . (3.1)

In Subsection 3.1 we present a ILU(k)-based fine component M F and in Subsection 3.2 we describe a simple coarse component.

3.1 ILU(k) based Domain Decomposition

The fine component of the domain decomposition preconditioner is based on the following block LU factorization of A:

A = L U = L 1 L P B 1 B P I U 1 C 1 U P C P S , (3.2)

where A JJ = L J U J is the LU factorization of A JJ , BJ=AΓJUJ1,CJ=LJ1AJΓ and

S = A Γ Γ J = 1 P A Γ J A J J 1 A J Γ = A Γ Γ J = 1 P B J C J , (3.3)

is the Schur complement of A with respect to the interior points.

From the decomposition in (3.2), we can show that the inverse of A is

A - 1 = U 1 - 1 U P - 1 I C 1 I C P - I I S - 1 I B 1 B P - I L 1 - 1 L P - 1 I . (3.4)

We want to define a preconditioner M F approximating the action of A -1 on a vector. Therefore, we need to define suitable approximations for LJ1,UJ1, B J and C J , J=1,…,P, and for S -1. These approximations are denoted by L˜J1,U˜J1,B˜J,C˜J, and SF1, respectively. The fine preconditioner is then defined as

M F = U ~ 1 - 1 U ~ P - 1 I C ~ 1 I C ~ P - I I S F - 1 I B ~ 1 B ~ P - I L ~ 1 - 1 L ~ P - 1 I . (3.5)

In the remaining of this subsection, we describe precisely how these approximations are chosen.

First we define L˜J and U˜j as the result of the incomplete LU factorization of A JJ with level of fill k Int, [L˜J,U˜J]=ILU(AJJ,kInt). In our numerical experiments, we used k Int = 1, which is a usual choice in reservoir simulation. Even though L˜J and U˜J are sparse, L˜J1AJΓ and U˜JTAΓJT (which would approximate C J and BJT) are not. To ensure sparsity, C˜JL˜J1AJΓ is defined as the result of the incomplete triangular solve with multiple right-hand sides of the system L˜JC˜J=AJΓ, by extending the definition of level of fill as follows. Let v l and w l be the sparse vectors corresponding to the l-th columns of A and C˜ respectively, so that L˜Jwl=vl. Based on the solution of triangular systems by forward substitution, the components of v l and w l are related by

w l k = v l k i = 1 k 1 L ˜ J k i w l i . (3.6)

The level of fill-in of component k of w l is defined recursively as

L e v ( w l k ) = min { L e v ( v l k ) , min 1 i ( k 1 ) { L e v ( L ˜ J k i ) + L e v ( w l i ) + 1 } } , (3.7)

where Lev(vlk)=0 when vlk0 and Lev(vlk)= otherwise, and Lev(L˜Jki) is the level of fill of entry ki in the ILU(k Int) decomposition of A JJ when L˜Jki0 and Lev(L˜Jki)= otherwise. The approximation C˜J to CJ=LJ1AJΓ is then obtained by an incomplete forward substitution (IFS) with level of fill k Bord, in which we do not compute any terms with level of fill greater than k Bord during the forward substitution process. We denote C˜J=IFS(L˜J,AJΓ,kBord). Notice that when k Bord = k Int, C˜J is what would result from a standard partial incomplete factorization of A. The approximation B˜J to BJ=AΓJU˜J1 is defined analogously.

Similarly, in order to define an approximation S˜ to S, we start by defining FJ=B˜JC˜J and a level of fill for the entries of F J ,

L e v ( F J k l ) = min { L e v ( A Γ Γ k l ) , min 1 i m { L e v ( B ˜ J k i ) + L e v ( C ˜ J i l ) + 1 } } , (3.8)

where m=#ΩJInt is the number of columns in B˜J and rows in C˜J,Lev(AΓΓkl)=0 when AΓΓkl0 and Lev(AΓΓkl)=, otherwise, and Lev(C˜Jki) is the level of fill according to definition (3.7) when C˜Jki0 and Lev(CJki)=, otherwise (Lev(B˜Jil) is defined in the same way). Next, F˜J is defined as the matrix obtained keeping only the entries in F J with level less than or equal to k Prod according to (3.8). We refer to this incomplete product(IP) as F˜J=IP(B˜J,C˜J,kProd).

S˜ is then defined as

S ˜ = A Γ Γ J = 1 P F ˜ J . (3.9)

While S˜ approximates S, we need to define an approximation for S -1. Since S˜ is defined on the global interface Γ, it is not practical to perform ILU on it. Instead, we follow the approach employed in77. L. M. Carvalho, L. Giraud & P. L. Tallec. Algebraic two-level preconditioners for the Schur complement method, SIAM Journal on Scientific Computing, 22(6) (2001), 1987-2005. and define for each subdomain a local version of S˜,

S ˜ J = R J S ˜ R J T , (3.10)

where RJ:ΓΓ¯J is a restriction operator such that S˜J is the result of pruning S˜ so that only the rows and columns associated with Γ¯J remain. More precisely, if {i1,i2,...,inΓ¯J} is a list of the nodes in Γ that belong to Γ¯J, then the k-th row of R J is eikT, the i k -th row of the n Γ × n Γ identity matrix,

R J = [ e i 1 T e i n Γ T ] .

Finally, our approximation SF1 to S -1 is defined as

S F 1 = J = 1 P T J ( L S ˜ J U S ˜ J ) 1 R J J = 1 P T J S ˜ J 1 R J , (3.11)

where LS˜J and US˜J are given by ILU(k Γ) of S˜J. Here TJ:Γ¯JΓ is an extension operator that takes values from a vector that lies in Γ¯J, scales them by w1J,...,wnΓ¯JJ (which we call weights), and places them in the corresponding position of a vector that lies in Γ. Therefore, using the same notation as before, the k-th column of T J is wkJeik,

T J = [ w 1 J e i 1 w n Γ J e i n Γ ] .

Different choices for the weights gives rise to different options for T J , our choice is one that avoids communication among processes and is defined as follows.

w k J = { 1, if the i k -th node of Γ belongs to Γ J 0, if the i k -th node of Γ belongs to Γ ¯ J Γ J .

We note that T J ,J=1,...,P, form a partition of unity. Our approach is based on the Restricted Additive Schwarz method, proposed on55. X. Cai & M. Sarkis. A Restricted Additive Schwarz Preconditioner for General Sparse Linear Systems SIAM Journal on Scientific Computing, 21(2) (1999), 792-797..

3.2 Coarse Space Correction

We define a coarse space spanned by the columns of an (n×P) matrix that we call R0T. The J-th column of R0T is associated with the extended subdomain Ω¯J and its i-th entry is

( R 0 T ) i J = { 0, if node i is not in Ω ¯ J and μ Ω i 1 , if node i is in Ω ¯ J ,

where the i-th entry of µ Ω, denoted μΩi, counts how many extended subdomains the i-th node belongs to. Notice that (R0T)iJ=1iΩJInt and that the columns of R0T form a partition of unity, in the sense that their sum is a vector with all entries equal to 1.

We define M C by the formula

M C = R 0 T ( R 0 A R 0 T ) 1 R 0 . (3.12)

Notice that this definition ensures that M C A is a projection onto range (R0T) and for A symmetric positive definite this projection is A-orthogonal. Since R0AR0T is small (P×P), we use exact LU (rather than ILU) when applying its inverse.

Summing up, the complete preconditioner has two components: one related to the complete grid, called M F in equation (3.5), and another related to the coarse space (3.12). It is possible to subtract a third term that improves the performance of the whole preconditioner, although increasing its computational cost. The combined preconditioners are written down as

M = M F + M C or (3.13)

M = M F + M C M F A M C . (3.14)

This formulation implies that the preconditioners will be applied additively (3.13) or multiplicatively (3.1) and can be interpreted as having two levels, see77. L. M. Carvalho, L. Giraud & P. L. Tallec. Algebraic two-level preconditioners for the Schur complement method, SIAM Journal on Scientific Computing, 22(6) (2001), 1987-2005.. In the following sections we address this two-level algebraic ILU(k)-based domain decomposition preconditioner by iSchur (for “incomplete Schur”).

4 PARALLEL IMPLEMENTATION

In this section we discuss the parallel implementation of the preconditioner, considering data locality, communication among processes, and strategies to avoid communication.

We use distributed memory model, so the matrix A is distributed among the processes and, as we use PETSc, the rows corresponding to the elements of each subdomain, including interior and local interfaces, reside in the same process, see Figure 3 for a four domain representation.

Figure 3:
Subdomain and matrix distribution among processes.

4.1 Preconditioner Setup

Algorithm 1 describes the construction of the fine and coarse components of the preconditioner.

Algorithm 1:
Preconditioner Setup

Step 1 of Algorithm 1 is the incomplete LU factorization. Because each subdomain is completely loaded in only one process and there is no connections between subdomain interiors, this step can be done in parallel in each process without communication. We use a native and optimized PETSc routine that computes the incomplete LU factorization.

In Step 2, C˜J=IFS(L˜J,AJΓ,0) computes an approximation C˜J=L˜1AJΓ through a level zero incomplete triangular solver applied to the columns of A . For each column a i of A , we take i , the set of indices of the nonzero elements of a i . Then we solve ci(Ii)=L˜(Ii,Ii)\ai(Ii), which is a small dense triangular system, where c i is the i-th column of C˜J. See Figure 4. The light gray rows and columns represent the i indices applied to the rows and columns of the matrix L˜J and the dark gray elements are the ones in L˜(Ii,Ii). We use a PETSc routine to get the submatrices L˜(Ii,Ii) as a small dense matrix and use our own dense triangular solver to compute c i ( i ), which is a small dense vector, and then use another PETSc routine to store it in the correct positions of c i , which is a column of the sparse matrix C˜J. We observe that the nonzero columns of A that take part in these operations, in each subdomain, are related to the extended interface, but are stored locally.

Figure 4:
Incomplete Forward Substitution.

Note that L˜J and A (see Equation (2.5)) are in process J, as can be seen in Figure 3, which means that Step 2 is done locally without any communication.

Step 3 computes B˜J also through the incomplete forward substitution using the same idea described for Step 2 , but A ΓJ is not stored entirely in process J. Actually, A ΓJ has nonzero elements in the rows associated with the extended interface Γ¯J, therefore the elements of A ΓJ are distributed among process J and its neighbors. So each process has to communicate only with its own neighbors to compute B˜J. The matrix B˜J is stored locally in process J.

The communication layout of Step 3 is illustrated in an example for four subdomains in Figure 5, where the arrows show the data transfer flow. In this example, using, for instance, five-point centered finite differences, each subdomain has at most two neighbors, but in different 2-D or 3-D grids and with different discretization schemes, each subdomain can have more neighbors.

Figure 5:
Domain distribution among processes.

The Step 4 computes the Schur complement through an incomplete product as described by equation (3.9). The submatrix A ΓΓ is formed by the subdomains interfaces and therefore is distributed among the processes. Each process computes locally its contribution IP(B˜J,C˜J,kProd) and subtracts globally from A ΓΓ to build up S˜, which is also distributed among the processes in the same way A ΓΓ is. Again, each subdomain communicates with their neighbors following the pattern illustrated in Figure 5, but with arrows pointers reversed, and there is a synchronization barrier to form the global matrix S˜.

In Step 5 and 6, each process takes a copy of the local part of S˜ relative to the extended interface Γ¯J as in (3.10), communicating only with neighbors, and makes an incomplete LU factorization. The use of local copies avoids communication during the parallel application of S˜ throughout the linear solver iterations when using the RAS55. X. Cai & M. Sarkis. A Restricted Additive Schwarz Preconditioner for General Sparse Linear Systems SIAM Journal on Scientific Computing, 21(2) (1999), 792-797. version of the preconditioner.

In Step 7 the matrix R0AR0T is initially formed as a parallel matrix, distributed among the processes. Local copies of this matrix is made in all processes and then a LU factorization is done redundantly. The order of this matrix is P, which is in general quite small.

4.2 Preconditioner Application

Algorithm 2 describes the application of the preconditioner M to the vector r, i.e., the computation of z=Mr. We define two additional restriction operators, RΩJ:ΩΩJInt and R Γ Γ. These restriction operators are needed to operate on the correct entries of the vector, it can be seen as a subset of indices in the implementation.

Algorithm 2:
Preconditioner Application

Step 1 of Algorithm 2 solves a triangular linear system in each process involving the L˜J factor and the local interior portion of the residual, obtaining z J . This process is done locally and no communication is necessary because the residual r is distributed among the processes following the same parallel layout of the matrix.

Step 2 applies B˜J multiplying z J and subtracting from r Γ obtaining z Γ and the communication between the neighbor subdomains is necessary only in r Γ and z Γ because B˜J is stored locally in process J.

The Schur complement application is done in Step 3. Because each process has a local copy of its parts of S˜ factored into LS˜ e US˜, only the communication in z Γ between the neighbor subdomains is necessary. As described above, T J is a diagonal operator, so we can compute it as a vector, which is applied through an element wise product. The result is then summed globally in z Γ. At this point there is a synchronization barrier among all processes.

In Step 4 we apply the U˜J factor to zJC˜JzΓ. The matrix-vector product C˜JzΓ requires communication among neighbors in z Γ and then the triangular solver is done locally with no communication.

The vectors z J and the vector z Γ are gathered in z F in Step 5, where z F is the residual preconditioned by the fine component of the preconditioner.

Step 6 applies the coarse component, completing the computation of the preconditioned residual. Triangular solvers involved are local because there are local copies of L C and U C in each process, so no communication is necessary.

5 COMPARISON TESTS

We use two metrics to evaluate the preconditioners: the number of iterations and the solution time. In this section we describe the platform where the tests were run, present the results of the proposed preconditioner, and compare with a native PETSc preconditioner.

We consider five matrices from reservoir simulation problems. They are linear systems arising from applying Newton’s method to solve the nonlinear equations discretizing the system of PDEs that model multiphase flow in porous media. Matrices F8, U26, U31 and Psalt were generated from a Petrobras reservoir simulation tool. The Psalt problem is a synthetic presalt reservoir model. The fifth matrix is the standard SPE10 model1010. M. A. Christie & M. J. Blunt. Tenth SPE comparative solution project: A comparison of upscaling techniques in SPE Reservoir Simulation Symposium, (2001). with over a million active cells. All the problems are black oil models using fully implicit method and there are four degrees of freedom per grid cell.

Table 1 shows the problems’ size considering the total number of cells and the size of their respective matrices.

Table 1:
Test case problems.

For both preconditioners the problem was solved using native PETSc right preconditioned GMRES method2727. Y. Saad & M. H. Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems SIAM Journal on scientific and statistical computing, 7(3) (1986), 856-869., with relative residual tolerance of 1e-4, maximum number of iterations of 1000, and restarting at every 30th iteration. The mesh was partitioned using PT-Scotch99. C. Chevalier & F. Pellegrini. PT-Scotch: A tool for efficient parallel graph ordering Parallel computing, 34(6) (2008), 318-331.. ISchur has fill-in level 1 in the interior of the subdomains and zero in the interfaces, namely k Int = 1, k Bord = k Prod = k Γ = 0, as this specific configuration presented a better behavior in preliminary Matlab tests.

We compare the proposed two-level preconditioner with PETSc’s native block Jacobi preconditioner (see33. S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B.F. Smith, S. Zampini, H. Zhang, & H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.7, Argonne National Laboratory, (2016).) combined (as in equation (3.1)) with the same coarse component presented in this paper. We refer to this combination simply as block Jacobi. We note that block-Jacobi/ILU is the default preconditioner in PETSc and is a common choice of preconditioner in reservoir simulation, which makes it an appropriate benchmark.

The first set of tests was run on an Intel Xeon(R) 64-bit CPU E5-2650 v2 @ 2.60GHz with 16 cores (hyper-threading disabled), 20MB LLC, 64GB RAM (ECC enabled). The operating system is a Debian GNU/Linux distribution running kernel version 3.2.65. The compiler is Intel Parallel Studio XE 2015.3.187 Cluster Edition.

The iSchur preconditioner, both in setup and application phases, requires more computation and communication than block Jacobi, so it has to present a pronounced reduction in number of iterations compared with block Jacobi in order to be competitive. Table 2 shows the number of iterations and the total time, in seconds, for both preconditioners, for all the tested problems, ranging from one to sixteen processes. For each test, the best results are highlighted in gray. We can observe that the iSchur preconditioner does reduce the number of iterations compared to block Jacobi for all problems in any number of processes. For one process (sequential execution), both preconditioners are equal, as there are no interface points, so we show the results just for the sake of measuring speed up and scalability. In average, iSchur does 80% as many iterations as block Jacobi. In this set of experiments, iSchur was more robust than block Jacobi, as the latter failed to converge in two experiments.

Table 2:
Number of iterations and the total time for both preconditioners from 1 to 16 processes. (NC) stands for “not converged”.

We also observe that, for some of the matrices, the number of iterations increases with the number of processes. This is to be expected, since block Jacobi neglects larger portions of the matrix as the size of the subdomains decreases. A correlated fact occurs with iSchur. In this case as we are using different levels of fill in the interface related parts and in the interior parts, more information is neglected with more subdomains, as the number of interface points increases.

Table 2 also shows the total time of the solution using both preconditioners. We can see that despite the reduction in the number of iterations the solver time of iSchur is smaller than block Jacobi only in a few cases. One of the main reasons is that the iSchur setup time is greater than block Jacobi’s, especially when running on few processes. Although iSchur’s setup is intrinsically more expensive than block Jacobi’s, we believe that this part of our code can still be further optimized to reduce this gap.

In reservoir simulation, the sparsity pattern of the Jacobian matrix only changes when the operating conditions of the wells change, and therefore it remains the same over the course of many nonlinear iterations. This is an opportunity to spare some computational effort in the setup phase, since in this case the symbolic factorization does not need to be redone. We can see in Table 3 the time comparison when the symbolic factorization is not considered; in this scenario iSchur is faster in most cases. Furthermore, block Jacobi is a highly optimized, mature code (the default PETSc preconditioner), while iSchur could possibly be further optimized.

Table 3:
Time of iterative solver without the symbolic factorization.

In order to examine the behavior of the preconditioners as the number of subdomains increases, we ran tests up to 512 processes on a cluster consisting of nodes with 16GB RAM and two Intel Xeon E5440 Quad-core processors each, totaling 8 CPU cores per node. The results are depicted in Figure 6. We see that iSchur consistently outperforms block Jacobi in terms of number of iterations. Nonetheless, the behaviour of both preconditioners, with respect to scalability, cannot be predicted from these experiments. Tests with more MPI processes, in a larger cluster, are still required.

Figure 6:
Comparison between block Jacobi and iSchur up to 512 processes.

6 CONCLUSIONS AND REMARKS

The presented preconditioner successfully reduced the number of iterations for the target reservoir problems compared to block Jacobi for the full range of partitions tested.

The scalability of both preconditioners is an on-going research as tests in a larger cluster are still necessary, but these initial results indicate a little advantage to iSchur.

For the current implementation the total solution time, setup and iterative solver, of the iSchur is smaller in only a few cases, where one important factor is the inefficiency in the setup step. The block Jacobi in PETSc is highly optimized while our implementation has room for improvement.

Removing the symbolic factorization time, considering the case when it is done once and used multiple times, which is usual in reservoir simulation, the solution using the iSchur preconditioner is faster for most tested cases. It shows that the presented preconditioner can be a better alternative than block Jacobi for the finer component when using two-level preconditioners to solve large-scale reservoir simulation problems. Furthermore, we deem that the application part of the code can still be optimized.

REFERENCES

  • 1
    T. M. Al-Shaalan, H. Klie, A.H. Dogru, & M. F. Wheeler. Studies of robust two stage preconditioners for the solution of fullyimplicit multiphase flow problems. In SPE Reservoir Simulation Symposium, 2-4 February, The Woodlands, Texas,number SPE-118722 in MS, (2009).
  • 2
    D. A. Augusto, L. M. Carvalho, P. Goldfeld, I. C. L. Nievinski, J.R.P. Rodrigues, & M. Souza. An algebraic ILU(k) based two-level domain decomposition preconditioner. In Proceeding Series of the Brazilian Society of Computational and Applied Mathematics, 3 (2015), 010093-1-010093-7.
  • 3
    S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B.F. Smith, S. Zampini, H. Zhang, & H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.7, Argonne National Laboratory, (2016).
  • 4
    M. Bollhöfer & V. Mehrmann. Algebraic multilevel methods and sparse approximate inverses. SIAM Journal on Matrix Analysis and Applications, 24(1) (2002), 191-218.
  • 5
    X. Cai & M. Sarkis. A Restricted Additive Schwarz Preconditioner for General Sparse Linear Systems SIAM Journal on Scientific Computing, 21(2) (1999), 792-797.
  • 6
    H. Cao, H. Tchelepi, J. Wallis & H. Yardumian. Parallel scalable unstructured CPR-type linear solver for reservoir simulation. In SPE Annual Technical Conference and Exhibition, 9-12 October 2005, Dallas, Texas. Society of Petroleum Engineers, (2005).
  • 7
    L. M. Carvalho, L. Giraud & P. L. Tallec. Algebraic two-level preconditioners for the Schur complement method, SIAM Journal on Scientific Computing, 22(6) (2001), 1987-2005.
  • 8
    T. F. Chan, J. Xu & L. Zikatanov. An agglomeration multigrid method for unstructured grids. Contemporary Mathematics, 218 (1998), 67-81.
  • 9
    C. Chevalier & F. Pellegrini. PT-Scotch: A tool for efficient parallel graph ordering Parallel computing, 34(6) (2008), 318-331.
  • 10
    M. A. Christie & M. J. Blunt. Tenth SPE comparative solution project: A comparison of upscaling techniques in SPE Reservoir Simulation Symposium, (2001).
  • 11
    D. A. Collins, J.E. Grabenstetter & P. H. Sammon. A Shared-Memory Parallel Black-Oil Simulator with a Parallel ILU Linear Solver. SPE-79713-MS. In SPE Reservoir Simulation Symposium, Houston, 2003. Society of Petroleum Engineers.
  • 12
    L. Formaggia & M. Sala. Parallel Computational Fluids Dynamics. Practice and Theory., chapter Algebraic coarse grid operators for domain decomposition based preconditioners. Elsevier, (2002), 119-126.
  • 13
    S. Gries, K. Stüben, G. L. Brown, D. Chen & D. A. Collinns. Preconditioning for efficiently applying algebraic multigrid in fully implicit reservoir simulations. SPE Journal, 19(04): 726-736, August 2014. SPE-163608-PA.
  • 14
    H. Jackson, M. Taroni & D. Ponting. A two-level variant of additive Schwarz preconditioning for use in reservoir simulation. arXiv preprint arXiv: 1401.7227, (2014).
  • 15
    P. Jenny, S. Lee & H. Tchelepi. Multi-scale finite-volume method for elliptic problems in subsurface flow simulation. Journal of Computational Physics, 187(1) (2003), 47- 67.
  • 16
    P. Jenny, S. Lee & H. Tchelepi. Adaptive fully implicit multi-scale finite-volume method for multi-phase flow and transport in heterogeneous porous media. Journal of Computational Physics, 217(2) (2006), 627-641.
  • 17
    P. Jenny, S. H. Lee & H. A. Tchelepi. Adaptive multiscale finite-volume method for multiphase flow and transport in porous media. Multiscale Modeling & Simulation, 3(1) (2005), 50-64.
  • 18
    A. Manea, J. Sewall & H. Tchelepi. Parallel multiscale linear solver for reservoir simulation. Lecture Notes - slides - 4th SESAAI Annual Meeting Nov 5, 2013, (2013).
  • 19
    A. Muresan & Y. Notay. Analysis of aggregation-based multigrid. SIAM Journal on Scientific Computing, 30(2) (2008), 1082-1103.
  • 20
    R. Nabben & C. Vuik. A comparison of deflation and coarse grid correction applied to porous media flow. SIAM J. Numer. Anal., 42 (2004), 1631-1647.
  • 21
    F. Nataf, H. Xiang & V. Dolean. A two level domain decomposition preconditioner based on local Dirichlet-to-Neumann maps. Comptes Rendus Mathematique, 348(21-22) (2010), 1163-1167.
  • 22
    Y. Notay. Algebraic multigrid and algebraic multilevel methods: a theoretical comparison. Numer. Linear Algebra Appl, 12(5-6) (2005), 419-451.
  • 23
    Y. Notay. Algebraic analysis of two-grid methods: The nonsymmetric case. Numerical Linear Algebra with Applications, 17(1) (2010), 73-96.
  • 24
    D. W. Peaceman Fundamentals of Numerical Reservoir Simulation. Elsevier, (1977).
  • 25
    A. Quarteroni & A. Valli. Applied and Industrial Mathematics: Venice - 1, 1989, chapter Theory and Application of Steklov-Poincaré Operators For Boundary-Value Problems. Kluwer, (1991), 179-203.
  • 26
    Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, 2nd edition, (2003).
  • 27
    Y. Saad & M. H. Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems SIAM Journal on scientific and statistical computing, 7(3) (1986), 856-869.
  • 28
    Y. Saad & J. Zhang. BILUM: Block versions of multielimination and multilevel ILU preconditioner for general sparse linear systems. SIAM Journal on Scientific Computing, 20(6) (1999), 2103-2121.
  • 29
    B. Smith, P. Bjørstad & W. Gropp. Domain Decomposition, Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, New York, 1st edition, (1996).
  • 30
    J. Tang, R. Nabben, C. Vuik & Y. Erlangga. Theoretical and numerical comparison of various projection methods derived from deflation, domain decomposition and multigrid methods. Reports of the Department of Applied Mathematical Analysis 07-04, Delft University of Technology, January 2007.
  • 31
    U. Trottenberg, C. Oosterlee & A. Schuller. Multigrid. Academic Press, Inc. Orlando, USA, (2001).
  • 32
    C. Vuik & J. Frank. Coarse grid acceleration of a parallel block preconditioner. Future Generation Computer Systems, JUN, 17(8) (2001), 933-940.
  • 33
    J. Wallis, R. Kendall & T. Little. Constrained residual acceleration of conjugate residual methods. In SPE Reservoir Simulation Symposium, number SPE 13563 in 8 th Symposium on Reservoir Simulation, Dallas, 1985. Society of Petroleum Engineers.
  • This article is based on work presented at CNMAC2016.
  • 1
    Had the condition K>J been dropped from definition (2.3),  would still be a separator set. But it would be unnecessarily large, which would yield a more expensive preconditioner.
  • 2
    If A is structurally symmetric, then A T and A have the same nonzero structure but are not necessarily equal.

Publication Dates

  • Publication in this collection
    Jan-Apr 2018

History

  • Received
    28 Nov 2016
  • Accepted
    26 Sept 2017
Sociedade Brasileira de Matemática Aplicada e Computacional Rua Maestro João Seppe, nº. 900, 16º. andar - Sala 163 , 13561-120 São Carlos - SP, Tel. / Fax: (55 16) 3412-9752 - São Carlos - SP - Brazil
E-mail: sbmac@sbmac.org.br