©¤ ¤ 5

© § F ©0¤

$¥$"

# ¤# ! DF

1 C

) I 7 I X)

3 5 ) I

In the ILU(0) factorization, the LU factors have the same nonzero patterns as the original

˜

matrix , so that the references of the entries belonging to the external subdomains in

the ILU(0) factorization are identical with those of the matrix-by-vector product operation

˜

with the matrix . This is not the case for the more accurate ILU(y ) factorization, with

. If an attempt is made to implement a wavefront ILU preconditioner on a distributed

¢qy

memory computer, a dif¬culty arises because the natural ordering for the original sparse

problem may put an unnecessary limit on the amount of parallelism available. Instead, a

two-level ordering is used. First, de¬ne a “global” ordering which is a wavefront ordering

for the subdomains. This is based on the graph which describes the coupling between

the subdomains: Two subdomains are coupled if and only if they contain at least a pair

of coupled unknowns, one from each subdomain. Then, within each subdomain, de¬ne a

local ordering.

To describe the possible parallel implementations of these ILU(0) preconditioners, it is

suf¬cient to consider a local view of the distributed sparse matrix, illustrated in Figure 12.8.

The problem is partitioned into subdomains or subgraphs using some graph partitioning

y

technique. This results in a mapping of the matrix into processors where it is assumed that

the -th equation (row) and the -th unknown are mapped to the same processor. We dis-

w w

tinguish between interior points and interface points. The interior points are those nodes

that are not coupled with nodes belonging to other processors. Interface nodes are those

local nodes that are coupled with at least one node which belongs to another processor.

Thus, processor number 10 in the ¬gure holds a certain number of rows that are local rows.

Consider the rows associated with the interior nodes. The unknowns associated with these

nodes are not coupled with variables from other processors. As a result, the rows associ-

ated with these nodes can be eliminated independently in the ILU(0) process. The rows

associated with the nodes on the interface of the subdomain will require more attention.

Recall that an ILU(0) factorization is determined entirely by the order in which the rows

are processed. The interior nodes can be eliminated ¬rst. Once this is done, the interface

™ —t

7 —— p w7 z p — |w z

˜ { { | |

“£§

¢ ¢

¡

4 ¢

$#

! § "#

rows can be eliminated in a certain order. There are two natural choices for this order.

The ¬rst would be to impose a global order based on the labels of the processors. Thus,

in the illustration, the interface rows belonging to Processors 2, 4, and 6 are processed be-

fore those in Processor 10. The interface rows in Processor 10 must in turn be processed

before those of Processors 13 and 14. The local order, i.e., the order in which we process

the interface rows in the same processor (e.g. Processor 10), may not be as important. This

global order based on PE-number de¬nes a natural priority graph and parallelism can be

exploited easily in a data-driven implementation.

Proc. 14

Proc. 13

Proc. 6

Proc. 10

Internal interface points

Proc. 2

Proc. 4

External interface points

& $"

%#!

A local view of the distributed ILU(0).

It is somewhat unnatural to base the ordering just on the processor labeling. Observe

that a proper order can also be de¬ned for performing the elimination by replacing the PE-

numbers with any labels, provided that any two neighboring processors have a different

label. The most natural way to do this is by performing a multicoloring of the subdomains,

and using the colors in exactly the same way as before to de¬ne an order of the tasks.

The algorithms will be written in this general form, i.e., with a label associated with each

processor. Thus, the simplest valid labels are the PE numbers, which lead to the PE-label-

based order. In the following, we de¬ne as the label of Processor number .

)z4 ¢ H

‘ P% # vx¤…¨©§¥¢ u

’

¡ ¦ ¦¤ §¥¢£¡

¦¤ 96 5§0 ©4 ) 6 2 ¡0

8 42 ¨

¤ 3¢ D) 2 7!4

¢2 4 $

d

&

1. In each processor Do:

2 Vt C

Dw di¢i¢’e

yt h h h t

2. Perform the ILU(0) factorization for interior local rows.

3. Receive the factored rows from the adjacent processors with H

4. .

4 ¢ ) )z 4 ¢ Cz

5. Perform the ILU(0) factorization for the interface rows with

6. pivots received from the external processors in step 3.

7. Perform the ILU(0) factorization for the boundary nodes, with

8. pivots from the interior rows completed in step 2.

9. Send the completed interface rows to adjacent processors with H

t™

˜ ` q £ — ¢ wp

}{ ¢ 7 7

£¤¢

¡ "

"

§ $

¡

¢

10. .

)z 4 ¢ 4¢

q Cz

11. EndDo

Step 2 of the above algorithm can be performed in parallel because it does not depend on

data from other subdomains. Once this distributed ILU(0) factorization is completed, the

preconditioned Krylov subspace algorithm will require a forward and backward sweep at

each step. The distributed forward/backward solution based on this factorization can be

implemented as follows.

‘ % # ¡v¤r …¨ ¦3¥¢

¦¤ £8 0 ¢ (0¥¤ ) 6¡ ¤ 3¢ D) 2 !74

¢2 4