equations

¬ P ¡

This section only considers those traditional preconditioners, such as ILU or SOR or

¡

SSOR, in which the solution with is the result of solving triangular systems. Since

these are commonly used, it is important to explore ways to implement them ef¬ciently in

a parallel environment. We only consider lower triangular systems of the form

q¥rGG¥ £

¦ ‘¥

¢

¬

Without loss of generality, it is assumed that is unit lower triangular.

’ ”p£¡¦&©¡ $ H&’ ”8 ” ’ ¤ © § ¨¦#"£v

¥$¥8

8¥©§ ’ $ $ ¡

&

$

¡

¡ XU ©W P f B P QDT S¨e 9 ¦ 763 2 I6 ¦

¡

e 26

TH T9 B 3e 9

R I IH

BH

¡

Typically in solving a lower triangular system, the solution is overwritten onto the right-

hand side on return. In other words, there is one array for both the solution and the

right-hand side. Therefore, the forward sweep for solving a lower triangular system with

coef¬cients and right-hand-side is as follows.

£ a ¥ % 7`

'! $!) 0 ¦ ! 2 ˜ ¥D5§D§ ¡w¤h¨ n¦ ¤ £¢ ¨ % ©¨¥£ ¤

¦¢

% § §

&%# © £

£

1. Do i=2, n

2. For (all j such that al(i,j) is nonzero) Do:

3. x(i) := x(i) “ al(i,j) * x(j)

4. EndDo

5. EndDo

Assume that the matrix is stored row wise in the general Compressed Sparse Row (CSR)

format, except that the diagonal elements (ones) are not stored. Then the above algorithm

translates into the following code segment:

1. Do i=2, n

2. Do j=ial(i), ial(i+1) “ 1

3. x(i)=x(i) “ al(j) * x(jal(j))

4. EndDo

5. EndDo

The outer loop corresponding to the variable is sequential. The loop is a sparse dot

¥

¢

¡

product of the row of and the (dense) vector . This dot product may be split among

the processors and the partial results may be added at the end. However, the length of the

vector involved in the dot product is typically short. So, this approach is quite inef¬cient

in general. We examine next a few alternative approaches. The regularly structured and the

irregularly structured cases are treated separately.

§ ¦

3W qU E&¦ U IB!I¥ HcbD¤¦5P !@cIHc¦¥ xQHb DT

453¡2 I6

2 6 H

a Y£8W T R a B T

H9

YP ¥

D!¥ ¤e ¨9gf

BH P Y

Q ¯

First, consider an example which consists of a 5-point matrix associated with a mesh "

as represented in Figure 11.10. The lower triangular matrix associated with this mesh is

represented in the left side of Figure 11.10. The stencil represented in the right side of

Figure 11.10 establishes the data dependence between the unknowns in the lower triangular

system solution when considered from the point of view of a grid of unknowns. It tells us

that in order to compute the unknown in position , only the two unknowns in positions

a ¥ 0 `

%

`

¨ –¥ 0 `

¨%

and are needed . The unknown does not depend on any other

a ¥% a ff

G G «C f g

«

variable and can be computed ¬rst. Then the value of can be used to get and

fC

£ ff

« C ks% f C

«

simultaneously. Then these two values will in turn enable and to be obtained £

Cf

simultaneously, and so on. Thus, the computation can proceed in wavefronts. The steps for

this wavefront algorithm are shown with dashed lines in Figure 11.10. Observe that the

8˜ m ’ ”p£8 ¡©

8¥ $

¡ ¡¨ ¨¦ ¤

© §¥ Yc¦ ¡©0c¨£"¥ §

¥

© § &

$

maximum degree of parallelism (or vector length, in the case of vector processing) that can

® p®

be reached is the minimum of , , the number of mesh points in the and directions,

respectively, for 2-D problems. For 3-D problems, the parallelism is of the order of the

w w £

C C S

maximum size of the sets of domain points , where , a constant level ¥

£

. It is important to note that there is little parallelism or vectorization at the beginning

and at the end of the sweep. The degree of parallelism is equal to one initially, and then

increases by one for each wave reaching its maximum, and then decreasing back down to Q

& & & & & &

¯

one at the end of the sweep. For example, for a grid, the levels (sets of equations that "

Q

¡H E H

¢