On distributed memory computers, some assumptions must be made about the way

in which the vectors are distributed. The main assumption is that the vectors and are

distributed in the same manner among the processors, meaning the indices of the compo-

nents of any vector that are mapped to a given processor are the same. In this case, the

vector-update operation will be translated into independent vector updates, requiring no

®

communication. Speci¬cally, if is the number of variables local to a given processor,

¨

this processor will simply execute a vector loop of the form

y(1:nloc) = y(1:nloc) + a * x(1:nloc)

and all processors will execute a similar operation simultaneously.

Dot products A number of operations use all the components of a given vector to com-

pute a single ¬‚oating-point result which is then needed by all processors. These are termed

Reduction Operations and the dot product is the prototype example. A distributed version

of the dot-product is needed to compute the inner product of two vectors and that are

distributed the same way across the processors. In fact, to be more speci¬c, this distributed

˜

¢

dot-product operation should compute the inner product of these two vectors and

¢

then make the result available in each processor. Typically, this result is needed to per-

form vector updates or other operations in each node. For a large number of processors, this

sort of operation can be demanding in terms of communication costs. On the other hand,

parallel computer designers have become aware of their importance and are starting to pro-

vide hardware and software support for performing global reduction operations ef¬ciently.

Reduction operations that can be useful include global sums, global max/min calculations,

etc. A commonly adopted convention provides a single subroutine for all these operations,

and passes the type of operation to be performed (add, max, min, multiply,. . . ) as one of

the arguments. With this in mind, a distributed dot-product function can be programmed

roughly as follows.

’ ”p£¡¦’w vc¦"¡8

8¥©§ ’ ©§

¡¡

$

¦3¦£15)4)210)(%&$£"!¦¨¦£¤¢

§ §© £©§¥ ¡

6 ¥ © % 3 % '¥£© % ' ¥#

$@9¢$$§1©

¥#£ 787 £

¥EH!G%FE¢"!D#¤$9

6 #£ 3 6¥#£ ' CB A7

¦¦10)4)¨¦15)G&¥EE£"R©© PEE§

Q¥ I ¥#

63¥£© % 3 % '¥£© % ' % #

`YXW¤&$#EV!T¢E©T¤I¢¤2

U § §©

S

6W A % ¥ §

$7

£

3

The function DDOT performs the usual BLAS-1 dot product of x and with strides

incx and incy, respectively. The REDUCE operation, which is called with “add” as the

operation-type parameter, sums all the variables “tloc” from each processor and put the

¢ ¢

resulting global sum in the variable in each processor. ¢

¢ ¢

§

XU P QI¥ AW @ f f¤¥ H QIb 3e

2 ¡2 I6

6 BeH H

U

Y9 P

W

To conclude this section, the following important observation can be made regarding the

practical implementation of Krylov subspace accelerators, such as PCG or GMRES. The

only operations that involve communication are the dot product, the matrix-by-vector prod-

uct, and, potentially, the preconditioning operation. There is a mechanism for delegating

the last two operations to a calling program, outside of the Krylov accelerator. The result of

this is that the Krylov acceleration routine will be free of any matrix data structures as well

as communication calls. This makes the Krylov routines portable, except for the possible

rede¬nition of the inner product distdot.

This mechanism, particular to FORTRAN programming, is known as reverse commu-

nication. Whenever a matrix-by-vector product or a preconditioning operation is needed,

the subroutine is exited and the calling program unit performs the desired operation. Then

the subroutine is called again, after placing the desired result in one of its vector arguments.

A typical execution of a ¬‚exible GMRES routine with reverse communication is

shown in the code segment below. The integer parameter icode indicates the type of oper-

ation needed by the subroutine. When icode is set to one, then a preconditioning operation

H

must be applied to the vector . The result is copied in and FGMRES is called

G

again. If it is equal to two, then the vector must be multiplied by the matrix . The

H G

result is then copied in and FGMRES is called again.

b@7© ¥

aI

c E¡¤d¦¥

7 £© §£

&2X%Y$u¤v% c st%gH!rHF`qh%&29"%ih©hg£"fE7$E@#¤¥ #A

9e8

e%

p

s% r % ©% #

u

% x7 w s

B

E7¤¥`!¡`F2¨©¦¢A2e

6 ©% § ©% § '

7E‚6 c gHR7`y ©¥©

7

£ p§

9¢7E£©¦¨©2E$9¤xhW)9¢72¡…„$¤s4% c u(%gH)Ex ¤A¥ ##

£ £¥79

s

6wu

§ £ ¥7

c $¢8

§

$‚4g7"R¢dR$7 © 7#

£7p§ 6w 7 ¥©

¥E$¤§¢2h)9¢2¡…„$¤s…% c H)£"fE$r¤¦e¤A¥ ##

¥7 §A

u s%

7r Ae W 7 6wu

c $¢8

§

©2$7

£

Reverse communication enhances the ¬‚exibility of the FGMRES routine substantially.

For example, when changing preconditioners, we can iterate on a coarse mesh and do the

8˜ m ’ ”p£8 ¡©

8¥ $

¡¡