£¦

¢ £¦

¢

nodes whose bit is 0 and those whose bit is 1. This will be called tearing along the

£¦

¢ ® ®

direction. Since there are bits, there are directions. One important consequence of

¡3

®

this is that arbitrary meshes with dimension can be mapped on hypercubes. However,

the hardware cost for building a hypercube is high, because each node becomes dif¬cult

to design for larger dimensions. For this reason, recent commercial vendors have tended to

prefer simpler solutions based on two- or three-dimensional meshes.

Distributed memory computers come in two different designs, namely, SIMD and

MIMD. Many of the early projects have adopted the SIMD organization. For example,

the historical ILLIAC IV Project of the University of Illinois was a machine based on a

mesh topology where all processors execute the same instructions.

SIMD distributed processors are sometimes called array processors because of the

regular arrays that they constitute. In this category, systolic arrays can be classi¬ed as an

example of distributed computing. Systolic arrays are distributed memory computers in

which each processor is a cell which is programmed (possibly micro-coded) to perform

only one of a few operations. All the cells are synchronized and perform the same task.

Systolic arrays are designed in VLSI technology and are meant to be used for special

purpose applications, primarily in signal processing.

—E–””Ud 3HU¡…

˜ • … ˜

¦m|

¥

Now consider two prototype Krylov subspace techniques, namely, the preconditioned Con-

jugate Gradient method for the symmetric case and the preconditioned GMRES algorithm

for the nonsymmetric case. For each of these two techniques, we analyze the types of oper-

ations that are performed. It should be emphasized that other Krylov subspace techniques

require similar operations.

’ ”p£¡¦’w vc¦"¡8

8¥©§ ’ ©§

¨¡

§

$

¦

C¥ IHcXrbY PcRcXC¥ 3C

6 2 2 I6

6 WU He

R W UP

8

Consider Algorithm 9.1. The ¬rst step when implementing this algorithm on a high-

performance computer is identifying the main operations that it requires. We distinguish

¬ve types of operations, which are:

¨

Preconditioner setup.

! Matrix vector multiplications.

Vector updates.

¡

Dot products.

¢ Preconditioning operations.

In the above list the potential bottlenecks are (1), setting up the preconditioner and (5),

¡

solving linear systems with , i.e., the preconditioning operation. Section 11.6 discusses

the implementation of traditional preconditioners, and the last two chapters are devoted

to preconditioners that are specialized to parallel environments. Next come the matrix-

by-vector products which deserve particular attention. The rest of the algorithm consists

essentially of dot products and vector updates which do not cause signi¬cant dif¬culties in

parallel machines, although inner products can lead to some loss of ef¬ciency on certain

types of computers with large numbers of processors.

52 2 I6

6

4 D3e A8

BH f

The only new operation here with respect to the Conjugate Gradient method is the orthog-

S£ £

onalization of the vector against the previous ™s. The usual way to accomplish this is

via the modi¬ed Gram-Schmidt process, which is basically a sequence of subprocesses of

the form:

© a £ % `

Compute .

¦

© £

¨

Compute .

£

This orthogonalizes a vector against another vector of norm one. Thus, the outer loop of

the modi¬ed Gram-Schmidt is sequential, but the inner loop, i.e., each subprocess, can be

parallelized by dividing the inner product and SAXPY operations among processors. Al-

though this constitutes a perfectly acceptable approach for a small number of processors,

the elementary subtasks may be too small to be ef¬cient on a large number of processors.

An alternative for this case is to use a standard Gram-Schmidt process with reorthogonal-

ization. This replaces the previous sequential orthogonalization process by a matrix opera-

˜¡¡¨

¦

tion of the form , i.e., BLAS-1 kernels are replaced by BLAS-2 kernels.

Recall that the next level of BLAS, i.e., level 3 BLAS, exploits blocking in dense

matrix operations in order to obtain performance on machines with hierarchical memories.

Unfortunately, level 3 BLAS kernels cannot be exploited here because at every step, there

is only one vector to orthogonalize against all previous ones. This may be remedied by

using block Krylov methods.

p¡

¶ 8˜ ’ ”p£8 ¡© Yc¦ m¡©0c¨£"¥ §

8¥ $ © §

¡¨ ¨¦ ¤

© §¥ ¥ &

$

Xrb¨9¨I3¦£U XUcY ¥ b§ ¨2 2 I6

§ 6

H

e

W UP Y e H B

¡

These are usually the simplest operations to implement on any computer. In many cases,

compilers are capable of recognizing them and invoking the appropriate machine instruc-

tions, possibly vector instructions. In the speci¬c case of CG-like algorithms, there are two

types of operations: vector updates and dot products.

Vector Updates Operations of the form

y(1:n) = y(1:n) + a * x(1:n),

where is a scalar and and two vectors, are known as vector updates or SAXPY

£

operations. They are typically straightforward to implement in all three machine models

discussed earlier. On an SIMD computer, the above code segment can be used on many

of the recent systems and the compiler will translate it into the proper parallel version.

The above line of code is written in FORTRAN 90, which is the prototype programming

language for this type of computers. On shared memory computers, we can simply write

the usual FORTRAN loop, possibly in the above FORTRAN 90 style on some computers,