¦

R b ©b

&

&% A message of length is “broadcast” to all processors by sending it from to and then

©b b b

from to , etc., until it reaches all destinations, i.e., until it reaches . How much time

does it take for the message to complete this process?

&1 Now split the message into packets of equal size and pipeline the data transfer. Typically,

each processor will receive packet number from the previous processor, while sending

aT

packet it has already received to the next processor. The packets will travel in chain

© b R b

b

from to , , to . In other words, each processor executes a program that is described

roughly as follows:

¨©

g ¨ ¦£ `AC©¢

$'

¢ 536 Y£

8˜ m ’ ”p£8 ¡©

8¥ $

¡¡

¢ ¡¨ ¨¦ ¤

© §¥ Yc¦ ¡©0c¨£"¥ §

¥

© § &

$

BB¦B¨¨¨¢(© 2 ¨¨qY ¦WhB¦‚ ¤¨3Q¨B5¨v

g5 V 2

©§X §¥Y£

£ §V

§©¢¢5©§V

B© ¦¨¨Q £ ¨q§ ¦a g ¨Q¨ t

53V d

¥Y£

©

§ ¢¢5©§V

¡ ©d¨

¢

T

There are a few additional conditionals. Assume that the number of packets is equal to .

"

How much time does it take for all packets to reach all processors? How does this compare

"

with the simple method in (a)?

6 (a) Write a short FORTRAN routine (or C function) which sets up the level number of each

unknown of an upper triangular matrix. The input matrix is in CSR format and the output should

be an array of length containing the level number of each node. (b) What data structure should

be used to represent levels? Without writing the code, show how to determine this data structure

from the output of your routine. (c) Assuming the data structure of the levels has been deter-

mined, write a short FORTRAN routine (or C function) to solve an upper triangular system

using the data structure resulting in the previous question. Show clearly which loop should be

executed in parallel.

7 In the jagged diagonal format described in Section 11.5.5, it is necessary to preprocess the matrix

by sorting its rows by decreasing number of rows. What type of sorting should be used for this

purpose?

8 In the jagged diagonal format described in Section 11.5.5, the matrix had to be preprocessed by

sorting it by rows of decreasing number of elements.

&'% What is the main reason it is necessary to reorder the rows?

&1 Assume that the same process of extracting one element per row is used. At some point the

extraction process will come to a stop and the remainder of the matrix can be put into a

CSR data structure. Write down a good data structure to store the two pieces of data and a

corresponding algorithm for matrix-by-vector products.

&32 This scheme is ef¬cient in many situations but can lead to problems if the ¬rst row is very

short. Suggest how to remedy the situation by padding with zero elements, as is done for the

Ellpack format.

9 Many matrices that arise in PDE applications have a structure that consists of a few diagonals

and a small number of nonzero elements scattered irregularly in the matrix. In such cases, it is

advantageous to extract the diagonal part and put the rest in a general sparse (e.g., CSR) format.

Write a pseudo-code to extract the main diagonals and the sparse part. As input parameter, the

number of diagonals desired must be speci¬ed.

NOTES AND REFERENCES. Kai Hwang™s book [124] is recommended for an overview of parallel

architectures. More general recommended reading on parallel computing are the book by Bertsekas

and Tsitsiklis [25] and a more recent volume by Kumar et al. [139]. One characteristic of high-

performance architectures is that trends come and go rapidly. A few years ago, it seemed that mas-

sive parallelism was synonymous with distributed memory computing, speci¬cally of the hypercube

type. Currently, many computer vendors are mixing message-passing paradigms with “global address

space,” i.e., shared memory viewpoint. This is illustrated in the recent T3D machine built by CRAY

Research. This machine is con¬gured as a three-dimensional torus and allows all three programming

paradigms discussed in this chapter, namely, data-parallel, shared memory, and message-passing. It

is likely that the T3D will set a certain trend. However, another recent development is the advent of

network supercomputing which is motivated by astounding gains both in workstation performance

and in high-speed networks. It is possible to solve large problems on clusters of workstations and to

”}$

8’

¨(c2 ¤ ¡" ¡© Q¡

§

©

$ ¥ © c"

©

obtain excellent performance at a fraction of the cost of a massively parallel computer.

Regarding parallel algorithms, the survey paper of Ortega and Voigt [156] gives an exhaustive

bibliography for research done before 1985 in the general area of solution of Partial Differential

Equations on supercomputers. An updated bibliography by Ortega, Voigt, and Romine is available in

[99]. See also the survey [178] and the monograph [71]. Until the advent of supercomputing in the

mid 1970s, storage schemes for sparse matrices were chosen mostly for convenience as performance

was not an issue, in general. The ¬rst paper showing the advantage of diagonal storage schemes in

sparse matrix computations is probably [133]. The ¬rst discovery by supercomputer manufacturers of

the speci¬city of sparse matrix computations was the painful realization that without hardware sup-

port, vector computers could be inef¬cient. Indeed, the early CRAY machines did not have hardware

instructions for gather and scatter operations but this was soon remedied in the second-generation

machines. For a detailed account of the bene¬cial impact of hardware for “scatter” and “gather” on

vector machines, see [146].

Level scheduling is a textbook example of topological sorting in graph theory and was discussed

from this viewpoint in, e.g., [8, 190, 228]. For the special case of ¬nite difference matrices on rectan-

gular domains, the idea was suggested by several authors independently, [208, 209, 111, 186, 10]. In

fact, the level scheduling approach described in this chapter is a “greedy” approach and is unlikely

to be optimal. There is no reason why an equation should be solved as soon as it is possible. For

example, it may be preferable to use a backward scheduling [7] which consists of de¬ning the levels

from bottom up in the graph. Thus, the last level consists of the leaves of the graph, the previous level

consists of their predecessors, etc. Dynamic scheduling can also be used as opposed to static schedul-

ing. The main difference is that the level structure is not preset; rather, the order of the computation is

determined at run-time. The advantage over pre-scheduled triangular solutions is that it allows pro-

cessors to always execute a task as soon as its predecessors have been completed, which reduces idle

time. On loosely coupled distributed memory machines, this approach may be the most viable since

it will adjust dynamically to irregularities in the execution and communication times that can cause

a lock-step technique to become inef¬cient. However, for those shared memory machines in which

hardware synchronization is available and inexpensive, dynamic scheduling would have some dis-

advantages since it requires managing queues and generates explicitly busy waits. Both approaches

have been tested and compared in [22, 189] where it was concluded that on the Encore Multimax

dynamic scheduling is usually preferable except for problems with few synchronization points and a

large degree of parallelism. In [118], a combination of prescheduling and dynamic scheduling was

found to be the best approach on a Sequent balance 21000. There seems to have been no comparison

of these two approaches on distributed memory machines or on shared memory machines with mi-

crotasking or hardware synchronization features. In [22, 24] and [7, 8], a number of experiments are

presented to study the performance of level scheduling within the context of preconditioned Conju-

gate Gradient methods. Experiments on an Alliant FX-8 indicated that a speed-up of around 4 to 5

can be achieved easily. These techniques have also been tested for problems in Computational Fluid

Dynamics [214, 216].

‘t™

Finally, a third strategy uses generalizations of “partitioning” techniques, which can

sweeps of the ILU preconditioning operation.

ated with the same color can be determined simultaneously in the forward and backward

adjacent nodes have different colors. The gist of this approach is that all unknowns associ-

rithms, such as graph-coloring techniques. These consist of coloring nodes such that two

A different strategy altogether is to enhance parallelism by using graph theory algo-

mial preconditioning.