¬¬¬ to it. It can also be used to compute the product of the transpose of a matrix

¥

by a vector, when the matrix is stored (row-wise) in the CSR format. Normally, the vector

y(ja(k1:k2)) is gathered and the SAXPY operation is performed in vector mode. Then the

resulting vector is “scattered” back into the positions ja(*), by what is called a Scatter

operation. This is illustrated in Figure 11.7.

A major dif¬culty with the above FORTRAN program is that it is intrinsically sequen-

tial. First, the outer loop is not parallelizable as it is, but this may be remedied as will be

seen shortly. Second, the inner loop involves writing back results of the right-hand side

into memory positions that are determined by the indirect address function ja. To be cor-

rect, y(ja(1)) must be copied ¬rst, followed by y(ja(2)), etc. However, if it is known that

the mapping ja(i) is one-to-one, then the order of the assignments no longer matters. Since

compilers are not capable of deciding whether this is the case, a compiler directive from

the user is necessary for the Scatter to be invoked.

8˜ m ’ ”p£8 ¡©

8¥ $

¥¡

¡¨ ¨¦ ¤

© §¥ Yc¦ ¡©0c¨£"¥ §

¥

© § &

$

Gather Scatter

+ +

+ +

+ x(j)* =

+ +

y(*) a(*,j) y(*)

+ +

y(1:n) y(1:n)

¤

Dd ¥ ¤¥£ ¢

¡

§§

Illustration of the column-oriented matrix-by-

vector multiplication.

Going back to the outer loop, subsums can be computed (independently) into

separate temporary vectors and then these subsums can be added at the completion of

all these partial sums to obtain the ¬nal result. For example, an optimized version of the

previous algorithm can be implemented as follows:

P

˜ ª˜ ¨ £ ¤ 2 ˜‘ ¨ %

"D§ wE¤©n¦§¥£¢

§ ¡ ¨¦ ¤¢ "

"

! !

V )

¥ ¥

¦

1. tmp(1:n,1:p) = 0.0

2. Do m=1, p

3. Do j = m, n, p

4. k1 = ia(j)

5. k2 = ia(j + 1)-1

6. tmp(ja(k1:k2),j) = tmp(ja(k1:k2),j) + x(j) * a(k1:k2)

7. EndDo

8. EndDo

9. y(1:n) = SUM(tmp,dim=2)

The SUM across the second dimension at the end of the algorithm constitutes additional

work, but it is highly vectorizable and parallelizable.

§

Q9 3XU SCXU 8 9 PcR Hca Y W P ¤¥ b Q9 f

¨2 ¡2 6I6

§ BHY

T9W

Y fe

The above storage schemes are general but they do not exploit any special structure of the

matrix. The diagonal storage format was one of the ¬rst data structures used in the context

of high performance computing to take advantage of special sparse structures. Often, sparse

matrices consist of a small number of diagonals in which case the matrix-by-vector product

”® H

¨

can be performed by diagonals as in Algorithm 11.3. For sparse matrices, most of the

G

diagonals invoked in the outer loop of Algorithm 11.3 are zero. There are again different

variants of Matvec algorithms for the diagonal format, related to different orderings of the

loops in the basic FORTRAN program. Recall that the matrix is stored in a rectangular

array diag(1:n,1:ndiag) and the offsets of these diagonals from the main diagonal may be

stored in a small integer array offset(1:ndiag). Consider a “dot-product” variant ¬rst.

p¡

·

’™8 ¤ "¡ } ¡ } ¨8p

©w w ¥ r

’ 8

¤ ¨& ¦§

1

r•v" ¤ D§ ¡w¤h¨ n¦ ¨ " ¨% " ¨ £ ¡ 9I!

¤ £¢ ¦

¢ § "

! &

!

¥ ¥

1. Do i = 1, n

2. tmp = 0.0d0

3. Do j = 1, ndiag

4. tmp = tmp + diag(i,j)*x(i+offset(j))

5. EndDo

6. y(i) = tmp

7. EndDo

In a second variant, the vector is initialized to zero, and then is multiplied by each of the

diagonals and the separate results are added to . The innermost loop in this computation

is sometimes called a Triad operation.

©©£

…

£0 ¨%

D§ ¡w¤h¨ n¦ ¤ £¢

¦¢

§ !) % £ !

& ¥

1. y = 0.0d0

2. Do j = 1, ndiag

3. joff = offset(j)

4. i1 = max(1, 1-offset(j))

5. i2 = min(n, n-offset(j))

6. y(i1:i2) = y(i1:i2) + diag(i1:i2,j)*x(i1+joff:i2+joff)

7. EndDo

Good speeds can be reached on vector machines when the matrix is large enough.