E[K] = .

2

6.7 Hash functions

In this section, we apply the tools we have developed thus far to a par-

ticularly important area of computer science: the theory and practice of

hashing.

The scenario is as follows. We have ¬nite, non-empty sets A and Z, with

|A| = k and |Z| = n, and a ¬nite, non-empty set H of hash functions, each

of which map elements of A into Z. More precisely, each element h ∈ H

de¬nes a function that maps a ∈ A to an element z ∈ Z, and we write

z = h(a); the value z is called the hash code of a (under h), and we say

that a hashes to z (under h). Note that two distinct elements of H may

happen to de¬ne the same function. We call H a family of hash functions

(from A to Z).

Let H be a random variable whose distribution is uniform on H. For any

a ∈ A, H(a) denotes the random variable whose value is z = h(a) when

H = h. For any = 1, . . . , k, we say that H is an -wise independent

family of hash functions if each H(a) is uniformly distributed over Z, and the

collection of all H(a) is -wise independent; in case = 2, we say that H is

a pairwise independent family of hash functions. Pairwise independence

is equivalent to saying that for all a, a ∈ A, with a = a , and all z, z ∈ Z,

1

P[H(a) = z § H(a ) = z ] = .

n2

Example 6.25. Examples 6.17 and 6.18 provide explicit constructions for

pairwise independent families of hash functions. In particular, from the

discussion in Example 6.17, if n is prime, and we take A := Zn , Z := Zn ,

and H := {hx,y : x, y ∈ Zn }, where for hx,y ∈ H and a ∈ A we de¬ne

hx,y (a) := ax + y, then H is a pairwise independent family of hash functions

from A to Z.

Similarly, Example 6.18 yields a pairwise independent family of hash func-

tions from A := Z—t to Z := Zn , with H := {hx1 ,...,xt ,y : x1 , . . . , xt , y ∈ Zn },

n

126 Finite and discrete probability distributions

where for hx1 ,...,xt ,y ∈ H and (a1 , . . . , at ) ∈ A, we de¬ne

hx1 ,...,xt ,y (a1 , . . . , at ) := a1 x1 + · · · + at xt + y.

In practice, the inputs to such a hash function may be long bit strings, which

we chop into small pieces so that each piece can be viewed as an element of

Zn . 2

6.7.1 Hash tables

Pairwise independent families of hash functions may be used to implement

a data structure known as a hash table, which in turn may be used to

implement a dictionary.

Assume that H is a family of hash functions from A to Z, where |A| = k

and |Z| = n. A hash function is chosen at random from H; an element

a ∈ A is inserted into the hash table by storing the value of a into a bin

indexed by the hash code of a; likewise, to see if a particular value a ∈ A

is stored in the hash table, one must search in the bin indexed by the hash

code of a.

So as to facilitate fast storage and retrieval, one typically wants the ele-

ments stored in the hash table to be distributed in roughly equal proportions

among all the bins.

Assuming that H is a pairwise independent family of hash functions, one

can easily derive some useful results, such as the following:

• If the hash table holds q values, then for any value a ∈ A, the expected

number of other values that are in the bin indexed by a™s hash code

is at most q/n. This result bounds the expected amount of “work”

we have to do to search for a value in its corresponding bin, which

is essentially equal to the size of the bin. In particular, if q = O(n),

then the expected amount of work is constant. See Exercise 6.32

below.

• If the table holds q values, with q(q ’ 1) ¤ n, then with probability

at least 1/2, each value lies in a distinct bin. This result is useful if

one wants to ¬nd a “perfect” hash function that hashes q ¬xed values

to distinct bins: if n is su¬ciently large, we can just choose hash

functions at random until we ¬nd one that works. See Exercise 6.33

below.

• If the table holds n values, then the expected value of the maximum

number of values in any bin is O(n1/2 ). See Exercise 6.34 below.

Results such as these, and others, can be obtained using a broader notion

6.7 Hash functions 127

of hashing called universal hashing. We call H a universal family of hash

functions if for all a, a ∈ A, with a = a , we have

1

P[H(a) = H(a )] ¤

.

n

Note that the pairwise independence property implies the universal prop-

erty (see Exercise 6.29 below). There are even weaker notions that are

relevant in practice; for example, in some applications, it is su¬cient to

require that P[H(a) = H(a )] ¤ c/n for some constant c.

Exercise 6.29. Show that any pairwise independent family of hash func-

tions is also a universal family of hash functions.

—(t+1)

Exercise 6.30. Let A := Zn and Z := Zn , where n is prime. Let

H := {hx1 ,...,xt : x1 , . . . , xt ∈ Zn } be a family of hash functions from A to

Z, where for hx1 ,...,xt ∈ H, and for (a0 , a1 , . . . , at ) ∈ A, we de¬ne

hx1 ,...,xt (a0 , a1 , . . . , at ) := a0 + a1 x1 + · · · + at xt .

Show that H is universal, but not pairwise independent.

Exercise 6.31. Let k be a prime and let n be any positive integer. Let

A := {0, . . . , k ’ 1} and Z := {0, . . . , n ’ 1}. Let

H := {hx,y : x = 1, . . . , k ’ 1, y = 0, . . . , k ’ 1},

be a family of hash functions from A to Z, where for hx,y ∈ H and for a ∈ A,

we de¬ne

hx,y (a) := ((ax + y) mod k) mod n.

Show that H is universal. Hint: ¬rst show that for any a, a ∈ A with a = a ,

the number of h ∈ H such that h(a) = h(a ) is equal to the number of pairs

of integers (r, s) such that

0 ¤ r < k, 0 ¤ s < k, r = s, and r ≡ s (mod n).

In the following three exercises, assume that H is a universal family of

hash functions from A to Z, where |A| = k and |Z| = n, and that H is a

random variable uniformly distributed over H.

Exercise 6.32. Let a1 , . . . , aq be distinct elements of A, and let a ∈ A.

De¬ne L to be the number of indices i = 1, . . . , q such that H(ai ) = H(a).

Show that

1 + (q ’ 1)/n if a ∈ {a1 , . . . , aq };

E[L] ¤

q/n otherwise.

128 Finite and discrete probability distributions

Exercise 6.33. Let a1 , . . . , aq be distinct elements of A, and assume that

q(q ’ 1) ¤ n. Show that the probability that H(ai ) = H(aj ) for some i, j

with i = j, is at most 1/2.

Exercise 6.34. Assume k ≥ n, and let a1 , . . . , an be distinct elements of

A. For z ∈ Z, de¬ne the random variable Bz := {ai : H(ai ) = z}. De¬ne

the random variable M := max{|Bz | : z ∈ Z}. Show that E[M ] = O(n1/2 ).

Exercise 6.35. A family H of hash functions from A to Z is called -

universal if for H uniformly distributed over H, and for all a, a ∈ A with

a = a , we have P[H(a) = H(a )] ¤ . Show that if H is -universal, then we

must have

1 1

≥ ’ .

|Z| |A|

Hint: using Exercise 6.26, ¬rst show that if H, A, A are mutually inde-

pendent random variables, with H uniformly distributed over H, and A

and A uniformly distributed over A, then P[A = A § H(A) = H(A )] ≥

1/|Z| ’ 1/|A|.

6.7.2 Message authentication

Pairwise independent families of hash functions may be used to implement

a message authentication scheme, which is a mechanism to detect if

a message has been tampered with in transit between two parties. Unlike

an error correcting code (such as the one discussed in §4.5.1), a message

authentication scheme should be e¬ective against arbitrary tampering.

As above, assume that H is a family of hash functions from A to Z, where

|A| = k and |Z| = n. Suppose that Alice and Bob somehow agree upon a

hash function chosen at random from H. At some later time, Alice transmits

a message a ∈ A to Bob over an insecure network. In addition to sending

a, Alice also sends the hash code z of a. Upon receiving a pair (a, z), Bob

checks that the hash code of a is indeed equal to z: if so, he accepts the