experiment (i.e., sampling exercise) is repeated

SOI

is estimated by the frequency with which

observations fall into that set in the available

Figure 5.1: The empirical distribution function of

sample.

the monthly mean SO index as computed from

Several examples of histograms are shown in

1933“84 observations.

Chapter 3, for example, Figures 3.3, 3.5, 3.7, 3.18,

or 3.20.

Note that the histogram depends on the details It is sometimes of interest to know whether a

given sample {x1 , . . . , xn } could be realizations

of the partitioning, and that the partitioning is

chosen subjectively. of a random variable Y, with a particular

type of probability distribution, such as the

normal distribution. One approach to this type of

5.2.2 Empirical Distribution Function. Com-

goodness-of-¬t question compares the empirical

bining the de¬nition of the cumulative distribution

distribution function F X with the proposed

function in (2.14) with the de¬nition of the esti-

distribution function FY . The difference F X ’ FY

mated probability density function in (5.2) gives

is a random variable and it is therefore possible

the following natural estimator of the distribution

to construct goodness-of-¬t tests that determine

function

whether the difference is unlikely to be large

|{Xk : Xk ¤ x}| under the null hypothesis H0 : FX = FY . Conover

F X (x) =

n [88] provides a good introduction to the subject.

= P(X ¤ x) = H([’∞, x]). (5.3) Stephens [356] [357] provides technical details of

a variety of goodness-of-¬t tests not discussed by

F X is often called the empirical distribution Conover.

function. It is a non-decreasing step function with The Kolmogorov“Smirnov test is a popular

F X (’∞) = 0 and F x (∞) = 1. The value goodness-of-¬t statistic that compares an empirical

of the function increases by a step of 1/n at distribution function with a speci¬ed distribution

each observation (or it increases by a multiple of function FY . The Kolmogorov“Smirnov test

1/n if several observations have the same value). statistic,

Note that F X (x(n|n) ) = 1, and that the estimated

D K S = max | F X (x) ’ FY (x)|,

probability of observing a value larger than the

x

largest value, x(n|n) , in the sample or a value

smaller than the smallest value, x(1|n) , is zero.2 measures the distance between the empirical dis-

A slightly different estimator of the distributiontribution function and the speci¬ed distribution.

function is described in [5.2.4]. Obviously, a large difference indicates an incon-

The empirical distribution function of the sistency between the data and the statistical model

monthly mean Southern Oscillation Index (see FY .

Figures 1.2 and 1.4, and subsections [1.2.2], There is a large family of related tests, some

[2.8.7], and [8.1.4]) is shown in Figure 5.1. of which feature norms other than the max-

norm.3 The Kolmogorov“Smirnov test becomes

5.2.3 Goodness-of-¬t Tests”a Diversion. The ˜conservative,™ that is, rejects the null hypothesis

subject of goodness-of-¬t tests arises naturally in 3 Other tests, such as the Anderson“Darling test and

the context of estimating the distribution function. the Cramer“von Mises test (see [356], [357], [307]) use

statistics that are more dif¬cult to compute, but they are

2 A reminder: x

( j|n) is the jth order statistic of the sample also more powerful and more sensitive to departures from

{x1 , . . . , xn }, that is, the jth largest value in the sample. the hypothesized distribution in the tails of the distribution.

5: Estimation

82

less frequently than indicated by the signi¬cance Figure 5.2 redisplays the empirical distribution

—

function of the SO index as ( F X (x), FN (x)) pairs

level, when FY has parameters that are estimated

—

where FN (x) is the normal distribution function

from the sample.

with mean and variance estimated from the SO

This problem often occurs when we want to

data. These points are expected to more or less lie

test for normality in a set of data. The Lilliefors

—

on the F X (x) = FN (x) line when the ¬t is good

test [253], is a variation of the Kolmogorov“

(i.e., when H0 is true). Note that the placement of

Smirnov test that accounts for the uncertainty

—

the thresholds parallel to the F X (x) = FN (x) line

of the estimate of the mean and variance. The

is correct only if the iid assumption holds for the

Lilliefors test statistic is given by

SOI, which is known not to be true. The results of

—

D L = max | F X (x) ’ FN (x)|, the test can therefore not be taken literally.

x

—

where FN ∼ N (µ X , σ X ) is the normal 5.2.4 Probability Plots. Subsection [3.1.3]

discusses the format of a probability plot that

distribution in which the mean and standard

is similar to Figure 5.2, but more useful for

deviation are replaced with the sample mean

determining whether FY = FX . A probability

and standard deviation. D L measures the distance

plot depicts the graph of the function y ’

between the empirical distribution function and

’1

FX [FY (y)], where FY is some prescribed,

the normal distribution ¬tted to the data. Large

possibly hypothesized, distribution and FX is the

realizations of D L indicate that H0 should be

distribution of the data. The graph is plotted

rejected. Conover (see Section 6.1 and Table

linearly in y but the horizontal axis is labelled with

15 in [88]) provides tables with thresholds

the probabilities FY (y) (see Figure 3.2).

for rejection as a function of sample size

A probability plot may be derived from a

and signi¬cance level. Stephens [356] offers

’1

¬nite sample by plotting points (FY ( FX (xi )), xi )

approximate formulae for the same purpose.

where FX is an estimator of the distribution func-

tion. Since F X (xn ) = 1 we can not use FX = F X .

0.0 0.2 0.4 0.6 0.8 1.0

Otherwise the scatter plot would include the point

Gaussian distribution

(∞, xn ). Alternative estimators are

|{Xk : Xk ¤ x}|

FX (x) =

n+1

n

= F X (x)

n+1

and

|{Xk : Xk ¤ x}| ’ 0.5

0.0 0.2 0.4 0.6 0.8 1.0 FX (x) =

n

Empirical distribution (SOI)

0.5

= F X (x) ’ (5.4)

n

’1 i

Figure 5.2: The empirical distribution function so that the points to be plotted are (FY ( n+1 ), xi )

(5.3) of the SOI plotted against the cumulative ’1

or (FY ( i’0.5 ), xi ). Equation (5.4) is used in

n