of realizations of a paired random vector. The phrase that seems to be able to override most

statistical model treats the SLP and SST ¬elds as statistical scepticism.

a paired random variable (X, Y) with covariance

and cross-covariance matrices Σ X , ΣY , and Σ X Y . 4.1.7 The Test of a Null Hypothesis. We brie¬‚y

These matrices are estimated in the conventional touched on the subject of statistical hypothesis

4 This statement must not be reduced, or changed, to the testing in [1.2.7]. Here, we continue to discuss the

misleading statement ˜the interval contains the true parameter concept in an intuitive manner before using a more

with (the selected high) probability.™ While the latter is

rigorous approach in Chapter 6.

technically equivalent, it encourages the mistake of regarding

A statistical test is a decision making procedure

the parameter, rather than the endpoints of the interval, as being

that attempts to determine whether a given set

random.

4: Concepts in Statistical Inference

72

observed. The actual signi¬cance level of the

of observations contains information consistent

˜

decision is different from the speci¬ed level p

with a concept that was formulated a priori. This

if there is a problem with the statistical model.

˜concept™ is known as the null hypothesis and is

usually denoted with the symbol H0 .

• The decision rule should be constructed so

In general, only two decisions are possible about

that the chances of rejecting H0 are optimized

H0 :

when H0 is false. That is, the decision rule

• reject H0 (if suf¬cient evidence is found that should maximize the power of the test.

it is false), or

Usually the model and the null hypothesis are

• do not reject H0 (if suf¬cient evidence can not separate but related entities.

be found that it is false). The statistical model used to represent an

experiment is expressed in terms of a random

The decision is a random variable because it is variable and the way in which it was observed.

a function of the sample. Thus, there will be For example, if the null hypothesis is that the mean

some sampling variability in the decision. The of a random variable is zero (i.e., H0 : µ = 0), we

same decision about H0 may not be made in every might use a model that says that the sample was

replication of the experiment that produced the drawn at random from a normal distribution with

sample. known variance σ 2 and unknown mean µ. That is,

The decision making rule used in hypothesis the model describes, in statistical terms, the way

testing is constructed using a statistical model in which observations were collected (they were

so that effects of the sampling variability on the drawn at random), and the probability distribution

average decision are known, and so that the rule (normal, with known variance σ 2 ) of the random

extracts the strongest possible evidence against variable which is observed.

H0 from the sample. Note that it is often not necessary, or desirable,

Since there is sampling variability, there is to prescribe a particular probability distribution.

a chance of rejecting H0 when H0 is true. The Our test of the mean can be conducted almost as

˜

probability, or risk p, of making this incorrect ef¬ciently if we assume only that the observations

decision is called the signi¬cance level. The are drawn from a symmetric distribution with

amount of risk can be controlled by the user of the unknown mean µ.

test. The only way to avoid all risk is to set p =˜ The null hypothesis H0 speci¬es a value of the

0 so that H0 is never rejected, which, of course, unknown parameter in the statistical model of

makes the test useless. However, the risk of false the experiment. Note that in general the model

rejection can be set very near zero, at the expense may have many parameters and H0 might specify

of reducing the chances of rejecting H0 when it is values for only a few of them. The parameters that

false. are not speci¬ed are called nuisance parameters

It is important to remember that the concept of and must be estimated. The testing procedure

signi¬cance is an artifact of the conceptual model must properly account for the uncertainty of any

that we place around our data gathering. The parameter estimates.

˜

signi¬cance level p is realized only if the statistical

model we are using is correct and only if the

4.1.8 Example: Number of Hurricanes in a

˜experiment™ that generated the data is replicated

Pair of GCM Experiments. As an example, we

ad in¬nitum. In the real world we need to base our

consider Bengtsson, Botzet, and Esch™s simulation

decision about H0 on a single sample.

[42, 43] of possible changes of the frequency

The decision making mechanism often consists

of hurricanes due to increasing atmospheric

of a statistic T and an interval designed so that

concentrations of greenhouse gases. They dealt

it contains (1 ’ p) — 100% of the realizations

˜

with hurricanes in both hemispheres, but we limit

of T when H0 is true. Then H0 is rejected at the

ourselves in the following to their results for the

p — 100% signi¬cance level if the observed value

˜

Northern Hemisphere.

of T, say T = t, falls outside the interval.

Bengtsson et al. conducted a pair of ˜time-slice

The important aspects of a statistical test are as

experiments™ with a high-resolution Atmospheric

follows.

General Circulation Model. One experiment was

• The statistical model correctly re¬‚ects the performed with present-day sea ice and sea-surface

stochastic properties of the observed random temperature distributions, and atmospheric CO2

variables and the way in which they were concentration. In the other experiment, doubled

4.1: General 73

CO2 concentrations were prescribed together with In Bengtsson et al.™s case, the sample sizes are

n 1 = n 2 = 5 since both simulations were run for

anomalous sea ice and SST conditions simulated

in an earlier experiment in which the GCM was ¬ve years. The yearly hurricane frequencies in the

coupled to a low-resolution ocean.5 The number of simulations are:

hurricanes in a model year is treated as a random year 1—CO2 2—CO2

variable. 1 49 41

The number of hurricanes in a year in 2 55 42

the 1—CO2 and the 2—CO2 experiment is 3 63 46

labelled N1 and N2 , respectively. The question 4 51 38

of whether the number of storms changes in the 5 63 38

2—CO2 experiment can be expressed as the null

The rank sum for the n 2 = 5 realizations of

hypothesis:

N2 is 15. Note that all are smaller than any of

H0 : E(N1 ) = E(N2 ) the realizations of N1 . When the null hypothesis

is true, the 5% threshold value for the rank sum

or, in words, ˜the expected number of hurricanes

is 18; that is, if H0 is true, the rank sum will

in the 1—CO2 -model world equals the expected

be greater than or equal to 18 in 19 out of

number of hurricanes in the 2—CO2 -model world.™

every 20 replications of this experiment, and it

We adopt a signi¬cance level of 5%, that is, we

will be less than 18 only once. Since the actual

accept a 5% risk of incorrectly rejecting the null

rank sum of 15 is smaller than the 5% threshold

hypothesis.

of 18, we reject the null hypothesis at the 5%

To design a test strategy we consider the

signi¬cance level.6 We may conclude, at least

number of hurricanes in any model year as being

in the framework of the GCM world, that an

statistically independent. We also assume that

increase of the CO2 -concentration will reduce the

the shape of the distribution of the number of

frequency of Northern Hemisphere hurricanes.

hurricanes is the same in both the 1—CO2 and

2—CO2 experiments. That is, we assume that the

4.1.9 Testing a Null Hypothesis: Interpretation

mean changes in response to CO2 doubling but that

of the Result. Given a particular sample, the

the higher moments (see [2.6.7]) do not.

decision to reject H0 with a signi¬cance level of

Given these assumptions we may then use the

˜

p may occur for several reasons.

Mann“Whitney test [6.6.11]. This test operates

with the sum of the ranks of the samples. Rank 1 is • We may have incorrectly rejected a true null

given to the smallest number of hurricanes found hypothesis. Occasional errors of this kind are

in all years from both time-slice experiments, unavoidable if we wish to make decisions. We

rank 2 to the second smallest number and so saw in the example above that unusual rank

on. Then the sum of the ranks of the yearly sums can occur even when there is no change