N2 is formed. Very small or large rank sums give

• The statistical model adopted for the observa-

evidence that the null hypothesis is false because

tions may not be valid. The observations may

rank sums of this type occur when most of the

not have been sampled in the way assumed

yearly hurricane frequencies in one experiment

by the model (e.g., they might not be inde-

are greater than those in the other experiment.

pendent) or they might not have the assumed

Under the null hypothesis we would expect a

distribution (e.g., it might not be symmet-

roughly equal number of large frequencies in both

ric about the mean). The resulting decision

experiments. Rank sum thresholds for making

making procedure may reject H0 much more

decisions about H0 at various signi¬cance levels

˜

frequently than speci¬ed by p even when

are listed in Appendix I.

H0 is true.

5 Brie¬‚y, the rationale for this methodology is as follows:

6 If the null hypothesis is true, the probability that the ¬ve

Hurricanes are not resolved in the low-resolution GCMs. It

years representative of 2—CO2 conditions all have fewer storms

is, however, assumed that the low-resolution model simulates

than those representative of the 1—CO2 conditions is 1/252

the large-scale SST and sea-ice distributions well. It is

(0.49%).

also assumed that the atmospheric circulation is, to a ¬rst

7 We reiterate that the signi¬cance level determines the

order approximation, in equilibrium with its lower boundary

conditions. These assumptions make it possible to assess the frequency with which we will make this type of error (which

impact of the changed SST and sea-ice distributions and statisticians call a ˜type I™ error). A testing procedure that

the enhanced CO2 concentration on hurricanes in the high- operates at the 5% signi¬cance level will make a type I error

resolution GCM. 5% of the time when H0 is true.

4: Concepts in Statistical Inference

74

probabilities, for example, 99%, are associated

Protection against this type of error can be

with statistical signi¬cance. This usage is contrary

partly obtained by using robust statistical

to the convention used in the statistical literature.

methods. Robust methods continue to per-

Here we follow the statistical convention and

form reasonably well under moderate de-

de¬ne the ˜signi¬cance level™ as the probability of

partures from the assumed model (see Sec-

incorrectly rejecting the null hypothesis. A smaller

tion 6.6 and [8.3.17]). However, in general,

signi¬cance level implies more evidence that H0 is

there is no way to determine positively that

the model underlying the test is valid. Instead false. If H0 is rejected with a signi¬cance level

additional physical arguments are required to of 1%, then there is 1 chance in 100 of obtaining

support the model. Also, other statistical tests the result by accident when the null hypothesis is

can sometimes be used to ensure that the data true.

are not grossly inconsistent with the adopted

model (e.g., one can test the null hypothesis 4.1.11 Source of Confusion: Con¬dence and

that the observations come from a normal Signi¬cance. One often reads statements that an

distribution). author is ˜95% con¬dent that the null hypothesis

is false™ or that ˜the null hypothesis is rejected

• We may have correctly rejected a false H0 .

at the 95% con¬dence level.™ These statements

Similarly, the decision not to reject H0 can interpret rejection of the null hypothesis at the

happen for several reasons. 5% signi¬cance level incorrectly. When we reject a

null hypothesis we are simply stating that the value

• H0 may be false, but the test may not of the test statistic is unusual in the context of the

have suf¬cient evidence to reject H0 . The null hypothesis (i.e., we have observed a value of

probability of this type of error depends upon the test statistic that occurs less than 5% of the time

the power of the test. The probability of not when H0 is true). Because the value is unusual,

rejecting H0 when it is false must also be we conclude that the null hypothesis is likely

nonzero to have a useful decision making false. But we can not express this ˜likelihood™ as

mechanism. a probability.8

The precise logical statement in the argument is

• The model adopted for the observations

˜H0 true ’ 1 out of 20 decisions is ˜reject H0 ™ ,™

may not be valid and the decision making

which is not at all related to the statement ˜reject

procedure developed from this model rejects

H0 ’ H0 false in 19 out of 20 cases.™

H0 too infrequently even when H0 is false.

This error in the model results in a test with

very low power. 4.2 Random Samples

• H0 may be true and insuf¬cient evidence

was found to reject H0 . This is the desired 4.2.1 Sampling. The conceptual model for a

simple random sample is that a simple, repeatable

outcome.

experiment is performed that has the effect of

The relevant catch phrase in all of this is drawing elements from a sample space at random

˜statistical signi¬cance,™ which may be markedly and with replacement.

different from ˜physical signi¬cance.™ The size The amount of imagination required to apply

of departure that is detectable by a statistical this paradigm depends upon the problem at hand.

test is a function of the amount of information We will brie¬‚y consider three examples.

about the tested parameter available in the sample.

• Suppose one wanted to estimate the height

Large samples contain more information than do

of the average human living today. We

small samples, and thus even physically trivial

can literally accomplish this by selecting

departures from H0 will be found to be statistically

humans at random from the global population

signi¬cant given a large enough sample.

(about ¬ve billion people) and recording their

4.1.10 Source of Confusion: The Signi¬cance 8 At least not in the ˜frequentist™ paradigm we use in this

Level. The term signi¬cance level sometimes book. Bayesian statisticians extend the notion of probability

to include subjective assessments of the likelihood that a

causes confusion. Some people, particularly

parameter has one value as opposed to another. It then becomes

climatologists, interpret the ˜signi¬cance level™ possible to solve statistical decisions by comparing the odds in

as ˜one minus the probability of rejecting a favour of one hypothesis with those in favour of another. See

correct null hypothesis.™ With this convention large Gelman et al. [139] for an introduction to Bayesian analysis.

4.2: Random Samples 75

heights. With care, and a lot of preparation, process is ergodic (meaning that sampling a

it is at least conceptually possible to ensure given realization of the process in time yields

that everyone has the same probability of information equivalent to randomly sampling

being selected. Thus, we can be assured that, independent realizations of the same pro-

if we sample the population 1000 times, the cess).

resulting sample of 1000 heights will be

representative of the entire global population. It is clear, then, that the concept of sampling

a geophysical process is complex, and that very

Here, the concept of a simple random sample strong assumptions are implicit in the analysis of

representative of the population is easy to climate data.

comprehend because the population from

which the sample is to be drawn is ¬nite.

The logistics required to obtain the sample 4.2.2 Models for a Collection of Data.

(i.e., preparing a list of ¬ve billion names Usually, the sampling exercise can be represented

and selecting randomly from those names) by a collection of independent and identically

distributed random variables, say {X1 , . . . , Xn }.

are easily visualized.

When the sample is taken, we end up with

• Suppose now that one wanted to estimate a set of realizations {x1 , . . . , xn }. Part of the

global mean temperature at 00 UTC on conceptual baggage we carry is the idea that

a given day: again, an easily imagined the sample could be taken again, resulting in

accomplishment. One approach would be to another set of realizations, say {x1 , . . . , xn } of

select randomly n locations on the globe and {X1 , . . . , Xn }. The statistical model describes the

to measure the temperature at each location range of possible realizations of the sample and the

at precisely 00 UTC. Our thinking in this relative likelihood of each realization.

The phrase independent and identically dis-

example is necessarily a bit more abstract

than in the previous example. The number of tributed represents two sampling assumptions that

points at which a temperature measurement are almost always needed when using classical

can be taken is in¬nite, and the logistics of inference methods (see Chapters 5“9). The as-

placing a thermometer are more dif¬cult for sumptions are as follows.

some points than for others. None the less,

• The observations x1 , . . . , xn are realiza-

given the desire and suf¬cient resources, this

tions of n independent random variables