are replaced with ef¬cient estimators. In this case the procedure

U = 12.8 (11.1) and L = 1.24 (1.64) for

12 For example, is known as the ˜parametric™ bootstrap.

p = 0.95 (0.90) when has 5 degrees of freedom.

˜ 15 That is, the true convergence of con¬dence intervals and

13 We write ±(X) instead of just ± to emphasize that ± is a true signi¬cance levels of tests will approach the speci¬ed

random variable whose distribution is derived from that of X. values when samples become large.

5: Estimation

94

5.5.2 Ordinary Bootstrap. The bootstrap is an 5.5.3 Moving Blocks Bootstrap. As with all the

example of a resampling procedure. When F X (x) other estimators discussed in this chapter, boot-

is given by the empirical distribution function strapped estimators are vulnerable to the effects of

(5.3), step 1 above is equivalent to taking a departures from the sampling assumptions. Zwiers

sample of size n, with replacement, from the n [442] illustrates what can happen when serial

observations x1 , . . . , xn .16 correlation is ignored. The dif¬culty arises because

To see that this is so, consider again step 2 the resampling procedure does not preserve the

above. When F X is a smooth function, there will temporal dependence of the observations in the

be a unique yj for every u j . However, when F X sample; the resampling done in the ordinary boot-

is a step function such as the empirical distribution strap produces samples of independent observa-

function, a range of u values can produce the same tions regardless of the dependencies that may exist

yj ; the resulting sample may therefore contain within the original sample.

a given yj more than once. In particular, the A simple adaptation that accounts for short-term

empirical distribution function has n steps of equal dependence is called the moving blocks bootstrap

height that completely partition the interval (0, 1). (see K¨ nsch [235], Liu and Singh [254], and

u

Thus it follows that yj will be equal to xi for also Leger, Politis, and Romano [248]). Instead

some i, and that every member of the sample of resampling individual observations, blocks of

{x1 , . . . , xn } has the same probability of selection, l consecutive observations are resampled, thus

that is, random resampling with replacement. preserving much of the dependent structure in the

Each sample produced by the procedure observations. In general, the block length should

described above is called a bootstrap sample. be related to the ˜memory,™ or persistence, of

When F X is the empirical distribution function, 2n the process that has been sampled, with longer

different samples can be generated. Consequently, block lengths used when the process is more

the bootstrapped estimate of the distribution persistent. Wilks [423] points out that care is

of ±(x) will be quite coarse when n is required to choose l appropriately. Blocks that are

small. However, the ˜resolution™ of the estimator too long will result in con¬dence intervals with

˜

quickly increases with increasing n. Even for coverage greater than the nominal p, and vice

moderate sample sizes, the cost of evaluating versa. Theoretical work [235, 254] also shows that

±(x) for all possible samples becomes prohibitive the block length l should increase with sample

(and is generally not necessary). Satisfactory size n in such a way that l/n tends to zero as n

bootstrapped variance estimates can often be approaches in¬nity.

made with as few as 100 bootstrap samples. A Wilks [423] describes the use of the moving

somewhat larger number of samples is required block bootstrap when constructing con¬dence

to produce good con¬dence intervals since these intervals for the difference of two means from

require estimates of quantiles in the tails of the ± data that are serially correlated. He gives simple

distribution. expressions for the block length that can be used

There are some problems for which bootstrap when data come from auto-regressions of order 1

estimators can be derived analytically. For ex- or 2 (Chapter 10). Wang and Zwiers [414] applied

ample, the bootstrap estimator of σ 2 is σ 2 = the moving blocks bootstrap to GCM simulated

n’1 2

n S (see [5.2.6] and [5.3.3]). precipitation.

16 That is, the elements of the sample are obtained one at a

time by drawing an observation at random, noting its value, and

returning it to the pool of observations.

Part II

Con¬rmation and Analysis

This Page Intentionally Left Blank

97

Overview

In Part II we address the problem of determining the correctness of a certain statistical model1 in the

light of empirical evidence. To make sure that the assessment is fair, the model must be tested with

information that is gathered independently of that which is used to formulate the model. The standard

method that is used is called statistical hypothesis testing. We deal with this concept in some length in

Chapter 6 (see also [1.2.7] and [4.1.7“11]).2 Examples of applications in climate research are presented

in Chapter 7.

The application of hypothesis testing in climate research is fraught with problems that are not always

encountered in other ¬elds.

• In climate research it is rarely possible to perform real independent experiments (see Navarra

[289]) with the observed climate system. There is usually only one observational record, which

is analysed again and again until the processes of building and testing hypotheses are hardly

separable. Dynamical climate models often provide a way out of this dilemma. Hypotheses that

are formed by analysing the observed record can frequently be tested by running independent

experiments with GCMs. However, even these experiments are not completely independent of

the observed record since GCMs rely heavily on parameterizations that have been tuned with the

observed record.

Even though fully independent tests are not possible, testing is often useful as an interpretational

aid because it helps quantify unusual aspects of the data. On the other hand, we need to be wary of

indiscriminate testing because it sometimes allows unusual quirks to draw our attention away from

physically signi¬cant aspects of our data.

• Almost all data in climate research have spatial and temporal correlations, which is most useful

since it allows us to infer the space“time state of the atmosphere and the ocean from a limited

number of observations (cf. [1.2.2]). However this correlation causes dif¬culties in testing problems

since most standard statistical techniques assume that the data are realizations of independent

random variables.

Because of these dif¬culties, the use of statistical tests in a cookbook manner is particularly

dangerous. Tests can become very unreliable when the statistical model implicit in the test

procedure does not properly account for properties such as spatial or temporal correlation.

The problems caused by the indiscriminate use of recipes are compounded when obscure

sophisticated techniques are used. It is fashionable to surprise the community with miraculous

new techniques, even though the statistical model implicit in the method is often not understood.

Hypothesis testing is carried out by formulating two propositions: the null hypothesis that is to be

tested, and the alternative hypothesis, which usually encompasses a range of possibilities that may be

true if the null hypothesis is false. The alternative hypothesis indirectly in¬‚uences the test because it

affects the interpretation of the evidence against the null hypothesis. The null hypothesis is rejected

if the evidence against it is strong enough; it is not rejected when the evidence is weak, but this does

not imply rejection of the alternative. We then continue to entertain the possibility that either of the

hypotheses is true.

Null hypotheses are typically of the type A = B, and in climate research the alternative A = B

is usually correct. Often, though, the difference between A and B is small and physically irrelevant.

Statistical tests can not be used to detect the difference between physically signi¬cant and insigni¬cant

differences. The strength of the evidence against the null hypothesis, and thus for the detection of

a ˜statistically signi¬cant™ difference, depends on the amount of evidence, that is, the number of

independent samples. As the sample size increases so do the chances of detecting A = B. With the very

large sample sizes that can be constructed with GCMs, almost every physically irrelevant difference can

achieve statistically signi¬cant status.

1 To be more precise: an attempt is made to determine whether the model is incorrect; absolute correctness can not be

determined statistically.

2 There are two approaches to statistical decision making. We use the frequentist approach, since it is more common in

climatology than the Bayesian approach (see, e.g., Gelman et al. [139]).

98

Obviously, this is not satisfactory. We introduce recurrence analysis (see Sections 6.9“6.10) as an

alternative for assessing the strength of the difference A ’ B. This technique produces estimates of the

degree of separation between A and B that are independent of the sample size.

6 The Statistical Test of a Hypothesis

altering the way the evidence in the sample is

6.0.0 Summary. In this chapter we introduce the

judged.

ideas behind the art of testing statistical hypotheses

(Section 6.1). The general concepts are described, A hypothesis testing process can only have two

terminology is introduced, and several elementary outcomes: either H0 is rejected or it is not rejected.

examples are discussed. We also examine some The former does not imply acceptance of Ha ”it

philosophical questions about testing and some simply means that we have fairly strong evidence

extensions to cases in which it is dif¬cult to build that H0 is false. Failure to reject H0 simply means

the statistical models needed for testing. that the evidence in the sample is not inconsistent

with H0 .

The signi¬cance level, power, bias, ef¬ciency,

and robustness of statistical tests are discussed

in Section 6.2. The application of Monte Carlo 6.1.2 The Ingredients of a Test. We need two

simulation in problem testing is discussed in objects to perform a statistical test: the object

Section 6.3, and in Section 6.4 we examine how to be examined”a set of observations that, for

hypotheses are formulated and explore some of convenience, we collect in a single vector x”and

the limitations of statistical testing. The spatial a rule that determines whether to reject the null

correlation structure of the atmosphere often hypothesis or not. This rule usually takes the

impacts testing problems. Strategies for coping form ˜reject H0 if S(x) > κp ,™ where S is a

˜

with and using this structure are discussed in predetermined function that measures the evidence

Section 6.5. A number of tests of the null against H0 , and κp is a threshold value for S

˜