23 An auto-regressive process of order 1 (an AR(1) process)

is written formally as Xt = ±Xt+1 + Nt , where Nt is a series

of independent random variables (sometimes called ˜white

24 This procedure is closely related to the bootstrap

noise™). Chapters 10 and 11 explain AR( p) processes in some

detail. (Section 5.5).

Part III

Fitting Statistical Models

This Page Intentionally Left Blank

143

Overview

In this part of the book we introduce two classical, fully developed1 methods of inference: ˜regression™

and ˜analysis of variance™ (ANOVA). We do not expect that there will be signi¬cant changes in the

overall formulation of these techniques, but new applications and improved approaches for special cases

may emerge.

Both regression and ANOVA are methods for the estimation of parametric models of the relationship

between related random variables, or between a random variable and one or more non-random external

factors. While regression techniques have been used almost from the beginning of quantitative climate

research in different degrees of complexity (see, e.g., Br¨ ckner [70]), ANOVA has only recently been

u

applied to climatic problems [441, 444].

The regression technique is introduced in detail and illustrated with several examples in Chapter 8.2

Regression is used to describe relationships that involve variables and factors measured on a continuous

scale. Examples of regression problems include modelling the trend in a time series by means of a

polynomial function of time (which would be a non-random external factor), or the description of the

link between two concurrent events, such as the width of a tree ring and the temperature, with the

purpose of constructing ˜best guesses™ of temperature in ancient times when no instrumental data are

available. Also, time-lagged events are linked through regression, such as the wind force in the German

Bight and the water level in Hamburg several hours later. The derived model is then used for storm surge

forecasts.

The reader may notice that climatologists often use the term specify when they refer to regressed

values, as opposed to the term forecast commonly used by statisticians. Neither word is perfect.

˜Forecast™ implies that there will be error in the estimated value, but sometimes has irrelevant time

connotations. ˜Specify™ eliminates the confusion about time but suggests that the estimate is highly

accurate. However, despite its inadequacies, we use ˜specify,™ except when discussing projections

forward in time, in which case we refer to forecasts.

The analysis of variance was designed by R.A. Fisher for problems arising in agriculture. In his

words, ANOVA deals with ˜the separation of the variance ascribable to one group of causes from the

variance ascribable to other groups.™ Separation of variance is also often required in climate diagnostics.

A typical problem is to discriminate between the effect of internal and external processes on the global

mean temperature. In that case, an internal process might be the formation and decay of storms in

midlatitudes, while an external factor might be the stratospheric loading of volcanic aerosols. Another

typical application treats sea-surface variability on monthly and longer time scales as an external

process. In this case several independent climate simulations might be performed such that the same time

series of sea-surface temperatures is prescribed in each simulation. ANOVA methods are then used to

identify the simulated atmospheric variability that results from the prescribed sea-surface temperatures.

The ANOVA technique is explained in detail in Chapter 9 and its merits are demonstrated with examples.

1 By ˜fully developed™ we mean that for each parameter involved there is at least an asymptotic distribution theory so hypothesis

tests and con¬dence intervals can be readily constructed.

2 In fact, regression techniques appear throughout the book, as in Sections 14.3 and 14.4, which deal with Canonical Correlation

Analysis and Redundancy Analysis, respectively.

This Page Intentionally Left Blank

8 Regression

8.1 Introduction the characteristics of the stochastic component,

typically that this component behaves as normally

8.1.1 Outline. We start by describing methods distributed white noise. Other less restrictive

used to estimate and make inferences about assumptions are possible, but they may require the

correlation coef¬cients. Then, we describe some use of more sophisticated inference methods than

of the ideas that underly regression analysis, those described in this chapter.

methods in which the mean of a response (or After introducing simple linear models, our

dependent) variable is described in terms of a discussion of regression goes on to consider

simple function of one or more predictors (or multivariate linear models and methods for model

independent variables). The models we consider selection. We close the chapter with two short

are said to be linear because they are linear in sections on model selection and some other related

their unknown parameters. We describe a variety topics, including nonlinear regression models. It

of inferential methods and model diagnostics, and is worth repeating that statisticians distinguish

consider the robustness of the estimators of the between linear and nonlinear models on the

model parameters. basis of the model™s parameters, not on how the

A simple example is a naive model of climate predictors enter the model.

change in which global annual mean temperature An example of a simple nonlinear model, which

increases, on average, logarithmically with CO2 may be better suited than (8.1) to the example

concentration: above, is

Tyear = a0 + a1 ln(cCO2 ) + year . Tyear = b0 + b1 ln(cCO2 year + b2 ) + year .

globe globe

(8.1)

Note that this model is nonlinear in b2 .

We know that global annual mean temperature

is subject to ¬‚uctuation induced by a variety

of physical processes whose collective effect

8.1.2 The Statistical Setting. Most of the

results in apparently stochastic behaviour. On

discussion in this chapter takes place in the

the other hand, CO2 concentration appears to

context of normal random variables, not because

have only a minor stochastic component, at least

other types of data are uncommon, but because

on interannual time scales, and can therefore

it is relatively easy to introduce concepts in

be considered to be deterministic to a ¬rst

this framework. Nevertheless, note that departures

approximation. The model proposes that global

from assumptions can affect the reliability of some

globe

annual mean temperature, denoted Tyear , is

statistical analyses quite drastically.

trending upwards approximately logarithmically

as the CO2 concentration, denoted cCO2 , increases.

globe

It also proposes that Tyear has a stochastic 8.1.3 Example: ENSO Indices. This example

component, which is represented by the noise was considered brie¬‚y in [1.2.2] and [2.8.8].

process { year }. There are two free parameters, a0 Wright [426] described a tropical Paci¬c sea-

and a1 , that must be estimated from the data. This surface temperature index that captures informa-

is something that is often (although not always tion about ENSO that is very similar to the

best) done using the method of least squares. Here information captured by the classical Southern

least squares estimation of the parameters is simple Oscillation Index (SOI) based on the difference

because the model is linear in its parameters. If between mean sea-level pressure at Darwin and

inferences are to be made about the parameters Tahiti. Wright™s index is based on SSTs observed

east of the date line and roughly between 5—¦ N

(e.g., tests of hypothesis or construction of

and 10—¦ S. A scatter plot of the monthly mean

con¬dence intervals), then it is required that

(8.1) also include some sort of assumption about values of these indices for 1933“84 inclusive is

145

8: Regression

146

parameterizations used in GCMs can estimate „

SO and Tropical Pacific SST Indices

and Ac , the fraction of the grid box that is cloud

•

covered, but they are not able to estimate σ„ .

•

60

•

• •

However, it turns out that the mean log cloud

•

•

optical depth ln „ is closely related to σ„ . Thus

•