ρ XY = ’1 when the two rank orders are the variance of Y that is attributable to knowledge of

S

reverse of each other. Small sample critical values the conditional mean.

Yet another way to view the relationship

for testing H0 : ρ XY = 0 with ρ XY are givenS

in Appendix K. Approximate large sample (i.e., between Y and X is to write Y in the form

n > 30) critical values for testing H0 against Ha :

Y = a0 + a1 X + E, (8.9)

ρ XY = 0 at the (1 ’ p) — 100% signi¬cance level

˜√

are given by ±Z (1+˜ )/2 n ’ 1 where Z (1+˜ )/2 is where E is independent of X. In geometrical terms,

p p

the ((1 + p)/2)-quantile of the standard normal a realization of the pair (X, a + a X) randomly

˜

0 1

3 If there are ties, the tied observations are assigned the selects a point on one of the axes of the ellipse

depicted in Figure 2.10, and Y is subsequently

corresponding average rank.

4 Also known as Pearson™s r. determined by deviating vertically from the chosen

8: Regression

150

point. By computing means and variances we Following on from the discussion in Section 8.2,

the random variables Ei must be independent

obtain

normal random variables with mean zero and

σE = σY2 (1 ’ ρ XY )

2 2

variance

σY

a1 = ρ XY

σE = σY2 (1 ’ ρ XY ).

2 2

σX (8.11)

σY

a0 = µY ’ ρ XY µX . The corresponding representation for the realized

σX

value of Yi is

The purpose of regression analysis, discussed

in the next section, is to diagnose relationships yi = a0 + a1 xi + ei ,

such as (8.9) between a response (or dependent)

where ei represents the realized value of Ei . If

variable and one or more factors (or independent)

we have estimates a 0 and a 1 of the unknown

variables. As the derivation above showed, the

coef¬cients a0 and a1 , estimates of the realized

language used in many statistics textbooks can be

errors (which are generally called residuals) are

misleading. If the factors that affect the mean of

given by

the response variable are determined externally to

the studied system, either by an experimenter (as ei = yi ’ a 0 ’ a 1 xi . (8.12)

in a doubled CO2 experiment conducted with a

GCM) or by nature (e.g., by altering the climate™s A reasonable strategy for estimating a0 and a1

external forcing through the effects of volcanos), is to minimize some measure of the size of the

then words such as dependent and independent estimated errors ei . While many metrics2 can be

n

or response and factor can be used to describe used, the sum of squared errors i=1 ei is the

relationships between variables. However, often in most common. The resulting estimators of a0 and

climatology both X and Y are responses of the a1 are called least squares estimators. We will

climate system to some other unobserved factor. see later that least squares estimators have some

Then regression analysis can be used to document potential pitfalls that may not always make them

the relationship between the means of X and Y, the best choice. However, they are prominent in

but it would be inappropriate to use language that the normal setup because of the tractability of their

distributional derivation, ease of interpretation,

implies causality.

and optimality within this particular restricted

parametric framework.

8.3 Fitting and Diagnosing Simple The least squares estimators of a0 and a1 are

Regression Models obtained as follows. The sum of squared errors is

n

Our purpose here is to describe the anatomy SSE = (yi ’ a 0 ’ a 1 xi )2 . (8.13)

of a simple linear regression in which it is i=1

postulated that the conditional mean of a response

variable Y depends linearly upon a random factor Taking partial derivatives with respect to the

X (the arguments in the next few subsections unknown parameters a 0 and a 1 and setting these

work equally well if this factor is deterministic). to zero yields the normal equations

n

Suppose that we have n pairs of observations

(yi ’ a 0 ’ a 1 xi ) = 0

{(xi , yi ): i = 1, . . . , n}, each representing the (8.14)

i=1

realizations of a corresponding random variable

n

pair (Xi , Yi ), all pairs being independent and

(yi ’ a 0 ’ a 1 xi )xi = 0. (8.15)

identically bivariate normally distributed.

i=1

The normal equations have solutions

8.3.1 Least Squares Estimate of a Simple

Linear Regression. Assume that the conditional a0 = y ’ a1x (8.16)

means satisfy n

i=1 xi yi ’ nx y

a1 = . (8.17)

µYi |X=xi = a0 + a1 xi n

xi ’ nx2

2

i=1

so that conditional upon Xi = xi , the ith response As will be shown in [8.3.20], an unbiased estimate

can be represented as a random variable Yi such of σ 2 (8.11) is given by

E

that

SSE

(8.10) σ E = n ’ 2 .

2

(8.18)

Yi = a0 + a1 xi + Ei .

8.3: Fitting and Diagnosing Simple Regression Models 151

n

i=1 (a 0 + a 1 xi ’ y)2 . The least squares ¬tting

Returning to our SO example [8.1.3], the

parameter estimates obtained using (8.16)“(8.18) process thus provides a partition of the total

are a 0 = ’0.09, a 1 = 0.15, and σ E = 12.2. The variability into a component that is attributed to the

¬tted line is shown as the upwards sloping line that ¬tted line (SSR) and a component that is due to

passes through the cloud of points in Figure 8.1, departures from that line (SSE). That is,

and σ E is an estimate of the standard deviation

SST = SSR + SSE. (8.19)

of the vertical scatter about the ¬tted line. Note

that the eye is not always a good judge of where In the SOI example, this partitioning of the total

the least squares line should be placed; our initial sum of squares is

impression of Figure 8.1 is that the slope of the

Source Sum of squares

¬tted line is not steep enough.

Regression (SSR) 74 463.2

Error (SSE)

8.3.2 Partitioning Variance. The slope esti- 92 738.1

mate (8.17) is often written Total (SST ) 167 201.3

S XY

a1 = ,

8.3.3 Coef¬cient of Multiple Determination.

S XX

An immediately available diagnostic of the ability

where

of the ¬tted line to explain variation in the data is

n

the coef¬cient of multiple determination, denoted

S XY = (xi ’ x)(yi ’ y)

R 2 , given by

i=1

R 2 = SSR/SST .

n

(8.20)

= xi yi ’ nx y

The use of the phrase coef¬cient of determination

i=1