˜

equation as interval for the conditional mean at x has bounds

µY |X =x = a 0 + a 1 x.

1 (x ’ x)2

µY |X =x ± t(1+˜ )/2 σ E + , (8.23)

p

n SXX

By substituting for a 0 with (8.16) we obtain

(8.21) where t(1+˜ /2) is the ((1 + p)/2)-quantile of

˜

µY |X =x = y + a 1 (x ’ x). p

t(n ’ 2) (Appendix F).

Computing variances, we see that An example of a ¬tted regression line and

the con¬dence bound curves de¬ned by (8.23) is

(x ’ x)2

21

σµY |X =x = σE + .

2

(8.22) illustrated in Figure 8.5. The pair of curves closest

ˆ n SXX

to the regression line illustrates a separate 95%

This can be derived by ¬rst substituting (8.17) con¬dence interval at each x. The curves bound the

for a 1 in (8.21), then substituting the model vertical interval at each x that covers the regression

(8.10) wherever Yi appears, and ¬nally computing line 95% of the time on average. As mentioned

8: Regression

154

SO index) in our SOI example. Again, the exact

Y = 1 + 0.1*X*X + noise

interpretation here hinges upon the independence

•

of observations. However, dependence has a

1.10

•

relatively minor effect on this particular inference

•

•• •

•

••

1.08

because the regression line itself is well estimated;

• ••

••

• •••

only the sampling variability of the regression line

•

•

1.06

• • •• •• is affected by dependence. Note also that in this

• ••

Y

••

case the curves do not bound the region that will

• •••

1.04

• ••

•• •• • •

•

simultaneously cover 95% of all possible values of

• • ••

• • •

• ••

1.02

the response variable.

••

• • • • •• •

• • • ••

••

••• •• •

••

• • • • • •• •• • •

1.00

•• •

• • • •••

8.3.12 Diagnostics: R 2 . The inferential meth-

•

ods described above are based on the assumptions

0.0 0.2 0.4 0.6 0.8 1.0

that the conditional mean of Y given X = x is a

X

linear function of x and that the errors Ei in model

Figure 8.6: This diagram illustrates the least (8.10) are iid normal.

We have already seen one diagnostic (8.20)

squares ¬t of a straight line to a sample of 100

observations generated from the model Y = 1 +

R 2 = SSR/SST

0.1x2 + E where E ∼ N (0, 0.0052 ). Even though

R 2 = 0.92, the model ¬ts the data poorly.

associated with a ¬tted model. However, R 2 , the

proportion of variance in the response variable

previously, accounting for dependence would that is explained by the ¬tted model, should not

increase the distance between the con¬dence be confused with the model™s goodness-of-¬t. The

correct interpretation of R 2 is that it is an estimate

bound curves.

of the model™s ability to specify unrealized values

8.3.11 A Con¬dence Interval for the Response of the response variable Y.

A large R 2 does not indicate that the model ¬ts

Variable. While con¬dence interval (8.23) ac-

counts for uncertainty in our estimate of the well in a statistical sense (i.e., that inferences made

conditional mean, it does not indicate the range of with the methods above are reliable). Figure 8.6

values of the response variable that is likely for a illustrates the least squares ¬t of a linear regression

given value x of X. To solve this problem we need model to data that closely approximate a quadratic.

to interpret the ¬tted regression equation, when The R for this ¬t is large (R = 0.92) but it

2 2

evaluated at x, as an estimate of Y rather than as would not be correct to say that the ¬t is a good one

an estimate of the conditional mean µY |X =x . The because the deviations from the ¬tted line display

estimation (or speci¬cation) error in this context is systematic behaviour. In this case the assumption

that the errors are iid normal is not satis¬ed and

µY |X =x + E ’ µY |X =x . thus inferences are not likely to be reliable.

Neither does a small R 2 indicate that the model

Since E is independent of µY |X =x , we see using

¬ts poorly. Figure 8.7 illustrates a least squares

(8.22) that the variance of the estimation error is

¬t of a linear regression model to simulated data

from a linear model. The R 2 for this ¬t is only

1 (x ’ x)2

σE 1 + + .

2

moderately large (R 2 = 0.51) but the deviations

n SXX

from the ¬tted line do not show any kind of