Figure 8.7: This diagram illustrates the least Figure 8.8: In the upper panel the standardized

squares ¬t of a straight line to a sample of 100 residuals (departures from the ¬tted line divided

observations generated from the model Y = 1 + by σ E ) are plotted as a function of the estimated

0.1x + E where E ∼ N (0, 0.0252 ). Even though conditional mean µY |X =x for the ¬t displayed in

R 2 = 51%, the model ¬ts the data well. Figure 8.6. The absolute values of the residuals are

plotted in the lower panel.

that R 2 is an optimistic indicator of model

speci¬cation performance for unrealized values

appears to increase until x = 0.5 and then

of X (see, e.g., Davis [101]). Climatologists and

decrease again beyond x = 0.5. Heteroscedasticity

meteorologists call this phenomenon arti¬cial

is generally easier to detect in scatter plots of

skill. The arti¬cial skill arises because the ¬tted

the absolute residuals. Heteroscedastic errors can

model, as a consequence of the ¬tting process,

sometimes be dealt with by transforming the

has adapted itself to the data. Cross-validation (see

data before ¬tting a regression model [8.6.2].

Section 18.5) provides a more reliable means of

Other times it may be necessary to use weighted

predicting future model performance.

regression techniques in which the in¬‚uence of a

squared error in determining the ¬t is inversely

8.3.13 Diagnostics: Using Scatter Plots. Some proportional to its variance (see Section 8.6 and

fundamental tools in model diagnostics include [104]).

scatter plots of the standardized residuals ei /σ E Finally, Figure 8.10 results from a simulated

(see (8.18)) against the corresponding estimates linear regression with two inserted errant obser-

of the conditional mean (8.21), and scatter plots vations. Attempts to detect these observations are

of the absolute standardized residuals against the made by looking for outliers, that is, residuals that

estimates of the conditional mean. are greater in absolute value than the rest. As a

Figure 8.6 illustrates a violation the assumption general rule, residuals more than three standard

that the conditional mean varies linearly with x. deviations from the ¬tted line should be examined

This is revealed through systematic behaviour in for errors in the corresponding observations of

standardized residuals, as displayed in Figure 8.8. the response and factor variables. Outliers are

This type of behaviour is generally easier to detect generally easier to detect using the plot of the

in displays of the standardized residuals (upper absolute residuals. However, they may not always

panel of Figure 8.8) than in displays of the absolute be easy to detect, especially when more than one

standardized residuals (lower panel of Figure 8.8). outlier is present in a sample. In this example,

Other kinds of departures from the ¬tted model the data were generated using the model Y =

are easier to detect in displays of the absolute 1+0.1x+E, where E is normally distributed noise

standardized residuals. with mean zero and standard deviation 0.05, and

Figure 8.9 illustrates an example in which x varies between 0 and 1. The error at x = 0.5

the assumption that the errors Ei all have was set to be 0.15 (3 standard deviations) and the

common variance is violated. This is known error at x = 0.95 was set to be ’0.15. The outlier

as heteroscedasticity. In this case error variance at x = 0.5 is detected in our residual display, but

8: Regression

156

Y = 1 + 0.1*X + X*(1-X)*noise Y = 1.0 + 0.1*X + noise + outliers

1.20

•• •

• •• •••

••

• • •• •• •

•

• • ••

• •• •

•

1.04 1.08

•

•

• • ••

• ••

1.10

•

•

• •• •• • •• • •• •• • • •

• •• •••

•• • •• • • • •• • • •

•• •

••

• • • •• •

• • • •• ••• •••

• • •• • • •

•

• ••

Y

Y

• ••• •• •• • • ••• • •

••

• • • •• •

••• •

• •

• •

1.00

•

••

•• •

•• •

•

• • •• •• ••••

•• • •• •

•

•• ••• • •

•

• •

•• • • • • ••

•

•••••• •••

1.00

•• •

0.90

• •• •

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

X X

Linear fit to Y = 1 + 0.1*X + X*(1-X)*noise Linear fit to Y = 1.0 + 0.1*X + noise + outliers

Absolute Standardized Residual