The effect of multicolinearity is to make the 8.4.12 Ridge Regression. One way to cope with

matrix X TX nearly uninvertible, resulting in multicolinearity is to remove redundant factors

highly variable parameter estimators (see Property from the model. However, this is not always

3 of [8.4.2]) and making it dif¬cult to diagnose the possible or desirable for either aesthetic or visual

factors that are most important in specifying Y. reasons. In this case ridge regression (see, e.g.,

Parameter estimates are sensitive to small vari- [104] or [420]) is an alternative. The idea is to

ations in the data when there is multicolinearity. give up the unbiased property of least squares

8: Regression

166

[8.4.9]. However, in problems with a large number

estimation in exchange for reduced estimator

of factors that are each potentially important for

uncertainty.

representing the conditional mean of the response

In ridge regression, constraints are implicitly

variable, an automated procedure is needed.

placed on the model parameters and the least

squares problem is then solved subject to those

constraints. These constraints result in a modi¬ed, 8.5.1 Stepwise Regression: Introduction.

less variable, least squares estimator. First, note Stepwise regression is the iterative application of

that the ordinary least squares estimator (8.27) can forward selection and backward elimination steps.

be written We ¬rst describe these procedures and then return

to the subject of stepwise regression. However, we

’1

a = PT PX T y,

need to introduce some additional notation before

where the columns of P are the normalized delving into detail.

We use SSRl1 ,...,l p to represent the sum

eigenvectors of X TX and is the corresponding

diagonal matrix of eigenvalues. That is, X TX = of squares due to regression when the p

factors Xl1 , . . . , Xl p are included in the multiple

P P. One form of a generalized ridge regression

regression model. Similar notation is used for the

estimator, which conveys the general idea and the

sum of squared errors. We use SSRl( p+1) |ll1 ,...,l p

source of the term ˜ridge,™ is given by

to denote the increase in the regression sum of

+ D)’1 PX T y

aridge = P T ( squares that comes about by adding factor Xl( p+1)

to the model. That is

where D is a diagonal matrix of positive constants.

The effect of in¬‚ating the eigenvalues in this way SSRl( p+1) |ll1 ,...,l p

is to downplay the importance of the off-diagonal = SSRl1 ,...,l( p+1) ’ SSRl1 ,...,l p .

elements of X TX when this matrix is inverted.

The constants, are, of course not known. Ridge

regression algorithms use a variety of procedures

to choose appropriate constants for a given design 8.5.2 Forward Selection. Before any ¬tting is

done, a decision should be made about whether

matrix X .

or not to include an intercept in the model. If an

intercept is to be included, it should be included at

8.5 Model Selection all steps of the forward selection procedure. The

steps are as follows.

None of the inference methods described in

Sections 8.3 and 8.4 performs reliably if factors 1 Simple linear regression is performed with

each factor. The factor Xl1 for which SSRl1

are missing from the model. On the other hand,

if the model contains unnecessary factors it will is greatest is selected as the initial factor.

be unnecessarily complex and will specify more

2 Search for factor Xl2 , l2 ∈ {l1 }, for which

poorly than it could otherwise. We therefore

the incremental regression sum of squares

brie¬‚y discuss methods helpful for developing

SSRl2 |{l1 } is greatest. The notation {l1 }

parsimonious models. The main goal here is not so

denotes the list of previously selected factors

much to specify accurately or estimate a complete

and l2 ∈ {l1 } denotes any factor not in {l1 }.

model, as it is to perform screening to discover

This list contains only the initial factor after

which factors contribute signi¬cantly to variation

step 1 has been completed.

in the response.

The primary screening principle we use is that 3 Test the hypothesis that inclusion of Xl2

a variable should not be included in a model if signi¬cantly reduces the regression sum of

it does not signi¬cantly increase the regression squares by computing

sum of squares SSR. A careful and systematic

SSRl2 |{l1 }

approach is needed because a test of an individual

F=

(SSE {l1 },l2 )/(n ’ (1 + |{l1 }|))

parameter, which asks whether a speci¬c factor

makes a signi¬cant contribution after accounting

where n = n or n ’ 1 depending upon

for all other factors, may hide the importance of

that factor within a group of factors. whether or not the intercept is included, and

{l1 } denotes the list of previously selected

When the number of factors in a problem is

factors. F is compared with the critical values

small it is usually possible to choose a suitable

of F(1, n ’ (1 + |{l1 }|)).

model, as in the example above, using the tools of

8.5: Model Selection 167

4 Stop at the previous iteration if Xl2 does not The idea here is to choose the model that

signi¬cantly increase the regression sum of minimizes the AIC criterion given by

squares. Otherwise, include Xl2 in the model

AI C = ’2l(al1 , . . . , al p ) + 2 p

and repeat steps 2 and 3.

SSE {l1 ,...,l p }

= n log(2πσ E ) + + 2 p,

2

σE 2

8.5.3 Backward Elimination. The backward

elimination procedure operates similarly to the

where l(al1 , . . . , al p ) is the log-likelihood function

forward selection procedure.

(see [8.3.4]). That is, minimizing AI C is

1 Fit the full model. equivalent to maximizing likelihood, but penalized

for the number of parameters in the model. As with

2 Search for the factor that reduces the

C p , we use the best available estimate of the error

regression sum of squares by the smallest

variance when computing AI C, the estimator of

amount when it is removed from the model.

σE obtained from the least squares ¬t of the full

2

3 Conduct an F test to determine whether model. Note the similarity between C p and AI C.

this factor explains a signi¬cant amount of

variance in the presence of all other factors 8.5.6 Numerical Forecast Improvement. One

remaining in the model at this point. Remove meteorological application for screening regres-

the variable from the model if it does not sion techniques is in the development of statistical

contribute signi¬cant variance. procedures for improving numerical weather fore-

8

4 Repeat steps 2 and 3 until no variable can be casts. Improvement is required because global,

and even regional, numerical forecast models do

removed from the model.

not accurately represent sub-grid scale processes.

Statistical procedures attempt to exploit systematic

8.5.4 Stepwise Regression. The stepwise

relationships between the large-scale ¬‚ow of the

regression procedure combines forward selection

free atmosphere, which is both well observed and

with backward elimination. As forward selection

well represented by numerical forecast models,

progresses, factors selected early on may become

and local phenomena.

redundant when related factors are selected during

MOS procedures (see Glahn and Lowry [140]

later steps. Therefore, in stepwise regression,

or Klein and Glahn [226]) rely upon ˜speci¬cation

backward elimination is performed after every

equations™ that describe statistical relationships

forward selection step to remove redundant

between numerical forecasts of atmospheric

variables from the model. Forward regression and

conditions in the troposphere (i.e., model output)