to by the regression algorithm. A system of simultaneous equations is

only “determined31“if the number of equations32 is greater than the

number of unknowns. That is, only if the number of regression

coefficients” K minus 1, the subtraction accounting for the coefficient for

the intercept).

What information is “known” prior to running the regression?

” All values of the independent variables are known first. In

theory, the independent variables are the “experiment.”

That is, it can be solved to estimate the optimization parameters ” the regression

31

coefficients in the case of a regression

The sample size N in the case of a regression

32

213

Statistical Analysis with Excel

” Once the “experiment” is conducted, the values of the dependent

series Y are known. (Not that this “experiment” analogy holds

even if the data for the independent and dependent variables are

obtained from the same data collection survey.)

” The regression minimizes the sum of the squared residuals,

which is the same as minimizing the square of the difference

between the observed and the predicted dependent series. The

number of residuals equals the number of observations. Thus,

the number of equations equals the number of observations.

What information is “unknown” prior to running the regression?

The regression coefficients ” the betas ” are unknown. Once the

regression coefficients are known, one can estimate the predicted

dependent variables, errors/residuals, R“square, etc. If X does not vary,

then the series cannot have any role in explaining the variation in Y. The

number of unknowns equals the number of regression coefficients.

12.1.D ASSUMPTION 4: NOT ALL THE VALUES OF ANY ONE

INDEPENDENT SERIES CAN BE THE SAME

A model uses the effect of variation in X to explain variation in Y. If X

does not vary, then the series cannot have any role in explaining the

variation in Y.

Note that the formulas for estimating the regression coefficients ” the

betas ” use the “squared deviations from mean” in the denominator of

the formula. If the X values do not vary then all the values equal the

mean implying that the “squared deviations from mean” is zero. This will

214

Chapter 12: Regression

make the regression coefficient indeterminate because the denominator of

the formula equals zero.

12.1.E ASSUMPTION 5: THE RESIDUAL OR DISTURBANCE

ERROR TERMS FOLLOW SEVERAL RULES

This is the most important assumption, and most diagnostic tests are

checking for the observance of this assumption. In several textbooks, you

will find this assumption broken into parts, but I prefer to list the rules of

Assumption 5:

Assumption 5a: The mean/average or expected value of the disturbance

equals zero

If not, then you know that the model has a systemic bias, which makes it

inaccurate, especially because one does not typically know what is causing

the bias.

Assumption 5b: The disturbance terms all have the same variance

This assumption is also called homoskedasticity. Given that the expected

value of any disturbance equals zero, if one disturbance has a higher

variance than the other one, it implies that the observation underlying

this high variance should be given less importance because its relative

accuracy is suspect. (This is the reason that weighted regression is used

to correct for the nonconformity with this rule.)

215

Statistical Analysis with Excel

Assumption 5c: A disturbance term for one observation should have no

relation with the disturbance terms for other observations or with any of

the independent variables

The disturbance term must be truly random ” one should not be able to

predict or guess the value of any disturbance term given any of the

information on the model data. The disturbance term is also called the

error term. This error is assumed random. If this is not the case, then

your model may have failed to capture all the underlying independent

variables, incorrectly measured independent variables, or have correlation

between successive observations in a series Sorted by one of the

independent variables.

Typically, Time Series data series suffers from the problem of disturbance

terms being related to the values of previous periods. It is for this reason

that times series analysis requires special data manipulation procedures

prior to creating any prediction model.