404

Figure 18.8: The correlation skill scores ρ„ of

two sets of forecasts of an index of the Madden-

and-Julian Oscillation. Both series are constructed

from 15 trials using the same initial conditions.

Figure 18.9: The average lead time, in days, at

One series (solid) was prepared with the POP

which the winter (DJF) mean anomaly correlation

method [15.3.3], the other with a dynamical

coef¬cient ρ F P of the forecasts of Northern

A

forecast model (dashed). From [388].

Hemisphere 500 mb height ¬eld fall below

60%. The boxes indicate that the average lead

18.4.5 Example: The Skill of Weather Predic- time of forecasts prepared by the US National

tion. In [18.2.8] we examined the performance Meteorological Center during the winters (DJF) of

of the operational weather forecasts of the US Na- 1981/82 to 1989/90 falls below 60%. The triangles

tional Meteorological Center, and displayed a plot indicate the corresponding average lead time for

of winter mean anomaly correlation coef¬cients persistence forecasts. From Kalnay et al. [209].

from from Kalnay et al. [209]. They compared

the operational forecast against the persistence

useless because the climatology may no longer

forecast. The lead time beyond which the skill of

be the mean value of the present observations.

the forecasts fall below the 60% threshold (see

The random forecast is unde¬ned simply because

[18.3.5]) is shown in Figure 18.9 for both the

the statistical parameters E(P) and Var(P) have

operational and the persistence forecasts. Clearly,

become moving targets.

the operational forecast outperforms persistence.

Livezey [255] presents an interesting and

Also, the diagram shows that the improvement in

convincing example of a forecasting scheme

the operational forecasts after 1985 is not due to

whose reputed merits were entirely due to the

increased persistence of the Northern Hemisphere

systematic exploitation of the urbanization effect

circulation.

[1.2.3]. Several of the scores introduced in this

chapter sometimes exhibit pathological behaviour

18.4.6 Effect of Trends. Many meteorological if suf¬cient care is not exercised in designing the

time series exhibit a trend on decadal and longer forecast evaluation.

time scales. That is, the series contains either

a deterministic or a low-frequency component.

These trends re¬‚ect a variety of processes, both 18.4.7 Arti¬cial skill. Skill scores should be

constructed so that they give an unbiased view

natural and anthropogenic origins.10

The de¬nition of a (trivial) reference forecast of the true utility of the forecasting scheme. This

may become dif¬cult in the presence of a trend. requirement is violated when statistical forecast

The skill of the persistence forecast is generally not schemes are built if the same data are used to

affected much by a trend because the amplitudes develop the scheme and evaluate its skill. Quite

of trends are generally small relative to the natural often, the statistical forecast model is ¬tted to the

variability of the forecasted process. On the other data by maximizing a skill score or a quantity, such

hand, the climatological forecast might become as mean squared error, that is related to a skill

score. If the sample size is small, or if the number

10 For a short discussion of processes potentially responsible of parameters ¬tted to the data is large relative

to the sample size, the skill score is arti¬cially

for trends, see [1.2.3].

18.5: Cross-validation 405

Var(P|P > p) > V ,

˜

enhanced because in such circumstances the ¬tted

model is able to adapt itself to the available data.

ρF P =

˜ V /(V + N ) > ρ F P

The sample used to ¬t the model is often called

the training sample. The estimate of skill obtained

˜F

R 2 P = 1 ’ (N /V ) > R 2 P

F

from the training sample is called the hindcast

˜

SF P = N = SF P .

2

skill. The hindcast skill is always somewhat greater

than the forecast skill, and this optimistic bias in

estimated skill is called arti¬cial skill.

18.5 Cross-validation

Techniques, such as cross-validation (see Sec-

tion 18.5) and bootstrapping (see Section 5.5) can 18.5.1 General. It is generally desirable to

sometimes be used to provide good estimates of be able to estimate the skill of a forecast or

the forecast skill. speci¬cation model before it is actually applied.11

A notoriously ef¬cient manner in which to However, skill estimates that are obtained from the

introduce arti¬cial skill into (time series) forecastdata used to identify and ¬t the model tend to be

models is to time-¬lter the analysed data in overly optimistic (see Davis [101]) because the

order to suppress high-frequency variations, and ¬tting process, by de¬nition, chooses parameters

to ¬t and verify the forecast model against these that ˜adapt™ the model to the data as closely as

smoothed observations. The time ¬ltering makes possible. This phenomenon, called arti¬cial skill,

future information about the predictand available is of particular concern in small samples.12

at the time of the forecast; in a real-time setup this One simple way to avoid the arti¬cial skill effect

future information would not be available. is to divide the data into ˜learning™ and ˜validation™

data sets; the model is ¬tted to the learning

Skill Scores Derived from Non- data and tested on the independent information

18.4.8

Randomly Chosen Subsets. In real applications contained in the validation subset. However, the

the skill score is derived from a ¬nite ensemble of data sets in which arti¬cial predictability is

forecasts. These ensembles can usually be thought particularly troublesome are not large enough to

of as random samples representative of the process use this strategy effectively. These samples are

being forecast. For example, an ensemble might too small to withhold a substantial fraction of

consist of all cases during a certain time period. the sample for validation, but if only a few

The ensemble is sometimes also sub-sampled observations are withheld, validation can not be

using criteria available at forecast time in order performed effectively.

to make inferences about forecast skill under

prescribed conditions (an example can be found in 18.5.2 Cross-validation. Cross-validation

[18.1.6]). Both of these approaches to estimating avoids the dif¬culty described above, in essence,

skill are perfectly legitimate. by making all of the data available for validation.

On the other hand, it is somewhat misleading The procedure is simple to apply provided that

to sub-sample an ensemble using criteria that are the model ¬tting can be automated. The ¬rst step

available only at veri¬cation (rather than forecast) is to withhold a small part of the sample. For

time. An example of such a criterion is the strength example, one might withhold 1 or 2 years of

of the predictand. This type of sub-sampling data when building a model for seasonal climate

criterion automatically enhances the correlation forecasting from a 45-year data base. The model

skill score ρ F P and the proportion of explained is ¬tted to the data that is retained and is used

variance R 2 P (cf. [18.1.4]). to make forecasts or speci¬cations of the data

F

To demonstrate this, we consider the x = i case that are withheld. These steps are performed

and a forecast of the form F = P + N where N is separately, either until no new veri¬cation data

sets can be selected or until there are enough

random error independent of P. Then

forecast/veri¬cation or speci¬cation/veri¬cation

ρ F P = V /(V + N ) pairs to estimate skill accurately. See Michaelson

R 2 P = 1 ’ (N /V ) 11 A forecast model is used to extrapolate into the future;

F

SF P = N ,

2 speci¬cation models are used to estimate present or past

unobserved values.

12 Small is a relative term in this context. The reliability of

where V = Var(P) and N = Var(N). If an internal estimate of skill increases with sample size but

we calculate the skill scores for the subset of decreases with the number of free parameters in the ¬tted

˜

forecasts for which |P| ≥ p we ¬nd with V =

˜ model.

18: Forecast Quality Evaluation

406

[273] (and also Barnston [25] and the references variety ways in which the model ¬tted to the

therein) for more information. learning data can be in¬‚uenced by the information

in the validation data. For example, in many

18.5.3 Some Cautions. Care must be taken analyses the annual mean is ¬rst estimated and

to ensure that the information used to ¬t the removed, and models are subsequently ¬tted to

model in each cross-validation step is completely the anomalies that remain. If cross-validation is

independent of the information that is withheld for performed by repeatedly dividing the anomalies

the validation data. Barnston and van den Dool into learning and veri¬cation subsets, the model