biases induced by analysis systems or from errors

the true unknown parameter. Therefore, forecast

induced by observing systems. An example of the

skill evaluation can be thought of as a form of

latter are the various biases that are inherent in the

parameter estimation (see Chapter 5), even though

many different rain gauge designs used throughout

the problem of forecast skill evaluation is generally

the world [247]. We ignore these biases in this

not considered as such by the practitioners of the

parameter estimation art.

1 This chapter is based in part on Livezey [255], Murphy

and Epstein [285] and Stanski, Wilson, and Burrows [355]. For

further reading, we also recommend Murphy and Daan [284], 18.0.4 Categorical and Quantitative Forecasts.

Murphy and Winkler [286], Murphy, Brown, and Chen [283],

We restrict ourselves to examples in which the

Barnston [25] and Livezey [256].

forecast F and the predictand P are either both

2 We call these statistics skill scores for convenience. We

quantitative (i.e., a number such as ˜13 —¦ C™) or

use the phrase ˜skill score™ somewhat less formally than is

dictated by statistical convention, where this expression is categorical statements (such as ˜warmer than nor-

limited to statistics that have a speci¬c functional form, such

mal™). If the forecast is categorical, we require that

as the Heidke skill score (18.1) and the Brier skill score (18.5).

the category boundaries (˜normal™) are unequiv-

These formal scores compare the actual rate of success with the

ocally de¬ned. We will not discuss probabilistic

success rate of a reference forecast.

391

18: Forecast Quality Evaluation

392

forecasting scheme is the Heidke skill score

Verifying Forecast

(Heidke [173]), which is given by

analysis above below normal

pC ’ p E

S= (18.1)

1 ’ pE

P

above normal paa pba pa

P

below normal pab pbb pb

where pC is the probability of a correct forecast,

given by

F

F

pa pb 1

pC = paa + pbb ,

Table 18.1: An illustration of a 2 — 2 contingency and p E is the probability of a correct forecast

table used to summarize the performance of a when the forecast carries no information about

categorical forecasting system. the subsequent observation (a ˜random forecast™).

We obtain a random forecast when F and P are

independent, and therefore ¬nd that

forecasts such as ˜The chance of precipitation

p E = pa pa + pb pb .

PF PF

tomorrow is 70%.™ We begin by discussing cat-

egorical forecasts in Section 18.1. Quantitative

If above normal and below normal classes are

forecasts are discussed in Section 18.2.

equally likely for both F and P, then p E = 0.5

because pa = pb = pa = pb = 0.5. On the

F P

F P

other hand, the two classes may not be equally

18.1 The Skill of Categorical

likely. For example, we might have pa = pa =

F P

Forecasts

0.6 and pb = pb = 0.4. Then p E = 0.42 +

F P

18.1.1 Categorical Forecasts. Categorical fore- 0.6 = 0.52.

2

It is easily demonstrated that the skill S of a

casts are often made in two or three (or more)

classes, such as above normal, near normal and random forecast is zero and that the skill of a

below normal, that are clearly de¬ned in terms of perfect forecast (i.e., pC = 1) is 1. If there is

a priori speci¬ed threshold values. For example, perfect reverse reliability, that is, every forecast is

two-class forecasts often specify either above nor- wrong, then pC = 0 and S = ’ p E /(1 ’ p E ).

mal or below normal, where the threshold normal In this case we obtain S = ’1 if both classes are

is the long-term mean of the predicted parameter. equally likely for F and P.

When sample sizes are ¬nite, the Heidke skill

The outcome of a two-class categorical forecasting

scheme can be summarized in a 2 — 2 contingency score (18.1) is often written as

table (see Table 18.1).

n pC ’ n p E

ˆ ˆ

The entries in the table are de¬ned as follows. S = (18.2)

n ’ n pE

ˆ

The probability that the forecast F and the

predictand P jointly fall in the above normal where n is the number of (F, P) realizations in the

category is paa . Similarly, pab is the probability sample, and the hat notation, as usual, indicates

that the forecast falls into the above normal normal that the probability is estimated. In this expression,

ˆ ˆ

category and the predictand falls into the below n pC is the number of correct forecasts and n p E

normal category. Probabilities pba and pbb are is an estimate of the expected number of correct

de¬ned analogously. random forecasts.

The marginal probability distribution (cf. The Heidke skill score may be extended

[2.3.12]) of the forecast F is given by to categorical forecasts with more than two

categories. Many other useful skill scores for

pa = P (F = above nor mal) = paa + pab

F

categorical forecasts may also be de¬ned (see,

for example, Stanski et al. [355]). Also, the term

pb = P (F = below nor mal) = pbb + pba .

F

p E in (18.1), which represents the probability

The marginal probability distribution for the of correct random forecasts, may be replaced by

the probability of a correct forecast produced by

predictand P are de¬ned similarly.

any other reference forecasting system (such as

persistence, in which the class the predictand will

18.1.2 The Heidke Skill Score. A useful occupy at the next verifying time is forecast to be

measure of the skill of a two-class categorical the class currently occupied by the predictand).

18.1: The Skill of Categorical Forecasts 393

18.1.5 The Skill Score is Subject to Sampling

18.1.3 Example: The Old Farmer™s Almanac.

Variation. The Heidke skill score (18.1) is a

The following example is taken from Walsh and

one number summary of the forecasting scheme

Allen [412] who evaluated ¬ve years of regular

performance relative to a competing reference

monthly mean temperature forecasts for the USA

scheme. As noted in [18.0.3], the forecast and

issued by The Old Farmer™s Almanac [364]. The

predictand should be viewed as a (hopefully)

success rate for temperature was 50.7%. The

correlated pair of random variables (F, P) and

corresponding rate for precipitation was 51.9%.

the skill score S F P should properly be viewed as

The Old Farmer™s Almanac™s forecasts have some

skill, with S = 7/500 = 0.014 for temperature an estimator of some characteristic of the joint

and S = 0.038 for precipitation, if we assume that distribution of F and P. One might therefore ask

the monthly means have symmetric distributions. how accurate this estimate is. One might also

However, the distributions are actually somewhat ask what the likelihood is of obtaining a positive

skewed: there are fewer (but larger) above- realization of the skill score from a ¬nite sample

normal temperature extremes than below-normal of random forecasts. There are no general answers

extremes. If we assume that pa = pa = 0.45 for

F P to these questions. Radok [327] however, has

temperature, then p E = 0.452 + 0.552 = 0.505, suggested an estimate of the sampling error of the