ef¬cient than the unbiased estimator S2 . We will

The proof is easy to demonstrate. see shortly that the biased estimator is also the

maximum likelihood estimator of σ 2 .

M(±; ±) = E (± ’ ±)2

An empirical approach frequently used to ¬nd

= E (± ’ E(±) ’ (± ’ E(±)))2 bias corrections is called the jackknife (see Efron

2 + ± ’ E(±) 2

= E (± ’ E(±)) [111] or Quenouille [326]). The idea is that

the estimator is computed from the full sample,

’ 2 ± ’ E(±) E ± ’ E(±) .

then recomputed n times, leaving a different

The cross-product term in the last expression observation out each time. These estimators are

is zero, so (5.28) follows. Therefore, any denoted ± and ± (i) , where the subscript (i)

asymptotically unbiased estimator with variance indicates that ± (i) is computed with Xi removed

from the sample. The jackknife bias correction,

that is asymptotically zero is consistent.

which is subtracted from ±, is then given by

± B = (n ’ 1)(± (·) ’ ±),

5.3.7 Bias Correction and the Jackknife. We

showed in [5.3.3] that σ 2 is a biased estimator

where

of σ 2 with B(σ 2 ) = n σ 2 . We also showed that

1

n

1

the sample variance S2 corrects this bias by ± (·) = ± (i) .

multiplying the estimator σ 2 by n/(n ’ 1). n i=1

Many bias corrections are of the above form,

The jackknifed estimator, ± = ± ’ ± B , can often

that is, a bias correction is often made by scaling ±,

be re-expressed in the form ± = ±/c(n).

a biased estimator of ±, by a constant c(n) so that

It can be shown, with some algebraic manipu-

the resulting estimator ± = ±/c(n) is an unbiased

lation, that the jackknifed bias correction for σ 2

estimator of ±. Biases and the corresponding bias

is

corrections come in a variety of forms, however, n

1

so there is no general rule about the form of these σB =’ (Xi ’ X)2 .

2

n(n ’ 1)

corrections. i=1

5: Estimation

88

Now suppose that we have observed H = h. The

Therefore the jackknifed estimator of σ 2 is

likelihood of observing h for a particular value of

σ 2 = σ 2 ’ σ B = S2 .

2

the parameter p is given by the likelihood function

A jackkni¬ng approach can also be used to

L H ( p) = f H (h; p). (5.31)

estimate variance of an estimator ±. Tukey [374]

suggested that the variance of ±, say σ±2 , could be The likelihood function is identical to the

estimated with probability distribution of our statistic H except

n

n’1 that it is now viewed as a function of the parameter

σ± = (± (i) ’ ± (·) )2 .

2

p.

n i=1

The maximum likelihood estimator (MLE) of

Efron [111] explains why this works. The p is now obtained by determining the value of

jackknife estimator of the variance of the sample parameter p for which the observed value h of

H is most likely. That is, given H = h, (5.31) is

mean is

maximized with respect to p.

1

σ µ = S2 ,

2

It is often easier to maximize the log-likelihood

n

function

which is also the estimator obtained when we

l H ( p) = ln(L H ( p)),

replace σ 2 with S2 in (5.24).

which is de¬ned as the natural log of the likelihood

5.3.8 Maximum Likelihood Method. The

function. For this example the log-likelihood is

estimators introduced in this section have been

given by

arbitrary so far. One systematic approach to

obtaining estimators is the Method of Maximum n

l H ( p) = ln h + h ln( p) + (n ’ h) ln(1 ’ p).

Likelihood, introduced by R.A. Fisher [119, 120]

in the 1920s. (5.32)

We maximize (5.32) by taking the derivative of

The Maximum Likelihood Estimator of the

l H ( p) with respect to p and solving the equation

Parameter of the Binomial Distribution. The

obtained by setting the derivative to zero. In the

idea is most easily conveyed through an example.

present example there will be only one solution

For simplicity, suppose that our sample consists

to this equation. However, there may be many

of n iid Bernoulli random variables {X1 , . . . , Xn }

solutions in general, and it is necessary to select

[2.4.2], which take values 0 or 1 with probabilities

the solution that produces the overall maximum of

1 ’ p and p, respectively. The problem is to

l (or, equivalently, L).

estimate p.

Taking the partial derivative of (5.32) and

The probability of observing a particular set of

setting it to zero, we obtain

realizations {x1 , . . . , xn } is

‚l H ( p) n’h

h

P (X1 = x1 , . . . , Xn = xn ) = p h (1 ’ p)n’h

, =’ = 0. (5.33)

‚p p 1’ p

(5.29)

The unique solution of (5.33) is p = h/n.

n

where h = i=1 xi . Therefore, we see that the

useful information about p is carried not by the The corresponding MLE of p, written in random

variable form, is p = H/n. Thus, we have

individual random variables Xi but by their sum

discovered that here the estimator we would

n

intuitively use to estimate p is also its maximum

H= Xi .

likelihood estimator.

i=1

We come to this conclusion because (5.29) has The Maximum Likelihood Estimator in

the same value regardless of the order in which General. We will continue to assume that our

the contributions to h (i.e., the 0s and 1s) were sample consists of n iid random variables,

{X1 , . . . , Xn }, all distributed as random variable

observed. Thus our estimator should be based on

the statistic H. X. For convenience we will assume that they

The probability distribution of H is the binomial are continuous, and refer to probability density

distribution (2.7) functions rather than probability distributions.

However, everything here can be repeated with

n

f H (h; p) = h p h (1 ’ p)n’h . (5.30) probability distributions simply by replacing all

5.3: Properties of Estimators 89

occurrences of density functions with probability Differentiation yields

distributions. n

‚l X 1 ...X n (µ, σ 2 ) xi ’ µ

Let f X (x; ±) be the density function of X, where = (5.36)

‚µ σ2

± is a vector containing the parameters of the i=1

distribution of X. The joint probability density n

‚l X 1 ...X n (µ, σ 2) (xi ’ µ)2