=’ 2+ .

function for the random vector (X1 , . . . , Xn )T is

‚σ 2 2σ 4

2σ i=1

n

f X 1 ...X n (x1 , . . . , xn ; ±) = f X (xi ; ±) (5.37)

i=1

We obtain the MLE of the mean by setting

(see (2.12)). Suppose we have observed Xi =

(5.36) to zero, to obtain

xi , i = 1, . . . , n. Then the likelihood function for

the unknown parameters ± is 1n

µ= xi . (5.38)

n n i=1

L X 1 ...X n (±) = f X (xi ; ±), (5.34)

Re-expressing (5.38) in the random variable

i=1

and the corresponding log-likelihood function is form, we ¬nd that the sample mean µ = X

(see [4.3.1], [5.3.3] and [5.3.5]) is the maximum

given by

likelihood estimator of the mean.

n

l X 1 ...X n (±) = ln( f X (xi ; ±)). Similarly, setting (5.37) to zero, we obtain

(5.35)

i=1 n

1

σ = (xi ’ µ)2 .

2

The maximum likelihood estimator ± of ± is found

n

by maximizing (5.34) or (5.35) with respect to ±. i=1

Then, replacing µ with its MLE, and rewriting the

The Appeal of Maximum Likelihood resulting expression in random variable form, we

Estimators. There are several good reasons obtain

to use maximum likelihood estimators. First, n

1

as we have noted, the method of maximum σ= (Xi ’ µ)2

2

n

likelihood provides a systematic way to search for i=1

estimators. Second, MLEs tend to have pleasing

as the maximum likelihood estimator of the

asymptotic properties. They can be shown to be

variance. Thus, we see that the MLE of the

consistent and asymptotically normal under fairly

variance is the biased estimator introduced in

general conditions (see, e.g., Cox and Hinkley

[5.2.6].

[92], Section 9.2). The asymptotic normality can,

in turn, be used to construct asymptotic con¬dence

regions.8 5.3.10 MLEs of Related Estimators. The

following theorem (see, for example, Pugachev

5.3.9 Maximum Likelihood Estimators of the [325]) extends the utility of a maximum likelihood

Mean and the Variance of a Normal Random estimator:

Variable. We derive the MLEs of the mean and Consider a random vector X with two parameters

the variance of a normal distribution N (µ, σ 2 ) ± and β, related to each other through g(±) = β

from a sample of n iid normal random variables and g ’1 (β) = ±, where g and g ’1 are both

using (5.35). The natural log of the normal density continuous. If ± is an MLE of ±, then β = g(±)

function is given by

is an MLE of β. Similarly, if β is an MLE of β,

(x ’ µ)2 then ± = g ’1 (q ) is an MLE of ±.

1

ln( f X (x; µ, σ 2 )) = ’ ln(2π σ 2 ) ’ .

2σ 2

2 There are various applications of this theorem.

Consequently, the log-likelihood function is given For example, suppose X is a normal random vector

with covariance matrix Σ. Let {»1 , . . . , »n } be

by

the eigenvalues of Σ and let {e 1 , . . . , e n } be the

n

l X 1 ...X n (µ, σ ) = ’ ln(2π σ )

2 2

corresponding eigenvectors (see Chapter 13). Both

2

the covariance matrix (corresponding to ± in the

n

(xi ’ µ)2

’ . theorem above) and its eigenvalues and eigenvec-

2σ 2

tors (corresponding to β) are parameter vectors

i=1

8 That is, con¬dence regions that attain the speci¬ed of X. Moreover, there is a continuous, one-to-one

relationship between these two representations of

coverage, say 95%, as the sample becomes large.

5: Estimation

90

the covariance structure of X. Therefore, since the

covariance estimator Σ in [5.2.7] is the maximum

likelihood estimator of Σ, it follows from the

theorem that the eigenvalues and eigenvectors of

Σ are MLEs of the eigenvalues and eigenvectors

of Σ.

5.4 Interval Estimators

5.4.1 What are Con¬dence Intervals? So

far we have dealt with point estimates, that

is, prescriptions that describe how to use the

-2 0 2

information in a sample to estimate a speci¬c

parameter of a random variable. We were

sometimes able to make statements about the Figure 5.4: Ten realizations of a 95% con¬dence

interval for unknown parameter ±. On average, 19

statistical properties of the estimators in repeated

out of 20 intervals will cover ±. In this example,

sampling, such as their mean squared errors, their

± = 0. The curve shows the density function of the

biases and their variances.

In the following we deal with interval estima- sampled random variable.

tion, that is, the estimation of intervals or regions

that will cover the unknown, but ¬xed, parameter

thus, no probabilitist interpretation can be given

with a given probability.

to the interval. Rather, the interval is interpreted

Statisticians often use the word coverage when

as reporting a range of parameter values that

discussing con¬dence intervals since the location

are strongly consistent with the realized sample

of parameter ± is ¬xed on the real line. A p —100%

˜

(i.e., this is a range of possible parameters for

con¬dence interval for ± is constructed from two

which the likelihood function [5.3.8] is large). The

statistics ± L and ±U , ± L < ±U , such that

con¬dence level indicates the average behaviour of

P (± L , ±U ) ± = p. ˜ (5.39) the reporting procedure, but it does not, and can

not, give a probabilitist interpretation to any one

We use the symbol to mean that the set on the realization of the con¬dence interval.

left covers the point on the right in (5.39). The

˜

con¬dence level p is chosen to be relatively large

(e.g., p = 0.95). The upper and lower limits of the 5.4.2 Con¬dence Interval for a Random

˜ 9

con¬dence interval are random variables; they are Variable”Optional. While the discussion to this

functions of the n random variables X1 , . . . , Xn point has focused on the probability that a random

that represent the sampling mechanism. Thus, the interval (± L , ±U ), de¬ned as a function of random

interval varies in length and location on the real variables X1 , . . . , Xn , covers a ¬xed parameter ±,

line. The interval is constructed so that it will cover our thinking need not be restricted to ¬xed targets.

Consider an experiment in which n + 1

the ¬xed point ± on the real line p — 100% of

˜

the time. That is, p — 100% of the realizations of observations are obtained in such a way that

˜

the con¬dence interval will lie on top of point ±. they can be represented by n + 1 iid random

variables X1 , . . . , Xn , Xn+1 . Suppose that there

Figure 5.4 illustrates this concept.

Many authors use the word ˜contain™ in the is an interval between the time the ¬rst n