Section 6.10. Nelder and Mead (1965) give a widely used algorithm based on simplex search. Lange (2000) gives a general account of numerical-analytic
7.2 Irregular form
µ= ¯ywith anyλnot too small is b√
(2π)exp{−n(yk− ¯y)2/2}
h(1−b) . (7.2)
This ratio is of interest because it has to be greater than 1 for the optimum likelihood aroundµ = yk to dominate the likelihood atµ = ¯y. Unlessb/h is large andyk is close toy¯ the ratio will be less than 1. The possibility of anomalous behaviour with misleading numerical consequences is much more serious with the continuous version of the likelihood.
7.2 Irregular form
A key step in the discussion of even the simplest properties is that the maximum likelihood estimating equation is unbiased, i.e., that
E{∇l(θ;Y);θ} =0. (7.3)
This was proved by differentiating under the integral sign the normalizing equation
f(y;θ)dy=1. (7.4)
If this differentiation is not valid there is an immediate difficulty and we may call the problemirregular. All the properties of maximum likelihood estimates sketched above are now suspect. The most common reason for failure is that the range of integration effectively depends onθ.
The simplest example of this is provided by a uniform distribution with unknown range.
Example 7.2. Uniform distribution.Suppose thatY1,. . .,Ynare independently distributed with a uniform distribution on(0,θ), i.e., have constant density 1/θ in that range and zero elsewhere. The normalizing condition fornindependent observations is, integrating inRn,
θ
0
(1/θn)dy=1. (7.5)
If this is differentiated with respect toθthere is a contribution not only from differentiating 1/θbut also one from the upper limit of the integral and the key property fails.
Direct examination of the likelihood shows it to be discontinuous, being zero ifθ < y(n), the largest observation, and equal to 1/θnotherwise. Thusy(n)is
the maximum likelihood estimate and is also the minimal sufficient statistic for θ. Its properties can be studied directly by noting that
P(Y(n)≤z)=P(Yj ≤z)=(z/θ)n. (7.6) For inference aboutθ, consider the pivotn(θ−Y(n))/θ for which
P{n(θ−Y(n))/θ <t} =(1−t/n)n, (7.7) so that asnincreases the pivot is asymptotically exponentially distributed. In this situation the maximum likelihood estimate is a sure lower limit forθ, the asymptotic distribution is not Gaussian and, particularly significantly, the errors of estimation areOp(1/n)notOp(1/√n). An upper confidence limit forθ is easily calculated from the exact or from the asymptotic pivotal distribution.
Insight into the general problem can be obtained from the following simple generalization of Example7.2.
Example 7.3. Densities with power-law contact.Suppose thatY1,. . .,Ynare independently distributed with the density(a+1)(θ−y)a/θa+1, wherea is a known constant. For Example7.2,a =0. Asavaries the behaviour of the density near the critical end-pointθchanges.
The likelihood has a change of behaviour, although not necessarily a local maximum, at the largest observation. A slight generalization of the argument used above shows that the pivot
n1/(a+1)(θ−Y(n))/θ (7.8) has a limiting Weibull distribution with distribution function 1−exp(−ta+1).
Thus for −1/2 < a < 1 inference aboutθ is possible from the maximum observation at an asymptotic rate faster than 1/√
n.
Consider next the formal properties of the maximum likelihood estimating equation. We differentiate the normalizing equation
θ
0
{(a+1)(θ−y)a/θa+1}dy=1 (7.9) obtaining from the upper limit a contribution equal to the argument of the integrand there and thus zero ifa >0 and nonzero (and possibly unbounded) ifa≤0.
For 0<a≤1 the score statistic for a single observation is
∂logf(Y,θ)/∂θ=a/(θ−Y)−(a+1)/θ (7.10) and has zero mean and infinite variance; an estimate based on the largest observation is preferable.
7.2 Irregular form 137
Fora > 1 the normalizing condition for the density can be differentiated twice under the integral sign, Fisher’s identity relating the expected information to the variance of the score holds and the problem is regular.
The relevance of the extreme observation for asymptotic inference thus depends on the level of contact of the density with they-axis at the terminal point. This conclusion applies also to more complicated problems with more parameters. An example is the displaced exponential distribution with density
ρexp{−ρ(y−θ)} (7.11)
fory> θ and zero otherwise, which has a discontinuity in density aty=θ.
The sufficient statistics in this case are the smallest observationy(1) and the mean. Estimation ofθbyy(1)has errorOp(1/n)and estimation ofρwith error Op(1/√
n)can proceed as ifθis known at least to the first order of asymptotic theory. The estimation of the displaced Weibull distribution with density
ργ{ρ(y−θ)}γ−1exp[−{ρ(y−θ)}γ] (7.12) fory> θillustrates the richer possibilities of Example7.3.
There is an important additional aspect about the application of these and similar results already discussed in a slightly different context in Example7.1.
The observations are treated as continuously distributed, whereas all real data are essentially recorded on a discrete scale, often in groups or bins of a par-ticular width. For the asymptotic results to be used to obtain approximations to the distribution of extremes, it is, as in the previous example, important that the grouping interval in the relevant range is chosen so that rounding errors are unimportant relative to the intrinsic random variability of the continuous variables in the region of concern. If this is not the case, it may be more relevant to consider an asymptotic argument in which the grouping interval is fixed as nincreases. Then with high probability the grouping interval containing the terminal point of support will be identified and estimation of the position of the maximum within that interval can be shown to be an essentially regular problem leading to a standard error that isO(1/√
n).
While the analytical requirements justifying the expansions of the log like-lihood underlying the theory of Chapter6are mild, they can fail in ways less extreme than those illustrated in Examples7.2and7.3. For example the log like-lihood corresponding tonindependent observations from the Laplace density exp(−|y−θ|)/2 leads to a score function of
−sgn(yk−θ). (7.13)
It follows that the maximum likelihood estimate is the median, although ifn is odd the score is not differentiable there. The second derivative of the log
likelihood is not defined, but it can be shown that the variance of the score does determine the asymptotic variance of the maximum likelihood estimate.
A more subtle example of failure is provided by a simple time series problem.
Example 7.4. Model of hidden periodicity. Suppose that Y1,. . .,Yn are independently normally distributed with varianceσ2and with
E(Yk)=µ+αcos(kω)+βsin(kω), (7.14) where(µ,α,β,ω,σ2)are unknown. A natural interpretation is to think of the observations as equally spaced in time. It is a helpful simplification to restrict ωto the values
ωp=2πp/n (p=1, 2,. . .,[n/2]), (7.15) where[x]denotes the integer part ofx. We denote the true value ofωholding in (7.14) byωq.
The finite Fourier transform of the data is defined by Y˜(ωp)=√
(2/n)Ykeikωp
= ˜A(ωp)+iB(ω˜ p). (7.16) Because of the special properties of the sequence{ωp}the transformation from Y1,. . .,Yn to Yk/√
n,A(ω˜ 1),B(ω˜ 1),. . . is orthogonal and hence the transformed variables are independently normally distributed with varianceσ2. Further, if the linear model (7.14) holds withω=ωqknown, it follows that:
• forp=q, the finite Fourier transform has expectation zero;
• the least squares estimates ofαandβare√(2/n)times the real and imaginary parts ofY˜(ωq), the expectations of which are therefore√(n/2)α and√
(n/2)β;
• the residual sum of squares is the sum of squares of all finite Fourier transform components except those atωq.
Now suppose that ω is unknown and consider its profile likelihood after estimating(µ,α,β,σ2). Equivalently, because (7.14) for fixedωis a normal-theory linear model, we may use minus the residual sum of squares. From the points listed above it follows that the residual sum of squares varies randomly across the values ofωexcept atω=ωq, where the residual sum of squares is much smaller. The dip in values, corresponding to a peak in the profile likeli-hood, is isolated at one point. Even though the log likelihood is differentiable as a function of the continous variable ωthe fluctuations in its value are so rapid and extreme that two terms of a Taylor expansion about the maximum likelihood point are totally inappropriate.