• Nenhum resultado encontrado

Nuisance parameters

No documento Principles of Statistical Inference (páginas 126-131)

Section 5.17. The discrimination problem has been tackled in the literature from many points of view. The original derivation of the linear discriminant

6.4 Nuisance parameters

with corresponding changes in the covariance matrix and thatZcan always be transformed todindependent standardized normal random variables. See also Note2.6.

It follows that the pivots

− ˆθ)Ti1(θ)(θ− ˆθ) (6.53) or

− ˆθ)Tˆ1− ˆθ) (6.54) can be used to form approximate confidence regions forθ. In particular, the second, and more convenient, form produces a series of concentric similar ellipsoidal regions corresponding to different confidence levels.

The three quadratic statistics discussed in Section 6.3.4 take the forms, respectively

WE =ˆ−θ0)Ti(θ0)(θˆ−θ0), (6.55)

WL =2{l(θ)ˆ −l(θ0)}, (6.56)

WU =U(θ0;Y)Ti10)U(θ0;Y). (6.57) Again we defer discussion of the relative merits of these until Sections6.6 and6.11.

6.4 Nuisance parameters

6.4.1 The information matrix

In the great majority of situations with a multidimensional parameterθ, we need to writeθ =(ψ,λ), whereψis the parameter of interest andλthe nuisance para-meter. Correspondingly we partitionU(θ;Y)into two componentsUψ :Y) andUλ;Y). Similarly we partition the information matrix and its inverse in the form

i(θ)=

iψψ iψλ iλψ iλλ

, (6.58)

and

i1(θ)=

iψψ iψλ iλψ iλλ

. (6.59)

There are corresponding partitions for the observed informationˆ.

6.4.2 Main distributional results

A direct approach for inference about ψ is based on the maximum likeli-hood estimateψˆ which is asymptotically normal with meanψand covariance matrixiψψ or equivalentlyˆψψ. In terms of a quadratic statistic, we have for testing whetherψ=ψ0the form

WE =ˆ −ψ0)T(iψψ)1ˆ −ψ0) (6.60) with the possibility of using(ˆψψ)1rather than(iψψ)1. Note also that even ifψ0were used in the calculation ofiψψit would still be necessary to estimateλ except for those special problems in which the information does not depend onλ. Now to studyψvia the gradient vector, as far as possible separated fromλ, it turns out to be helpful to write Uψ as a linear combination of Uλ plus a term uncorrelated with Uλ, i.e., as a linear least squares regression plus an uncorrelated deviation. This representation is

Uψ =iψλiλλ1Uλ+Uψ·λ, (6.61) say, whereUψ·λdenotes the deviation ofUψfrom its linear regression onUλ. Then a direct calculation shows that

cov(Uψ·λ)=iψψ·λ, (6.62)

where

iψψ·λ =iψψiψλiλλ1iλψ =(iψψ)1. (6.63) The second form follows from a general expression for the inverse of a partitioned matrix.

A further property of the adjusted gradient which follows by direct evaluation of the resulting matrix products by (6.63) is that

E(Uψ·λ;θ+δ)=iψψ·λδψ+O(δ2), (6.64) i.e., to the first order the adjusted gradient does not depend onλ. This has the important consequence that in using the gradient-based statistic to test a null hypothesisψ=ψ0, namely

WU =Uψ·λT 0,λ)iψψ0,λ)Uψ·λ0,λ), (6.65) it is enough to replace λ by, for example, its maximum likelihood estimate givenψ0, or even by inefficient estimates.

The second version of the quadratic statistic (6.56), corresponding more directly to the likelihood function, requires the collapsing of the log likelihood into a function ofψalone, i.e., the elimination of dependence onλ.

6.4 Nuisance parameters 111

This might be achieved by a semi-Bayesian argument in whichλbut notψ is assigned a prior distribution but, in the spirit of the present discussion, it is done by maximization. For givenψwe defineλˆψto be the maximum likelihood estimate ofλand then define the profile log likelihood ofψto be

lP(ψ)=l(ψ,λˆψ), (6.66)

a function ofψalone and, of course, of the data. The analogue of the previous likelihood ratio statistic for testingψ=ψ0is now

WL =2{lP(ψ)ˆ −lP0)}. (6.67) Expansions of the log likelihood about the point 0,λ) show that in the asymptotic expansion, we have to the first term thatWL = WU and there-fore thatWLhas a limiting chi-squared distribution withdψdegrees of freedom whenψ =ψ0. Further because of the relation between significance tests and confidence regions, the set of values ofψdefined as

{ψ: 2{lP(ψ)ˆ −lP(ψ)} ≤kd2

ψ;c} (6.68)

forms an approximate 1−clevel confidence set forψ.

6.4.3 More on profile likelihood

The possibility of obtaining tests and confidence sets from the profile log like-lihood lP(ψ) stems from the relation between the curvature oflP(ψ)at its maximum and the corresponding properties of the initial log likelihoodl(ψ,λ). To see this relation, let∇ψand∇λdenote thedψ×1 anddλ×1 operations of partial differentiation with respect toψandλrespectively and letDψdenote total differentiation of any function ofψandλˆψ with respect toψ. Then, by the definition of total differentiation,

DTψlP(ψ)= ∇ψTl(ψ,λˆψ)+ ∇λTl(ψ,λˆψ)DψˆTψ}T. (6.69) Now applyDψ again to get the Hessian matrix of the profile likelihood in the form

DψDTψlP(ψ)= ∇ψψTl(ψ,λˆψ)+ {∇ψλTl(ψ,λˆψ)(∇ψλˆTψ)}T

+ {∇λTl(ψ,λˆψ)}{∇ψψTλˆTψ} +(∇ψλˆT)∇λψTl(ψ,λˆψ) +(∇ψλˆTψ){∇λλTl(ψ,λˆψ)(∇ψλˆTψ)T}. (6.70) The maximum likelihood estimate λˆψ satisfies for all ψ the equation

λTl(ψ,λˆψ)=0. Differentiate totally with respect toψto give

ψλTl(ψ,λˆψ)+(DψλˆTψ)∇λλTl(ψ,λˆψ)=0. (6.71)

Thus three of the terms in (6.70) are equal except for sign and the third term is zero in the light of the definition ofλˆψ. Thus, eliminatingDψλˆψ, we have that the formal observed information matrix calculated as minus the Hessian matrix oflP(ψ)evaluated atψˆ is

ˆ

P,ψψ = ˆψψ− ˆψλˆλλ1ˆλψ = ˆψψ·λ, (6.72) where the two expressions on the right-hand side of (6.72) are calculated from l(ψ,λ). Thus the information matrix forψevaluated from the profile likelihood is the same as that evaluated via the full information matrix of all parameters.

This argument takes an especially simple form when bothψandλare scalar parameters.

6.4.4 Parameter orthogonality

An interesting special case arises wheniλψ =0, so that approximatelyjλψ =0.

The parameters are then said to beorthogonal. In particular, this implies that the corresponding maximum likelihood estimates are asymptotically independent and, by (6.71), thatDψλˆψ =0 and, by symmetry, thatDλψˆλ=0. In nonortho-gonal cases if ψ changes byO(1/

n), thenλˆψ changes byOp(1/n); for orthogonal parameters, however, the change is Op(1/n). This property may be compared with that of orthogonality of factors in a balanced experimental design. There the point estimates of the main effect of one factor, being contrasts of marginal means, are not changed by assuming, say, that the main effects of the other factor are null. That is, theOp(1/n)term in the above discussion is in fact zero.

There are a number of advantages to having orthogonal or nearly orthogonal parameters, especially component parameters of interest. Independent errors of estimation may ease interpretation, stability of estimates of one parameter under changing assumptions about another can give added security to conclusions and convergence of numerical algorithms may be speeded. Nevertheless, so far as parameters of interest are concerned, subject-matter interpretability has primacy.

Example 6.4. Mixed parameterization of the exponential family.Consider a full exponential family problem in which the canonical parameterφ and the canonical statisticsare partitioned as1,φ2)and(s1,s2)respectively, thought of as column vectors. Suppose that φ2is replaced by η2, the corresponding component of the mean parameter η = ∇k(φ), where k(φ)is the cumulant generating function occurring in the standard form for the family. Thenφ1and η2are orthogonal.

6.4 Nuisance parameters 113

To prove this, we find the Jacobian matrix of the transformation from1,η2) to1,φ2)in the form

∂(φ1,η2)/∂(φ1,φ2)=

I 0

12Tk(φ)22Tk(φ)

. (6.73)

Here∇1denotes partial differentiation with respect toφlforl=1, 2.

Combination with (6.51) proves the required result. Thus, in the analysis of the 2×2 contingency table the difference of column means is orthogonal to the log odds ratio; in a normal distribution mean and variance are orthogonal.

Example 6.5. Proportional hazards Weibull model. For the Weibull distribution with density

γρ(ρy)γ1exp{−(ρy)γ} (6.74) and survivor function, or one minus the cumulative distribution function,

exp{−(ρy)γ}, (6.75)

the hazard function, being the ratio of the two, isγρ(ρy)γ1.

Suppose thatY1,. . .,Ynare independent random variables with Weibull dis-tributions all with the same γ. Suppose that there are explanatory variables z1,. . .,znsuch that the hazard is proportional toeβz. This is achieved by writing the value of the parameterρcorresponding toYkin the form exp{(α+βzk)/γ}.

Here without loss of generality we takezk =0. In many applicationszandβ would be vectors but here, for simplicity, we take one-dimensional explanatory variables. The log likelihood is

(logγ+α+βzk)+ −1)logykexp(α+βzk)ykγ. (6.76) Direct evaluation now shows that, in particular,

E

2l

∂α∂β

=0, (6.77)

E

2l

∂β∂γ

= −βzk2

γ , (6.78)

E

2l

∂γ ∂α

= −0.5771+α

γ , (6.79)

where Euler’s constant, 0.5771, arises from the integral

0

vlogvdv. (6.80)

Now locally nearβ =0 the information elements involvingβ are zero or small implying local orthogonality ofβto the other parameters and in particular toγ. Thus not only are the errors of estimatingβalmost uncorrelated with those of the other parameters but, more importantly in some respects, the value of βˆγ will change only slowly withγ. In some applications this may mean that analysis based on the exponential distribution,γ =1, is relatively insensitive to that assumption, at least so far as the value of the maximum likelihood estimate ofβis concerned.

No documento Principles of Statistical Inference (páginas 126-131)