Mathematical Statistics

(1)

Mathematical Statistics

Class notes

Paulo Soares

December 2022

(2)

(3)

. . . 1

. . . 5

. . . 6

. . . 12

. . . 14

. . . 15

. . . 17

. . . 21

. . . 23

. . . 28

. . . 33

. . . 36

. . . 40

. . . 45

. . . 46

. . . 51

. . . 55

. . . 58

. . . 63

. . . 64

. . . 69

. . . 74

. . . 75

Formulae

Special functions

Indicator function

Gamma function

Beta function

Discrete distributions

Discrete uniform distribution

Binomial distribution

IA(x) = { 1, x ∈ A0, x ∉ A

Γ(x) = ∫₀^+∞t^x−1e^−tdt, x > 0

Γ(x + 1) = xΓ(x), x > 0 Γ(n) = (n − 1)!, n ∈ N

B(x, y) = ∫₀¹t^x−1(1 − t)^y−1dt, x, y > 0 B(x, y) = Γ(x)Γ(y)

Γ(x + y)

Uniform ({a, … , a + n − 1}) a ∈ Z, n ∈ N f(x) = 1 IS(x) S = {a, … , a + n − 1}

n

E[X] = a + n − 1 V ar[X] = 2

n²− 1 12 M(t) = e^at(1 − e^nt)

n(1 − e^t)

Binomial(n, θ) n ∈ N θ ∈]0, 1[

(6)

Hypergeometric distribution

Negative binomial distribution

Poisson distribution

E[X] = nθ V ar[X] = nθ(1 − θ) M(t) = (θe^t+ (1 − θ))ⁿ

Hypergeometric(N, M, n)

N ∈ N M ∈ {1, … , N} n ∈ {1, … , N}

f(x) = ( )(^M_x ^N−M_n−x )IS(x) ( )^N_n

S = {max{0, n − N + M}, … , min{n, M}}

E[X] = nM V ar[X] = n N

M N

N − M N

N − n N − 1

NegativeBinomial(r, θ) r ∈ N θ ∈]0, 1[

f(x) = (^x−1_r−1)θ^r(1 − θ)^x−rIS(x) S = {r, r + 1, …}

E[X] = r V ar[X] = θ

r(1 − θ) θ² M(t) = ( θe^t )^r

1 − (1 − θ)e^t

Geometric(θ) ≡ NegativeBinomial(1, θ)

Poisson(λ) λ ∈ R⁺

f(x) = e^−λλ^xIS(x) S = N0

x!

E[X] = V ar[X] = λ M(t) = e^λ(e^t⁻¹⁾

(7)

Continuous distributions

Continuous uniform distribution

Gamma distribution

Normal distribution

Uniform(α, β) α, β ∈ R, α < β f(x) = 1 I_S(x) S = [α, β]

β − α

E[X] = α + β V ar[X] = 2

(β − α)² 12 M(t) = e^βt− e^αt, t ≠ 0

(β − α)t

Gamma(α, β) α, β ∈ R⁺

f(x) = β^α x^α−1e^−βxIS(x) S = R⁺₀ Γ(α)

E[X] = α V ar[X] = β

α β² M(t) = ( β )^α, t < β

β − t

χ²_(n) ≡ Gamma ( , ) n ∈ Nⁿ₂ ¹₂ Exponential(λ) ≡ Gamma(1, λ)

Normal(μ, σ²) μ ∈ R, σ² ∈ R⁺

f(x) = 1 e⁻

√2πσ²

(x − μ)² 2σ² E[X] = μ V ar[X] = σ² M(t) = e^{μt+ t}^σ2² ²

(8)

Lognormal distribution

Beta distribution

Lognormal(μ, σ²) μ ∈ R, σ² ∈ R⁺

f(x) = 1 e⁻ I_S(x) S = R⁺ x√2πσ²

(log x − μ)² 2σ²

E[X] = e^μ+^σ2² V ar[X] = (e^σ²− 1) e^2μ+σ² E[X^k] = e^{μk+ k}^σ2² ²

Beta(α, β) α, β ∈ R⁺

f(x) = 1 x^α−1(1 − x)^β−1IS(x) S = [0, 1]

B(α, β)

E[X] = α V ar[X] = α + β

αβ

(α + β)²(α + β + 1)

(9)

1 Principles of data reduction

1.1 Introduction

Data obtained from the observation of a random vector taking values in (sam- ple space)

Assumption of a probability model for , , where is a -algebra defined in

is a family of distributions What can this be?

continuous distributions

, where is the parameter space Usual questions:

1. Does the data contradict or ? Hypotheses testing

X X

X (X, A, F)

A σ X

F F

F = { }

…

F = {F(x ∣ θ) : θ ∈ Θ} Θ

F F0 ⊂ F ⇝

(10)

2. Assessed the validity of , can we refine our initial model choice? Point or set estimation

3. Can we use to predict unobserved data? Prediction A frequent scenario:

, with fixed

is a random sample from some unknown Parametric inference

What is the sample information about (or )?

1. The observed values ;

2. The sampling distribution

How do we use the data?

Frequentist statistics rely heavily on the use of . . . statistics

used as

point estimators;

a starting point to find pivotal quantities;

test statistics . . .

Since, usually, , a statistic provides a reduction of the data and, potentially, some loss of information.

Do we always lose information by using any statistic?

F_θ ⇝

F ⇝

X ⊂ Rⁿ n

X F ∈ F = {F(x ∣ θ) : θ ∈ Θ}

⇝

F θ

x = (x₁, … , x_n)

f(x ∣ θ) =

n

∏

i=1

f(xi ∣ θ).

T(X) : X ⟶ R^k, 1 ≤ k ≤ n

k ≪ n

(11)

1.2 Suf ciency

De nition

A statistics is said sufficient for if and only if the distribution of does not depend on .

Does a sufficient statistic always exist?

Example

Let be a random sample from

1. is a (trivial) sufficient statistic for ; 2. is also a sufficient statistic for .

Your turn

Determine whether and are sufficient statistics for .

How to find sufficient statistics?

So, is a sufficient statistic for if and only if does not depend on , .

T = T(X) θ

X ∣ T = t θ

(X1, … , Xn)

F = {F(x ∣ θ) : θ ∈ Θ}.

(X1, … , Xn) θ

(X(1), … , X(n)) θ

(X1, … , Xn)

F = {Ber(θ) : θ ∈]0, 1[} .

T1 = X1+ Xn T2 = ∑ⁿ_i=1Xi θ

f(x ∣ t, θ) = = {0, x : T(x) ≠ t , x : T(x) = t , ∀t f(x, t ∣ θ)

f(t ∣ θ) ^f(x∣θ)_f(t∣θ)

T θ ^f(x∣θ)_f(t∣θ) θ ∀x ∈ X

(12)

Factorization theorem

A statistic is sufficient for if and only if

for some non-negative functions and .

Your turn

Find a sufficient statistic for each of the following models:

a.

b.

Are all sufficient statistics equal?

De nition

where , is the partition of induced by

.

Your turn

Let be a statistic and its partition of . Consider another statistic that is some function of .

Discuss how and can be related.

De nition

is equivalent to if and only if .

T(X) θ

f(x ∣ θ) = g (T(x), θ) h(x), ∀x ∈ X, ∀θ ∈ Θ,

g h

F1 = {N(μ, σ²) : μ ∈ R, σ² ∈ R⁺} F2 = {U(θ − 1/2, θ + 1/2) : θ ∈ R}

ΠT = {πt}_t∈T(X) πt= {x ∈ X : T(x) = t} X

T(X)

T(X) Π_T X U(X)

T(X) Π_T Π_U

T(X) U(X) ΠT = ΠU

(13)

De nition

Given two statistics, and , is nested in if and only if

Example

Let be a random sample from and consider the statistics:

If is sufficient for what characterizes ? Note

Sufficient statistic sufficient partition

Theorem

If is sufficient for then any other statistic such that is also sufficient for .

De nition

A statistic is called a minimal sufficient statistic if and only if it is sufficient and a function of any other sufficient statistic.

T(X) U(X) ΠT ΠU

∀π ∈ Π_T∃π^∗ ∈ Π_U : π ⊂ π∗.

(X1, … , Xn), n > 2 F T0 = (X1, … , Xn)

T1 = (X(1), … , X(n)) T2 = (X₍₁₎, X_(n)) T3 = X(n)

T(X) θ ΠT

≡

T(X) θ U(X) T = g(U)

θ

(14)

Lehmann & Scheffé’s method

If there exists a statistic such that , where

, and

for some positive function , then is a minimal sufficient statistic.

Example

Find a minimal sufficient statistic for .

Your turn

Show that the sufficient statistic found before is minimal.

De nition

A statistic whose distribution does not depend on is called an ancillary statistic.

Example

1. is a trivial ancillary statistic in any model.

2. Let be a random sample from any member of the location-scale family of distributions. Any statistic that is a function of

is an ancillary statistic.

T(X) ∀(x, y) ∈ X∖Π₀

Π₀ = {x ∈ X : f(x ∣ θ) = 0, ∀θ ∈ Θ} ∀θ ∈ Θ :

f(y ∣ θ) = c(x, y) f(x ∣ θ) ⟺ T(x) = T(y)

c T(X)

(X1, … , Xn)

F = {N(μ, σ²) : μ ∈ R, σ² ∈ R⁺}.

(μ, σ²)

(X1, … , Xn)

F = {U(θ − 1/2, θ + 1/2) : θ ∈ R}.

θ

T(X) = c

(X1, … , Xn)

(X₁ − λ, … , ) δ

X_n− λ δ

(15)

Your turn

Show that

is a minimal sufficient statistic but is an ancillary statistic.

De nition

A statistic is called a complete statistic if and only if

Note

1. Any non constant function of a complete statistic cannot be an ancillary statistic.

2. is complete is also complete.

Example

Show that is a sufficient and complete statistic.

Theorem

A sufficient and complete statistic is also a minimal sufficient statistic.

(X1, … , Xn)

F = {U(θ − 1/2, θ + 1/2) : θ ∈ R}.

(R, C) = (X_(n)− X₍₁₎, X₍₁₎+ X_(n)) 2

R

T(X)

E[h(T) ∣ θ] = 0 ⟹ P(h(T) = 0) = 1, ∀θ ∈ Θ.

T ⟹ U = g(T)

(X1, … , Xn)

F = {Ber(θ) : θ ∈]0, 1[}.

T(X) = ∑ⁿ_i=1Xi

(16)

Your turn

1. Let be a random sample from

Show that is a sufficient and complete statistic.

2. Show that for any model with a minimal sufficient statistic that is not complete there are no sufficient and complete statistics.

3. Let be a random sample from . Find a

minimal sufficient statistic for . Is that statistic complete?

Basu’s theorem

A sufficient and complete statistic is independent of any ancillary statistic.

Note

Useful to prove independence without finding a joint distribution but . . . to prove that a statistic is complete is often a difficult problem.

1.3 Exponential families of distributions

De nition

A family of distributions is a -parametric exponen-

tial family if

for some non-negative functions and .

Note

The support of can not depend on . (X1, … , Xn)

F = {U(0, θ) : θ ∈ R⁺}.

T(X) = X_(n)

(X₁, … , X_n) F = {U(θ, 2θ) : θ ∈ R⁺} θ

F = {F(x ∣ θ) : θ ∈ Θ ⊂ R^p} k

f(x ∣ θ) = c(θ)h(x) exp{

k

∑

j=1

nj(θ)Tj(x)}

c h

f θ

(17)

– natural parameters

– natural parameter space Example

Show that is an uniparametric exponential family.

Some properties

1. An exponential family is closed under random sampling.

2. For any member of a -parametric exponential family there is always a -dimensional sufficient statistic regardless of the sample size .

3. Any model with a -dimensional sufficient statistic for any sample size is a - parametric exponential family if its support does not depend on the parameter.

Theorem

The sufficient statistic for a -parametric exponential family is complete if the natural parameter space contains an open set of .

Your turn

1. Show that the minimal sufficient statistic for

found before is also complete.

2. Investigate the model

regarding completeness.

f(x ∣ θ) = c(θ)h(x) exp{∑^k

j=1

n_j(θ)T_j(x)}

αj = nj(θ), j = 1, … , k A = {α ∈ R^k : θ(α) ∈ Θ}

F = {Geo(θ) : θ ∈]0, 1[}

k k

n

k n k

k

R^k

F = {N(μ, σ²) : μ ∈ R, σ² ∈ R⁺}

F = {N(θ, θ²) : θ ∈ R∖{0}}

(18)

Example

Back to Basu’s theorem . . .

Show that and are independent.

1.4 Suf ciency in restricted models

Quite often some parameters in a model are just auxiliary and there is no real inferential interest on them – the so called nuisance parameters.

Consider with

De nition

is said specific sufficient (ancillary) for if is sufficient (ancillary) for

, .

is specific sufficient for :

is specific ancillary for :

De nition

is said partial sufficient for if it is specific sufficient for and specific ancillary for .

is partial sufficient for :

Note

(X1, … , Xn)

F = {N(μ, σ²) : μ ∈ R, σ² ∈ R⁺}.

X¯ S²

F_θ θ = (γ, ϕ) ∈ Γ × Φ = Θ.

T = T(X) γ T

γ ∀ϕ ∈ Φ

T γ

f(x ∣ γ, ϕ) = f(x ∣ t, ϕ) f(t ∣ γ, ϕ)

T ϕ

f(x ∣ γ, ϕ) = f(x ∣ t, ϕ, γ) f(t ∣ γ)

T(X) γ γ

ϕ

T γ

f(x ∣ γ, ϕ) = f(x ∣ t, ϕ) f(t ∣ γ)

(19)

Your turn

1. Check if and are partial sufficient for and in

2. Let , be a random sample from such that

and . With and

show that is partial sufficient for but is not specific sufficient nor specific ancillary for .

1.5 Suf ciency and Fisher’s information

De nition

A uniparametric model with the following properties is called a regular model:

1. The model is identifiable, that is, is a one-to-one transformation, and is an open interval;

2. The support of the model does not depend on ;

3. is differentiable in and is integrable in ; 4. The operators and can be permuted.

De nition

is called the score function, with in .

Note

measures the variation of in for a given .

X¯ S² μ σ²

F = {N(μ, σ²) : μ ∈ R, σ² ∈ R⁺}.

(X_i, Y_i), i = 1, … , n (X, Y )

X ∣ ϕ ∼ Poi(ϕ) Y ∣ X, γ ∼ Bi(x, γ) T = ∑ⁿ_i=1X_i

U = ∑ⁿ_i=1Y_i T ϕ U

γ

θ → Fθ

Θ

θ

f(x ∣ θ) Θ ^∂f_∂θ X

∂

∂θ ∫ dx

S(x ∣ θ) = ∂ log f(x ∣ θ)

∂θ

S(x ∣ θ) = 0 X₀ = {x : f(x ∣ θ) = 0, ∀θ}

S(x ∣ θ) log f(x ∣ θ) Θ x

(20)

De nition

is called the Fisher’s information measure.

Note

A measure of dispersion of in is taken as a measure of sample information.

Theorem

For a regular model we have and, therefore,

Some properties

1. For with differentiable

2. For a regular model with

3. If with and independent then and,

consequently, where represents the Fisher’s information for a single observation.

4. For any statistic we have and if and only if

is sufficient.

Note

The previous definitions and properties can be naturally generalized for , with , (the Fisher’s information matrix).

IX(θ) = V ar[S(X ∣ θ)]

S(X ∣ θ) X

E[S(X ∣ θ)] = 0

IX(θ) = E[S²(X ∣ θ)].

θ = g(ϕ) g

IX(ϕ) = IX(g(ϕ)) (g^′(ϕ))². 0 < IX(θ) < +∞

I_X(θ) = −E [∂²log f(X ∣ θ)] .

∂θ²

X = (X1, X2) X1 X2 IX(θ) = IX1(θ) + IX2(θ) IX(θ) = nI(θ) I(θ)

T = T(X) IT(θ) ≤ IX(θ) IT(θ) = IX(θ) T

θ ∈ R^k k > 1

(21)

Your turn

Determine the Fisher’s information measure for and for .

1.6 Some principles of data reduction

De nition

An experiment is defined as where .

Any inference about given obtained through the experiment is denoted by .

The suf ciency principle

Consider an experiment and a sufficient statistic for , Then,

Note

1. This principle seams quite reasonable and, therefore, appealing.

2. However, it is very model-dependent and so it requires firm belief in the model.

3. Common frequentist statistical procedures violate this principle. For example, model checking using residuals usually not based on sufficient statistics.

(X1, … , Xn)

F = {Bi(k, θ) : θ ∈]0, 1[}.

θ ϕ = _1−θ^θ

E = (X, Fθ, θ) Fθ = {F(x ∣ θ) : θ ∈ Θ}

θ X = x E

Ev(E, x)

E θ T(X).

∀(x, y) ∈ X² : T(x) = T(y) ⟹ Ev(E, x) = Ev(E, y).

(22)

The conditionality principle

Consider a set of experiments with a common parameter , , , from which is randomly selected an experi-

ment with probabilities , independent of . Then,

Note

1. In practice, this principle is well accepted.

2. However, on theoretical grounds it raises some difficulties when used together with the sufficiency principle.

De nition

Let be a random sample from . The function

is called the likelihood function.

The likelihood function is another data-reduction device that is widely used in Statistics:

Maximum likelihood estimation;

Hypotheses testing (the likelihood ratio statistic);

Fisher’s fiducial inference.

The likelihood principle

Consider two experiments and with a common parameter . Then,

k θ

E_i = (X_i, {f_i(x_i ∣ θ)}, θ) i = 1, … , k

E_J^∗ p_j = P(J = j), j = 1, … , k θ

Ev(E_J^∗, {j, xj}) = Ev(Ej, xj).

(X1, … , Xn) F = {F(x ∣ θ) : θ ∈ Θ}

L(θ ∣ x) ≡ f(x ∣ θ)

E1 E2 θ

∀x1 ∈ X1 ∀x2 ∈ X2 :

L1(θ ∣ x1) = c(x1, x2)L2(θ ∣ x2), ∀θ ∈ Θ ⟹

⟹ Ev(E1, x1) = Ev(E2, x2).

(23)

Note

Many frequentist statistical procedures violate this principle and so it faces strong rejection.

Example

Two experimenters, and wanted to test against in

. Both observed 9 successes and 3 failures from two different experiments:

observed “number of successes in 12 independent trials”

observed “number of independent trials until 3 failures”

Example

Note that the likelihood functions are proportional:

The p-values are given by:

0.073 0.033

could be rejected at a significance level by but not by !

Theorem

The likelihood principle implies the sufficiency principle.

Birnbaum’s theorem

The sufficiency and the conditionality principles are jointly equivalent to the likelihood principle.

E₁ E₂ H₀ : θ = 1/2 H₁ : θ > 1/2 F = {Ber(θ) : θ ∈]0, 1[}

E₁ X₁ =

X₁ ∣ θ ∼ Bin(12, θ)

E₂ X₂ =

X₂ ∣ θ ∼ NegBin(3, 1 − θ)

L1(θ ∣ X1 = 9) = ( )θ¹²₉ ⁹(1 − θ)³ L2(θ ∣ X2 = 12) = ( )(1 − θ)¹¹₂ ³θ⁹

p1 = P(X1 ≥ 9 ∣ θ = 1/2) = 1 − FBin(12,1/2)(8) = p2 = P(X2 ≥ 12 ∣ θ = 1/2) = 1 − FNegBin(3,1/2)(11) =

H0 α = 0.05 E2 E1

(24)

Note

1. The exact conditions under which this theorem is valid are still today under much controversy (Evans, M. (2013) What does the proof of Birnbaum’s theorem prove? (http://arxiv.org/abs/1302.5468)).

2. To this day, frequentist Statistics has failed to establish itself on solid and uni- versally accepted principles.

(25)

2 Point estimation

2.1 Estimators and estimates

The problem

Given a sample from some determine a plausible

value for (or some function ) from . The procedure

Select a statistic such that and use as an estimator and call any value an estimate of .

Note

For any given parameter, many estimators can always be proposed. How to choose between them?

It is not possible to evaluate estimates. So, some properties of any candidate estimator must be known beforehand.

Some properties of estimators

Finite sample properties

1. Minimal sufficiency

Nice, but . . . as we have seen can be difficult or even impossible to assure.

2. Bias

May be too restrictive and is not enough . . . 3. Efficiency

x ∈ X F ∈ F = {F(x ∣ θ) : θ ∈ Θ}

θ ψ(θ) x

T(X) Θ ⊂ T(X) T(X)

T(x) θ

Bias_ψ(θ)[T ∣ θ] = E[T ∣ θ] − ψ(θ)

(26)

The MSE represents a compromise between:

accuracy (measured by ) and

precision (measured by ).

The relative efficiency of and :

Asymptotic properties

1. Consistency is consistent for

Theorem

If then is consistent for .

2. Asymptotic efficiency 3. Asymptotic normality

Robustness

How well an estimator behaves in the presence of deviations from the postulated model for the data.

Your turn

Compare and as estimators of the parameter from MSE_ψ(θ)[T ∣ θ] = E[(T − ψ(θ))² ∣ θ] =

= V ar[T ∣ θ] + Bias²_ψ(θ)[T ∣ θ]

Bias_ψ(θ)[T ∣ θ]

V ar[T ∣ θ]

T U

e(T, U ∣ θ) = MSEψ(θ)[T ∣ θ]

MSE_ψ(θ)[U ∣ θ]

T ψ(θ) ⟺ {Tn}_n∈N−−−−→ ψ(θ)^n→+∞

n→+∞lim Bias_ψ(θ)[Tn∣ θ] = lim_n→+∞V ar[Tn ∣ θ] = 0 T ψ(θ)

S_n² S_n−1² σ²

F = {N(μ, σ²) : μ ∈ R, σ² ∈ R⁺}

(27)

2.2 The search for the best estimator

There is no such thing as an overall best estimator!

To establish different criteria we will:

1. choose some property to compare estimators (usually the MSE);

2. restrict the search to some suitable class of estimators or models.

De nition

is the class of unbiased estimators of .

De nition

is the class of linear unbiased estimators of .

Best linear unbiased estimators

De nition

An estimator is said the best linear unbiased estimator (BLUE) of if

and .

Note

To find a BLUE we need to solve a constrained optimization problem.

Example

Let be uncorrelated variables with common and finite mean and variance . Find the BLUE of .

U (ψ(θ)) = {T(X) : E[T(X) ∣ θ] = ψ(θ), ∀θ ∈ Θ}

ψ(θ)

LU (ψ(θ)) = {T(X) ∈ U (ψ(θ)) : T(X) =

n

∑

i=1

aiXi} ψ(θ)

T ψ(θ)

T ∈ LU (ψ(θ)) V ar[T ∣ θ] ≤ V ar[W ∣ θ], ∀θ ∈ Θ, ∀W ∈ LU (ψ(θ))

X1, … , Xn μ

σ² μ

(28)

Your turn

Find the BLUE of and compare it with . Note

Your turn

Best unbiased estimators

Hereafter we will consider only regular models.

Theorem

If is an estimator of with a differentiable bias then (X1, … , Xn)

F = {U(θ − 1/2, θ + 1/2) : θ ∈ R}.

θ C = ^X⁽¹⁾^+X₂ ⁽ⁿ⁾

Cov[X₍₁₎, X_(n)] = _(n+1)¹₂_(n+2)

T ψ(θ) b(θ)

MSE [T ∣ θ] ≥ [ψ^′(θ) + b^′(θ)]² + b²(θ).

(29)

Fréchet-Cramér-Rao inequality

Let be an estimator in , where is a differentiable function. Then

Note

We will call the FCR lower bound.

De nition

An estimator is said the best unbiased estimator (BUE) of if

and .

Note

An estimator is said the asymptoticaly best unbiased estimator of if and

When is it possible to attain the FCR lower bound?

Theorem

Let be an estimator in . is the BUE of if and only if

for some function .

T U (ψ(θ)) ψ

V ar[T ∣ θ] ≥ [ψ^′(θ)]² = R(ψ(θ)).

nI(θ)

R(ψ(θ))

T ψ(θ) T ∈ U (ψ(θ))

V ar[T ∣ θ] = R(ψ(θ)), ∀θ ∈ Θ

T ψ(θ)

T ∈ U (ψ(θ))

−−−−→ 1, ∀θ ∈ Θ.

V ar[T ∣ θ]

R(ψ(θ))

n→+∞

T U (ψ(θ)) T ψ(θ)

S(X ∣ θ) = g(θ) [T(X) − ψ(θ)] , g

(30)

Corollary

Let be an estimator of with a differentiable bias . The equals the FCR lower bound if and only if

for some function .

Corollary

The FCR lower bound is attainable for some if and only if is a sufficient statistic of an uniparametric exponential family.

Note

{BUE} {sufficient statistics}

one-dimensional sufficient statistic BUE

Your turn

Is there a BUE of ? For which parametric functions does a BUE exist?

Your turn

Let be a random sample from a mixture of the distributions and with weights and . Find the BUE of

T ψ(θ) b(θ) MSE_ψ(θ)[T ∣ θ]

S(X ∣ θ) = g(θ) [T(X) − (ψ(θ) + b(θ))] , g

ψ(θ) T

⊂

∄ ⟹ ∄

(X₁, … , X_n)

F = {Exp(λ) : λ ∈ R⁺}.

λ

(X1, … , Xn) Exp(1/θ)

Gamma(2, 1/θ) _θ+1¹ _θ+1^θ

ψ(θ) = (3 + 2θ)(2 + θ). θ + 1

(31)

Uniform minimum variance unbiased estimators

De nition

An estimator is said the uniform minimum variance unbiased estimator

(UMVUE) of if and

Rao-Blackwell’s theorem

Let be a sufficient statistic for , and . Then,

1. ;

2. .

Note

The equality in 2. happens if and only if is a function of ; If the UMVUE exists it must be a function of a sufficient statistic;

Rao-Blackwell’s theorem does not provide the UMVUE. However . . .

Lehmann-Scheffé’s theorem

If a model admits a complete sufficient statistic and there is an unbiased estimator for , then there is an unique UMVUE for that is a function of .

So, we have two possible strategies to find the UMVUE:

1. Apply Rao-Blackwell’s theorem using an unbiased estimator and a complete sufficient statistic;

2. Directly find an unbiased function of a complete sufficient statistic.

T

ψ(θ) T ∈ U (ψ(θ))

V ar[T ∣ θ] ≤ V ar[W ∣ θ], ∀W ∈ U (ψ(θ)) , ∀θ ∈ Θ.

T θ W ∈ U (ψ(θ)) U = E[W ∣ T]

E[U ∣ θ] = ψ(θ)

V ar[U ∣ θ] ≤ V ar[W ∣ θ], ∀θ ∈ Θ

W T

T

ψ(θ) ψ(θ) T

(32)

Example

Find the UMVUE of .

Your turn

Let be a random sample from , with .

a. Let be the UMVUE of and consider the class of estimators of defined by , with . In this class, find the estimator with uniform minimum MSE. What does this say about the UMVU criterium?

b. Determine the UMVUE of and show that it is the asymptotically BUE.

Summary

For regular models:

For non-regular models:

Without a complete sufficient statistic it is usually difficult to find an UMVUE.

The restriction to unbiased estimators is still a limitation.

2.3 Methods of nding estimators

In simple situations, some ingenuity combined with the previous criteria can provide good estimators;

For more complex models we need more methodical ways of estimating parameters.

(X1, … , Xn)

F = {Ber(θ) : θ ∈]0, 1[}.

θ²

(X1, … , Xn) F = {Exp(λ) : λ ∈ R⁺} n > 2

U λ λ

k U

n−1 k ∈ N

1 λ²

BUE ⟶ UMV UE ⟶ BLUE

UMV UE ⟶ BLUE

(33)

Method of moments

For a random sample from , equate the

first (at least) sample moments to the corresponding population moments,

Solving this system of equations for we find the method of moments estimators

The properties of these estimators can be derived from the properties of the sample moments which are:

1. Unbiased and consistent estimators of the population moments

2. Asymptotically normal Using the CLT,

1. Under general conditions, the method of moments estimators are consistent, asymptotically unbiased and normal. However, their efficiency can usually be improved.

2. Can be helpful as a source of reasonable starting values to other numerical methods of estimation.

(X1, … , Xn) F = {F(x ∣ θ) : θ = (θ1, … , θk)}

k

Mr = ∑ⁿ_i=1X_i^r = gr(θ) = E[X^r] = μr, r = 1, … , k.

n

θ

^θr = hr(M1, … , Mk), r = 1, … , k.

E[Mr ∣ θ] = μr

V ar[Mr ∣ θ] = ^μ^2r_n^−μ²^r

√n(Mr− μr) ⟶ N(0, μ^D 2r− μ²r)

(34)

M-estimators

De nition

The solutions of

are called M-estimators of (the M stands for “Maximum likelihood type”).

The function may be chosen to provide estimators with desirable properties, in particular, regarding robustness.

Particular cases

1. Least squares estimation in linear models where is defined as the square of a residual, such as

in a simple linear regression model.

2. Maximum likelihood estimation with

Maximum likelihood estimation

De nition

is the maximum likelihood estimate of .

If the likelihood function is differentiable then may be any solution of

such that .

Two possible exceptions cannot be forgotten:

1. the global maximum can be in the boundary of ;

^θ = arg min

θ∈Θ n

∑

i=1

g(Xi, θ) θ

g

g(Yi, β) = (Yi− (β0 + β1xi))²,

g(X_i, θ) = − log f(X_i ∣ θ).

^θ ∈ Θ : L(^θ ∣ X) ≥ L(θ ∣ X), ∀θ ∈ Θ θ

^θ_ML S(X ∣ θ) = 0

∣∣_θ=^θ_ML < 0

∂S(X∣θ)

∂θ

Θ

(35)

2. the global maximum can occur in a point where the likelihood function has no derivative.

Your turn

Find the MLE of based on a random sample from each of the following models:

a. ;

b. .

1. The MLE may not exist and may not be unique.

2. Boundary problems: consider the closure of the parameter space.

3. Numerical methods are usually required.

Suf ciency

If is a sufficient statistic can we claim that the MLE is a function of ? Example

For the uniform model in the last exercise,

is a MLE of that is not a function of the sufficient statistic .

Ef ciency

In a regular model, if the BUE exists then it must be a MLE.

Invariance

For any with we have

θ (X1, … , Xn)

{Ber(θ) : θ ∈]0, 1[}

{U(θ − 1/2, θ + 1/2) : θ ∈ R}

T T

T = sin²X₍₂₎(X_(n)− 1/2) + cos²X₍₂₎(X₍₁₎+ 1/2)

θ (X₍₁₎, X_(n))

g : θ ⊂ R^k → R^p p ≤ k

^g_ML(θ) = g(^θ_ML).

(36)

L-estimators

De nition

Linear combinations of the ordinal statistics, , are called L- estimators.

Intuitive estimators for location, scale parameters and quantiles.

Often provide robust estimators.

Non-asymptotical properties are hard to investigate.

Examples: Sample mean, trimmed means, range, mid-range and quantiles.

Your turn

1. Let be a random sample from . Show

that the UMVUE of exists for all but that it is not the BUE.

2. Let be a random sample from . Find

the UMVUE of and check if it is also the BUE.

3. Based on a random sample of size from

we want to estimate the relative precision measured by the square of the re- ciprocal of the coefficient of variation. Find the MLE and the UMVUE of that measure.

T = ∑ⁿ_i=1aiX_(i)

(X₁, … , X_n) F = {Poi(λ) : λ ∈ R⁺} P(X > 0 ∣ λ) n > 1

(X₁, … , X_n) F = {N(0, σ²) : σ² ∈ R⁺} σ

n F = {N(μ, σ²) : μ ∈ R, σ² ∈ R⁺}

(37)

3 Hypothesis testing

3.1 Introduction

De nition

A statistical hypothesis is a statement about the distribution of some observable quantity in a population.

The parametric case

Given a sample from some in test a null hypothesis

against an alternative hypothesis

in which and .

A hypothesis testing procedure is a partition defined in that leads to one of two possible decisions:

1. to reject 2. to not reject Note

is usually chosen in a conservative way – it should only be rejected if strong evidence against it is found.

As a consequence, the two hyphoteses are not permutable.

The procedure

Define a statistic such that

x ∈ X F Fθ

H0 : θ ∈ Θ0

H1 : θ ∈ Θ1

Θ₀∩ Θ₁ = ∅ Θ₀∪ Θ₁ = Θ

X

H0

H₀

T : X → [0, 1]

(38)

with , and . Note

is called the critical region;

If then the test is called a randomized test.

The evaluation of tests

Any test can lead to one of two possible wrong decisions:

Type I error

rejecting given that is true Type II error

not rejecting given that is false

De nition

The function

is called the power function of the test .

T(x) = P(Reject H0 ∣ x) =⎧⎪

⎨⎪

⎩

1, x ∈ Xc

ε, x ∈ Xr

0, otherwise ,

Xc, Xr⊂ X Xc∩ Xr = ∅ ε ∈ [0, 1[

X_c ε ≠ 0

H0 H0

βT(θ) = P(Reject H0 ∣ θ) = E[T(X) ∣ θ]

T

βT(θ) = {P(Type I error), θ ∈ Θ0

1 − P(Type II error), θ ∈ Θ1

(39)

Your turn

To test the hypothesis

at a significance level we can use the known test with

Obtain the power function for this test.

Your turn

The ideal test

It is impossible to get arbitrarily close to this ideal test since the two probabilities of error (X1, … , Xn)

F = {N(μ, 1) : μ ∈ R}.

H0 : μ = 0 against H1 : μ ≠ 0

α T(X) = I_C_α(X)

Cα = {x ∈ X : |√n¯x| > ϕ⁻¹(1 − α/2)} .

βT(θ) = IΘ1(θ)

(40)

So, in practice some compromise is required.

For instance, we could try to minimize

for some fixed .

3.2 Uniformly most powerful tests

De nition

For a test ,

is said the size of the test and is said a size test.

Any test with is said a level test .

De nition

A test is an uniformly most powerful (UMP) test in a class of tests if

where is any other test in .

Note

If an UMP -size test exists and is a sufficient statistic for then there is an UMP -size test that is a function of .

kβT(θ ∣ H0) + (1 − k) (1 − βT(θ ∣ H1)) , k ∈]0, 1[

T

α = sup

θ∈Θ0

βT(θ)

T α

U sup

θ∈Θ0

βU(θ) ≤ α α

T C

β_T(θ) ≥ β_T^∗(θ), ∀θ ∈ Θ₁,

T^∗ C

α S θ α

S

(41)

Neyman-Pearson lemma

To test against in the model

the test

for some and , is the essentially unique MP test of its level.

Example

Find the MP -size test for the hypothesis

Your turn

Find the MP -size test for the hypothesis H₀ : θ = θ₀ H₁ : θ = θ₁

F = {f(x ∣ θ) : θ ∈ {θ0, θ1}}

T(x) =

⎧⎪

⎪⎪

⎨⎪

⎪⎪

⎩

1, f(x ∣ θ1) > kf(x ∣ θ0) ε, f(x ∣ θ1) = kf(x ∣ θ0) 0, f(x ∣ θ1) < kf(x ∣ θ0)

,

k > 0 ε ∈ [0, 1[

(X1, … , Xn)

F = {N(μ, σ₀²) : μ ∈ {μ₀, μ₁}}.

α

H0 : μ = μ0 against H1 : μ = μ1(> μ0).

(X₁, … , X_n)

F = {Exp(λ) : λ ∈ {λ0, λ1}}, with λ1 > λ0. α

H0 : λ = λ0 against H1 : λ = λ1.

(42)

Your turn

Find the MP test with size 0.05 for the hypothesis

Example

As we have seen, the test

with , is the MP -size test for against .

Note that the MP test does not depend on the actual value of . What can be con- cluded from that?

Example

The former test is the UMP -size test for against . Can it also be the UMP test for

Let’s look at (X1, … , X10)

F = {Poi(λ) : λ ∈ {1, 2}}.

H0 : λ = 2 against H1 : λ = 1.

T(x) = {1, Z0 > Φ⁻¹(1 − α) 0, otherwise ,

Z₀ = √n^X−μ^¯_σ₀ ⁰ α H₀ : μ = μ₀ H₁ : μ = μ₁(> μ₀)

μ1

α H0 : μ = μ0 H1 : μ > μ0

H0 : μ ≤ μ0 against H1 : μ > μ0?

βT(μ) = 1 − Φ (Φ⁻¹(1 − α) − √nμ − μ0) . σ₀

(43)

Your turn

Consider now the problem of testing

in .

a. Check if the test in the previous example can still be the UMP -size test for this new hypotheses.

b. Is there an UMP test for this case?

Hint: consider the MP test for against .

Where can we find UMP tests?

De nition

A model has a monotone likelihood ratio in a real- valued statistic if for all the likelihood ratio is a nondecreasing

function of in .

Note

1. We will consider for .

2. If the likelihood ratio is a nonincreasing function of then the model has a MLR in .

Example

Let be a member of the uniparametric exponential family of distributions.

Under which conditions can this model have a MLR?

Your turn

Show that the model has a MLR in some statistic.

H₀ : μ = μ₀ against H₁ : μ ≠ μ₀ F = {N(μ, σ₀²) : μ ∈ R}

α

H0 : μ = μ0 H1 : μ = μ1(< μ0)

F = {f(x ∣ θ) : θ ∈ Θ ⊂ R}

S θ₂ > θ₁ ^f(x∣θ_f(x∣θ²⁾

1)

S {x ∈ X : f(x ∣ θ1) > 0 or f(x ∣ θ2) > 0}

c/0 = +∞ c > 0

S

−S

F

F = {U(0, θ) : θ ∈ R⁺}

(44)

Lemma

If a model has a MLR in a statistic and is a nondecreasing function then is a nondecreasing function of .

Karlin-Rubin’s theorem

If the model has a monotone likelihood ratio in a real-valued statistic then the test

for some and , is an UMP test of its size to test against .

Note

For the hypotheses against we can use the reparametrization .

Your turn

Consider a random sample of size 20 from the model

and find the UMP test with size 0.05 for the hypothesis against .

3.3 Likelihood ratio tests

For many problems there is no UMP test among all tests of a given size!

We could keep applying the same optimality criterion in restricted classes of tests:

1. unbiased tests (UMPU)

F_θ S g

E[g(S) ∣ θ] θ

F = {f(x ∣ θ) : θ ∈ Θ ⊂ R}

S(X)

T(x) =

⎧⎪

⎨⎪

⎩

1, S(X) > k ε, S(X) = k 0, S(X) < k

,

k ε ∈ [0, 1[ H0 : θ ≤ θ0

H₁ : θ > θ₀

H₀ : θ ≥ θ₀ H₁ : θ < θ₀ λ = −θ

F = {U(0, θ) : θ ∈ R⁺}

H0 : θ ≥ 1 H1 : θ < 1

(45)

3. . . .

De nition

For the hypotheses against with , the test

where and

is called a likelihood ratio test.

Note

1. .

2. We can also write , where is the MLE of and is the MLE of restricted to .

3. If is a sufficient statistic for then can be written as a function of . Example

Find the LR test for the hypothesis against .

H₀ : θ ∈ Θ₀ H₁ : θ ∈ Θ₁ Θ₀∪ Θ₁ = Θ T(X) = I_[0,k[(Λ(X)) k ∈ [0, 1]

Λ(X) =

θ∈Θsup0

L(θ ∣ X) supθ∈ΘL(θ ∣ X)

0 ≤ Λ(x) ≤ 1, ∀x ∈ X

Λ(X) = L(^θ0 ∣ X)

L(^θ ∣ X) ^θ θ ^θ₀

θ Θ₀

S θ Λ(X) S

(X1, … , Xn)

F = {N(μ, σ₀²) : μ ∈ R}.

H0 : μ = μ0 H1 : μ ≠ μ0

(46)

Example

Your turn

Find the LR test for the hypothesis against .

It is possible to construct LR tests provided:

1. the distribution of under is known, or

2. the LR test can equivalently be written as a function of a statistic whose distribution under is known.

Otherwise it may be difficult to find a LR test!

(X1, … , Xn)

F = {f(x ∣ θ) = e^−(x−θ)I[θ,+∞[(x) : θ ∈ R} . H0 : θ < θ0 H1 : θ ≥ θ0

Λ(X) H0

S(X) H0

(47)

Wilk’s LR test statistic

Under some regularity conditions we have that

where and

with , is the -size Wilk’s asymptotic LR test.

Your turn

1. Let be a random sample from the model . We want to

test against .

a. Show that the size of the test is approxi- mately 0.26.

b. Define and interpret the test

Should we prefer over ?

c. Identify the UMP test with size .

Your turn

2. Let be a random sample from the model .

a. Define the LRT for the hypotheses and . b. Is the previous test a UMP test? Justify.

W(X) = −2 log Λ(X)⟶^D

H0 χ²_(r), r = dim(Θ) − dim(Θ0)

T(X) = I_[c,+∞[(W(X)) , c = F_χ⁻¹2

(r)(1 − α) α

(X1, X2) {Poi(θ), θ > 0}

H0 : θ ≤ 1 H1 : θ > 1

T1(X1, X2) = 1 − I{0,1}(X1)

T2(X1, X2) = E[T1(X1, X2) ∣ X1+ X2 = t].

T2 T1

α = E[T1(X1, X2) ∣ θ = 1]

X1, … , Xn {Ber(θ), 0 < θ < 1}

H0 : θ ≤ θ0 H1 : θ > θ0

Mathematical Statistics