Mathematical Statistics
Class notes
Paulo Soares
December 2022
. . . 1
. . . 5
. . . 5
. . . 6
. . . 12
. . . 14
. . . 15
. . . 17
. . . 21
. . . 21
. . . 23
. . . 28
. . . 33
. . . 33
. . . 36
. . . 40
. . . 45
. . . 46
. . . 51
. . . 55
. . . 55
. . . 58
. . . 63
. . . 64
. . . 69
. . . 69
. . . 74
. . . 75
Table of contents
Formulae
1 Principles of data reduction 1.1 Introduction
1.2 Suf ciency
1.3 Exponential families of distributions 1.4 Suf ciency in restricted models 1.5 Suf ciency and Fisher’s information 1.6 Some principles of data reduction 2 Point estimation
2.1 Estimators and estimates
2.2 The search for the best estimator 2.3 Methods of nding estimators 3 Hypothesis testing
3.1 Introduction
3.2 Uniformly most powerful tests 3.3 Likelihood ratio tests
4 Set estimation
4.1 Methods of nding set estimators 4.2 Optimal con dence sets
5 The Bayesian choice
5.1 There’s no theorem like Bayes’ theorem 5.2 Parametric bayesian statistics
5.3 The main characteristics of bayesian statistics 5.4 The prior distribution
6 Bayesian inference
6.1 Summarizing posterior inference 6.2 Prediction
6.3 Computation
Formulae
Special functions
Indicator function
Gamma function
Beta function
Discrete distributions
Discrete uniform distribution
Binomial distribution
IA(x) = { 1, x ∈ A0, x ∉ A
Γ(x) = ∫0+∞tx−1e−tdt, x > 0
Γ(x + 1) = xΓ(x), x > 0 Γ(n) = (n − 1)!, n ∈ N
B(x, y) = ∫01tx−1(1 − t)y−1dt, x, y > 0 B(x, y) = Γ(x)Γ(y)
Γ(x + y)
Uniform ({a, … , a + n − 1}) a ∈ Z, n ∈ N f(x) = 1 IS(x) S = {a, … , a + n − 1}
n
E[X] = a + n − 1 V ar[X] = 2
n2− 1 12 M(t) = eat(1 − ent)
n(1 − et)
Binomial(n, θ) n ∈ N θ ∈]0, 1[
Hypergeometric distribution
Negative binomial distribution
Poisson distribution
E[X] = nθ V ar[X] = nθ(1 − θ) M(t) = (θet+ (1 − θ))n
Hypergeometric(N, M, n)
N ∈ N M ∈ {1, … , N} n ∈ {1, … , N}
f(x) = ( )(Mx N−Mn−x )IS(x) ( )Nn
S = {max{0, n − N + M}, … , min{n, M}}
E[X] = nM V ar[X] = n N
M N
N − M N
N − n N − 1
NegativeBinomial(r, θ) r ∈ N θ ∈]0, 1[
f(x) = (x−1r−1)θr(1 − θ)x−rIS(x) S = {r, r + 1, …}
E[X] = r V ar[X] = θ
r(1 − θ) θ2 M(t) = ( θet )r
1 − (1 − θ)et
Geometric(θ) ≡ NegativeBinomial(1, θ)
Poisson(λ) λ ∈ R+
f(x) = e−λλxIS(x) S = N0
x!
E[X] = V ar[X] = λ M(t) = eλ(et−1)
Continuous distributions
Continuous uniform distribution
Gamma distribution
Normal distribution
Uniform(α, β) α, β ∈ R, α < β f(x) = 1 IS(x) S = [α, β]
β − α
E[X] = α + β V ar[X] = 2
(β − α)2 12 M(t) = eβt− eαt, t ≠ 0
(β − α)t
Gamma(α, β) α, β ∈ R+
f(x) = βα xα−1e−βxIS(x) S = R+0 Γ(α)
E[X] = α V ar[X] = β
α β2 M(t) = ( β )α, t < β
β − t
χ2(n) ≡ Gamma ( , ) n ∈ Nn2 12 Exponential(λ) ≡ Gamma(1, λ)
Normal(μ, σ2) μ ∈ R, σ2 ∈ R+
f(x) = 1 e−
√2πσ2
(x − μ)2 2σ2 E[X] = μ V ar[X] = σ2 M(t) = eμt+ tσ22 2
Lognormal distribution
Beta distribution
Lognormal(μ, σ2) μ ∈ R, σ2 ∈ R+
f(x) = 1 e− IS(x) S = R+ x√2πσ2
(log x − μ)2 2σ2
E[X] = eμ+σ22 V ar[X] = (eσ2− 1) e2μ+σ2 E[Xk] = eμk+ kσ22 2
Beta(α, β) α, β ∈ R+
f(x) = 1 xα−1(1 − x)β−1IS(x) S = [0, 1]
B(α, β)
E[X] = α V ar[X] = α + β
αβ
(α + β)2(α + β + 1)
1 Principles of data reduction
1.1 Introduction
Data obtained from the observation of a random vector taking values in (sam- ple space)
Assumption of a probability model for , , where is a -algebra defined in
is a family of distributions What can this be?
continuous distributions
, where is the parameter space Usual questions:
1. Does the data contradict or ? Hypotheses testing
X X
X (X, A, F)
A σ X
F F
F = { }
…
F = {F(x ∣ θ) : θ ∈ Θ} Θ
F F0 ⊂ F ⇝
2. Assessed the validity of , can we refine our initial model choice? Point or set estimation
3. Can we use to predict unobserved data? Prediction A frequent scenario:
, with fixed
is a random sample from some unknown Parametric inference
What is the sample information about (or )?
1. The observed values ;
2. The sampling distribution
How do we use the data?
Frequentist statistics rely heavily on the use of . . . statistics
used as
point estimators;
a starting point to find pivotal quantities;
test statistics . . .
Since, usually, , a statistic provides a reduction of the data and, potentially, some loss of information.
Do we always lose information by using any statistic?
Fθ ⇝
F ⇝
X ⊂ Rn n
X F ∈ F = {F(x ∣ θ) : θ ∈ Θ}
⇝
F θ
x = (x1, … , xn)
f(x ∣ θ) =
n
∏
i=1
f(xi ∣ θ).
T(X) : X ⟶ Rk, 1 ≤ k ≤ n
k ≪ n
1.2 Suf ciency
De nition
A statistics is said sufficient for if and only if the distribution of does not depend on .
Does a sufficient statistic always exist?
Example
Let be a random sample from
1. is a (trivial) sufficient statistic for ; 2. is also a sufficient statistic for .
Your turn
Let be a random sample from
Determine whether and are sufficient statistics for .
How to find sufficient statistics?
So, is a sufficient statistic for if and only if does not depend on , .
T = T(X) θ
X ∣ T = t θ
(X1, … , Xn)
F = {F(x ∣ θ) : θ ∈ Θ}.
(X1, … , Xn) θ
(X(1), … , X(n)) θ
(X1, … , Xn)
F = {Ber(θ) : θ ∈]0, 1[} .
T1 = X1+ Xn T2 = ∑ni=1Xi θ
f(x ∣ t, θ) = = {0, x : T(x) ≠ t , x : T(x) = t , ∀t f(x, t ∣ θ)
f(t ∣ θ) f(x∣θ)f(t∣θ)
T θ f(x∣θ)f(t∣θ) θ ∀x ∈ X
Factorization theorem
A statistic is sufficient for if and only if
for some non-negative functions and .
Your turn
Find a sufficient statistic for each of the following models:
a.
b.
Are all sufficient statistics equal?
De nition
where , is the partition of induced by
.
Your turn
Let be a statistic and its partition of . Consider another statistic that is some function of .
Discuss how and can be related.
De nition
is equivalent to if and only if .
T(X) θ
f(x ∣ θ) = g (T(x), θ) h(x), ∀x ∈ X, ∀θ ∈ Θ,
g h
F1 = {N(μ, σ2) : μ ∈ R, σ2 ∈ R+} F2 = {U(θ − 1/2, θ + 1/2) : θ ∈ R}
ΠT = {πt}t∈T(X) πt= {x ∈ X : T(x) = t} X
T(X)
T(X) ΠT X U(X)
T(X) ΠT ΠU
T(X) U(X) ΠT = ΠU
De nition
Given two statistics, and , is nested in if and only if
Example
Let be a random sample from and consider the statistics:
If is sufficient for what characterizes ? Note
Sufficient statistic sufficient partition
Theorem
If is sufficient for then any other statistic such that is also sufficient for .
De nition
A statistic is called a minimal sufficient statistic if and only if it is sufficient and a function of any other sufficient statistic.
T(X) U(X) ΠT ΠU
∀π ∈ ΠT∃π∗ ∈ ΠU : π ⊂ π∗.
(X1, … , Xn), n > 2 F T0 = (X1, … , Xn)
T1 = (X(1), … , X(n)) T2 = (X(1), X(n)) T3 = X(n)
T(X) θ ΠT
≡
T(X) θ U(X) T = g(U)
θ
Lehmann & Scheffé’s method
If there exists a statistic such that , where
, and
for some positive function , then is a minimal sufficient statistic.
Example
Let be a random sample from
Find a minimal sufficient statistic for .
Your turn
Let be a random sample from
Show that the sufficient statistic found before is minimal.
De nition
A statistic whose distribution does not depend on is called an ancillary statistic.
Example
1. is a trivial ancillary statistic in any model.
2. Let be a random sample from any member of the location-scale family of distributions. Any statistic that is a function of
is an ancillary statistic.
T(X) ∀(x, y) ∈ X∖Π0
Π0 = {x ∈ X : f(x ∣ θ) = 0, ∀θ ∈ Θ} ∀θ ∈ Θ :
f(y ∣ θ) = c(x, y) f(x ∣ θ) ⟺ T(x) = T(y)
c T(X)
(X1, … , Xn)
F = {N(μ, σ2) : μ ∈ R, σ2 ∈ R+}.
(μ, σ2)
(X1, … , Xn)
F = {U(θ − 1/2, θ + 1/2) : θ ∈ R}.
θ
T(X) = c
(X1, … , Xn)
(X1 − λ, … , ) δ
Xn− λ δ
Your turn
Let be a random sample from
Show that
is a minimal sufficient statistic but is an ancillary statistic.
De nition
A statistic is called a complete statistic if and only if
Note
1. Any non constant function of a complete statistic cannot be an ancillary statistic.
2. is complete is also complete.
Example
Let be a random sample from
Show that is a sufficient and complete statistic.
Theorem
A sufficient and complete statistic is also a minimal sufficient statistic.
(X1, … , Xn)
F = {U(θ − 1/2, θ + 1/2) : θ ∈ R}.
(R, C) = (X(n)− X(1), X(1)+ X(n)) 2
R
T(X)
E[h(T) ∣ θ] = 0 ⟹ P(h(T) = 0) = 1, ∀θ ∈ Θ.
T ⟹ U = g(T)
(X1, … , Xn)
F = {Ber(θ) : θ ∈]0, 1[}.
T(X) = ∑ni=1Xi
Your turn
1. Let be a random sample from
Show that is a sufficient and complete statistic.
2. Show that for any model with a minimal sufficient statistic that is not com- plete there are no sufficient and complete statistics.
3. Let be a random sample from . Find a
minimal sufficient statistic for . Is that statistic complete?
Basu’s theorem
A sufficient and complete statistic is independent of any ancillary statistic.
Note
Useful to prove independence without finding a joint distribution but . . . to prove that a statistic is complete is often a difficult problem.
1.3 Exponential families of distributions
De nition
A family of distributions is a -parametric exponen-
tial family if
for some non-negative functions and .
Note
The support of can not depend on . (X1, … , Xn)
F = {U(0, θ) : θ ∈ R+}.
T(X) = X(n)
(X1, … , Xn) F = {U(θ, 2θ) : θ ∈ R+} θ
F = {F(x ∣ θ) : θ ∈ Θ ⊂ Rp} k
f(x ∣ θ) = c(θ)h(x) exp{
k
∑
j=1
nj(θ)Tj(x)}
c h
f θ
– natural parameters
– natural parameter space Example
Show that is an uniparametric exponential family.
Some properties
1. An exponential family is closed under random sampling.
2. For any member of a -parametric exponential family there is always a -dimen- sional sufficient statistic regardless of the sample size .
3. Any model with a -dimensional sufficient statistic for any sample size is a - parametric exponential family if its support does not depend on the parameter.
Theorem
The sufficient statistic for a -parametric exponential family is complete if the natu- ral parameter space contains an open set of .
Your turn
1. Show that the minimal sufficient statistic for
found before is also complete.
2. Investigate the model
regarding completeness.
f(x ∣ θ) = c(θ)h(x) exp{∑k
j=1
nj(θ)Tj(x)}
αj = nj(θ), j = 1, … , k A = {α ∈ Rk : θ(α) ∈ Θ}
F = {Geo(θ) : θ ∈]0, 1[}
k k
n
k n k
k
Rk
F = {N(μ, σ2) : μ ∈ R, σ2 ∈ R+}
F = {N(θ, θ2) : θ ∈ R∖{0}}
Example
Back to Basu’s theorem . . .
Let be a random sample from
Show that and are independent.
1.4 Suf ciency in restricted models
Quite often some parameters in a model are just auxiliary and there is no real inferential interest on them – the so called nuisance parameters.
Consider with
De nition
is said specific sufficient (ancillary) for if is sufficient (ancillary) for
, .
is specific sufficient for :
is specific ancillary for :
De nition
is said partial sufficient for if it is specific sufficient for and specific ancil- lary for .
is partial sufficient for :
Note
(X1, … , Xn)
F = {N(μ, σ2) : μ ∈ R, σ2 ∈ R+}.
X¯ S2
Fθ θ = (γ, ϕ) ∈ Γ × Φ = Θ.
T = T(X) γ T
γ ∀ϕ ∈ Φ
T γ
f(x ∣ γ, ϕ) = f(x ∣ t, ϕ) f(t ∣ γ, ϕ)
T ϕ
f(x ∣ γ, ϕ) = f(x ∣ t, ϕ, γ) f(t ∣ γ)
T(X) γ γ
ϕ
T γ
f(x ∣ γ, ϕ) = f(x ∣ t, ϕ) f(t ∣ γ)
Your turn
1. Check if and are partial sufficient for and in
2. Let , be a random sample from such that
and . With and
show that is partial sufficient for but is not specific suffi- cient nor specific ancillary for .
1.5 Suf ciency and Fisher’s information
De nition
A uniparametric model with the following properties is called a regular model:
1. The model is identifiable, that is, is a one-to-one transformation, and is an open interval;
2. The support of the model does not depend on ;
3. is differentiable in and is integrable in ; 4. The operators and can be permuted.
De nition
is called the score function, with in .
Note
measures the variation of in for a given .
X¯ S2 μ σ2
F = {N(μ, σ2) : μ ∈ R, σ2 ∈ R+}.
(Xi, Yi), i = 1, … , n (X, Y )
X ∣ ϕ ∼ Poi(ϕ) Y ∣ X, γ ∼ Bi(x, γ) T = ∑ni=1Xi
U = ∑ni=1Yi T ϕ U
γ
θ → Fθ
Θ
θ
f(x ∣ θ) Θ ∂f∂θ X
∂
∂θ ∫ dx
S(x ∣ θ) = ∂ log f(x ∣ θ)
∂θ
S(x ∣ θ) = 0 X0 = {x : f(x ∣ θ) = 0, ∀θ}
S(x ∣ θ) log f(x ∣ θ) Θ x
De nition
is called the Fisher’s information measure.
Note
A measure of dispersion of in is taken as a measure of sample information.
Theorem
For a regular model we have and, therefore,
Some properties
1. For with differentiable
2. For a regular model with
3. If with and independent then and,
consequently, where represents the Fisher’s information for a single observation.
4. For any statistic we have and if and only if
is sufficient.
Note
The previous definitions and properties can be naturally generalized for , with , (the Fisher’s information matrix).
IX(θ) = V ar[S(X ∣ θ)]
S(X ∣ θ) X
E[S(X ∣ θ)] = 0
IX(θ) = E[S2(X ∣ θ)].
θ = g(ϕ) g
IX(ϕ) = IX(g(ϕ)) (g′(ϕ))2. 0 < IX(θ) < +∞
IX(θ) = −E [∂2log f(X ∣ θ)] .
∂θ2
X = (X1, X2) X1 X2 IX(θ) = IX1(θ) + IX2(θ) IX(θ) = nI(θ) I(θ)
T = T(X) IT(θ) ≤ IX(θ) IT(θ) = IX(θ) T
θ ∈ Rk k > 1
Your turn
Let be a random sample from
Determine the Fisher’s information measure for and for .
1.6 Some principles of data reduction
De nition
An experiment is defined as where .
Any inference about given obtained through the experiment is denoted by .
The suf ciency principle
Consider an experiment and a sufficient statistic for , Then,
Note
1. This principle seams quite reasonable and, therefore, appealing.
2. However, it is very model-dependent and so it requires firm belief in the model.
3. Common frequentist statistical procedures violate this principle. For example, model checking using residuals usually not based on sufficient statistics.
(X1, … , Xn)
F = {Bi(k, θ) : θ ∈]0, 1[}.
θ ϕ = 1−θθ
E = (X, Fθ, θ) Fθ = {F(x ∣ θ) : θ ∈ Θ}
θ X = x E
Ev(E, x)
E θ T(X).
∀(x, y) ∈ X2 : T(x) = T(y) ⟹ Ev(E, x) = Ev(E, y).
The conditionality principle
Consider a set of experiments with a common parameter , , , from which is randomly selected an experi-
ment with probabilities , independent of . Then,
Note
1. In practice, this principle is well accepted.
2. However, on theoretical grounds it raises some difficulties when used together with the sufficiency principle.
De nition
Let be a random sample from . The function
is called the likelihood function.
The likelihood function is another data-reduction device that is widely used in Statistics:
Maximum likelihood estimation;
Hypotheses testing (the likelihood ratio statistic);
Fisher’s fiducial inference.
The likelihood principle
Consider two experiments and with a common parameter . Then,
k θ
Ei = (Xi, {fi(xi ∣ θ)}, θ) i = 1, … , k
EJ∗ pj = P(J = j), j = 1, … , k θ
Ev(EJ∗, {j, xj}) = Ev(Ej, xj).
(X1, … , Xn) F = {F(x ∣ θ) : θ ∈ Θ}
L(θ ∣ x) ≡ f(x ∣ θ)
E1 E2 θ
∀x1 ∈ X1 ∀x2 ∈ X2 :
L1(θ ∣ x1) = c(x1, x2)L2(θ ∣ x2), ∀θ ∈ Θ ⟹
⟹ Ev(E1, x1) = Ev(E2, x2).
Note
Many frequentist statistical procedures violate this principle and so it faces strong rejection.
Example
Two experimenters, and wanted to test against in
. Both observed 9 successes and 3 failures from two differ- ent experiments:
observed “number of successes in 12 independent trials”
observed “number of independent trials until 3 failures”
Example
Note that the likelihood functions are proportional:
The p-values are given by:
0.073 0.033
could be rejected at a significance level by but not by !
Theorem
The likelihood principle implies the sufficiency principle.
Birnbaum’s theorem
The sufficiency and the conditionality principles are jointly equivalent to the likeli- hood principle.
E1 E2 H0 : θ = 1/2 H1 : θ > 1/2 F = {Ber(θ) : θ ∈]0, 1[}
E1 X1 =
X1 ∣ θ ∼ Bin(12, θ)
E2 X2 =
X2 ∣ θ ∼ NegBin(3, 1 − θ)
L1(θ ∣ X1 = 9) = ( )θ129 9(1 − θ)3 L2(θ ∣ X2 = 12) = ( )(1 − θ)112 3θ9
p1 = P(X1 ≥ 9 ∣ θ = 1/2) = 1 − FBin(12,1/2)(8) = p2 = P(X2 ≥ 12 ∣ θ = 1/2) = 1 − FNegBin(3,1/2)(11) =
H0 α = 0.05 E2 E1
Note
1. The exact conditions under which this theorem is valid are still today under much controversy (Evans, M. (2013) What does the proof of Birnbaum’s theo- rem prove? (http://arxiv.org/abs/1302.5468)).
2. To this day, frequentist Statistics has failed to establish itself on solid and uni- versally accepted principles.
2 Point estimation
2.1 Estimators and estimates
The problem
Given a sample from some determine a plausible
value for (or some function ) from . The procedure
Select a statistic such that and use as an estimator and call any value an estimate of .
Note
For any given parameter, many estimators can always be proposed. How to choose between them?
It is not possible to evaluate estimates. So, some properties of any candidate estimator must be known beforehand.
Some properties of estimators
Finite sample properties
1. Minimal sufficiency
Nice, but . . . as we have seen can be difficult or even impossible to assure.
2. Bias
May be too restrictive and is not enough . . . 3. Efficiency
x ∈ X F ∈ F = {F(x ∣ θ) : θ ∈ Θ}
θ ψ(θ) x
T(X) Θ ⊂ T(X) T(X)
T(x) θ
Biasψ(θ)[T ∣ θ] = E[T ∣ θ] − ψ(θ)
The MSE represents a compromise between:
accuracy (measured by ) and
precision (measured by ).
The relative efficiency of and :
Asymptotic properties
1. Consistency is consistent for
Theorem
If then is consistent for .
2. Asymptotic efficiency 3. Asymptotic normality
Robustness
How well an estimator behaves in the presence of deviations from the postulated model for the data.
Your turn
Compare and as estimators of the parameter from MSEψ(θ)[T ∣ θ] = E[(T − ψ(θ))2 ∣ θ] =
= V ar[T ∣ θ] + Bias2ψ(θ)[T ∣ θ]
Biasψ(θ)[T ∣ θ]
V ar[T ∣ θ]
T U
e(T, U ∣ θ) = MSEψ(θ)[T ∣ θ]
MSEψ(θ)[U ∣ θ]
T ψ(θ) ⟺ {Tn}n∈N−−−−→ ψ(θ)n→+∞
n→+∞lim Biasψ(θ)[Tn∣ θ] = limn→+∞V ar[Tn ∣ θ] = 0 T ψ(θ)
Sn2 Sn−12 σ2
F = {N(μ, σ2) : μ ∈ R, σ2 ∈ R+}
2.2 The search for the best estimator
There is no such thing as an overall best estimator!
To establish different criteria we will:
1. choose some property to compare estimators (usually the MSE);
2. restrict the search to some suitable class of estimators or models.
De nition
is the class of unbiased estimators of .
De nition
is the class of linear unbiased estimators of .
Best linear unbiased estimators
De nition
An estimator is said the best linear unbiased estimator (BLUE) of if
and .
Note
To find a BLUE we need to solve a constrained optimization problem.
Example
Let be uncorrelated variables with common and finite mean and vari- ance . Find the BLUE of .
U (ψ(θ)) = {T(X) : E[T(X) ∣ θ] = ψ(θ), ∀θ ∈ Θ}
ψ(θ)
LU (ψ(θ)) = {T(X) ∈ U (ψ(θ)) : T(X) =
n
∑
i=1
aiXi} ψ(θ)
T ψ(θ)
T ∈ LU (ψ(θ)) V ar[T ∣ θ] ≤ V ar[W ∣ θ], ∀θ ∈ Θ, ∀W ∈ LU (ψ(θ))
X1, … , Xn μ
σ2 μ
Your turn
Let be a random sample from
Find the BLUE of and compare it with . Note
Your turn
Best unbiased estimators
Hereafter we will consider only regular models.
Theorem
If is an estimator of with a differentiable bias then (X1, … , Xn)
F = {U(θ − 1/2, θ + 1/2) : θ ∈ R}.
θ C = X(1)+X2 (n)
Cov[X(1), X(n)] = (n+1)12(n+2)
T ψ(θ) b(θ)
MSE [T ∣ θ] ≥ [ψ′(θ) + b′(θ)]2 + b2(θ).
Fréchet-Cramér-Rao inequality
Let be an estimator in , where is a differentiable function. Then
Note
We will call the FCR lower bound.
De nition
An estimator is said the best unbiased estimator (BUE) of if
and .
Note
An estimator is said the asymptoticaly best unbiased estimator of if and
When is it possible to attain the FCR lower bound?
Theorem
Let be an estimator in . is the BUE of if and only if
for some function .
T U (ψ(θ)) ψ
V ar[T ∣ θ] ≥ [ψ′(θ)]2 = R(ψ(θ)).
nI(θ)
R(ψ(θ))
T ψ(θ) T ∈ U (ψ(θ))
V ar[T ∣ θ] = R(ψ(θ)), ∀θ ∈ Θ
T ψ(θ)
T ∈ U (ψ(θ))
−−−−→ 1, ∀θ ∈ Θ.
V ar[T ∣ θ]
R(ψ(θ))
n→+∞
T U (ψ(θ)) T ψ(θ)
S(X ∣ θ) = g(θ) [T(X) − ψ(θ)] , g
Corollary
Let be an estimator of with a differentiable bias . The equals the FCR lower bound if and only if
for some function .
Corollary
The FCR lower bound is attainable for some if and only if is a sufficient statis- tic of an uniparametric exponential family.
Note
{BUE} {sufficient statistics}
one-dimensional sufficient statistic BUE
Your turn
Let be a random sample from
Is there a BUE of ? For which parametric functions does a BUE exist?
Your turn
Let be a random sample from a mixture of the distributions and with weights and . Find the BUE of
T ψ(θ) b(θ) MSEψ(θ)[T ∣ θ]
S(X ∣ θ) = g(θ) [T(X) − (ψ(θ) + b(θ))] , g
ψ(θ) T
⊂
∄ ⟹ ∄
(X1, … , Xn)
F = {Exp(λ) : λ ∈ R+}.
λ
(X1, … , Xn) Exp(1/θ)
Gamma(2, 1/θ) θ+11 θ+1θ
ψ(θ) = (3 + 2θ)(2 + θ). θ + 1
Uniform minimum variance unbiased estimators
De nition
An estimator is said the uniform minimum variance unbiased estimator
(UMVUE) of if and
Rao-Blackwell’s theorem
Let be a sufficient statistic for , and . Then,
1. ;
2. .
Note
The equality in 2. happens if and only if is a function of ; If the UMVUE exists it must be a function of a sufficient statistic;
Rao-Blackwell’s theorem does not provide the UMVUE. However . . .
Lehmann-Scheffé’s theorem
If a model admits a complete sufficient statistic and there is an unbiased estimator for , then there is an unique UMVUE for that is a function of .
So, we have two possible strategies to find the UMVUE:
1. Apply Rao-Blackwell’s theorem using an unbiased estimator and a complete suffi- cient statistic;
2. Directly find an unbiased function of a complete sufficient statistic.
T
ψ(θ) T ∈ U (ψ(θ))
V ar[T ∣ θ] ≤ V ar[W ∣ θ], ∀W ∈ U (ψ(θ)) , ∀θ ∈ Θ.
T θ W ∈ U (ψ(θ)) U = E[W ∣ T]
E[U ∣ θ] = ψ(θ)
V ar[U ∣ θ] ≤ V ar[W ∣ θ], ∀θ ∈ Θ
W T
T
ψ(θ) ψ(θ) T
Example
Let be a random sample from
Find the UMVUE of .
Your turn
Let be a random sample from , with .
a. Let be the UMVUE of and consider the class of estimators of defined by , with . In this class, find the estimator with uniform minimum MSE. What does this say about the UMVU criterium?
b. Determine the UMVUE of and show that it is the asymptotically BUE.
Summary
For regular models:
For non-regular models:
Without a complete sufficient statistic it is usually difficult to find an UMVUE.
The restriction to unbiased estimators is still a limitation.
2.3 Methods of nding estimators
In simple situations, some ingenuity combined with the previous criteria can pro- vide good estimators;
For more complex models we need more methodical ways of estimating parameters.
(X1, … , Xn)
F = {Ber(θ) : θ ∈]0, 1[}.
θ2
(X1, … , Xn) F = {Exp(λ) : λ ∈ R+} n > 2
U λ λ
k U
n−1 k ∈ N
1 λ2
BUE ⟶ UMV UE ⟶ BLUE
UMV UE ⟶ BLUE
Method of moments
For a random sample from , equate the
first (at least) sample moments to the corresponding population moments,
Solving this system of equations for we find the method of moments estimators
The properties of these estimators can be derived from the properties of the sample mo- ments which are:
1. Unbiased and consistent estimators of the population moments
2. Asymptotically normal Using the CLT,
1. Under general conditions, the method of moments estimators are consistent, asymptotically unbiased and normal. However, their efficiency can usually be improved.
2. Can be helpful as a source of reasonable starting values to other numerical methods of estimation.
(X1, … , Xn) F = {F(x ∣ θ) : θ = (θ1, … , θk)}
k
Mr = ∑ni=1Xir = gr(θ) = E[Xr] = μr, r = 1, … , k.
n
θ
^θr = hr(M1, … , Mk), r = 1, … , k.
E[Mr ∣ θ] = μr
V ar[Mr ∣ θ] = μ2rn−μ2r
√n(Mr− μr) ⟶ N(0, μD 2r− μ2r)
M-estimators
De nition
The solutions of
are called M-estimators of (the M stands for “Maximum likelihood type”).
The function may be chosen to provide estimators with desirable properties, in particular, regarding robustness.
Particular cases
1. Least squares estimation in linear models where is defined as the square of a residual, such as
in a simple linear regression model.
2. Maximum likelihood estimation with
Maximum likelihood estimation
De nition
is the maximum likelihood estimate of .
If the likelihood function is differentiable then may be any solution of
such that .
Two possible exceptions cannot be forgotten:
1. the global maximum can be in the boundary of ;
^θ = arg min
θ∈Θ n
∑
i=1
g(Xi, θ) θ
g
g
g(Yi, β) = (Yi− (β0 + β1xi))2,
g(Xi, θ) = − log f(Xi ∣ θ).
^θ ∈ Θ : L(^θ ∣ X) ≥ L(θ ∣ X), ∀θ ∈ Θ θ
^θML S(X ∣ θ) = 0
∣∣θ=^θML < 0
∂S(X∣θ)
∂θ
Θ
2. the global maximum can occur in a point where the likelihood function has no derivative.
Your turn
Find the MLE of based on a random sample from each of the follow- ing models:
a. ;
b. .
1. The MLE may not exist and may not be unique.
2. Boundary problems: consider the closure of the parameter space.
3. Numerical methods are usually required.
Suf ciency
If is a sufficient statistic can we claim that the MLE is a function of ? Example
For the uniform model in the last exercise,
is a MLE of that is not a function of the sufficient statistic .
Ef ciency
In a regular model, if the BUE exists then it must be a MLE.
Invariance
For any with we have
θ (X1, … , Xn)
{Ber(θ) : θ ∈]0, 1[}
{U(θ − 1/2, θ + 1/2) : θ ∈ R}
T T
T = sin2X(2)(X(n)− 1/2) + cos2X(2)(X(1)+ 1/2)
θ (X(1), X(n))
g : θ ⊂ Rk → Rp p ≤ k
^gML(θ) = g(^θML).
L-estimators
De nition
Linear combinations of the ordinal statistics, , are called L- estimators.
Intuitive estimators for location, scale parameters and quantiles.
Often provide robust estimators.
Non-asymptotical properties are hard to investigate.
Examples: Sample mean, trimmed means, range, mid-range and quantiles.
Your turn
1. Let be a random sample from . Show
that the UMVUE of exists for all but that it is not the BUE.
2. Let be a random sample from . Find
the UMVUE of and check if it is also the BUE.
3. Based on a random sample of size from
we want to estimate the relative precision measured by the square of the re- ciprocal of the coefficient of variation. Find the MLE and the UMVUE of that measure.
T = ∑ni=1aiX(i)
(X1, … , Xn) F = {Poi(λ) : λ ∈ R+} P(X > 0 ∣ λ) n > 1
(X1, … , Xn) F = {N(0, σ2) : σ2 ∈ R+} σ
n F = {N(μ, σ2) : μ ∈ R, σ2 ∈ R+}
3 Hypothesis testing
3.1 Introduction
De nition
A statistical hypothesis is a statement about the distribution of some observable quantity in a population.
The parametric case
Given a sample from some in test a null hypothesis
against an alternative hypothesis
in which and .
A hypothesis testing procedure is a partition defined in that leads to one of two possi- ble decisions:
1. to reject 2. to not reject Note
is usually chosen in a conservative way – it should only be rejected if strong evidence against it is found.
As a consequence, the two hyphoteses are not permutable.
The procedure
Define a statistic such that
x ∈ X F Fθ
H0 : θ ∈ Θ0
H1 : θ ∈ Θ1
Θ0∩ Θ1 = ∅ Θ0∪ Θ1 = Θ
X
H0
H0
H0
T : X → [0, 1]
with , and . Note
is called the critical region;
If then the test is called a randomized test.
The evaluation of tests
Any test can lead to one of two possible wrong decisions:
Type I error
rejecting given that is true Type II error
not rejecting given that is false
De nition
The function
is called the power function of the test .
T(x) = P(Reject H0 ∣ x) =⎧⎪
⎨⎪
⎩
1, x ∈ Xc
ε, x ∈ Xr
0, otherwise ,
Xc, Xr⊂ X Xc∩ Xr = ∅ ε ∈ [0, 1[
Xc ε ≠ 0
H0 H0
H0 H0
βT(θ) = P(Reject H0 ∣ θ) = E[T(X) ∣ θ]
T
βT(θ) = {P(Type I error), θ ∈ Θ0
1 − P(Type II error), θ ∈ Θ1
Your turn
Let be a random sample from
To test the hypothesis
at a significance level we can use the known test with
Obtain the power function for this test.
Your turn
The ideal test
It is impossible to get arbitrarily close to this ideal test since the two probabilities of error (X1, … , Xn)
F = {N(μ, 1) : μ ∈ R}.
H0 : μ = 0 against H1 : μ ≠ 0
α T(X) = ICα(X)
Cα = {x ∈ X : |√n¯x| > ϕ−1(1 − α/2)} .
βT(θ) = IΘ1(θ)
So, in practice some compromise is required.
For instance, we could try to minimize
for some fixed .
3.2 Uniformly most powerful tests
De nition
For a test ,
is said the size of the test and is said a size test.
Any test with is said a level test .
De nition
A test is an uniformly most powerful (UMP) test in a class of tests if
where is any other test in .
Note
If an UMP -size test exists and is a sufficient statistic for then there is an UMP -size test that is a function of .
kβT(θ ∣ H0) + (1 − k) (1 − βT(θ ∣ H1)) , k ∈]0, 1[
T
α = sup
θ∈Θ0
βT(θ)
T α
U sup
θ∈Θ0
βU(θ) ≤ α α
T C
βT(θ) ≥ βT∗(θ), ∀θ ∈ Θ1,
T∗ C
α S θ α
S
Neyman-Pearson lemma
To test against in the model
the test
for some and , is the essentially unique MP test of its level.
Example
Let be a random sample from
Find the MP -size test for the hypothesis
Your turn
Let be a random sample from
Find the MP -size test for the hypothesis H0 : θ = θ0 H1 : θ = θ1
F = {f(x ∣ θ) : θ ∈ {θ0, θ1}}
T(x) =
⎧⎪
⎪⎪
⎨⎪
⎪⎪
⎩
1, f(x ∣ θ1) > kf(x ∣ θ0) ε, f(x ∣ θ1) = kf(x ∣ θ0) 0, f(x ∣ θ1) < kf(x ∣ θ0)
,
k > 0 ε ∈ [0, 1[
(X1, … , Xn)
F = {N(μ, σ02) : μ ∈ {μ0, μ1}}.
α
H0 : μ = μ0 against H1 : μ = μ1(> μ0).
(X1, … , Xn)
F = {Exp(λ) : λ ∈ {λ0, λ1}}, with λ1 > λ0. α
H0 : λ = λ0 against H1 : λ = λ1.
Your turn
Let be a random sample from
Find the MP test with size 0.05 for the hypothesis
Example
As we have seen, the test
with , is the MP -size test for against .
Note that the MP test does not depend on the actual value of . What can be con- cluded from that?
Example
The former test is the UMP -size test for against . Can it also be the UMP test for
Let’s look at (X1, … , X10)
F = {Poi(λ) : λ ∈ {1, 2}}.
H0 : λ = 2 against H1 : λ = 1.
T(x) = {1, Z0 > Φ−1(1 − α) 0, otherwise ,
Z0 = √nX−μ¯σ0 0 α H0 : μ = μ0 H1 : μ = μ1(> μ0)
μ1
α H0 : μ = μ0 H1 : μ > μ0
H0 : μ ≤ μ0 against H1 : μ > μ0?
βT(μ) = 1 − Φ (Φ−1(1 − α) − √nμ − μ0) . σ0
Your turn
Consider now the problem of testing
in .
a. Check if the test in the previous example can still be the UMP -size test for this new hypotheses.
b. Is there an UMP test for this case?
Hint: consider the MP test for against .
Where can we find UMP tests?
De nition
A model has a monotone likelihood ratio in a real- valued statistic if for all the likelihood ratio is a nondecreasing
function of in .
Note
1. We will consider for .
2. If the likelihood ratio is a nonincreasing function of then the model has a MLR in .
Example
Let be a member of the uniparametric exponential family of distributions.
Under which conditions can this model have a MLR?
Your turn
Show that the model has a MLR in some statistic.
H0 : μ = μ0 against H1 : μ ≠ μ0 F = {N(μ, σ02) : μ ∈ R}
α
H0 : μ = μ0 H1 : μ = μ1(< μ0)
F = {f(x ∣ θ) : θ ∈ Θ ⊂ R}
S θ2 > θ1 f(x∣θf(x∣θ2)
1)
S {x ∈ X : f(x ∣ θ1) > 0 or f(x ∣ θ2) > 0}
c/0 = +∞ c > 0
S
−S
F
F = {U(0, θ) : θ ∈ R+}
Lemma
If a model has a MLR in a statistic and is a nondecreasing function then is a nondecreasing function of .
Karlin-Rubin’s theorem
If the model has a monotone likelihood ratio in a real-valued statistic then the test
for some and , is an UMP test of its size to test against .
Note
For the hypotheses against we can use the reparametrization .
Your turn
Consider a random sample of size 20 from the model
and find the UMP test with size 0.05 for the hypothesis against .
3.3 Likelihood ratio tests
For many problems there is no UMP test among all tests of a given size!
We could keep applying the same optimality criterion in restricted classes of tests:
1. unbiased tests (UMPU)
Fθ S g
E[g(S) ∣ θ] θ
F = {f(x ∣ θ) : θ ∈ Θ ⊂ R}
S(X)
T(x) =
⎧⎪
⎨⎪
⎩
1, S(X) > k ε, S(X) = k 0, S(X) < k
,
k ε ∈ [0, 1[ H0 : θ ≤ θ0
H1 : θ > θ0
H0 : θ ≥ θ0 H1 : θ < θ0 λ = −θ
F = {U(0, θ) : θ ∈ R+}
H0 : θ ≥ 1 H1 : θ < 1
3. . . .
De nition
For the hypotheses against with , the test
where and
is called a likelihood ratio test.
Note
1. .
2. We can also write , where is the MLE of and is the MLE of restricted to .
3. If is a sufficient statistic for then can be written as a function of . Example
Let be a random sample from
Find the LR test for the hypothesis against .
H0 : θ ∈ Θ0 H1 : θ ∈ Θ1 Θ0∪ Θ1 = Θ T(X) = I[0,k[(Λ(X)) k ∈ [0, 1]
Λ(X) =
θ∈Θsup0
L(θ ∣ X) supθ∈ΘL(θ ∣ X)
0 ≤ Λ(x) ≤ 1, ∀x ∈ X
Λ(X) = L(^θ0 ∣ X)
L(^θ ∣ X) ^θ θ ^θ0
θ Θ0
S θ Λ(X) S
(X1, … , Xn)
F = {N(μ, σ02) : μ ∈ R}.
H0 : μ = μ0 H1 : μ ≠ μ0
Example
Your turn
Let be a random sample from
Find the LR test for the hypothesis against .
It is possible to construct LR tests provided:
1. the distribution of under is known, or
2. the LR test can equivalently be written as a function of a statistic whose dis- tribution under is known.
Otherwise it may be difficult to find a LR test!
(X1, … , Xn)
F = {f(x ∣ θ) = e−(x−θ)I[θ,+∞[(x) : θ ∈ R} . H0 : θ < θ0 H1 : θ ≥ θ0
Λ(X) H0
S(X) H0
Wilk’s LR test statistic
Under some regularity conditions we have that
where and
with , is the -size Wilk’s asymptotic LR test.
Your turn
1. Let be a random sample from the model . We want to
test against .
a. Show that the size of the test is approxi- mately 0.26.
b. Define and interpret the test
Should we prefer over ?
c. Identify the UMP test with size .
Your turn
2. Let be a random sample from the model .
a. Define the LRT for the hypotheses and . b. Is the previous test a UMP test? Justify.
W(X) = −2 log Λ(X)⟶D
H0 χ2(r), r = dim(Θ) − dim(Θ0)
T(X) = I[c,+∞[(W(X)) , c = Fχ−12
(r)(1 − α) α
(X1, X2) {Poi(θ), θ > 0}
H0 : θ ≤ 1 H1 : θ > 1
T1(X1, X2) = 1 − I{0,1}(X1)
T2(X1, X2) = E[T1(X1, X2) ∣ X1+ X2 = t].
T2 T1
α = E[T1(X1, X2) ∣ θ = 1]
X1, … , Xn {Ber(θ), 0 < θ < 1}
H0 : θ ≤ θ0 H1 : θ > θ0