MATHEMATICAL PRELIMINARIES
4.6 BAYESIAN ESTIMATION *
BAYESIAN ESTIMATION * 95
This expression shows that the minimization can be carried out by minimizing the conditional expectation
Efk?
^
k2jx
Tg=
^
T^
?2
^
TEfjx
Tg+
EfTjx
Tg (4.69) The right-hand side is obtained by evaluating the squared norm and noting that^
is a function of the observationsx
T only, so that it can be treated as a nonrandom vector when computing the conditional expectation (4.69). The result (4.67) now follows directly by computing the gradient2
^
?2
Efjx
Tgof (4.69) with respect to^
andequating it to zero.
The minimum mean-square estimator
^
MSEis unbiased sinceEf
^
MSEg=
ExfEfjx
Tgg=
Efg (4.70)The minimum mean-square estimator (4.67) is theoretically very significant be-cause of its conceptual simplicity and generality. This result holds for all distributions for which the joint distribution
p
;x(
; x )
exists, and remains unchanged if a weight-ing matrixW
is added into the criterion (4.66) [407].However, actual computation of the minimum mean-square estimator is often very difficult. This is because in practice we only know or assume the prior distribution
p
(
)
and the conditional distribution of the observationsp
xj( x
j)
given the pa-rameters. In constructing the optimal estimator (4.67), one must first compute the posterior density from Bayes’ formula (see Section 2.4)p
jx(
jx
T) = p
xj( x
Tj) p
(
)
p
x( x
T)
(4.71)where the denominator is computed by integrating the numerator:
p
x( x
T) =
Z
1
?1
p
xj( x
Tj) p
(
) d
(4.72)The computation of the conditional expectation (4.67) then requires still another integration. These integrals are usually impossible to evaluate at least analytically except for special cases.
There are, however, two important special cases where the minimum mean-square estimator
^
MSE for random parameters can be determined fairly easily. If the estimator^
is constrained to be a linear function of the data:^
=Lx
T, then it can be shown [407] that the optimal linear estimator^
LMSEminimizing the MSE criterion (4.66) is^
LMSE
= m
+ C
xC
?1x( x
T ?m
x)
(4.73)where
m
andm
x are the mean vectors of andx
T, respectively,C
x is thecovariance matrix of
x
T, andC
xis the cross-covariance matrix ofandx
T. The error covariance matrix corresponding to the optimum linear estimator^
LMSEisEf
(
?^
LMSE)(
?^
LMSE)
Tg= C
?C
xC
?1xC
x (4.74)where
C
is the covariance matrix of the parameter vector. We can conclude that if the minimum mean-square estimator is constrained to be linear, it suffices to know the first-order and second-order statistics of the datax
and the parameters, that is, their means and covariance matrices.If the joint probability density
p
;x(
; x
T)
of the parameters and datax
T is gaussian, the results (4.73) and (4.74) obtained by constraining the minimum mean-square estimator to be linear are quite generally optimal. This is because the conditional densityp
jx(
jx
T)
is also gaussian with the conditional mean (4.73) and covariance matrix (4.74); see section 2.5. This again underlines the fact that for the gaussian distribution, linear processing and knowledge of first and second order statistics are usually sufficient to obtain optimal results.4.6.2 Wiener filtering
In this subsection, we take a somewhat different signal processing viewpoint to the linear minimum MSE estimation. Many estimation algorithms have in fact been developed in context with various signal processing problems [299, 171].
Consider the following linear filtering problem. Let
z
be anm
-dimensional data or input vector of the formz = [ z
1;z
2;::: ;z
m]
T (4.75)and
w = [ w
1;w
2;::: ;w
m]
T (4.76)an
m
-dimensional weight vector with adjustable weights (elements)w
i,i = 1 ;::: ;m
operating linearly on
z
so that the output of the filter isy = w
Tz
(4.77)In Wiener filtering, the goal is to determine the linear filter (4.77) that minimizes the mean-square error
EMSE
=
Ef( y
?d )
2g (4.78)between the desired response
d
and the outputy
of the filter. Inserting (4.77) into (4.78) and evaluating the expectation yieldsEMSE
= w
TR
zw
?2 w
Tr
zd+
Efd
2g (4.79)Here
R
z = Efzz
Tgis the data correlation matrix, andr
zd = Efz d
gis the cross-correlation vector between the data vectorz
and the desired responsed
. Minimizing the mean-square error (4.79) with respect to the weight vectorw
provides as the optimum solution the Wiener filter [168, 171, 419, 172]w ^
MSE= R
?1zr
zd (4.80)BAYESIAN ESTIMATION * 97
provided that
R
zis nonsingular. This is almost always the case in practice due to the noise and the statistical nature of the problem. The Wiener filter is usually computed by directly solving the linear normal equationsR
zw ^
MSE= r
zd (4.81)In practice, the correlation matrix
R
z and the cross-correlation vectorr
zd are usually unknown. They must then be replaced by their estimates, which can be computed easily from the available finite data set. In fact the Wiener estimate then becomes a standard least-squares estimator (see exercises). In signal processing applications, the correlation matrixR
z is often a Toeplitz matrix, since the data vectorsz ( i )
consist of subsequent samples from a single signal or time series (see Section 2.8). For this special case, various fast algorithms are available for solving the normal equations efficiently [169, 171, 419].4.6.3 Maximum a posteriori (MAP) estimator
Instead of minimizing the mean-square error (4.66) or some other performance index, we can apply to Bayesian estimation the same principle as in the maximum likelihood method. This leads to the maximum a posteriori (MAP) estimator
^
MAP, which is defined as the value of the parameter vector that maximizes the posterior densityp
jx(
jx
T)
ofgiven the measurementsx
T. The MAP estimator can be interpreted as the most probable value of the parameter vectorfor the available datax
T. The principle behind the MAP estimator is intuitively well justified and appealing.We have earlier noted that the posterior density can be computed from Bayes’
formula (4.71). Note that the denominator in (4.71) is the prior density
p
x( x
T)
of thedata
x
T which does not depend on the parameter vector, and merely normalizes the posterior densityp
jx(
jx
T)
. Hence for finding the MAP estimator it suffices to find the value ofthat maximizes the numerator of (4.71), which is the joint densityp
;x(
; x
T) = p
xj( x
Tj) p
(
)
(4.82)Quite similarly to the maximum likelihood method, the MAP estimator
^
MAP canusually be found by solving the (logarithmic) likelihood equation. This now has the form
@ @
ln p (
; x
T) = @
@
ln p ( x
Tj) + @
@
ln p (
) = 0
(4.83)where we have dropped the subscripts of the probability densities for notational simplicity.
A comparison with the respective likelihood equation (4.50) for the maximum likelihood method shows that these equations are otherwise the same, but the MAP likelihood equation (4.83) contains an additional term
@ (ln p (
)) =@
, which takes into account the prior information on the parameters . If the prior densityp (
)
is uniform for parameter valuesfor which
p ( x
Tj)
is markedly greater than zero, then the MAP and maximum likelihood estimators become the same. In this case,they are both obtained by finding the value
^
that maximizes the conditional densityp ( x
Tj)
. This is the case when there is no prior information about the parameters available. However, when the prior densityp (
)
is not uniform, the MAP and ML estimators are usually different.Example 4.8 Assume that we have
T
independent observationsx (1) ;::: ;x ( T )
from a scalar random quantity
x
that is gaussian distributed with mean x andvariance
2x. This time the meanxis itself a gaussian random variable having mean zero and the variance2. We assume that both the variances2xand2are knownand wish to estimate
using the MAP method.Using the preceding information, it is straightforward to form the likelihood equation for the MAP estimator
^
MAP and solve it. The solution is (the derivation is left as a exercise)^
MAP=
2 x2+ T
2T
X
j=1
x ( j )
(4.84)The case in which we do not have any prior information on
can be modeled by letting2 !1, reflecting our uncertainty about[407]. Then clearly^
MAP !1 T
T
X
j=1
x ( j )
(4.85)so that the MAP estimator
^
MAPtends to the sample mean. The same limiting value is obtained if the number of samplesT
!1. This shows that the influence of the prior information, contained in the variance2, gradually decreases as the number of the measurements increases. Hence asymptotically the MAP estimator coincides with the maximum likelihood estimator^
MLwhich we found earlier in (4.56) to be the sample mean (4.85).Note also that if we are relatively confident about the prior value
0
of the mean , but the samples are very noisy so that2x>>
2, the MAP estimator (4.84) for smallT
stays close to the prior value0
of, and the numberT
of samples must grow large until the MAP estimator approaches its limiting value (4.85). In contrast, if2>>
x2, so that the samples are reliable compared to the prior information on , the MAP estimator (4.84) rapidly approaches the sample mean (4.85). Thus the MAP estimator (4.84) weights in a meaningful way the prior information and the samples according to their relative reliability.Roughly speaking, the MAP estimator is a compromise between the general minimum mean-square error estimator (4.67) and the maximum likelihood estimator.
The MAP method has the advantage over the maximum likelihood method that it takes into account the (possibly available) prior information about the parameters
, but it is computationally somewhat more difficult to determine because a second term appears in the likelihood equation (4.83). On the other hand, both the ML and MAP estimators are obtained from likelihood equations, avoiding the generally
CONCLUDING REMARKS AND REFERENCES 99
difficult integrations needed in computing the minimum mean-square estimator. If the posterior distribution
p (
jx
T)
is symmetric around its peak value, the MAP estimator and MSE estimator coincide.There is no guarantee that the MAP estimator is unbiased. It is also generally difficult to compute the covariance matrix of the estimation error for the MAP and ML estimators. However, the MAP estimator is intuitively sensible, yields in most cases good results in practice, and it has good asymptotic properties under appropriate conditions. These desirable characteristics justify its use.