BAYESIAN ESTIMATION * - MATHEMATICAL PRELIMINARIES

MATHEMATICAL PRELIMINARIES

4.6 BAYESIAN ESTIMATION *

BAYESIAN ESTIMATION * 95

This expression shows that the minimization can be carried out by minimizing the conditional expectation

E^fk^?

^

^k²^j

x

T^g

=

^

₂

^

^TE^fj

x

T^g

+

^E^f^T^j

x

T^g ^(4.69) The right-hand side is obtained by evaluating the squared norm and noting that

^

is a function of the observations

x

T only, so that it can be treated as a nonrandom vector when computing the conditional expectation (4.69). The result (4.67) now follows directly by computing the gradient

2 ^

₂

^E^fj

x

T^gof (4.69) with respect to

^

^and

equating it to zero.

The minimum mean-square estimator

^

_MSEis unbiased since

E^f

^

_MSE^g

₌

^E^x^f^E^fj

x

T^gg

=

^E^fg ^(4.70)

The minimum mean-square estimator (4.67) is theoretically very significant be-cause of its conceptual simplicity and generality. This result holds for all distributions for which the joint distribution

p

;^x

(

; x ⁾

exists, and remains unchanged if a weight-ing matrix

W

is added into the criterion (4.66) [407].

However, actual computation of the minimum mean-square estimator is often very difficult. This is because in practice we only know or assume the prior distribution

p

(

)

and the conditional distribution of the observations

p

^xj

( x

⁾

given the pa-rameters. In constructing the optimal estimator (4.67), one must first compute the posterior density from Bayes’ formula (see Section 2.4)

p

^jx

(

x

) = p

^xj

( x

T^j

) p

(

)

p

( x

)

^(4.71)

where the denominator is computed by integrating the numerator:

p

( x

) =

p

^xj

( x

T^j

) p

(

) d

^(4.72)

The computation of the conditional expectation (4.67) then requires still another integration. These integrals are usually impossible to evaluate at least analytically except for special cases.

There are, however, two important special cases where the minimum mean-square estimator

^

_MSE for random parameters can be determined fairly easily. If the estimator

^

is constrained to be a linear function of the data:

^

Lx

T, then it can be shown [407] that the optimal linear estimator

^

_LMSEminimizing the MSE criterion (4.66) is

^

LMSE

= m

+ C

C

^?1^x

( x

T ^?

m

)

^(4.73)

where

m

^and

m

^x are the mean vectors of and

x

T, respectively,

C

^x ^{is the}

covariance matrix of

x

T^{, and}

C

^xis the cross-covariance matrix ofand

x

T^{. The} error covariance matrix corresponding to the optimum linear estimator

^

_LMSEis

E^f

(

^

_LMSE

₎₍

^

_LMSE

₎

^T^g

₌ C

C

^?1^x

C

^x ^(4.74)

where

C

is the covariance matrix of the parameter vector. We can conclude that if the minimum mean-square estimator is constrained to be linear, it suffices to know the first-order and second-order statistics of the data

x

and the parameters, that is, their means and covariance matrices.

If the joint probability density

p

;^x

(

; x

)

of the parameters and data

x

T is gaussian, the results (4.73) and (4.74) obtained by constraining the minimum mean-square estimator to be linear are quite generally optimal. This is because the conditional density

p

^jx

(

x

)

is also gaussian with the conditional mean (4.73) and covariance matrix (4.74); see section 2.5. This again underlines the fact that for the gaussian distribution, linear processing and knowledge of first and second order statistics are usually sufficient to obtain optimal results.

4.6.2 Wiener filtering

In this subsection, we take a somewhat different signal processing viewpoint to the linear minimum MSE estimation. Many estimation algorithms have in fact been developed in context with various signal processing problems [299, 171].

Consider the following linear filtering problem. Let

z

^{be an}

^m

-dimensional data or input vector of the form

z ^{= [} ^z

^;z

^{;::: ;z}

]

^T ^(4.75)

and

w ^{= [} ^w

^;w

^{;::: ;w}

]

^T ^(4.76)

m

-dimensional weight vector with adjustable weights (elements)

w

i^,

i = 1 ;::: ;m

operating linearly on

z

so that the output of the filter is

y = w

z

^(4.77)

In Wiener filtering, the goal is to determine the linear filter (4.77) that minimizes the mean-square error

EMSE

=

^E^f

( y

d )

²^g ^(4.78)

between the desired response

d

and the output

y

of the filter. Inserting (4.77) into (4.78) and evaluating the expectation yields

EMSE

= w

R

w

² w

r

^zd

+

^E^f

d

²^g ^(4.79)

Here

R

^z ^{= E}^f

zz

^T^gis the data correlation matrix, and

r

^zd ^{= E}^f

z ^d

^gis the cross-correlation vector between the data vector

z

and the desired response

d

. Minimizing the mean-square error (4.79) with respect to the weight vector

w

provides as the optimum solution the Wiener filter [168, 171, 419, 172]

w ^

MSE

= R

^?1^z

r

^zd ^(4.80)

BAYESIAN ESTIMATION * 97

provided that

R

^zis nonsingular. This is almost always the case in practice due to the noise and the statistical nature of the problem. The Wiener filter is usually computed by directly solving the linear normal equations

R

w ^{^}

MSE

= r

^zd ^(4.81)

In practice, the correlation matrix

R

^z and the cross-correlation vector

r

^zd ^are usually unknown. They must then be replaced by their estimates, which can be computed easily from the available finite data set. In fact the Wiener estimate then becomes a standard least-squares estimator (see exercises). In signal processing applications, the correlation matrix

R

^z is often a Toeplitz matrix, since the data vectors

z ( i )

consist of subsequent samples from a single signal or time series (see Section 2.8). For this special case, various fast algorithms are available for solving the normal equations efficiently [169, 171, 419].

4.6.3 Maximum a posteriori (MAP) estimator

Instead of minimizing the mean-square error (4.66) or some other performance index, we can apply to Bayesian estimation the same principle as in the maximum likelihood method. This leads to the maximum a posteriori (MAP) estimator

^

_MAP, which is defined as the value of the parameter vector that maximizes the posterior density

p

^jx

(

x

)

^ofgiven the measurements

x

T. The MAP estimator can be interpreted as the most probable value of the parameter vectorfor the available data

x

T^{. The} principle behind the MAP estimator is intuitively well justified and appealing.

We have earlier noted that the posterior density can be computed from Bayes’

formula (4.71). Note that the denominator in (4.71) is the prior density

p

( x

)

^{of the}

data

x

T which does not depend on the parameter vector, and merely normalizes the posterior density

p

^jx

(

x

)

. Hence for finding the MAP estimator it suffices to find the value ofthat maximizes the numerator of (4.71), which is the joint density

p

;^x

(

; x

) = p

^xj

( x

T^j

) p

(

)

^(4.82)

Quite similarly to the maximum likelihood method, the MAP estimator

^

_MAP ^can

usually be found by solving the (logarithmic) likelihood equation. This now has the form

@ @

^ln p (

; x

) = @

@

^ln p ( x

T^j

) + @

@

^ln p (

) = 0

^(4.83)

where we have dropped the subscripts of the probability densities for notational simplicity.

A comparison with the respective likelihood equation (4.50) for the maximum likelihood method shows that these equations are otherwise the same, but the MAP likelihood equation (4.83) contains an additional term

@ (ln p (

)) =@

, which takes into account the prior information on the parameters . If the prior density

p (

)

is uniform for parameter valuesfor which

p ( x

T^j

)

is markedly greater than zero, then the MAP and maximum likelihood estimators become the same. In this case,

they are both obtained by finding the value

^

that maximizes the conditional density

p ( x

T^j

)

. This is the case when there is no prior information about the parameters available. However, when the prior density

p (

)

is not uniform, the MAP and ML estimators are usually different.

Example 4.8 Assume that we have

T

independent observations

x (1) ;::: ;x ( T )

from a scalar random quantity

x

that is gaussian distributed with mean

_x ^and

variance

²_x. This time the mean

xis itself a gaussian random variable having mean zero and the variance

². We assume that both the variances

²_x^and

²^{are known}

and wish to estimate

using the MAP method.

Using the preceding information, it is straightforward to form the likelihood equation for the MAP estimator

^

MAP and solve it. The solution is (the derivation is left as a exercise)

^

MAP

=

_x²

+ T

j⁼¹

x ( j )

^(4.84)

The case in which we do not have any prior information on

can be modeled by letting

² ^!¹, reflecting our uncertainty about

[407]. Then clearly

^

MAP ^!

1 T

j⁼¹

x ( j )

^(4.85)

so that the MAP estimator

^

MAPtends to the sample mean. The same limiting value is obtained if the number of samples

T

^!¹. This shows that the influence of the prior information, contained in the variance

², gradually decreases as the number of the measurements increases. Hence asymptotically the MAP estimator coincides with the maximum likelihood estimator

^

_MLwhich we found earlier in (4.56) to be the sample mean (4.85).

Note also that if we are relatively confident about the prior value

0

of the mean

, but the samples are very noisy so that

²_x

>>

², the MAP estimator (4.84) for small

T

stays close to the prior value

0

^of

, and the number

T

of samples must grow large until the MAP estimator approaches its limiting value (4.85). In contrast, if

>>

_x², so that the samples are reliable compared to the prior information on

, the MAP estimator (4.84) rapidly approaches the sample mean (4.85). Thus the MAP estimator (4.84) weights in a meaningful way the prior information and the samples according to their relative reliability.

Roughly speaking, the MAP estimator is a compromise between the general minimum mean-square error estimator (4.67) and the maximum likelihood estimator.

The MAP method has the advantage over the maximum likelihood method that it takes into account the (possibly available) prior information about the parameters

, but it is computationally somewhat more difficult to determine because a second term appears in the likelihood equation (4.83). On the other hand, both the ML and MAP estimators are obtained from likelihood equations, avoiding the generally

CONCLUDING REMARKS AND REFERENCES 99

difficult integrations needed in computing the minimum mean-square estimator. If the posterior distribution

p (

x

)

is symmetric around its peak value, the MAP estimator and MSE estimator coincide.

There is no guarantee that the MAP estimator is unbiased. It is also generally difficult to compute the covariance matrix of the estimation error for the MAP and ML estimators. However, the MAP estimator is intuitively sensible, yields in most cases good results in practice, and it has good asymptotic properties under appropriate conditions. These desirable characteristics justify its use.

No documento Independent Component Analysis (páginas 116-121)

BAYESIAN ESTIMATION *

MATHEMATICAL PRELIMINARIES

4.6 BAYESIAN ESTIMATION *

^

x

=

^

^

2

^

x

+

x

^

x

2

^

2

x

^

^

^

=

x

=

p

(

; x )

W

p

(

)

p

( x

)

p

(

x

) = p

( x

) p

(

)

p

( x

)

p

( x

) =

p

( x

) p

(

) d

^

^

^

Lx

^

^

= m

+ C

C

( x

m

)

m

m

x

C

x

C

x

^

(

^

)(

^

)

= C

₂

₂

₌

; x ⁾

⁾

₎₍

₎

₌ C

^m

z ^{= [} ^z

^;z

^{;::: ;z}

w ^{= [} ^w

^;w

^{;::: ;w}

² w

z ^d

w ^{^}