ENTROPY - MATHEMATICAL PRELIMINARIES - Independent Component Analysis

MATHEMATICAL PRELIMINARIES

5.1 ENTROPY

5

Information Theory

Estimation theory gives one approach to characterizing random variables. This was based on building parametric models and describing the data by the parameters.

An alternative approach is given by information theory. Here the emphasis is on coding. We want to code the observations. The observations can then be stored in the memory of a computer, or transmitted by a communications channel, for example. Finding a suitable code depends on the statistical properties of the data.

In independent component analysis (ICA), estimation theory and information theory offer the two principal theoretical approaches.

In this chapter, the basic concepts of information theory are introduced. The latter half of the chapter deals with a more specialized topic: approximation of entropy.

These concepts are needed in the ICA methods of Part II.

Fig. 5.1 The function

f

in (5.2), plotted on the interval^[0

;

^1]^.

not important since it only changes the measurement scale, so it is not explicitly mentioned.

Let us define the function

f

^as

f ( p ) =

p log p;

^for

0 p

1

^(5.2)

This is a nonnegative function that is zero for

p = 0

^{and for}

p = 1

, and positive for values in between; it is plotted in Fig. 5.1. Using this function, entropy can be written as

H ( X ) =

f ( P ( X = a

))

^(5.3)

Considering the shape of

f

, we see that the entropy is small if the probabilities

P ( X = a

)

are close to

0

^or

1

, and large if the probabilities are in between.

In fact, the entropy of a random variable can be interpreted as the degree of information that the observation of the variable gives. The more “random”, i.e., unpredictable and unstructured the variable is, the larger its entropy. Assume that the probabilities are all close to

0

, expect for one that is close to

1

(the probabilities must sum up to one). Then there is little randomness in the variable, since it almost always takes the same value. This is reflected in its small entropy. On the other hand, if all the probabilities are equal, then they are relatively far from

0

^and

1

^{, and}

f

takes large values. This means that the entropy is large, which reflects the fact that the variable is really random: We cannot predict which value it takes.

Example 5.1 Let us consider a random variable

X

that can have only two values,

a

and

b

. Denote by

p

the probability that it has the value

a

, then the probability that it is

b

is equal to

1 p

. The entropy of this random variable can be computed as

H ( X ) = f ( p ) + f (1

p )

^(5.4)

ENTROPY 107

Thus, entropy is a simple function of

p

. (It does not depend on the values

a

^and

b

^.)

Clearly, this function has the same properties as

f

: it is a nonnegative function that is zero for

p = 0

^{and for}

p = 1

, and positive for values in between. In fact, it it is maximized for

p = 1 = 2

(this is left as an exercice). Thus, the entropy is largest when the values are both obtained with a probability of

50%

. In contrast, if one of these values is obtained almost always (say, with a probability of

99 : 9%

), the entropy of

X

is small, since there is little randomness in the variable.

5.1.2 Entropy and coding length

The connection between entropy and randomness can be made more rigorous by considering coding length. Assume that we want to find a binary code for a large number of observations of

X

, so that the code uses the minimum number of bits possible. According to the fundamental results of information theory, entropy is very closely related to the length of the code required. Under some simplifying assumptions, the length of the shortest code is bounded below by the entropy, and this bound can be approached arbitrarily close, see, e.g., [97]. So, entropy gives roughly the average minimum code length of the random variable.

Since this topic is out of the scope of this book, we will just illustrate it with two examples.

Example 5.2 Consider again the case of a random variable with two possible values,

a

^and

b

. If the variable almost always takes the same value, its entropy is small. This is reflected in the fact that the variable is easy to code. In fact, assume the value

a

is almost always obtained. Then, one efficient code might be obtained simply by counting how many

a

’s are found between two subsequent observations of

b

^{, and}

writing down these numbers. If we need to code only a few numbers, we are able to code the data very efficiently.

In the extreme case where the probability of

a

^is

1

, there is actually nothing left to code and the coding length is zero. On the other hand, if both values have the same probability, this trick cannot be used to obtain an efficient coding mechanism, and every value must be coded separately by one bit.

Example 5.3 Consider a random variable

X

that can have eight different values with probabilities

(1 = 2 ; 1 = 4 ; 1 = 8 ; 1 = 16 ; 1 = 64 ; 1 = 64 ; 1 = 64 ; 1 = 64)

. The entropy of

X

2

bits (this computation is left as an exercice to the reader). If we just coded the data in the ordinary way, we would need 3 bits for every observation. But a more intelligent way is to code frequent values with short binary strings and infrequent values with longer strings. Here, we could use the following strings for the outcomes:

0,10,110,1110,111100,111101,111110,111111. (Note that the strings can be written one after another with no spaces since they are designed so that one always knows when the string ends.) With this encoding the average number of bits needed for each outcome is only 2, which is in fact equal to the entropy. So we have gained a 33% reduction of coding length.

5.1.3 Differential entropy

The definition of entropy for a discrete-valued random variable can be generalized for continuous-valued random variables and vectors, in which case it is often called differential entropy.

The differential entropy

H

of a random variable

x

with density

p

( : )

^{is defined}

as:

H ( x ) =

p

( )log p

( ) d =

f ( p

( )) d

^(5.5)

Differential entropy can be interpreted as a measure of randomness in the same way as entropy. If the random variable is concentrated on certain small intervals, its differential entropy is small.

Note that differential entropy can be negative. Ordinary entropy cannot be negative because the function

f

in (5.2) is nonnegative in the interval

[0 ; 1]

, and discrete probabilities necessarily stay in this interval. But probability densities can be larger than

1

, in which case

f

takes negative values. So, when we speak of a “small differential entropy”, it may be negative and have a large absolute value.

It is now easy to see what kind of random variables have small entropies. They are the ones whose probability densities take large values, since these give strong negative contributions to the integral in (5.8). This means that certain intervals are quite probable. Thus we again find that entropy is small when the variable is not very random, that is, it is contained in some limited intervals with high probabilities.

Example 5.4 Consider a random variable

x

that has a uniform probability distribu-tion in the interval

[0 ;a ]

. Its density is given by

p

( ) =

(

1 =a;

^for

0 a

0 ;

^otherwise ^(5.6)

The differential entropy can be evaluated as

H ( x ) =

^?^Z ^a

1 a ^{log 1} ad ^{= log} a

^(5.7)

Thus we see that the entropy is large if

a

is large, and small if

a

is small. This is natural because the smaller

a

is, the less randomness there is in

x

. In the limit where

a

^{goes to}

0

, differential entropy goes to^?1, because in the limit,

x

is no longer random at all: it is always

0

The interpretation of entropy as coding length is more or less valid with differ-ential entropy. The situation is more complicated, however, since the coding length interpretation requires that we discretize (quantize) the values of

x

. In this case, the coding length depends on the discretization, i.e., on the accuracy with which we want to represent the random variable. Thus the actual coding length is given by the sum of entropy and a function of the accuracy of representation. We will not go into the details here; see [97] for more information.

ENTROPY 109

The definition of differential entropy can be straightforwardly generalized to the multidimensional case. Let

x

be a random vector with density

p

( : )

. The differential entropy is then defined as:

H ( x ) =

p

(

)log p

(

) d

=

f ( p

(

)) d

^(5.8)

5.1.4 Entropy of a transformation

Consider an invertible transformation of the random vector

x

^{, say}

y = f ( x )

^(5.9)

In this section, we show the connection between the entropy of

y

and that of

x

A short, if somewhat sloppy derivation is as follows. (A more rigorous derivation is given in the Appendix.) Denote by

J f (

)

the Jacobian matrix of the function

f

^{, i.e.,}

the matrix of the partial derivatives of

f

^{at point}. The classic relation between the density

p

y^of

y

and the density

p

x^of

x

, as given in Eq. (2.82), can then be formulated as

p

(

) = p

( f

^?1

⁽

⁾⁾

^det ^J f ⁽ f

^?1

⁽

⁾⁾

^j^?1 ^(5.10)

Now, expressing the entropy as an expectation

H ( y ^{) =}

^E

^log ^p

( y ⁾

^g ^(5.11)

we get

E

log p

( y ⁾

⁼ ^E

^log[ ^p

( f

^?1

⁽ y ⁾⁾

^det ^J f ⁽ f

^?1

⁽ y ⁾⁾

^j^?1

^]

= E

log[ p

( x )

det J f ( x )

^j^?1

]

= E

log p

( x )

^g^?

E

log

det J f ( x )

^jg ^(5.12)

Thus we obtain the relation between the entropies as

H ( y ) = H ( x ) + E

log

det J f ( x )

^jg ^(5.13)

In other words, the entropy is increased in the transformation by

E

log

det J f ⁽ x ⁾

^jg^.

An important special case is the linear transformation

y = Mx

^(5.14)

in which case we obtain

H ( y ) = H ( x ) + log

det M

^j ^(5.15)

This also shows that differential entropy is not scale-invariant. Consider a random variable

x

. If we multiply it by a scalar constant,

, differential entropy changes as

H ( x ) = H ( x ) + log

^j ^(5.16)

Thus, just by changing the scale, we can change the differential entropy. This is why the scale of

x

often is fixed before measuring its differential entropy.

5.2 MUTUAL INFORMATION

No documento Independent Component Analysis (páginas 127-132)