MATHEMATICAL PRELIMINARIES
5.1 ENTROPY
5
Information Theory
Estimation theory gives one approach to characterizing random variables. This was based on building parametric models and describing the data by the parameters.
An alternative approach is given by information theory. Here the emphasis is on coding. We want to code the observations. The observations can then be stored in the memory of a computer, or transmitted by a communications channel, for example. Finding a suitable code depends on the statistical properties of the data.
In independent component analysis (ICA), estimation theory and information theory offer the two principal theoretical approaches.
In this chapter, the basic concepts of information theory are introduced. The latter half of the chapter deals with a more specialized topic: approximation of entropy.
These concepts are needed in the ICA methods of Part II.
Fig. 5.1 The function
f
in (5.2), plotted on the interval[0;
1].not important since it only changes the measurement scale, so it is not explicitly mentioned.
Let us define the function
f
asf ( p ) =
?p log p;
for0
p
1
(5.2)This is a nonnegative function that is zero for
p = 0
and forp = 1
, and positive for values in between; it is plotted in Fig. 5.1. Using this function, entropy can be written asH ( X ) =
Xi
f ( P ( X = a
i))
(5.3)Considering the shape of
f
, we see that the entropy is small if the probabilitiesP ( X = a
i)
are close to0
or1
, and large if the probabilities are in between.In fact, the entropy of a random variable can be interpreted as the degree of information that the observation of the variable gives. The more “random”, i.e., unpredictable and unstructured the variable is, the larger its entropy. Assume that the probabilities are all close to
0
, expect for one that is close to1
(the probabilities must sum up to one). Then there is little randomness in the variable, since it almost always takes the same value. This is reflected in its small entropy. On the other hand, if all the probabilities are equal, then they are relatively far from0
and1
, andf
takes large values. This means that the entropy is large, which reflects the fact that the variable is really random: We cannot predict which value it takes.Example 5.1 Let us consider a random variable
X
that can have only two values,a
and
b
. Denote byp
the probability that it has the valuea
, then the probability that it isb
is equal to1
?p
. The entropy of this random variable can be computed asH ( X ) = f ( p ) + f (1
?p )
(5.4)ENTROPY 107
Thus, entropy is a simple function of
p
. (It does not depend on the valuesa
andb
.)Clearly, this function has the same properties as
f
: it is a nonnegative function that is zero forp = 0
and forp = 1
, and positive for values in between. In fact, it it is maximized forp = 1 = 2
(this is left as an exercice). Thus, the entropy is largest when the values are both obtained with a probability of50%
. In contrast, if one of these values is obtained almost always (say, with a probability of99 : 9%
), the entropy ofX
is small, since there is little randomness in the variable.5.1.2 Entropy and coding length
The connection between entropy and randomness can be made more rigorous by considering coding length. Assume that we want to find a binary code for a large number of observations of
X
, so that the code uses the minimum number of bits possible. According to the fundamental results of information theory, entropy is very closely related to the length of the code required. Under some simplifying assumptions, the length of the shortest code is bounded below by the entropy, and this bound can be approached arbitrarily close, see, e.g., [97]. So, entropy gives roughly the average minimum code length of the random variable.Since this topic is out of the scope of this book, we will just illustrate it with two examples.
Example 5.2 Consider again the case of a random variable with two possible values,
a
andb
. If the variable almost always takes the same value, its entropy is small. This is reflected in the fact that the variable is easy to code. In fact, assume the valuea
is almost always obtained. Then, one efficient code might be obtained simply by counting how many
a
’s are found between two subsequent observations ofb
, andwriting down these numbers. If we need to code only a few numbers, we are able to code the data very efficiently.
In the extreme case where the probability of
a
is1
, there is actually nothing left to code and the coding length is zero. On the other hand, if both values have the same probability, this trick cannot be used to obtain an efficient coding mechanism, and every value must be coded separately by one bit.Example 5.3 Consider a random variable
X
that can have eight different values with probabilities(1 = 2 ; 1 = 4 ; 1 = 8 ; 1 = 16 ; 1 = 64 ; 1 = 64 ; 1 = 64 ; 1 = 64)
. The entropy ofX
is
2
bits (this computation is left as an exercice to the reader). If we just coded the data in the ordinary way, we would need 3 bits for every observation. But a more intelligent way is to code frequent values with short binary strings and infrequent values with longer strings. Here, we could use the following strings for the outcomes:0,10,110,1110,111100,111101,111110,111111. (Note that the strings can be written one after another with no spaces since they are designed so that one always knows when the string ends.) With this encoding the average number of bits needed for each outcome is only 2, which is in fact equal to the entropy. So we have gained a 33% reduction of coding length.
5.1.3 Differential entropy
The definition of entropy for a discrete-valued random variable can be generalized for continuous-valued random variables and vectors, in which case it is often called differential entropy.
The differential entropy
H
of a random variablex
with densityp
x( : )
is definedas:
H ( x ) =
?Z
p
x( )log p
x( ) d =
Z
f ( p
x( )) d
(5.5)Differential entropy can be interpreted as a measure of randomness in the same way as entropy. If the random variable is concentrated on certain small intervals, its differential entropy is small.
Note that differential entropy can be negative. Ordinary entropy cannot be negative because the function
f
in (5.2) is nonnegative in the interval[0 ; 1]
, and discrete probabilities necessarily stay in this interval. But probability densities can be larger than1
, in which casef
takes negative values. So, when we speak of a “small differential entropy”, it may be negative and have a large absolute value.It is now easy to see what kind of random variables have small entropies. They are the ones whose probability densities take large values, since these give strong negative contributions to the integral in (5.8). This means that certain intervals are quite probable. Thus we again find that entropy is small when the variable is not very random, that is, it is contained in some limited intervals with high probabilities.
Example 5.4 Consider a random variable
x
that has a uniform probability distribu-tion in the interval[0 ;a ]
. Its density is given byp
x( ) =
(
1 =a;
for0
a
0 ;
otherwise (5.6)The differential entropy can be evaluated as
H ( x ) =
?Z a0
1 a log 1 ad = log a
(5.7)Thus we see that the entropy is large if
a
is large, and small ifa
is small. This is natural because the smallera
is, the less randomness there is inx
. In the limit wherea
goes to0
, differential entropy goes to?1, because in the limit,x
is no longer random at all: it is always0
.The interpretation of entropy as coding length is more or less valid with differ-ential entropy. The situation is more complicated, however, since the coding length interpretation requires that we discretize (quantize) the values of
x
. In this case, the coding length depends on the discretization, i.e., on the accuracy with which we want to represent the random variable. Thus the actual coding length is given by the sum of entropy and a function of the accuracy of representation. We will not go into the details here; see [97] for more information.ENTROPY 109
The definition of differential entropy can be straightforwardly generalized to the multidimensional case. Let
x
be a random vector with densityp
x( : )
. The differential entropy is then defined as:H ( x ) =
?Z
p
x(
)log p
x(
) d
=
Z
f ( p
x(
)) d
(5.8)5.1.4 Entropy of a transformation
Consider an invertible transformation of the random vector
x
, sayy = f ( x )
(5.9)In this section, we show the connection between the entropy of
y
and that ofx
.A short, if somewhat sloppy derivation is as follows. (A more rigorous derivation is given in the Appendix.) Denote by
J f (
)
the Jacobian matrix of the functionf
, i.e.,the matrix of the partial derivatives of
f
at point. The classic relation between the densityp
yofy
and the densityp
xofx
, as given in Eq. (2.82), can then be formulated asp
y(
) = p
x( f
?1(
))
jdet J f ( f
?1(
))
j?1 (5.10)Now, expressing the entropy as an expectation
H ( y ) =
?E
flog p
y( y )
g (5.11)we get
E
flog p
y( y )
g= E
flog[ p
x( f
?1( y ))
jdet J f ( f
?1( y ))
j?1]
g= E
flog[ p
x( x )
jdet J f ( x )
j?1]
g= E
flog p
x( x )
g?E
flog
jdet J f ( x )
jg (5.12)Thus we obtain the relation between the entropies as
H ( y ) = H ( x ) + E
flog
jdet J f ( x )
jg (5.13)In other words, the entropy is increased in the transformation by
E
flog
jdet J f ( x )
jg.An important special case is the linear transformation
y = Mx
(5.14)in which case we obtain
H ( y ) = H ( x ) + log
jdet M
j (5.15)This also shows that differential entropy is not scale-invariant. Consider a random variable
x
. If we multiply it by a scalar constant,, differential entropy changes asH ( x ) = H ( x ) + log
jj (5.16)Thus, just by changing the scale, we can change the differential entropy. This is why the scale of
x
often is fixed before measuring its differential entropy.5.2 MUTUAL INFORMATION