MATHEMATICAL PRELIMINARIES
2.8 STOCHASTIC PROCESSES * .1 Introduction and definition
STOCHASTIC PROCESSES * 43
Thus higher-order cumulants measure the departure of a random vector fro m a gaus-sian random vector with an identical mean vector and covariance matrix. This property is highly useful, making it possible to use cumulants for extracting the nongaussian part of a signal. For example, they make it possible to ignore additive gaussian noise corrupting a nongaussian signal using cumulants.
Moments, cumulants, and characteristic functions have several other properties which are not discussed here. See, for example, the books [149, 319, 386] for more information. However, it is worth mentioning that both moments and cumulants have symmetry properties that can be exploited to reduce the computational load in estimating them [319].
For estimating moments and cumulants, one can apply the procedure introduced in Section 2.2.4. However, the fourth-order cumulants cannot be estimated directly, but one must first estimate the necessary moments as is obvious from (2.106). Practical estimation formulas can be found in [319, 315].
A drawback in utilizing order statistics is that reliable estimation of higher-order moments and cumulants requires much more samples than for second-higher-order statistics [318]. Another drawback is that higher-order statistics can be very sensitive to outliers in the data (see Section 8.3.1). For example, a few data samples having the highest absolute values may largely determine the value of kurtosis. Higher-order statistics can be taken into account in a more robust way by using the nonlinear hyperbolic tangent function
tanh( u )
, whose values always lie in the interval(
?1 ; 1)
,or some other nonlinearity that grows slower than linearly with its argument value.
2.8 STOCHASTIC PROCESSES *
x (t)1
x (t)2
1 t t
n
sample Second function sample First
function
x (t) th
function sample n
Fig. 2.10 Sample functions of a stochastic process.
Figure 2.10 shows an example of a scalar stochastic process represented by the set of sample functionsf
x
j( t )
g,j = 1 ; 2 ;::: ;n
. Assume that the probability of occurrence of thei
th sample functionx
i( t )
isP
i, and similarly for the other sample functions. Suppose then we observe the set of waveformsfx
j( t )
g,j = 1 ; 2 ;::: ;n
,simultaneously at some time instant
t = t
1, as shown in Figure 2.10. Clearly, the values fx
j( t
1)
g,j = 1 ; 2 ;::: ;n
of then
waveforms at timet
1 form a discrete random variable withn
possible values, each having the respective probability of occurrenceP
j. Consider then another time instantt = t
2. We obtain again a random variablefx
j( t
2)
g, which may have a different distribution thanfx
j( t
1)
g.Usually the number of possible waveforms arising from an experiment is infinitely large due to additive noise. At each time instant a continuous random variable having some distribution arises instead of the discrete one discussed above. However, the time instants
t
1;t
2;::: ;
on which the stochastic process is observed are discrete due to sampling. Usually the observation intervals are equispaced, and the resulting samples are represented using integer indicesx
j(1)
=x
j( t
1) ; x
j(2)
=x
j( t
2) ;:::
forSTOCHASTIC PROCESSES * 45
notational simplicity. As a result, a typical representation for a stochastic process consists of continuous random variables at discrete (integer) time instants.
2.8.2 Stationarity, mean, and autocorrelation function
Consider a stochastic processf
x
j( t )
gdefined at discrete timest
1;t
2;::: ;t
k. For characterizing the processfx
j( t )
gcompletely, we should know the joint probability density of all the random variablesfx
j( t
1)
g,fx
j( t
2)
g,:::
,fx
j( t
k)
g. The stochastic process is said to be stationary in the strict sense if its joint density is invariant under time shifts of origin. That is, the joint pdf of the process depends only on the differencest
i?t
jbetween the time instantst
1;t
2;::: ;t
kbut not directly on them.In practice, the joint probability density is not known, and its estimation from samples would be too tedious and require an excessive number of samples even if they were available. Therefore, stochastic processes are usually characterized in terms of their first two moments, namely the mean and autocorrelation or autocovariance functions. They give a coarse but useful description of the distribution. Using these statistics is sufficient for linear processing (for example filtering) of stochastic processes, and the number of samples needed for estimating them remains reasonable.
The mean function of the stochastic processf
x ( t )
gis definedm
x( t ) =
Efx ( t )
g=
Z
1
?1
x ( t ) p
x(t)( x ( t )) dx ( t )
(2.107)Generally, this is a function of time
t
. However, when the processfx ( t )
gis stationary, the probability density functions of all the random variables corresponding to different time instants become the same. This common pdf is denoted byp
x( x )
. In such a case, the mean functionm
x( t )
reduces to a constant meanm
xindependent of time.Similarly, the variance function of the stochastic processf
x ( t )
g 2x( t ) =
Ef[ x ( t )
?m
x( t )]
2g=
Z
1
?1
[ x ( t )
?m
x( t )]
2p
x(t)( x ( t )) dx ( t )
(2.108) becomes a time-invariant constant
2xfor a stationary process.Other second-order statistics of a random processf
x ( t )
gare defined in a similar manner. In particular, the autocovariance function of the processfx ( t )
gis given byc
x( t; ) =
cov[ x ( t ) ;x ( t
?)] =
Ef[ x ( t )
?m
x( t )][ x ( t
?)
?m
x( t
?)]
g(2.109) The expectation here is computed over the joint probability density of the random variables
x ( t )
andx ( t
?)
, where is the constant time lag between the observation timest
andt
?. For the zero lag= 0
, the autocovariance reduces to the variance function (2.108). For stationary processes, the autocovariance function (2.109) is independent of the timet
, but depends on the lag:c
x( t; )
=c
x( )
.Analogously, the autocorrelation function of the processf
x ( t )
gis defined byr
x( t; ) =
Efx ( t ) x ( t
?)
g (2.110)Iff
x ( t )
gis stationary, this again depends on the time lag only:r
x( t; )
=r
x( )
.Generally, if the mean function
m
x( t )
of the process is zero, the autocovariance and autocorrelation functions become the same. If the lag= 0
, the autocorrelation function reduces to the mean-square functionr
x( t; 0)
= Efx
2( t )
gof the process, which becomes a constantr
x(0)
for a stationary processfx ( t )
g.These concepts can be extended for two different stochastic processes f
x ( t )
gandf
y ( t )
gin an obvious manner (cf. Section 2.2.3). More specifically, the cross-correlation functionr
xy( t; )
and the cross-covariance functionc
xy( t; )
of theprocessesf
x ( t )
gandfy ( t )
gare, respectively, defined byr
xy( t; ) =
Efx ( t ) y ( t
?)
g (2.111)c
xy( t; ) =
Ef[ x ( t )
?m
x( t )][ y ( t
?)
?m
y( t
?)]
g(2.112) Several blind source separation methods are based on the use of cross-covariance functions (second-order temporal statistics). These methods will be discussed in Chapter 18.
2.8.3 Wide-sense stationary processes
A very important subclass of stochastic processes consists of wide-sense stationary (WSS) processes, which are required to satisfy the following properties:
1. The mean function
m
x( t )
of the process is a constantm
xfor allt
.2. The autocorrelation function is independent of a time shift: Ef
x ( t ) x ( t
?)
g=
r
x( )
for allt
.3. The variance, or the mean-square value
r
x(0)
= Efx
2( t )
gof the process is finite.The importance of wide-sense stationary stochastic processes stems from two facts.
First, they can often adequately describe the physical situation. Many practical stochastic processes are actually at least mildly nonstationary, meaning that their statistical properties vary slowly with time. However, such processes are usually on short time intervals roughly WSS. Second, it is relatively easy to develop useful mathematical algorithms for WSS processes. This in turn follows from limiting their characterization by first- and second-order statistics.
Example 2.8 Consider the stochastic process
x ( t ) = a cos( !t ) + b sin( !t )
(2.113)where
a
andb
are scalar random variables and!
a constant parameter (angular frequency). The mean of the processx ( t )
ism
x( t ) =
Efx ( t )
g=
Efa
gcos( !t ) +
Efb
gsin( !t )
(2.114)STOCHASTIC PROCESSES * 47
and its autocorrelation function can be written
r
x( t; ) =
Efx ( t ) x ( t
?)
g= 12
Efa
2g[cos( ! (2 t
?)) + cos(
?! )]
+ 12
Efb
2g[
?cos( ! (2 t
?)) + cos(
?! )]
+
Efab
g[sin( ! (2 t
?)]
(2.115)where we have used well-known trigonometric identities. Clearly, the process
x ( t )
is generally nonstationary, since both its mean and autocorrelation functions depend on the time
t
.However, if the random variables
a
andb
are zero mean and uncorrelated with equal variances, so thatEf
a
g=
Efb
g=
Efab
g= 0
Efa
2g=
Efb
2gthe mean (2.114) of the process becomes zero, and its autocorrelation function (2.115) simplifies to
r
x( ) =
Efa
2gcos( ! )
which depends only on the time lag
. Hence, the process is WSS in this special case (assuming that Efa
2gis finite).Assume now thatf
x ( t )
gis a zero-mean WSS process. If necessary, the process can easily be made zero mean by first subtracting its meanm
x. It is sufficient to consider the autocorrelation functionr
x( )
offx ( t )
gonly, since the autocovariance functionc
x( )
coincides with it. The autocorrelation function has certain properties that are worth noting. First, it is an even function of the time lag:r
x(
?) = r
x( )
(2.116)Another property is that the autocorrelation function achieves its maximum absolute value for zero lag:
?
r
x(0)
r
x( )
r
x(0)
(2.117)The autocorrelation function
r
x( )
measures the correlation of random variablesx ( t )
andx ( t
?)
that are units apart in time, and thus provides a simple measure for the dependence of these variables which is independent of the timet
due to theWSS property. Roughly speaking, the faster the stochastic process fluctuates with time around its mean, the more rapidly the values of the autocorrelation function
r
x( )
decrease from their maximumr
x(0)
as increases.Using the integer notation for the samples
x ( i )
of the stochastic process, we can represent the lastm +1
samples of the stochastic process at timen
using the random vectorx ( n ) = [ x ( n ) ;x ( n
?1) ;::: ;x ( n
?m )]
T (2.118)Assuming that the values of the autocorrelation function
r
x(0) ;r
x(1) ;::: ;r
x( m )
areknown up to a lag of
m
samples, the( m + 1)
( m + 1)
correlation (or covariance) matrix of the processfx ( n )
gis defined byR
x=
2
6
6
6
4
r
x(0) r
x(1) r
x(2)
r
x( m ) r
x(1) r
x(0) r
x(1)
r
x( m
?1)
... ... ... . .. ...
r
x( m ) r
x( m
?1) r
x( m
?2)
r
x(0)
3
7
7
7
5 (2.119) The matrix
R
x satisfies all the properties of correlation matrices listed in Section 2.2.2. Furthermore, it is a Toeplitz matrix. This is generally defined so that on each subdiagonal and on the diagonal, all the elements of Toeplitz matrix are the same.The Toeplitz property is helpful, for example, in solving linear equations, enabling use of faster algorithms than for more general matrices.
Higher-order statistics of a stationary stochastic process
x ( n )
can be defined in an analogous manner. In particular, the cumulants ofx ( n )
have the form [315]cumxx
( j ) =
Efx ( i ) x ( i + j )
gcumxxx
( j;k ) =
Efx ( i ) x ( i + j ) x ( i + k )
g (2.120)cumxxx
( j;k;l ) =
Efx ( i ) x ( i + j ) x ( i + k ) x ( i + l )
g?Ef
x ( i ) x ( j )
gEfx ( k ) x ( l )
g?Efx ( i ) x ( k )
gEfx ( j ) x ( l )
g?Ef
x ( i ) x ( l )
gEfx ( j ) x ( k )
gThese definitions correspond to the formulas (2.106) given earlier for a general random vector
x
. Again, the second and third cumulant are the same as the respective moments, but the fourth cumulant differs from the fourth moment Efx ( i ) x ( i + j ) x ( i + k ) x ( i + l )
g. The second cumulant cumxx( j )
is equal to the autocorrelationr
x( j )
and autocovariance
c
x( j )
.2.8.4 Time averages and ergodicity
In defining the concept of a stochastic process, we noted that at each fixed time instant
t = t
0the possible valuesx ( t
0)
of the process constitute a random variable having some probability distribution. An important practical problem is that these distributions (which are different at different times if the process is nonstationary) are not known, at least not exactly. In fact, often all that we have is just one sample of the process corresponding to each discrete time index (since time cannot be stopped to acquire more samples). Such a sample sequence is called a realization of the stochastic process. In handling WSS processes, we need to know in most cases only the mean and autocorrelation values of the process, but even they are often unknown.A practical way to circumvent this difficulty is to replace the usual expectations of the random variables, called ensemble averages, by long-term sample averages or time averages computed from the available single realization. Assume that this realization contains
K
samplesx (1) ;x (2) ;::: ;x ( K )
. Applying the preceding principle, theSTOCHASTIC PROCESSES * 49
mean of the process can be estimated using its time average
m ^
x( K ) = 1 K
K
X
k=1
x ( k )
(2.121)and the autocorrelation function for the lag value
l
usingr ^
x( l;K ) = 1 K
?l
KX?l
k=1
x ( k + l ) x ( k )
(2.122)The accuracy of these estimates depends on the number
K
of samples. Note also that the latter estimate is computed over theK
?l
possible sample pairs having the lagl
that can be found from the sample set. The estimates (2.122) are unbiased, but if the number of pairsK
?l
available for estimation is small, their variance can be high. Therefore, the scaling factorK
?l
of the sum in (2.122) is often replaced byK
in order to reduce the variance of the estimated autocorrelation valuesr ^
x( l;K )
,even though the estimates then become biased [169]. As
K
! 1, both estimates tend toward the same value.The stochastic process is called ergodic if the ensemble averages can be equated to the respective time averages. Roughly speaking, a random process is ergodic with respect to its mean and autocorrelation function if it is stationary. A more rigorous treatment of the topic can be found for example in [169, 353, 141].
For mildly nonstationary processes, one can apply the estimation formulas (2.121) and (2.122) by computing the time averages over a shorter time interval during which the process can be regarded to be roughly WSS. It is important to keep this in mind. Sometimes formula (2.122) is applied in estimating the autocorrelation values without taking into account the stationarity of the process. The consequences can be drastic, for example, rendering eigenvectors of the correlation matrix ( 2.119) useless for practical purposes if ergodicity of the process is in reality a grossly invalid assumption.
2.8.5 Power spectrum
A lot of insight into a WSS stochastic process is often gained by representing it in the frequency domain. The power spectrum or spectral density of the process
x ( n )
provides such a representation. It is defined as the discrete Fourier transform of the autocorrelation sequence
r
x(0) ;r
x(1) ;:::
:S
x( ! ) =
X1k=?1
r
x( k )exp(
?|k! )
(2.123)where
|
= p?1
is the imaginary unit and!
the angular frequency. The time domain representation given by the autocorrelation sequence of the process can be obtained from the power spectrumS
x( ! )
by applying the inverse discrete-timeFourier transform
r
x( k ) = 12
Z
?
S
x( ! )exp( |k! ) d!; k = 1 ; 2 ;:::
(2.124) It is easy to see that the power spectrum (2.123) is always real-valued, even, and a periodic function of the angular frequency
!
. Note also that the power spectrum is a continuous function of!
, while the autocorrelation sequence is discrete. In practice, the power spectrum must be estimated from a finite number of autocorrelation values.If the autocorrelation values
r
x( k )
!0
sufficiently quickly as the lagk
grows large, this provides an adequate approximation.The power spectrum describes the frequency contents of the stochastic process, showing which frequencies are present in the process and how much power they possess. For a sinusoidal signal, the power spectrum shows a sharp peak at its oscillating frequency. Various methods for estimating power spectra are discussed thoroughly in the books [294, 241, 411].
Higher-order spectra can be defined in a similar manner to the power spectrum as Fourier transforms of higher-order statistics [319, 318]. Contrary to the power spectra, they retain information about the phase of signals, and have found many applications in describing nongaussian, nonlinear, and nonminimum-phase signals [318, 319, 315].
2.8.6 Stochastic signal models
A stochastic process whose power spectrum is constant for all frequencies
!
is calledwhite noise. Alternatively, white noise
v ( n )
can be defined as a process for which any two different samples are uncorrelated:r
v( k ) =
Efv ( n ) v ( n
?k )
g=
(
2v; k = 0
0 ; k =
1 ;
2 ;:::
(2.125)Here
2vis the variance of the white noise. It is easy to see that the power spectrum of the white noise isS
v( ! )
= 2v for all!
, and that the formula (2.125) follows from the inverse transform (2.124). The distribution of the random variablev ( n )
forming the white noise can be any reasonable one, provided that the samples are uncorrelated at different time indices. Usually this distribution is assumed to be gaussian. The reason is that white gaussian noise is maximally random because any two uncorrelated samples are also independent. Furthermore, such a noise process cannot be modeled to yield an even simpler random process.
Stochastic processes or time series are frequently modeled in terms of autoregres-sive (AR) processes. They are defined by the difference equation
x ( n ) =
?XMi=1
a
ix ( n
?i ) + v ( n )
(2.126)CONCLUDING REMARKS AND REFERENCES 51
where
v ( n )
is a white noise process, anda
1;::: ;a
M are constant coefficients (pa-rameters) of the AR model. The model orderM
gives the number of previous samples on which the current valuex ( n )
of the AR process depends. The noise termv ( n )
introduces randomness into the model; without it the AR model would be completely deterministic. The coefficients
a
1;::: ;a
M of the AR model can be computed us-ing linear techniques from autocorrelation values estimated from the available data [419, 241, 169]. Since the AR models describe fairly well many natural stochastic processes, for example, speech signals, they are used in many applications. In ICA and BSS, they can be used to model the time correlations in each source processs
i( t )
. This sometimes improves greatly the performance of the algorithms.Autoregressive processes are a special case of autoregressive moving average (ARMA) processes described by the difference equation
x ( n ) +
XMi=1
a
ix ( n
?i ) = v ( n ) +
XNi=1
b
iv ( n
?i )
(2.127)Clearly, the AR model (2.126) is obtained from the ARMA model (2.127) when the moving average (MA) coefficients