Optimized Support Vector Machines for Nonstationary Signal Classification

(1)

442 IEEE SIGNAL PROCESSING LETTERS, VOL. 9, NO. 12, DECEMBER 2002

Optimized Support Vector Machines for Nonstationary Signal Classification

Manuel Davy, Arthur Gretton, Arnaud Doucet, and Peter J. W. Rayner, Associate Member, IEEE

Abstract—This letter describes an efficient method to perform nonstationary signal classification. A support vector machine (SVM) algorithm is introduced and its parameters optimized in a principled way. Simulations demonstrate that our low-complexity method outperforms state-of-the-art nonstationary signal classifi- cation techniques.

Index Terms—Model selection, optimized time-frequency repre- sentations, signal classification, support vector machines.

I. INTRODUCTION

T

HE CLASSIFICATION of nonstationary signals is a dif- ficult and a much studied problem. On one hand, the non- stationarity precludes classification in the time or frequency do- main; on the other hand, nonparametric representations such as time-frequency or time-scale representations, while suited to nonstationary signals, have high dimension. Time-frequency representations (TFRs) and distance measures adapted to their comparison have previously been used to classify nonstationary signals such as underwater acoustic signals [1], speech [2], [3], and loudspeaker test signals [4] with excellent efficiency. How- ever, the decision rules chosen in these studies limit the performance of these classification algorithms.

Support vector machines (SVMs) [5]–[7] provide efficient and powerful classification algorithms that are capable of dealing with high-dimensional input features and with theo- retical bounds on the generalization error and sparseness of the solution provided by statistical learning theory [5], [8].

Classifiers based on SVMs have few free parameters requiring tuning, are simple to implement, and are trained through optimization of a convex quadratic cost function, which ensures the uniqueness of the SVM solution. Furthermore, SVM-based solutions are sparse in the training data and are defined only by the most “informative” training points.

In this letter, we propose to use a SVM for binary classification of the TFRs of nonstationary signals. The main problem in both TFR and SVM methods, however, is that of tuning the parameters, because TFRs are defined by the choice of a so-called

Manuscript received March 5, 2002; revised July 12, 2002. This work was supported by the European project MOUMIR. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Geert Leus.

M. Davy is with the Institut de Recherche en Communications et Cyberné- tique de Nantes (IRCCyN), BP 92101, 44321 Nantes Cedex 03, France and also with the Department of Engineering, University of Cambridge, Cambridge CB2 1PZ U.K. (e-mail: Manuel.Davy@irccyn.ec-nantes.fr).

A. Gretton and P. J. W. Rayner are with the Engineering Department, Univer- sity of Cambridge, CB2 1PZ, Cambridge, U.K.

A. Doucet is with the Department of Electrical and Electronic Engineering, The University of Melbourne, Melbourne, Vic., 3010 Australia.

Digital Object Identifier 10.1109/LSP.2002.806070

TFR kernel, while SVMs depend on the choice of the SVM kernel. We present a simple optimization procedure aimed at determining the optimal SVM and TFR kernel parameters. In Section II, we review support vector classifiers. In Section III, we propose a classifier implementation based on Cohen’s group TFRs, and we describe the optimization procedure. Finally, in Section IV, we compare the classification results obtained with the SVM-TFR approach to those found using other classification methods.

II. SUPPORTVECTORCLASSIFICATION

We first describe how support vector machines may be applied to binary classification, using the -SV procedure. The results in this section are derived in [9] and are also described in detail in [5]–[7]. We are given a set of labeled training points

(1) in which , where is the input space, and , where is the label space; we assume that the random variable pairs are generated independently and identically distributed (i.i.d.) according to a distribution _xxxy . We seek to determine a function that best predicts the label for an input . The optimal predicted class label for an input

is

y (2)

Since we do not know the mapping , we define a learning algorithm

(3) within a class (here refers to the set of functions mapping from to ), which we call the hypothesis space, that is flexible enough to model a wide range of decision boundaries.

We next define a feature space (which can be infinite dimensional), endowed with an inner product and a mapping from to . Let us restrict to functions of the form

(4) We can then define a function in such that

; thus

(5)

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on November 14, 2009 at 19:40 from IEEE Xplore. Restrictions apply.

(2)

DAVY et al.: OPTIMIZED SUPPORT VECTOR MACHINES FOR NONSTATIONARY SIGNAL CLASSIFICATION 443

and the problem of finding a nonlinear decision boundary in has been transformed into a problem of finding the optimal hyperplane in separating the two classes, where this hyperplane is parameterized by . The mapping need never be computed explicitly; instead, we use the fact that if is the reproducing kernel Hilbert space induced by , then

(6) The latter requirement is met for kernels fulfilling the Mercer conditions (see [5]). These conditions are satisfied for a wide range of kernels, including Gaussian radial basis functions

(7) An estimate associated with the loss is attained by minimizing the risk

xxxy (8)

We use the soft margin loss described in [10] and [11]

if

otherwise (9)

which has been used successfully with support vector methods in a wide variety of classification problems [5]. In practice, (8) cannot readily be solved, as we do not usually know the distribution _xxxy . Minimizing the empirical risk alone does not take into account other factors, such as the complexity of the classifying function and can, therefore, result in overfitting, as pointed out in [5] and [8].

The -SV solution is found by minimizing the reg- ularized risk

(10) where we use the soft margin loss from (9) in the empirical risk (11)

where . All training points for

which are known as support vectors; it is only these points that determine . The term in (10) is an upper bound on the number of support vectors for which

and can, thus, be thought of as representing our prior belief regarding the degree of overlap of the two classes. It is shown in [9] that the weight vector can be written , and that solving (10) is equivalent to finding

(12)

subject to

(13)

Using the expansion for and the inner product definition in (6), the estimate in (5) can be rewritten

(14)

III. KERNELDESIGN

A. TFRs for Signal Classification

The performance of the -SV classification procedure is de- pendent on the choice of a -SVM kernel suited to the problem at hand. In this letter, we choose based on an approach similar to the nonstationary signal classification algorithm introduced in [3] (see also [4] for an industrial application of this technique), which is based on Cohen’s group TFRs.

We write Cohen’s group time-frequency representation of as (parameterized by its TFR kernel ). Given two signals and , the Gaussian radial basis function kernel of (7) then becomes

(15) where the notation emphasizes the normalization of the TFR

(16)

In this formulation, the input space defined in the previous section is the space of normalized TFRs [i.e., ], which depends on the choice of the TFR kernel . Many para- metric TFR kernel shapes have been proposed in the literature, and an efficient choice is the family of radially Gaussian kernels (see [3]), defined in the ambiguity plane as

(17)

where and are the polar coor-

dinates corresponding to , and the contour function is (18)

in which is chosen such that for all

(this TFR kernel should not be confused with the SVM kernel). The TFR kernel parameters are then , , and

, with ; the -SVM kernel parameter

is . In the following, this set of parameters is denoted .

B. Optimization Procedure

Let and be subsets of approximately equal size obtained by randomly partitioning in two parts, each con- taining elements from both classes. There are (respectively ) samples in labeled 1 (respictively labeled 1) and

(3)

444 IEEE SIGNAL PROCESSING LETTERS, VOL. 9, NO. 12, DECEMBER 2002

(respesctively ) samples in labeled 1 (respesctively labeled 1). In order to automatically tune , we intro- duce the optimization criterion , which is minimized with respect to via a standard optimization procedure. For a given training set and given kernel parameters ,

is computed as follows:

Step 1) Use the set to train the -SVM classifier. More precisely, the set of parameters ,

, and are computed by solving the optimization problem in (12) and (13), using the learning vectors in .

Step 2) For each element in , compute the decision function using (14) and the parameters obtained in Step 1). This results in two sets of values:

and

Step 3) Compute the empirical mean and standard de- viation of , where

(19) Compute and , where these are defined in an analogous manner.

Step 4) Compute the criterion (introduced in [3] and applied in [4])

(20) where

Steps 1)–4) are repeated times with different subsets and , and the final criterion is obtained as the average of over all the subsets tested.

Under Gaussian assumptions over the distributions of and , it can be shown [3] that is an estimate of the classification probability of error for the classifier implemented with parameter . (The Gaussian assumption is verified via a simulation study in Section IV.) In practice, is selected so as to minimize this estimated probability of error using a standard numerical optimization procedure. When the optimal is obtained, the -SVM classifier is trained over the full learning set using the optimal TFR kernel and the optimal SVM kernel parameter. Classification is then performed using both this optimal SVM classifier and optimized kernel TFRs.

IV. RESULTS

We now apply the -SVM algorithm to the binary classification of chirp signals, and we compare our results to those ob-

Fig. 1. Support of the noise-free TFRs (the gray areas represent the possible instantaneous frequencies for each class).

Fig. 2. Histograms of 20 000 realizations of (left) g and (right)g , computed with the parameters at their optimal values. The dashed line illustrates the Gaussian distribution with meanm (respectivelym ) and standard deviations (respectivelys ).

tained previously in [3], [12]. The test signals are defined as the sum of two linear chirps

(21) where the are i.i.d. and are generated by a zero-mean Gaussian process with variance . Each test signal is

parameterized by , with and

. The problem consists of classifying a given signal into one of the two following classes:

• Class :

where is the uniform distribution on ;

• Class :

The remaining signal parameters are identical in both classes,

i.e., , , , ,

. The support of the noise-free time-frequency representation for signals in each class is plotted in Fig. 1.

The -SVM algorithm was trained using 100 signals, with an equal number of examples in each class. We specified . A radially Gaussian TFR kernel was selected, with parameters computed so as to minimize the error rate observed on the test data. The SVM kernel width is also optimized. To measure the performance of the algorithm, a total of 20 000 randomly generated test signals were used, again divided equally between the two classes (note that the training signals did not form part of the test set). Fig. 2 displays the histograms of and computed from the test signals for the optimal parameters. As can be seen, the Gaussian

(4)

DAVY et al.: OPTIMIZED SUPPORT VECTOR MACHINES FOR NONSTATIONARY SIGNAL CLASSIFICATION 445

TABLE I

ERRORRATES FOR THECLASSIFICATION OFCHIRPS, USING THEOPTIMIZED -SVM M^{ETHOD AND}OTHERCLASSIFIERS

assumption is sound. Table I shows the average error over these test signals, compared with the average obtained over the same number of test signals for alternative classification methods.

We have also implemented the linear SVM kernel obtained by

defining instead of

(15) (dot product kernel). Compared with other state-of-the-art techniques, the optimized -SVM algorithm with Gaussian SVM kernel introduced in this letter achieves the lowest error rate, with about half the error rate of the best preexisting method. Note that the nonlinearity of the Gaussian SVM kernel improves the classification accuracy when compared with the linear SVM kernel.

V. CONCLUSION

We have presented a nonstationary signal classification technique based on an optimized support vector machine classifier.

Our optimization procedure is simple, efficient, and easy to implement. The -SVM method is able to cope with a high-dimensional feature space, namely the TFR of the signal, and does not require any time-frequency feature extraction. Moreover, com- puter memory is saved, since only support vector TFRs (with

) are stored. On a benchmark dataset, our algorithm has been shown to outperform state-of-the-art methods that classi- fied real data with excellent performance [4]. Finally, we show that a nonlinear SVM kernel yields better performance than the linear kernel.

REFERENCES

[1] S. Abeysekera and B. Boashash, “Methods of signal classification using the images produced by the Wigner distribution,” Pattern Recognit.

Lett., vol. 12, pp. 717–729, Nov. 1991.

[2] L. Atlas, J. Droppo, and J. McLaughlin, “Optimizing time-frequency distributions for automatic classification,” in Proc. SPIE, vol. 3162, 1997.

[3] M. Davy, C. Doncarli, and G. F. Boudreaux-Bartels, “Improved optimization of time-frequency based signal classifiers,” IEEE Signal Pro- cessing Lett., vol. 8, pp. 52–57, Feb. 2001.

[4] M. Davy and C. Doncarli, “A new nonstationary test procedure for improved loudspeaker fault detection,” J. Audio Eng. Soc., vol. 50, no. 6, pp. 458–469, June 2002.

[5] B. Schölkopf and A. Smola, Learning With Kernels. Cambridge, MA:

MIT Press, 2002.

[6] R. Herbrich, Learning Kernel Classifers: Theory and Algo- rithms. Cambridge, MA: MIT Press, 2002.

[7] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines. Cambridge, U.K.: Cambridge Univ. Press, 2000.

[8] V. Vapnik, The Nature of Statistical Learning Theory. New York:

Springer-Verlag, 1995.

[9] B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett, “New support vector algorithms,” Neural Comput., vol. 12, pp. 1207–1245, 2000.

[10] K. Bennett and O. Mangasarian, “Robust linear programming discrim- ination of two linearly inseperable sets,” Opt. Meth. Softw., vol. 1, pp.

23–34, 1993.

[11] C. Cortes and V. Vapnik, “Support vector networks,” Mach. Learn., vol.

20, pp. 273–297, 1995.

[12] M. Davy, C. Doncarli, and J. Y. Tourneret, “Classification of chirp signals using hierarchical Bayesian learning and MCMC methods,” IEEE Trans. Signal Processing, vol. 50, pp. 377–388, Feb. 2002.