Description of the proposed approach based on the combination of timbral

The main point of the proposed method is to exploit the synergy offered by the two majors representations of audio signals and the features we can derive from them. First, in Sec.3.1.1, we discuss the complementary aspect of the timbral and intonative features obtained with the source-filter and the sinusoidal model respectively. We also detail the reasons that make the direct combination of these features impossible. To combine information offered by these different features it is necessary to perform independent classifications with each feature type separately and to combine the decisions obtained. The details of the combination method are given in Sec.3.1.2.

3.1.1 Sound descriptions complementarity

As shown by the common representation of sound, the spectrogram, a sound is a pattern varying along two dimensions: the time and the frequency axis. So, to describe a sound, one can chose to:

• describe the relative amplitudes of frequencies at a given time or to

• describe the temporal variations of one frequency (or one band of frequencies) during a given interval of time.

These two descriptions literally adopt orthogonal points of view on the sound to be described. The first description clearly corresponds to the idea developed in the source filter model while the second description corresponds to the idea that we have developed in the intonative model applied on partials obtained with the sinusoidal model. Features computed on the spectral envelope describe a characteristic of sounds that varies all along its duration. Thus, to proceed to the complete description of a sound, these features are extracted onto frames and repeated all over the duration a sound.

Conversely, to be relevant, intonative features have to be computed on long portions of a signal. Vi- brato parameters, obtained on time-varying frequency of partials, have to be computed on a segment covering several cycles of the modulation. The problem is similar for tremolo parameters computed on the time-varying amplitude of partials. Concerning the parameters of portamento, they have to be

1Details on the source-filter and the sinusoidal models are given in Chap.3

computed on a segment that covers a note transition. In the following, we compute these parameters on partials segmented using the BIC criterion as explained in Chap.3 Sec.3.2.

By analogy, with the image signal processing area, features computed on the spectral envelope of a given frame are refereed to aslocal features, while features computed on the whole duration of a note (or note transition) are refereed to asglobal features.

It is rather complex to prove formally the complementarities of these features. However, the in- terpretation of these features given above clearly shows that they convey non-overlapping information and that they come from orthogonal descriptions of audio signal. We note that both types of features can only be extracted on voiced portions of a signal.

In this study we use LPC, MFCC, and TECC as timbral features. The details of the computation of these features can be found in Chap.3, Sec.3.1. Experimentally we have chosen to use 15 LPC, 20 MFCC and 25 TECC. The aim of the proposed method is to increase the singer identification performance by combining complementary information on the signal to be classified. There is no point of combining various timbral features since they all convey the same information. The idea is then to combine timbral features with intonative features.

As explained in Chap.2 Sec.2 there exist different levels of combination. In this case the features cannot be combined to form a unique description of the sound for several reasons. First of all, they have different dimensions. For a given sample, decomposed into F overlapping frames, on which P partials are extracted, the timbral features lead to a matrix of features of size NxF. In contrast, the intonative features lead to a matrix of features of size PxL where N and L indicates the number of timbral and intonative coefficients respectively.

Even if it is possible to transform these two features’ matrices (either by repeating or deleting information) to obtain two matrices of the same dimension that can be then compacted into a single matrix, the information conveyed by these matrices have different meanings. As explained previously, in this case it is more appropriate to combine the decisions of classifiers trained on each feature independently.

3.1.2 Combining decisions obtained with each sound description

As explained in Chap.2 Sec.2 there are two main schemes to combine classification decision: sequential and parallel. In the present case, we have only two types of information to be combined.

From preliminary experiments, we assume that timbral features have a better performance than intonative features for this task. In this situation, parallel combination rules cannot be applied directly.

It would be necessary to learn the combination rule using the outputs of each classification as new features. This solution, however, requires a very large amount of data to avoid over-fitting problems, as explained in Sec.2.3. The remaining solution is to use a sequential scheme of combination. The drawback of all sequential combination schemes is that there is no backward analysis of the decisions taken at each step. Therefore, if a wrong decision is made at one stage there is no chance to obtain a correct classification at the end.

Considering these last points, we propose a combination method that combines the advantages of parallel and sequential combination schemes. The underlying idea of this method is the following: the performance of any system of classification increases when the number of possible classes decreases.

This makes it so that the decision of a system of classification can be combined efficiently with the

3. PROPOSED APPROACH FOR SINGER IDENTIFICATION 111 decisions of more accurate systems if the problem given to the weaker system is simplified. By

“simplified problem” we mean a problem with a smaller number of classes.

In the present case, a query sample is given as input of the system of classification based on timbral features. The outputs of this system are then analyzed to retain a small number of possible classes.

Then the same query sample is given as input of the system of classification based on intonative features, together with the subset of selected classes. The system returns its decisions for the subset of classes. Finally, the consensual decision is taken using classical parallel decision rules applied on the decisions of the two systems for the reduced set of classes. Here, the decision is made after two iterations because we only have two descriptions of the data. Theoretically, the process can be iterated as long as the last classifier does not return a single class or another system (a new description or a new classifier) is still available. If the method is iterated until only one class remains possible then the method works as a decision tree. Reciprocally, if the class-set is not reduced at each step, then the method is equivalent to a classical parallel scheme of combination.

We present next the general framework of the method that is specially adapted to:

Combine classifications with different levels of performance.

Solve problems involving a large number of classes with no hierarchical organization of the data.

Combine a low number of representations (when a cascade classification can not be processed until only a single class remain possible).

Using the notation introduced in Sec.2 of Chap.2 we have:

• Each patternz:

– belongs to one of the N possibleclassesof ⌦={!₁. . .!_N}. – can be described with differentsets of features

– is classified with on of the availableclassifiers

• If D = {D1, . . . , DL} denotes the set of all available descriptions (features) of z and C = {C₁, . . . , C_M} denotes the set of available classifiers, a system of classification is given by S^k = D^k, C^k whereD^k2D andC^k2C.

• During the sequential phase, the number of possible classes is reduced at each stepk. Thus, we have:⌦^k+1 ⇢⌦^k⇢⌦⁰ =⌦.

• For a given classification task, all systems of classification have the same original set of training samples. The training data set associated with the systemS^k(i.e which is described usingD^k) is denoted byT^kand the training set reduced to samples belonging to⌦^kis denoted byT_#^k.

• We assume that all classifiers outputs can be converted into membership measurement for each class : M^k(z) =⇥

m^k₁(z), . . . , m^k_N(z)⇤

. Then, for a combination involving K systems of classification and a set of classes reduced to N classes at step K, the final decision is taken by analyzing theK⇥N membership measurement stored in a decision profile matrix denoted by M.

Alg. 1HybrideClassif(z,S, T,⌦, r) Iterative algorithm to predict the class of pattern z given a set of K classification systemsS ={S¹. . . S^k}, a training-set T and r a combination rule

Inputs: z, the training data set T, the set of possible classes⌦and K classifiers Output: the predicted class for z:wˆ

1begin

2 ⌦¹ =⌦⁰

3 fork from 1toKdo 4 if|⌦^k|>1

5 DefineT_#^kthe training set reduced to pattern from class in⌦^k 6 TrainS^kusing classifierC^kandT_#^(k)

7 ComputeM^kthe membership measurements vector returned byS^k 8 Deduce⌦^k+1the subset of theN^kmost probable classes.

9 else

10 returnwˆ=⌦^(k)

11 endo`// at this stage,N^K classes remain possible 12 fork from 1toKdo

13 `// the highestN^Kvalues of eachM^kare conserved in a matrix M.

14 M(1 :N^K, k) =M^k(1 :N^K)

15 endo

16 wˆ r(M)`// make the final decision using the ruleron M 17 returnwˆ

18end

3. PROPOSED APPROACH FOR SINGER IDENTIFICATION 113

T⁽¹⁾: training-set

Ts⁽¹⁾ z

T⁽²⁾ Ts⁽²⁾ z System 1

S⁽¹⁾ = (D⁽¹⁾ ,C⁽¹⁾) D⁽¹⁾= [Ts⁽¹⁾, T⁽¹⁾]

C⁽¹⁾

T⁽²⁾^↓

C⁽²⁾ Ω⁰={ω¹,...,ω^N}

Ω⁽¹⁾ ⊂ Ω⁽⁰⁾

Ω¹

Ω⁽²⁾ ⊂ Ω⁽¹⁾ System 2

S⁽²⁾ = (D⁽²⁾ ,C⁽²⁾) D⁽²⁾= [Ts⁽²⁾, T⁽²⁾]

T^(K) Ts^(K) z

T^(K)↓

C^(K) Ω^(K-1) System K S^(K) = (D^(K) ,C^(K))

D^(K)= [Ts^(K), T^(K)] ...

M¹= [m¹¹, ..., m¹N] M²= [m²¹,..., m²M] M^K= [m^K¹,..., m^KL] M=[M^K;...;M^K] ω

Figure 5.1: Scheme of the proposed method to combine K systems

Based on these notations, the algorithm of the method is given in Alg.1. An illustration of the method is given in Fig.5.1, where each column block represents one system of classification. The output of a classification system is given to the next classifier until no system remains available. The final decision is made by analyzing theM^kof the K systems.

We now discuss the choice of the different parameters of the method in a general case and then present the set of parameters adapted to our experiments.

Choice of parameters

• Selection of the remaining classes:The set of classes selected at each step can be chosen using a pre-determined relation: N^k = ^N_c^kk¹ wherec^kare defined beforehand. The classes can also be selected using a rule of thumb on the values ofM^k.

• Choice of feature space and classifier:There is no restriction on the choice of the descriptions D^kand the classifiersC^k. Thus fori6=j, the combination system can be set up withDⁱ=D^j orCⁱ =C^j. From our experiments, the proposed method is still accurate ifSⁱ=S^j.

• Sequential organization of the S^k: If knowledge on the relative performances of theK systems is available, the most performant systems should be placed at the top of the iterative process. If all systems have equivalent performances, or if the relative performances cannot be estimated, systems that require the lowest number of computation can be placed at the end of the process to minimize the cost.

• Parallel combination rule:At the end of the sequential stageN^Kclasses remain possible. The Kvectors of measurementsM^kare reduced to values of classes in⌦^(K)before being combined into a decision profile matrixM. We suggest to first normalize each column ofM on theN^(K) remaining classes, so that the sum of the N^(K)membership values for each classifier is equal to one. Then thesum-ruleis applied for the reasons explained in [KHDM02].

In the following experiments we deal with the combination of decisions obtained with two independent descriptions. The first description is given by timbral features (D⁽¹⁾ = T ECC orD⁽¹⁾ = M F CC or D⁽¹⁾ = LP C) and the second description is given by the intonative features (D⁽²⁾ = IN T O). The combination method is tested with 3 classifiers (SVM, GMM, kNN) chosen for their variety as explained in Chap.2 Sec.1.2. The number of classes remaining at the end of the first stage is chosen dynamically. The membership values are normalized, such that their sum is equal to one. The classes that explain80%of the cumulative posterior probabilities are retained to form the new subset of classes of sizeN⁽¹⁾. Once the membership measurements for the remaining classes given by the two classification systems have been normalized and concatenated to form a decision profile matrix, we apply a “sum-rule” to take the final decision.

No documento Lise Regnier (páginas 110-115)