Postgraduate Program of Systems and Computing Research Master in Systems and Computing
Investigating fuzzy methods for multilingual
speaker identification
Thales Aguiar de Lima
Natal — RN September 2, 2020
Investigating fuzzy methods for multilingual speaker
identification
Master dissertation entitled as Investigating
fuzzy methods for multilingual speaker identi-fication submitted to the Federal University
of Rio Grande do Norte as a partial require-ment for obtaining the degree of Research Master in Systems and Computing.
Research area:
SIGNAL PROCESSING AND COMPUTA-TIONAL INTELLIGENCE
Supervisor
PhD. Májory Cristiany da Costa Abreu
PPgSC – Postgraduate Program in Systems and Computing CCET – Center for Exact and Earth Sciences
UFRN – Federal University of Rio Grande do Norte
Natal — RN September 2, 2020
Lima, Thales Aguiar de.
Investigating fuzzy methods for multilingual speaker identification / Thales Aguiar de Lima. - 2020.
66f.: il.
Dissertação (Mestrado) - Universidade Federal do Rio Grande do Norte, Centro de Ciências Exatas e da Terra, Programa de Pós-Graduação em Sistemas e Computação. Natal, 2020.
Orientador: Márjory Cristiany da Costa Abreu.
1. Computação - Dissertação. 2. Fuzzy - Dissertação. 3. Speaker identification - Dissertação. 4. Artificial intelligence Dissertação. 5. Biometrics Dissertação. 6. Multilingual -Dissertação. 7. Brazilian portuguese - -Dissertação. I. Abreu, Márjory Cristiany da Costa. II. Título.
RN/UF/CCET CDU 004
Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Ronaldo Xavier de Arruda - CCET
First, thanks to my family for all the support. Without it, I would not achieve any life goals. Besides, I want to thank Rita de Cássia Nascimento Sousa (my beloved girlfriend) for giving me the strength and emotional support necessary to continue this work.
To my advisor, Márjory Cristiany da Costa Abreu for guiding me through this work and my master’s course. Her guidance, patience, knowledge, and experience have made this work easier to complete and better than I could ever do.
Also, many thanks to Jing Wang for revising and helping me understand some Chinese results and verifying the data as well. To the CAPES Foundation for funding me during this course.
Finally, to all my childhood friends for maintaining the friendship even though I became a bit crazy with all this research.
identification
Author: Thales Aguiar de Lima Advisor: PhD. Májory Cristiany da Costa Abreu
Abstract
Speech is a crucial ability for humans to interact and communicate. Speech-based technolo-gies are becoming more popular with speech interfaces, real-time translation, and budget healthcare diagnosis. Besides, the use of voice for system identification is an important and relevant topic. There are several ways of doing it, but most are dependent on the language the user speaks. However, if the idea is to create an all inclusive and reliable system that uses speech as its input, we must take into account that people can and will speak different languages and accents. This research evaluates closed-set text-independent speaker identification systems on a multilingual setup, including both fuzzy and crisp models. Our experiments are performed using three widely spoken languages which are Por-tuguese, English, and Chinese. Then, we extracted 13-MFCCs, along with log-Energy and its respective delta and delta-delta from signals to use as our feature vector. We adopted four classifiers: Fuzzy C-Means, Fuzzy k-Nearest Neighbours, k-Nearest Neighbours, and Support Vector Machines. Initial tests indicated the systems have certain robustness on multiple languages. Where results with more languages decreases our accuracy; however our investigation suggests these impacts are from number of classes.
Keywords: speaker identification, speaker recognition, fuzzy, signal processing, multilingual
Figure 1.1 – Human auditory system and computer acoustic model. . . 1
Figure 1.2 – Speech recognition sub-areas; the highlighted area is the scope of this paper. . . 2
Figure 2.1 – A general SPiD system architecture. . . 6
Figure 2.2 – Signal representation in Time/Amplitude, Frequency/Magnitude, and Frequency/Phase of “Don’t ask me to carry an oily rag like that.” . . . 8
Figure 2.3 – Representation of “Devem estar vendo as américas sob uma ótica to-talmente deturpada” in Time/Amplitude, Frequency/Magnitude, and Frequency/Phase. . . 9
Figure 2.4 – Signal representation in Time/Amplitude, Frequency/Log-Magnitude, and Frequency/Phase. . . 10
Figure 2.5 – Voice activity detection steps. . . 12
Figure 2.6 – Graphical visualisation of log-Energy and ZCR of a signal . . . 12
Figure 2.7 – MFCC extraction steps. . . 13
Figure 2.8 – Example of ∆ and ∆∆ extraction. . . 14
Figure 3.1 – Original speaker distribution a, and distribution used for experiments b. 23 Figure 3.2 – Spectral representation of MFCC for three speakers on BP, EN, and CN. 25 Figure 3.3 – Effect of dropping the first cepstrum of MFCC for three speakers . . . 26
Figure 3.4 – Applying cepstral mean subtraction to MFCCs. Original features at 1st row, and reduction results below. . . 27
Figure 4.1 – FCM results for speaker recognition on BP. . . 30
Figure 4.2 – Results for Fuzzy k-Nearest Neighbours with different configurations. . 32
Figure 4.3 – k-Nearest Neighbours accuracy for k = 2,4,6,8,12 on BP. . . 32
Figure 4.4 – Accuracy results on BP with Sigmoid and RBF kernels. . . 34
Figure 4.5 – Average accuracy over time for best configurations on BP. . . 35
Figure 4.6 – Average training and test duration in picoseconds (ps) on BP. . . 35
Figure 4.7 – SVM accuracy using a Polynomial kernel with different degrees for BP. 36 Figure 4.8 – Results of Linear SVM on BP+EN dataset. . . 36
Figure 4.9 – Results of Linear SVM on BP+EN+CN dataset. . . 38
Figure 4.10–Language × Language confusion matrix using two multilingual datasets. Main diagonal represents all correct tests for stratified 3-fold combined. 38 Figure 4.11–Speaker × Speaker confusion matrix using two multilingual datasets. Main diagonal represents all correct tests for stratified 3-fold combined. 39 Figure 4.12–Results for randomly selecting 1/ 3 classes of each language. . . 39
Table 2.1 – Summary of the datasets. . . 10
Table 3.1 – DARPA-TIMIT proportions before and after under-sampling for speaker identification . . . 22
Table 3.2 – Characteristics of data used in the experiments. . . 24
Table 3.3 – Summary of decisions. . . 29
Table 4.1 – FCM accuracy (acc) and standard deviation (σ) for BP. . . 30
Table 4.2 – FKNN accuracy (acc) and standard deviation (σ) for BP. . . 31
Table 4.3 – KNN accuracy (acc) and standard deviation (σ) for BP. . . 33
Table 4.4 – SVM accuracy (acc) and standard deviation (σ) for BP. . . 33
Table 4.5 – Polynomial SVM accuracy (acc) and standard deviation (σ) for BP. . . 37
Table 4.6 – Linear SVM accuracy (acc) and standard deviation (σ) for BP+EN. . . 37 Table 4.7 – RBF SVM accuracy (acc) and standard deviation (σ) for BP+EN+CN. 38
ANFIS – Adaptive-Network-Based Fuzzy Inference System ASR – Automatic Speech Recognition
BnF – Bottleneck features BP – Brazilian Portuguese CN – Chinese Mandarin DNN – Deep Neural Network DTW – Dynamic Time Wrapping EN – English
FCM – Fuzzy C-Means
FFT – Fast Fourier Transform
FKNN – Fuzzy K-Nearest Neighbours GMM – Gaussian Mixture Models KNN – K-Nearest Neighbours LBM16K – LapsBenchmark16k LDC – Linguistic Data Consortium LPC – Linear Predictive Coefficients
MFCC – Mel-Frequency Cepstrum Coefficients SnR – Signal-to-Noise Ratio
SpID – Speaker Identification STT – Speech-to-Text
SVM – Support Vector Machines ZCR – Zero-Crossing Rate
List of Figures . . . . iii List of Tables . . . . iv Contents . . . . vi 1 INTRODUCTION . . . . 1 1.1 Motivation. . . 4 1.2 Objectives . . . 5 1.3 Work organisation . . . 5 2 THEORETICAL FOUNDATIONS . . . . 6
2.1 General approach to SPiD . . . 6
2.2 Available datasets . . . 7
2.2.1 DARPA-TIMIT . . . 8
2.2.2 LapsBenchmark16k . . . 9
2.2.3 AISHELL-1 . . . 9
2.3 Voice activity detection. . . 11
2.4 Speech features . . . 11
2.4.1 Mel-Frequency Cepstrum Coefficients . . . 12
2.4.2 Dynamic features . . . 13
2.5 Crisp models for speech . . . 14
2.5.1 k-Nearest Neighbours . . . 14
2.5.2 Support Vector Machines . . . 15
2.6 Fuzzy modelling of speech . . . 15
2.6.1 Fuzzy logic . . . 15
2.6.2 Fuzzy C-Means . . . 16
2.6.3 Fuzzy k-Nearest Neighbours . . . 17
2.7 Chapter summary . . . 17
3 IDENTIFICATION THROUGH VOICE . . . . 19
3.1 SpID related works . . . 19
3.2 Methodology . . . 22
3.2.1 Preparing the data . . . 22
3.2.2 Biometric feature extraction . . . 24
4 RESULTS . . . . 30
4.1 Fuzzy C-Means . . . 30
4.2 Fuzzy k-Nearest Neighbours . . . 31
4.3 k-Nearest Neighbours . . . 31
4.4 Support Vector Machines . . . 33
4.5 Multilingual experiments. . . 35
4.6 Chapter summary . . . 39
5 DISCUSSION . . . . 40
5.1 Unsupervised SpID . . . 40
5.2 Nearest neighbours classification . . . 41
5.3 SVM classification . . . 41
5.4 Identification with fuzzy methods . . . 43
5.5 On the multilingual SpID . . . 44
6 FINAL REMARKS . . . . 46
6.1 Contribution for speech recognition . . . 47
6.2 Future works . . . 47
BIBLIOGRAPHY . . . . 48
APPENDIX
53
APPENDIX A – HARDWARE INFORMATION . . . . 541 Introduction
Speech exists with the main reason to enable communication between humans. This communication translates into a sequence-dependent and rule-based system that we call
language. To talk with one another, humans use a complex system to produce the voice
signal. Starting at the lungs, through the trachea, stimulating vocal cords and the larynx tube, using the pharynx cavity, the tongue, vellum, mouth and nasal cavity to produce sound finally. This procedure is detailed in (RABINER; SCHAFER,2011, p. 5). Figure 1.1 shows a side-to-side comparison between a human auditory system and an acoustic model.
Figure 1.1 – Human auditory system and computer acoustic model.
Source: adapted fromRabiner e Schafer(2011).
Originally, the problem was to enable a computer to recognise and translate a voice signal into text. This area requires a broad knowledge from different scientific fields to understand and emulate the human auditory system. From linguistics to speech therapy, from physics to medicine, and finally, computer science (RABINER; RABINER; JUANG,
1993, p. 2). Therefore, “Successful speech-recognition systems require knowledge and expertise from a wide range of disciplines [...]” (RABINER; RABINER; JUANG, 1993, p. 2).
This problem is quite a challenge since voice is a highly sensitive signal. It can easily be affected by background noises (RABINER; SCHAFER, 2011), such as cars, heaters, nature sounds, and another person speech. Furthermore, this signal may suffer changes due to ageing (HÄMÄLÄINEN et al.,2015; CAMPBELL,1997), misspoken words, health, and emotion (CAMPBELL, 1997). Thus, besides carrying personal information, it also presents several physiological and behavioural characteristics. The physiological part of speech is related to the sound production process, such as the trachea opening size or
the pitch. While the behaviour is found on individual experience that can affect the way someone talk, like jargon or speaking speed (also known as speaking rate).
Now, the research for speech recognition is far more broad and sophisticated than before. It has branched into several sub-areas, and a portion of the field can be visualised in Figure 1.2. The translation of a voice signal into its respective textual representation is now known as Automatic Speech Recognition (ASR), or Speech-to-Text (STT). However, it can also refer to Text-to-Speech, Speaker Identification (SpID), Speech Analysis or Synthesis. Therefore, it is now more than just translation. Speech Recognition is currently an attempt to reproduce several human abilities since it is possible to enable a computer to recognise a language, the identity, to create and translate speech, and several other human characteristics.
Figure 1.2 – Speech recognition sub-areas; the highlighted area is the scope of this paper.
Speech Recognition Recognition Coding Analysis/ Synthesis Speaker Recognition Language Identification Speech-to-Text Speaker Detection Speaker Verification Speaker Identification
Besides being able to reproduce the human auditory system, new challenges have come to the field. Lately, trying to create a language-independent model is getting an increasing popularity in the field (LIMA; COSTA-ABREU, 2020). The objective is to create a model that can translate a speech signal from any language, even though the model had no previous knowledge about it. It makes STT an even more challenging problem, and it is also carried outside this branch, appearing in SpID, Text-to-Speech, and other sub-areas.
The original problem in ASR suffers from a vast number of interference sources. Not only the previously cited sources but also a variation inside the classification targets, the phonemes. These are a textual representation of a basic unit of sound, a phone. These sounds vary between speakers and for the same person. Furthermore, a sound also changes according to the context and the phones in its neighbourhood, showing how refined and complex is the auditory system. A usual ASR system has an encoder and a decoder. The former translates the signal into a set of features, while the latter recognises them as target labels.
For speaker identification, we see the ASR field branching towards biometrics. It is not commonly used as an authentication service for most situations (SPOLAOR et al.,
authentication methods. Besides the popular fingerprint authentication (SPOLAOR et al., 2016), the facial and voice verification authentication approaches are getting some space. But SpID shines when used for database search in criminal records (LINDH,2017). For instance, ALIZE is a popular framework for these systems (BONASTRE; WILS;
MEIGNIER, 2005).
Therefore, this sub-field of speech research focuses on identity recognition.Campbell
(1997) defines speaker identification as “deciding if a speaker is a specific person or is among a group of persons.”.
Moreover, it is also possible to specify the classification even further. According to the audio source, this problem can be open-set when the speaker of target audio may not be registered in the system. On the other hand, when the subject is always known, the problem becomes closed-set. A good description of such a system is given by Hu e Damper
(2005) as identifying the owner of an input utterance among people known to the system. Finally, these systems can be further specified based on Reynolds (1995). First, when the contents of the recording are known, and it used in favour of classification, we say it is a text-dependent problem. In contrast to text-idependent, when there is no previous knowledge about the contents of the target recording. Therefore, in our work, we explore
closed-set text-independent speaker identificationsystems.
The choices previously described have most impact on the system security. When using a closed-set identification system, any feature vector given as input will be classified as known to the system (spoofing). This is a crucial vulnerability of a system, and must be take into consideration. It is crucial to to understand that our objective is limited to the restrictions placed above. Even though we are aware of possible system vulnerabilities, this work is more concerned into investigating how languages can affect its performance and how it behaves under these circumstances.
The literature presents a wide range of experiments on this subject. From established classification methods such as Dynamic Time Warping (DTW), Support Vector Machines (SVM), and GMM (DEVIKA; SUMITHRA; DEEPIKA, 2014), to more recent DNN models (CHUNG; NAGRANI; ZISSERMAN, 2018; LI et al., 2017; TONG; GARNER;
BOURLARD,2017). As well as features such as MFCC and LPC (DEVIKA; SUMITHRA;
DEEPIKA, 2014), to using models as feature extractors obtaining BnF, i-vectors, and
x-vectors (MCLAREN; MANDASARI; LEEUWEN, 2012).
When it comes to cross-lingual environments, there seems to exist a consensus on the research community that speaker identification is language-dependent (NAGARAJA;
JAYANNA, 2012; NAGARAJA; JAYANNA, 2013a), that is, the systems are affected
when a subject speaks on more than one language, or when there are different sources of languages. The main argument is that the vocal cords vibrate differently when the speaker changes the language, creating a dependency. However, these works tend to cover closely-related languages, mostly from Asia (CHUNG; NAGRANI; ZISSERMAN, 2018). Which
increase even more the gap for under-resourced languages such as BP (EBERHARD;
SIMONS; FENNIG, 2018) and gives a tight knowledge about the behaviour of these
systems on a more global environment. Another restriction of works on this area comes from fuzzy methods (DEVIKA; SUMITHRA; DEEPIKA,2014;PANDEY et al.,2010). In addition to scarce research, the authors barely present their hyperparameters set-up or any modification on the data to prevent biased results.
In this work, we explore some language-independent models for SpID, as highlighted in Figure 1.2. With an emphasis on exploring methods that have less or no attention at all in the literature. Trying to identify a new path for other investigations to come. Additionally, we perform our experiments using three languages: English (EN), Brazilian Portuguese (BP), and Chinese Mandarin (CN).
Our experiments support that SpID has certain robustness when dealing with multiple languages. Using our method, adding a third language impacts negatively on the accuracy of our system. However, we also take into consideration possible influences coming from gender and number of classes. The three communication systems have not only distinctive structures but also distant ancestors. Showing that our method carries more information about the user than its language.
1.1
Motivation
The increasing popularity of speech interfaces, health diagnosis that uses speech technologies, and the speech biometrics are the main motivations to explore this research area. Besides, from the literature presented above, we identified that Brazilian Portuguese is almost not present in any research for this topic. Since it is a low-resource language and one of the most spoken in the world, investigating the behaviour of these technologies for it would be interesting. Furthermore, employing languages with distinct structures and origins can expand the research for SpID, since the literature seems to reserve itself to using closely related communication systems, or different accents for multilingual systems. Therefore, using English, Chinese Mandarin, and Brazilian Portuguese, we expect to provide decent results and a rich discussion of these languages in a computational view. Then, when it comes to fuzzy systems, there is another gap in this area when compared to the most popular methods. Most of the research in this matter is found to be poorly explained. Their details are usually obscure, leading to doubts about its processes, and thus their results. Here, we decided to explore some of these methods and give a profound description of their hyper-parameters. Finally, the “language-independent” SpID systems have taken our attention. Then, creating a more distinct language combination, we want to investigate how SpID systems behave under these conditions.
1.2
Objectives
In order of priority, from top to bottom:
• Expand the research using Brazilian Portuguese;
• Explore cross and multilingual SpID technologies on a more generic approach, using languages with distinct structures and origins;
• Verify SpID robustness on multilingual environments;
• Investigate the viability of Fuzzy methods on speaker identification systems.
1.3
Work organisation
This work is organised as follows:
Chapter 2 — Theoretical foundations presents information on our data, followed
by a general introduction of the feature extraction methods used in this work. Additionally, this chapter gives a brief description of the models adopted in this thesis.
Chapter 3 — Identification through voice expands the discussion about the
state-of-the-art and describes our methods. This includes showing dataset modifications, how each feature is extracted, and the parameters used to fine tune each model.
Chapter 4 — Results is where we apply the methodology. The results go through each
model monolingual performance, then the best model accuracy rate on multilingual experiments, and finally a deeper investigation about the impacts of gender and number of classes in these informations.
Chapter 5 — Discussion presents a substantial discussion about our results, with
respect to our recognition accuracy, gender misclassification, and impact of quantity of classes on recognition rate.
Chapter 6 — Final Remarks gives a general overview of our findings, as well as our
contribution to the speech recognition field. Then, we highlight a few ideas to further improve our work.
2 Theoretical foundations
This work aims to explore the cross-lingual speaker identification area. We will present three datasets, each for one language: English (EN), Brazilian Portuguese (BP), and Standard Chinese (CN). Therefore, we expect to explore how SpID systems behave under multilingual conditions and further explore fuzzy methods.
Here, we present the datasets used in this work, the main feature extraction methods, and a general description of the classifiers employed in our identification problems. This chapter reserves itself to a brief formal introduction to our methods, while possible parameters and validation techniques are explored later in this work.
2.1
General approach to SPiD
In this section we give a brief introduction to a general structure of speaker identification systems. A detailed description of them can be found in (CAMPBELL,1997) and (REYNOLDS, 1995). An SPiD system has four main steps: data acquisition, feature extraction, model training, and identification. All these steps are presented in Figure 2.1, where complete lines represent the enrolment process, and the dashed lines represent the decision process. To start, it necessary to collect voice data from a certain number of participants. This may vary with respect to the task, for identification it is desirable to have a large number (over 100) to stress the system (Section 2.2).
Figure 2.1 – A general SPiD system architecture.
Then, the data goes through the feature extraction module, or the front-end (Sections 2.3 and 2.4). Here, the most significant characteristics of the voice signal are extracted to create a lower-dimensional representation of it. As explained in Chapter 1,
voice carries several characteristics. However, for speaker identification, we want to preserve the ones that “uniquely” describes the owner. For that, several approaches can be found in the literature (Chapter 1 and Section 3.1). Overall, the optimal features must have a high interspeaker variability and low intraspeaker variability. The former guarantees the discrimination between users, while the latter ensures the features do not vary too much for a single person (CAMPBELL, 1997). Also, in this stage, the data can pass through a number of pre-processing techniques to improve further stages. For example: silence removal, speaker rate normalisation, cepstral mean normalisation, and others.
Next, a mathematical model is created with the features to register the users. This process is also called by enrolment (Sections 2.5 and 2.6). The model must be capable of distinguishing between users, this can be done by creating voiceprints (for generative models) or data separation (for descriptive models). However, for simplicity, we will focus on the voiceprint strategy. At this stage, the system is defined as closed-set or open-set. On the latter, background models are create to model any unknown speaker (not registered). If the former is adopted, then these new models are unnecessary. Also, this is the final step for creating the system.
Finally, the decision process consists on verifying who is the owner of a given voice signal. Here, the development has two lines to follow: verification and identification. The verification process has both the claimed identity and the voice signal as inputs, while the identification has only the signal. This is a crucial step, and can highly affect the system behaviour. If the problem is a verification, then, the signal is translated to a feature vector (voiceprint) which is compared only to the model respective to claimed identity (if open-set, also compares to the universal speaker). When the decision is taken towards identification, then there is no claimed identity and the unknown speaker feature vector must be compared to every voiceprint. Therefore, verification is not affected by the number of speakers, while identification is. In this work, we take the path of closed-set text-independent
speaker identification; thus, the “Who is talking” decision. Following sections describes
each step presented here, except by the decision making which has a dedicated chapter (Chapter 4).
2.2
Available datasets
In this work, we have decided to explore three distinct datasets on three different languages. The DARPA-TIMIT, the LapsBenchmark16k, and the AISHELL-1. They were chosen because their audios are recorded at the same sampling rate. As presented in Chapter 1, another motivation for using them is because of their distinct structures and do not descend from a common ancient language. Of curse, there are several multilingual NIST Speaker Recognition Evaluation datasets, for instance, the Call My Net Corpus (JONES
et al.,2017). However, they are under Linguistic Data Consortium (LDC)1, which requires
a subscription fee that is outside the budget of this research. As presented in subsequent sections and chapters, there is a significant amount of our effort on preparing the data. This problem could be drastically mitigated if one has access to an multilingual dataset ready for SpID and STT problems.
Each dataset is better detailed in their respective references. Therefore, we limit ourselves to a brief description of their main characteristic and how they are organised.
2.2.1
DARPA-TIMIT
The DARPA-TIMIT is a free version of TIMIT, which has to be purchased at the LDC. This free version is available atKaggle(2019). It has 630 speakers (438 male and 192 female), each of them recorded 10 phonetically rich prompts, resulting in 6300 samples. The recordings have 2.9s ± 0.8s duration; thus, each speaker has around 30s duration. Besides being phonetically rich, this dataset also comprises 8 dialects from different locations: New England, Northern, North Midland, South Midland, Southern, New York City, Western, and Army Brat. So, these characteristics make it one of the most complete datasets for EN. The authors recorded the audios at 16kHz and 16bit. The prompts of this corpora are scripted, and every speaker recorded a distinct set of sentences with a small intersection between them.
Figure 2.2 – Signal representation in Time/Amplitude, Frequency/Magnitude, and Frequency/Phase of “Don’t
ask me to carry an oily rag like that.”
(a) 0.0s 0.3s 0.7s 1.0s 1.4s 1.7s 2.1s 2.4s 2.8s 3.1s 3.5s −1.0 −0.5 0.0 0.5 1.0 Amplitude (b) 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) 0 5 10 Magnitude (energy) (c) 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) −5000 0 5000 Phase (radians) (d) 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) −100 −75 −50 Magnitude (dB)
Source: DR4_FBAS0_SA2 from DARPA-TIMIT (KAGGLE, 2019).
The files in this dataset are on the NIST format. Therefore we converted them to standard WAV before executing the feature extraction methods. For that, we used the
1 “The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories.”<https://www.ldc.upenn.edu/about>
program recommended by LDC. Figure 2.2 illustrates one sample from this dataset in different representations.
2.2.2
LapsBenchmark16k
The LapsBenchmark16k (LBM16K) is an open speech dataset for BP provided by the FalaBrasil (FALABRASIL,2018) research group, from the Federal University of Pará through the research conducted by Neto (2011). This corpus contains a total of 35 speakers (25 male and 10 female), with 20 recordings per speaker, and around 700 unique phrases. The audios have an approximated duration of 4.62s ± 0.78; thus, around 1m33s per participant. Each prompt was recorded at 16KHz and 16bit, making it concise with the DARPA-TIMIT. The complete sentence set is available at (NETO, 2011). Even though its sentences are scripted, they are not the same for each speaker. Also, they are not equivalent translations from neither EN and CN. In Figure 2.3 an example from this dataset can be visualised in different representations.
Figure 2.3 – Representation of “Devem estar vendo as américas sob uma ótica totalmente deturpada” in Time/Amplitude, Frequency/Magnitude, and Frequency/Phase.
. (a) 0.0s 0.5s 0.9s 1.4s 1.8s 2.3s 2.8s 3.2s 3.7s 4.2s 4.6s −1.0 −0.5 0.0 0.5 1.0 Amplitude (b) 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) 0 500 1000 Magnitude (energy) (c) 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) −5000 0 5000 Phase (radians) (d) 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) −150 −100 −50 Magnitude (dB)
Source: M034_BM_0662 from LBM16K (FALABRASIL,2018).
2.2.3
AISHELL-1
The AISHELL-1 (BU et al., 2017) is an open Chinese language speech dataset available at Open Speech and Language Repository2. It has a total of 400 participants
(188 males and 212 females), with at least four distinct accents. Each audio was recorded at 16KHz and sampled at 16bit.
The sentences on the AISHELL are not scripted but have common topics. They include words about “Finance”, “Sports”, “Science and Technology”, “Entertainments”,
“News”, and others. Figure 2.4 illustrates one of them with different representations. Each speaker has a different set of sentences, and at least 300 phrases for one participant, with 4.57s ± 1.3s of duration. The authors already split this dataset into training, development, and test. The sizes of those portions are, respectively, 340, 40, and 20 speakers. Moreover, the Chinese samples are not direct translations from EN nor BP, just as the LBM16K.
Figure 2.4 – Signal representation in Time/Amplitude, Frequency/Log-Magnitude, and Frequency/Phase. (a) 0.0s 0.4s 0.8s 1.2s 1.7s 2.1s 2.5s 2.9s 3.3s 3.7s 4.2s −1.0 −0.5 0.0 0.5 1.0 Amplitude (b) 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) 0 50 100 150 Magnitude (energy) (c) 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) −5000 0 Phase (radians) (d) 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) −150 −100 −50 Magnitude (dB)
Source: S0724_BAC009S0724 from AISHELL-1 (BU et al., 2017).
Following, Table 2.1 summarises the main characteristics of the datasets. The gender distribution from BP and EN are not good compared to CN. However, since our goal is to investigate cross-lingual speech technologies, then in total the gender is almost evenly distributed. The number of recordings for each data is also quite different, but these characteristics are balanced by under-sampling or reducing the database when needed. Furthermore, the most important for us is the sampling rate, which ensures that all signals have the same number of samples per second.
Table 2.1 – Summary of the datasets.
Dataset #Size #Speakers Gender (M/F) Sampling rate Lang Scripted
DARPA-TIMIT 6,300 630 70%/30% 16KHz/16bit EN Yes
LapsBenchmark16k 700 35 72%/28% 16KHz/16bit BP Yes
AISHELL-1 141,200 400 47%/53% 16KHz/16bit CN No
Total 148,200 1,065 48%/52% — — —
This section introduced the data adopted for this work and a portion of the preparation for executing the experiments. Next, the feature extraction methods are introduced, while the configuration used on experiments are left to later chapters.
2.3
Voice activity detection
Discriminating silence from voiced intervals is an important module of SpID systems. It helps to label where speech starts and stop; that is, the start word and stop word of a sample. Also, it improves the recognition rate of some biometric system since silence is generally the same for everyone. There are several approaches to it, but here we use a short-time analysis of speech to estimate the logarithm energy (dB) and zero-crossing rate (ZCR) of each sample. By splitting the signal into r small frames of size L and R striding samples, followed by a transformation T (·) on each sample, as well as applying a window
wwe obtain a new signal representation throughout time. Usually a Hamming Window
for energy, and a rectangular window for ZCR.
The energy of a finite signal s is in a short-time analysis is defined in Equation (2.1a), while Equation (2.1b) is a zero-peak dB energy. Similarly, Equation (2.1c) is a short-time representation for ZCR. Thus, the energy of a signal is represented by the sum squared amplitudes, while ZCR occurs when successive samples have different algebraic signs.
Er= L−1 X m=0 (s[rR + m]w[m])2 (2.1a) ˆ
Er= 10log10Er−maxr (10log10Er) (2.1b) ˆ Zr= R 2L L−1 X m=0 | sgn(s[rR + m]) − sgn(s[rR + m − 1]) | (2.1c) An interesting fact is that “[...] voiced speech should be characterized by relatively high energy and low zero-crossing rate [...]” (RABINER; SCHAFER, 2011, p. 262). Thus, detecting silence in a signal is as simple as finding frames with a high energy and low zero-crossing rate. The process of the algorithm is depicted in Figure 2.5, where the thresholds are obtained from the initial 100ms of the signal, passed through a median filter with 3 samples of context. This initial portion of the signal is known to have no speech, as this represents the delay for the person to react and begin speaking (RABINER;
SCHAFER,2011, p. 592).
The algorithm employed in this work is based on the algorithm in (RABINER;
SCHAFER, 2011, p. 593). However, instead of using upper and lower thresholds for each
representation we use only one. Therefore, there is no optimisation using context frames. These modifications do not have a large influence on the detection capacity for our data (Figure 2.6).
2.4
Speech features
This section will introduce the feature extraction methods used in this work. Here we will address the processes superficially, but details on settings and where each feature
Figure 2.5 – Voice activity detection steps.
X
X
Figure 2.6 – Graphical visualisation of log-Energy and ZCR of a signal
−1 0 1 Amplitude −52 −41 −30 −19 −8 Energy (dB) 0 26 52 78 104 130 ZCR 0.0s 0.3s 0.7s 1.0s 1.4s 1.7s 2.1s 2.4s 2.8s 3.1s 3.5s −1 0 1 Amplitude
is applied are left to later chapters. The next sections will explain how and what are each of the features. The characteristics in sections 2.4.1 and 2.4.2 were obtained through an implementation by the author, available at GitHub3. However, keep in mind that these
implementations may change in time.
2.4.1
Mel-Frequency Cepstrum Coefficients
These features tend to preserve mostly physiological characteristics from speakers (such as glottal amplitude and trachea opening), but can also contain behavioural
char-acteristics. They are widely used for a large number of speech tasks such as phoneme recognition (GROSSINHO et al., 2016;KANNADAGULI; BHAT, 2015), language
tification (GUNAWAN; HUSAIN; KARTIWI, 2017), and speaker identification (
NA-GARAJA; JAYANNA,2013b; MCLAREN; MANDASARI; LEEUWEN, 2012; LI et al.,
2017; SHARMA; SHUKLA; MISHRA, 2014; TONG; GARNER; BOURLARD, 2017).
Besides its raw use, some works modify it, creating some unique speech representations. Specifically, some fuzzy transformations with Information Set Theory in (ANAND et al.,
2017) and dimensionality reduction in (KHANUM; FIROS, 2017).
Figure 2.7 – MFCC extraction steps.
It is necessary to follow the steps shown in Figure 2.7 to obtain these features. The first block is not crucial but can be useful in cases where silence may affect the model performance. First, we split the signal into overlapping frames, and an N point Discrete Fourier Transform is performed on each frame, resulting in Xm[k]. Then, R triangular filters group the values into critical bands. After, by normalising the power spectrum by the sum of weighted frames, we obtain the Mel-spectrum.
Finally, take the log of the Mel-spectrum and apply a Discrete Cosine Transform into resulting frames to then get R Mel-Frequency Cepstrum Coefficients (MFCC). Usually, the first 13 cepstrums are used for speech tasks, due to most of the signal energy residing on initial frames. Moreover, the first cepstrum can be ignored, because the majority of the energy resides there; thus, making it a non distinctive feature. A final step that can be applied is the cepstral mean reduction. This last procedure can remove some channel variations from signals, making them more uniform.
2.4.2
Dynamic features
Along with the spectral energy-based features, it is also possible to use dynamic features. They can represent the direction and acceleration of a characteristic. The Delta
(∆) and double Delta (∆∆) features, are very similar to the first and second derivatives of a function. Since MFCC is a discrete representation, the ∆ and ∆∆ are computed by
∆t= N P n=1 n(st+n− st−n) 2 PN n=1 n2 (2.2)
where n is a smooth factor, t is the current frame index, N is the number of frames, and s is the frame itself. The smooth is used to get a better representation of the movement by giving an interval between frames. This gap is necessary because using consecutive frames may leave us with a continuous line for some high variable signals.
Figure 2.8 – Example of ∆ and ∆∆ extraction.
Frames
Features
The ∆ is obtained by feeding this equation with the particular features. Then, giving ∆ as input to Equation 2.2 outputs ∆∆. Both procedures are illustrated in Figure 2.8. For each dimension of the feature, there is a related delta. So, a 13 MFCC feature vector has a 13 ∆ and ∆∆ associated. Similar to the static features, their dynamic representations also have variants. Particularly for ∆, the study on their robustness on high Signal-to-Noise Ratios (SnR) (BANSAL; IMAM, 2018).
2.5
Crisp models for speech
This section will provide a brief description of the commonly used methods for Speaker identification. Just as the previous sections, here, a general introduction to the classifiers used in this work is presented. However, their configurations are described on following chapters and sections.
2.5.1
k-Nearest Neighbours
k-Nearest Neighbours (KNN) is a supervised learning algorithm. It is easy to understand and simple to implement. The idea is to associate a given data point xt to the nearest k points based on a similarity metric (COVER; HART, 1967).
Two main parameters can affect its accuracy: the number of neighbours and the similarity metric. The latter can use, for instance, the Manhattan, Euclidean, or Minkowski metrics. The last parameter is the only one used to implement since it is equivalent to the Euclidean distance when p = 2, and to Manhattan when p = 1, achieved by Equation 2.3.
D(r,s) = n X i=1 |ri− si|p !1/p (2.3) However, these are not the only possible similarities. For time-dependent series, it also typical to use Dynamic Time Warping (DTW).
2.5.2
Support Vector Machines
The Support Vector Machines (SVM) is a method that “maps the input vectors into some high dimensional feature space Z through some non-linear mapping chosen a priori” (CORTES; VAPNIK, 1995, p. 274). On this space, the hyperplane separates the samples. The distance between the decision boundaries and the points is ruled by C parameter, and it controls how the model may behave when presenting a new sample. An improper value for this parameter can make the model unable to correctly classify unknown samples, making it learn the training samples so well that it cannot understand changes to new data, that is, overfitting the model. Then, γ control how a single sample can influence the hyperplanes. This method can use different Kernels to model its high dimensional plane. The most popular are Sigmoid, Radial-Basis, Linear, and Polynomial.
2.6
Fuzzy modelling of speech
This section explains, in short, some learning methods based on fuzzy logic. The chapter for the specific problem gives a more in-depth explanation about the configuration of these methods.
2.6.1
Fuzzy logic
Fuzzy logic is an attempt to represent the imprecision and vagueness of natural language better, that is, to better mimic human thinking as a many-valued logic (NGUYEN;
WALKER; WALKER, 2006, p. 1–2). While in first-order logic the possible outputs are
True or False, here, a wide range of outputs is possible. This way, it can serve as more complex data with little loss of information.
There are some basic concepts of fuzzy logic that are necessary to understand how this is applied to create more generic artificial intelligence methods. The first concept is the fuzzy set. Considering the idea of range representation, then a fuzzy set element pertinence is given by A : U → [0,1]. Therefore, this is simply the co-domain of a function f
that ranges through [0,1] instead of restricted to {0,1} (NGUYEN; WALKER; WALKER,
2006, p. 3). These functions are called membership functions, and they will compute a membership degree of a given element, that is its “compatibility value” rather than the “truth value” (NGUYEN; WALKER; WALKER, 2006, p. 4). Then, we say that
A: U → [0,1] is the membership function of the linguistic variable A, hence µA(u), is the
membership degree of u ∈ A.
This section briefly introduced the motivation of creating fuzzy logic and its underlying concepts necessary to partially understand this work. Next we describe the fuzzy classifiers adopted in this work. Furthermore, the following methods used an implementation by the authors, available at GitHub4.
2.6.2
Fuzzy C-Means
The Fuzzy C-Means (FCM) is an algorithm developed by (DUNN, 1973) and further improved by (BEZDEK; EHRLICH; FULL, 1984). It is closely related to k-Means, as the name suggests. FCM a more generic version of k-Means, since each sample may belong to every cluster by a degree. This algorithm works by minimising the weighted distance of each sample to every cluster centre. The weights correspond to the partition, or membership, matrix U = [uij]. Thus, given a fuzziness factor m, a sample xi has a membership degree of uij for cluster j.
In this method, every cluster works as a fuzzy subset. Those subsets have a fuzziness degree m > 1, which affects the membership degrees of samples evaluated by Equation (2.4).
uij = 1 C X k=1 kxi− cjk kxi− ckk !m−12 (2.4)
Therefore, the classification is much like a crisp model. However, instead of classifying to a single label the model returns an membership degree for each cluster c. Since, for instance, cluster 0 may not relate to class 0, it is necessary to label each cluster with an specific class. Here, we simply assign the most frequent class to each cluster. Finally, to defuzzify membership degrees we use
ˆy = arg max v (2.5)
This equation assigns ˆy to the class which had the maximum degree of membership for sample x.
2.6.3
Fuzzy k-Nearest Neighbours
Another fuzzification of crisp methods is the Fuzzy k-Nearest Neighbours (FKNN). In this thesis, this model is based on the work ofKeller, Gray e Givens (1985).
First, the model creates a membership matrix U[i, j] with the labelled data. There are several methods for this. First, one can assign a complete membership to the class which the vector belongs. It is also possible to generate an centre to each class and calculate the distance of xj to each centre, which is a k-Means strategy. Then, assign uij to their inverse normalised. Finally, in (KELLER; GRAY; GIVENS,1985) the author suggests a more adaptive initialisation using the KNN itself. With
uij(x) = 0.51 + (nj/ L) · 0.49 if i = j (nj/ L) · 0.49 if i 6= j (2.6)
for nj neighbours of x that belong to class i, and L is the total number of neighbours. It is important to note that L can be different from K. Thus, the memberships are proportionally initialised to the presence of each class. This method substantially increases the “training” duration but can provide better results, as presented in their work.
The classification for this method develops just as its crisp version. The idea is to find the k closest points to an unknown sample x and use Equation (2.7) to give a fuzzified prediction. ui(x0) = K X j uijkx0− xjk 2/m−1 K X j kx0− xjk2/m−1 (2.7)
That is, the membership of the unknown vector x0 to class i is given by u i. The distance to each of its K neighbours normalise the membership, and the compatibility of neighbour j to class i (uij) scales the x0 distances; while m controls the impact of the weighted distances. Furthermore, notice that u is a vector with as many elements as the number of classes. So, it is necessary to again “defuzzify” it just like with FCM. For that, we employ Equation (2.5).
2.7
Chapter summary
This chapter briefly introduced some concepts used in our work. Here, we began with an overview of a general SPiD system architecture in Section 2.1. Then, we describe our data in Section 2.2. Our three datasets: DARPA-TIMIT, LBM16K, and AISHELL-1 which we further apply some methods to combine them into a multilingual dataset. These datasets were picked as they are public data and contains languages from different ancestors, thus, having distinct structures. Next, we present the voice activity detection approach
which is based on a combination of energy and zero-crossing rate to find voiced frames of speech. Then, Section 2.4 introduces our feature extraction methods. First, the MFCC, then its derivatives, and the energy of a signal are presented. As presented in Chapter 1, these features are yet popular in the literature, as they can represent several characteristics of a speaker.
Finally, we give a brief introduction to our models on Section 2.5 and 2.6. There, we began with the traditional methods: KNN, SVM; followed by the fuzzy methods FCM, and FKNN. Further chapters will introduce the hyper parametrisation and proposed models.
3 Identification through voice
This chapter describes our methods. As explained in Chapter 1, our problem consists of closed-set text-independent speaker identification for EN, BP, and CN. Here, our objective is to analyse how the speaker identification system is affected by adding speakers from different languages, including BP. The following section will expand our description and discussions about the related works given in Chapter 1. Then we describe how to prepare the data, extract the biometric features, and set-up the systems for the experiments in Section 3.2.
3.1
SpID related works
The literature presents a wide range of experiments for this biometry. Devika,
Sumithra e Deepika(2014) performed speaker identification in a cross-lingual condition.
The author uses MFCC as input features, with 40 filter banks, 64ms frame duration and 12.5ms of overlapping. His dataset consists of 54 speakers recorded at 8kHz with 16bit resolution. The paper proposes a combination of GMM with FCM that achieves good results. Also using a GMM, Basu et al. (2015) applies an ensemble of classifiers into NTIMIT and NISIS along with a novel Principal Component Analysis. Instead of reducing the dimensionality of data, their analysis removes the correlation between features (MFCC in this case). While Ding e Shi (2017) used the GMM as a speaker identification step on its voice controlled robot. In another work, a modified KNN with the Mahalanobis distance achieves 94% accuracy for 21 speakers (KACUR, 2016). Thus, those model still have its place for speaker identification.
Several works report the degradation of the accuracy for speaker identification systems on cross-lingual environments (NAGARAJA; JAYANNA, 2012; NAGARAJA;
JAYANNA, 2013a). These authors have addressed the problem of multilingual speaker
identification on a distinct way. Their subjects speaks the same sentences on different languages. The work ofNagaraja e Jayanna(2013a) achieved good results for three set-ups: mono, cross and multilingual speaker identification. However, the author does not gives a further description of their self-collected database, which gives an opening to assume their model, an Vector Quantisation, has overfitted for a codebook size of 256 and 30 speakers. Whereas they obtain similar results for both mono and cross lingual experiments in their previous work (NAGARAJA; JAYANNA, 2012). Another way to take this problem is to use a open-set speaker identification. This allowed to train a model into language A, and test for unknown speaker in languages other than A (CASANOVA et al., 2020).
speaker identification. InShahin(2015), an research about the performance of such systems under shouted, or loud, conditions is addressed. The results showed that the performance significantly degrades under that condition. The author proposes the use of a Third Order Suprasegmental Hidden Markov Model for these situations. The behaviour of these systems on multilingual conditions is not new, in fact “There has often been speculation about the effect of language on speaker recognition performance, and particularly of the effect of speakers switching language between training and test” (PRZYBOCKI; MARTIN; LE,2007, p. 1956). The reason is that the features have information about physiological conditions of speech (PRZYBOCKI; MARTIN; LE, 2007). That is, how the vocal cords behave to produce a sound (phoneme), therefore changing when the person speaks on a different language.
There are attempts to increase the robustness of these systems. The collection procedure for the VoxCeleb2 is explained in (CHUNG; NAGRANI; ZISSERMAN,2018). It is a multilingual audio-visual dataset built with open data. Its resources cover over 2,400 hours of speech from more than 6,000 speakers. It does include several real-world problems, such as: overlapping voices, background noises, music, and other noises. Besides, it covers more than 145 nationalities. However, they are not well distributed, some recordings language are unknown, and BP is not listed. The authors evaluate the data using a ResNet-50 with a spectrum obtained from 25ms frames with 10ms of overlapping, achieving 96.05% accuracy. Another dataset for cross-lingual speakers identification is the “Call My Net Corpus”. The data contains 220 speakers, with 10 samples for each. However, the authors cover only four Asian languages, and the data is not publicly available.
The literature mainly employed Deep Neural Networks (DNN) on recent works. The authors in (CHUNG; NAGRANI; ZISSERMAN, 2018; LI et al., 2017; MATEJKA et al.,
2016;TONG; GARNER; BOURLARD, 2017;CAI; CHEN; LI,2018;ANAND et al.,2019;
CHEN; WU, 2020) have achieved results for cross-lingual and multilingual environments
that are comparable or higher than previous literature works. In general, they feed the DNN with a spectrum and extract Bottleneck Features to train a speaker identification model. However, state-of-the-art for speaker identification is x-vectors (SNYDER et al., 2018), which surpassed the previous i-vector models (MCLAREN; MANDASARI; LEEUWEN,
2012; CAI; CHEN; LI, 2018; CHEUK et al., 2019). Also, these features can be further
augmented, to create more distinguishable representations such as the Latent Space features (CHEUK et al., 2019). Furthermore, neural methods present a high variety. Some unique models are generated such as the end-to-end classifier proposed in (CAI; CHEN; LI,2018). Their work proposes an CNN with different layers that can perform all steps of speaker identification (feature extraction, modelling, and classification) on VoxCeleb data. Also, in Cheuk et al.(2019) the author uses a Triplet Neural Network to transform the feature space. These methods are very interesting, as they can perform classification from the raw signal, but as shown by some works they can easily underperform with a small
amount of data (ANAND et al., 2019). Even tough, some effort is put on making these methods more reliable under that situation (ANAND et al.,2019).
There is also an effort from the research community for fuzzy classification. There are hybrid models composed of FCM and GMM (DEVIKA; SUMITHRA; DEEPIKA,
2014), and an investigation on ANFIS for speaker identification (PANDEY et al., 2010). The former is an identification system that achieves almost 100% accuracy, while the latter obtained 83.32% on a cross-lingual setup. However, the research with ANFIS did not describe how they used the dataset for training, validation, or test. Besides, it does not explain the model configuration to perform the experiments. While these works explored fuzzy classification, other used this multivalued logic to create unique sound representations (ANAND et al., 2017), and also reduce the dimension of feature sets (KHANUM; FIROS, 2017). Furthermore, some simple fuzzy rule-based systems also show up in literature (RATHOR; JADON, 2017) with decent results, even though the author leaves some important steps out of discussion, such as how the rules are created, how many features he used, and more. In addition to these works, the FCM is also used for tests under high SnR (BANSAL; IMAM,2018), and for Japanese speaker identification (SINGH;
SINGH,2020). Therefore, this puts the FCM in a highlighted spot in the literature.
So, there is a lack of research for short-time speaker identification on multilingual environments in the literature. Using utterances with short-time is known to degrade performance (FATIMA; ZHENG, 2012). Some works have investigated methods to face this problem using phonetic based features, SVM, and Joint Factor Analysis (FATIMA;
ZHENG,2012). Most research defines a recording to be short-time when its duration is 5
seconds or less (FATIMA; ZHENG, 2012), while “very short duration training and test conditions, namely 10 s each” (PRZYBOCKI; MARTIN; LE, 2007, p. 1952). Which is the case on our work.
However, most of these works have a small dataset or do not provide a better description of how to split the dataset or any statistical tests performed. Also, not much research has been done for fuzzy models, even though they have provided decent results. Moreover, only (CASANOVA et al.,2020) has considered BP in their tests. The low occurrence of this language is due to its lack of resource for creating speech technologies. Those problems and a profound discussion on the topic can be found in our previous work (LIMA; COSTA-ABREU, 2020).
Casanova et al. (2020) has addressed closed-set and open-set text-independent
speaker identification, making it the most similar work to ours. The author uses MFCC as input features to several Neural Network methods. These features are extracted from a self-collected Brazilian Portuguese dataset with 40 male speakers. The speaker have a one-sit session to record the phone /a/ and a sentence with 149 words. This resulted into recordings with 42s to 95s of duration. Then these audios are split into 5s windows with 1s of striding, resulting into 2,394 utterances. Further, closed set experiments are
performed with GMM, NN, a Dense Fully Connected Neural Network, and a Convolutional Neural Network. Here, they test a multilingual setup for open-set only, where the accuracy suffers differently for: English, Spanish, and Chinese. However, keep in mind that our work focuses on closed-set speaker identification, which differs from their multilingual setup. Furthermore, here we want to use fuzzy models, while they employ several network structures for different speaker tasks.
Next, we introduce our methodology for speaker identification. There, we describe how we prepare the data for our experiments and our feature extraction setup.
3.2
Methodology
This section will present the organisation of our data to perform SpID experiments. It will explain the features employed in this dissertation, as well as describe the information about the parameters used for every extraction method. All processes of feature extraction, signal processing, training and classification procedures were executed on a system with the configuration described in Appendix A.
3.2.1
Preparing the data
The first step of our work was to make sure all the datasets were as similar as possible in their organisation. Two main characteristics kept from the data were gender and number of speakers. Since LBM16K had the least number of classes, both EN and CN had to be adjusted to obtain a more fair experimental setup.
The DARPA-TIMIT had to be reduced to execute speaker identification tests. After this process, this resource ended up with 34 speakers (24 male and 10 female) for the EN language. For that, we randomly selected speakers from the training part while respecting the proportions on the original dataset, as presented in Table 3.1. The resulting ratios may not be the same as the original data; however, they are close enough. Therefore, the number of speakers is consistent with LBM16K individuals.
Table 3.1 – DARPA-TIMIT proportions before and after under-sampling for speaker identification
Reg (DR) Original Reduced
#Male #Female Total #Male #Female Total
1 31 (63%) 18 (27%) 49 (8%) 2 (66%) 1 (33%) 3 (8.3%) 2 71 (70%) 31 (30%) 102 (16%) 4 (80%) 1 (20%) 5 (15%) 3 79 (67%) 23 (23%) 102 (16%) 3 (60%) 2 (40%) 5 (15%) 4 69 (69%) 31 (31%) 100 (16%) 4 (80%) 1 (20%) 5 (15%) 5 62 (63%) 36 (37%) 98 (15%) 3 (60%) 2 (40%) 5 (15%) 6 30 (65%) 16 (35%) 46 (7%) 2 (66%) 1 (33%) 3 (8.3%) 7 74 (74%) 26 (26%) 100 (16%) 4 (80%) 1 (20%) 5 (15%) 8 22 (67%) 11 (33%) 33 (5%) 2 (66%) 1 (33%) 3 (8.3%) Total 24 (70%) 10 (30%) 630 (100%) 24 (71%) 10 (29%) 34 (100%)
The LBM16K provided 20 recordings per speaker, 10 more than DARPA-TIMIT. For a better comparison, an undersampling was necessary for it to match the number of samples of EN speaker. Therefore, we selected 10 samples from each speaker with a Roulette algorithm, resulting into recordings: 1, 2, 4, 7, 8, 12, 13, 14, 16, 19 (considering files in alphabetic order). So, there are 35 (25 male and 10 female) BP speakers with 10 samples each, resulting in 350 audios.
Finally, AISHELL-1 is also under-sampled to be consistent with both LBM16K and DARPA-TIMIT. However, it is more straightforward since the authors made the development data balanced; respecting the regions, gender, etc. Besides, the number of participants is 40 CN language, which is close enough to the number of BP and EN speakers. So, we employed this portion of the data for speaker identification experiments, carefully selecting 10 samples from each class by the same algorithm employed for BP.
As a result of the preparation, the dataset distribution end up as shown in Fig-ure 3.1b. Which in comparison with the original version, shown in FigFig-ure 3.1a, is way better balanced. The gender distribution is biased towards male speakers before the preparation, as shown in Table 2.1; whereas after the procedure, we have 55% male and 45% female speakers.
Figure 3.1 – Original speaker distribution (a), and distribution used for experiments (b). (a)
CN
EN
BP
m f m f m f (b)CN
EN
BP
m f m f m fFirst, we test the models on the LBM16K dataset, then the DARPA-TIMIT speakers are added, and finally, augmenting the data with the CN speakers from AISHELL-1. However, to reduce the number of tests and for better visualisation, the BP+EN and BP+EN+CN are tested only with the classifier that better performed on the BP data only. With this experimental setup, it is possible to asses how the model act when new languages are added. Furthermore, expecting an accuracy decrease as the number of languages grow in case SpID is not a language-independent problem, and an increase otherwise. It is crucial to note that, this way, we do not get information concerning the order of adding languages. An even broader approach would be using every set of combinations; from
individual, to combinations of 2, and finally with the full multilingual data.
Also, it is crucial to pay attention to the growing number of classes as new languages are added. To asses this problem, we perform 30 experiments with around 34 classes from each language. For that, we randomly select1/
3 of data from each language (considering
the distribution in Figure 3.1b), this time ignoring restrictions from dialects, gender, etc. At the end, the results for this procedure should be lower than monolingual outcomes, otherwise the accuracy reduction for cross-lingual data is not affected by language, but the number of classes.
In short, the experiments are processed in a reduced version of all data to prevent any bias towards an specific language. A total of 109 speakers are used with 10 utterances each. Therefore, our data represents a total of 1090 samples. Table 3.2 presents details of the processed dataset. Following, we describe and justify the parameters used for extracting the features.
Table 3.2 – Characteristics of data used in the experiments.
Dataset #Size #Speakers Gender (M/F) Lang
DARPA-TIMIT 340 34 70%/30% EN
LapsBenchmark16k 350 35 72%/28% BP
AISHELL-1 400 40 30%/70% CN
Total 1,090 109 55%/45% —
3.2.2
Biometric feature extraction
To prepare the data for the speaker identification task, we first removed any silence segments from the recordings. The algorithm explained in Section 2.3 is used here. This procedure can reduce the computational effort when using FFT, since it may create a shorter signal, and, it helps to remove some computations using zero samples. Furthermore, the information contained in silence parts has no crucial impact on distinguishing speakers (HU;
DAMPER,2005;MARCINIAK; KRZYKOWSKA; WEYCHAN,2012; REYNOLDS,1995;
SAHOO; PATRA,2014). As pointed out by these works, it may deplete the accuracy of
some models such as GMM since they will end up modelling the silence, which is the same for everyone. More specifically “For text-independent speaker recognition, it is important to remove silence/noise frames from both the training and testing signal to avoid modelling and detecting the environment rather than the speaker.” (REYNOLDS, 1995, p. 94).
After removing the silence from every sample, we extracted 12 MFCC from each recording keeping cepstrums 1 to 12. Using frames of 25ms length with 10ms stride, a Hamming Window function, as well as a 512 point FFT. Besides, we used 40 triangular filters spamming from 300Hz to 3700Hz. Then, computing the ∆ and ∆∆, appending the logarithm energy, we create a 39-dimensional feature vector. A cepstral mean subtraction
is applied to the data before training/testing. There are studies about the effect of window function into speaker identification for mono and crosslingual experiments (NAGARAJA;
JAYANNA, 2013b), but no research was found for multilingual conditions. The resulting
features are shown in Figure 3.2 in their spectral form. There, the energy is represented as a heat-map, where warmer colours shows a point of energy concentration. Also, achieving a more similar pattern between languages is desirable since we do not want the model to classify the language, but the speakers.
Figure 3.2 – Spectral representation of MFCC for three speakers on BP, EN, and CN. (a) Brazilian 0 100 200 300 400 Frame 2 6 10 14 MF CC 0 100 200 300 400 Frame 2 6 10 14 MF CC 0 100 200 300 Frame 2 6 10 14 MF CC 0 100 200 300 Frame 2 6 10 14 MF CC 0 100 200 300 Frame 2 6 10 14 MF CC (b) English 0 100 200 300 Frame 2 6 10 14 MF CC 0 100 200 300 Frame 2 6 10 14 MF CC 0 50 100 150 200 250 Frame 2 6 10 14 MF CC 0 100 200 300 400 Frame 2 6 10 14 MF CC 0 50 100 150 200 Frame 2 6 10 14 MF CC (c) Chinese 0 100 200 300 400 Frame 2 6 10 14 MF CC 0 200 400 600 800 Frame 2 6 10 14 MF CC 0 100 200 300 400 500 Frame 2 6 10 14 MF CC 0 100 200 300 Frame 2 6 10 14 MF CC 0 100 200 300 Frame 2 6 10 14 MF CC
Source: F004–0062, 0065, 0067, 0068, 0069, FJSP0–SA1, SA2, SI1434, SI1763, SI804, and S074–W0121, W0131, W0165, W0171, W0195, for respectively Brazilian (FALABRASIL,2018), English (KAGGLE,2019), and Chinese (BU et al.,2017).
The 512 point for the FFT is chosen by using the Nyquist theorem, that states the minimum frequency that can be used while still being able to restore the signal. For that,
we compute the number of samples in each frame L with Kms by
L= K
1000fs = 0.025 · 16000 = 400
Thus, we can use the next power of 2 that is greater than the frame size L. Which is 512 = 29. Now, when it comes to the filter design, we chose to pass everything between
300Hz and 3700Hz as this is the interval where speech generally resides. Then, frequencies outside this range decreases in a logarithm scale. For the number of filters, the most important is to use more filters than the number of cepstrums adopted in the feature vector, here we used 40. Finally, from the 40 resulting cepstrums we keep the first 13.
The recordings for BP have most of their energy concentrated into the first cepstrum, while the other languages are more distributed along the spectrum (Figure 3.2). This behaviour is not due to any BP characteristic but by the recording quality that has a lower intensity and the noise is more present compared to AISHELL-1 and DARPA-TIMIT datasets, as well as the intrinsic characteristic of MFCCs to preserve more energy into initial cepstrums. To prevent any bias from it, we ignore the first cepstrum and apply a cepstral mean normalisation. The effect of ignoring this cepstrum is represented in Figure 3.3, while Figure 3.4 presents the results spectrum after subtraction.
Figure 3.3 – Effect of dropping the first cepstrum of MFCC for three speakers (a) Brazilian 0 100 200 300 400 Frame 1 5 9 13 MF CC 0 100 200 300 400 Frame 2 6 10 14 MF CC (b) English 0 100 200 300 Frame 1 5 9 13 MF CC 0 100 200 300 Frame 2 6 10 14 MF CC (c) Chinese 0 100 200 300 400 Frame 1 5 9 13 MF CC 0 100 200 300 400 Frame 2 6 10 14 MF CC
Source: F004–0074, DR1–FJSP0–SA1, BAC009S0760W0132 samples from LBM16K (FALABRASIL, 2018), DARPA-TIMIT (KAGGLE,2019), AISHELL-1 (BU et al.,2017).
The visual implications for ignoring the first cepstrum, as shown in Figure 3.3, are more impactful for BP. While for EN and CN not much has changed. However, the most important is that it removed possible bias coming from BP speakers, due to their very distinct representation.
Then, applying the cepstral mean subtraction removes possible channel variations from both feature extraction and recording process. The result of this operations is
Figure 3.4 – Applying cepstral mean subtraction to MFCCs. Original features at 1st row, and reduction results below. (a) Brazilian 0 100 200 300 400 Frame 2 6 10 14 MF CC 0 100 200 300 Frame 2 6 10 14 MF CC (b) English 0 100 200 300 Frame 2 6 10 14 MF CC 0 50 100 150 200 Frame 2 6 10 14 MF CC (c) Chinese 0 100 200 300 400 Frame 2 6 10 14 MF CC 0 100 200 300 Frame 2 6 10 14 MF CC
Source: F004–0074, DR1–FJSP0–SA1, BAC009S0760W0132 samples from LBM16K (FALABRASIL, 2018), DARPA-TIMIT (KAGGLE,2019), AISHELL-1 (BU et al.,2017).
illustrated by Figure 3.4 comparing the feature before and after applying this reduction. Most of the effort on extraction was for MFCC, since both ∆ and ∆∆ derivate from it. While the total energy of a signal does not offer too much variability. This prevents bias from channel variability coming from different sources of record, which is our situation, where “if training and testing speech are collected from different microphones or channels (e.g., different handsets and/or lines in telephone applications), this is a crucial step for
achieving good recognition accuracy” (REYNOLDS, 1995, p. 95).
3.2.3
Speaker identification systems setup
The description below presents every set of configurations for each parameter used during the fine-tuning. This procedure was performed with a grid-search implementation available at GitHub1, along with those hyperparameters and a stratified 3-fold. We
choose a stratified version because it is an identification problem; therefore the classes (speakers) are already known to the system. Besides, the 3-fold ensures that we have a more significant amount of recordings for each speaker to train our models; again keeping all speakers known to the system. Furthermore, the FCM and FKNN are implemented by the author and is available at GitHub2 as well. While KNN and SVM are from the
SKLearn library (PEDREGOSA et al., 2011).
As a non-standard dataset, we looked for models with distinct characteristics. We decided to use FCM as it an unsupervised method with decent background for this problem, as well as being a fuzzy model. The KNN is used here as its known ability for multi-label classification, while FKNN takes its counterpart as the fuzzy version. This allowed a direct comparison of both methodologies for our experiments. Finally, the SVM has a very long
1 See:<http://www.github.com/thalesaguiar21/Gryds> 2 See:<http://www.github.com/thalesaguiar21/Fuzzy>