Exploring deep learning representations for biometric multimodal systems.

(1)

Exploring Deep Learning Representations For Biometric Multimodal

Systems

Eduardo Jos´

e da Silva Luz

Ouro Preto

2019

(2)

(3)

Exploring Deep Learning Representations For Biometric Multimodal

Systems

Orientador:

Prof. Dr. DAVID MENOTTI

Co-orientador:

Prof. Dr. GLADSTON MOREIRA

Tese submetida ao Programa de Pós-Gradua¸cão em Ciência da Computa¸cão do Instituto de Ciências Exa-tas e Biológicas da Universidade Federal de Ouro Preto, como requisito parcial para obten¸cão do t´ıtulo de Doutor em Ciência da Computa¸cão.

Ouro Preto

2019

(4)

Catalogação: www.sisbin.ufop.br

Orientador: Prof. Dr. David Menotti Gomes.

Coorientador: Prof. Dr. Gladston Juliano Prates Moreira.

Tese (Doutorado) - Universidade Federal de Ouro Preto. Instituto de Ciências Exatas e Biológicas. Departamento de Computação. Programa de Pós-Graduação em Ciência da Computação.

Área de Concentração: Ciência da Computação.

1. Aprendizagem . 2. Biometria. 3. Transferência de aprendizagem. I. Gomes, David Menotti. II. Moreira, Gladston Juliano Prates. III. Universidade Federal de Ouro Preto. IV. Titulo.

(5)

(6)

(7)

(8)

Resumo

Biometria é uma importante área de pesquisa nos dias atuais. Um sistema completo de biometria compreende sensores, extra¸cão de caracter´ısticas, algoritmos de casamento de padrões e tomada de decisão. Sistemas biométricos demandam acurácia alta ali-ada a robustez e para tal, pesquisadores estão utilizando combina¸cão de várias fontes biométricas, dois ou mais algoritmos para casamento de padrões e diferentes sistemas de tomada de decisão. Estes sistemas são denominados sistemas biométricos multimodais e representam hoje o estado-da-arte para biometria. Contudo, o processo de extra¸cão de caracter´ısticas em sistemas biométricos multimodais representa hoje um grande desafio. Aprendizagem em profundidade vem sendo usado pelos pesquisadores da área de aprendizagem de máquina para extra¸cão de caracter´ısticas de forma automática e di-versos problemas tiveram avan¸cos consideráveis, como o caso de reconhecimento de face. Contudo, métodos baseados em aprendizagem em profundidade necessitam de grande quantidade de dados e, com excessão de reconhecimento facial, para as demais modalidades isso não se verifica na literatura, dificultando a investiga¸cão de aprendiza-gem em profundidade em métodos multimodais. Nesta tese, propomos um conjunto de contribui¸cões para viabilizar o uso de aprendizagem em profundidade em sistemas biométricos multimodais.

Para alcan¸car tais objetivos, primeiramente, exploramos técnicas para aumento de dados e transferência de aprendizagem para o treinamento de redes de convolu¸cão pro-fundas, em bases de dados biométricas restritas em termos de quantidade de imagen-s/sinais. Em seguida, propomos um protocolo simples, visando reprodutibilidade, para

(9)

em que todas as modalidades são representadas por meio de descritores profundos. Neste trabalho, mostramos que é poss´ıvel trazer os ganhos expressivos já obtidos com a modalidade da face para quatro outras modalidades biométricas, explorando-se aprendizagem em profundidade. Também mostramos que a fusão das modalidades é um caminho promissor, mesmo quando as mesmas são representadas por meio de apren-dizagem em profundidade. Avan¸camos o estado-da-arte em bases de dados importantes na literatura, como a FRGC (região periocular), NICE/UBIRIS.V2 (região periocular e ´ıris), MobBio (região periocular e face), CYBHi the-person ECG), UofTDB (o↵-the-person ECG) e Physionet (sinal de EEG). Nossa melhor abordagem multimodal, na base quimérica, resultou na expressiva decidabilidade de 9.15_{± 0.16 e um} reconheci-mento perfeito em (i.e., EER de 0.00%_{± 0.00) para o cenário multimodal em uma única} sessão. Para o cenário entre duas sessões, conseguimos uma decidabildiade de 7.91_{± 0.19} e um EER de 0.03%± 0.03, o que representa um ganho de mais de 22% para o melhor caso unimodal.

(10)

Abstract

Biometrics is an important area of research today. A complete biometric system comprises sensors, feature extraction, pattern matching algorithms, and decision making. Biometric systems demand high accuracy and robustness, and researchers are using a combination of several biometric sources, two or more algorithms for pattern matching and di↵erent decision-making systems. These systems are called multimodal biometric systems and today represent state-of-the-art for biometrics. However, the process of extracting features in multimodal biometric systems poses a major challenge today.

Deep learning has been used by researchers in the machine learning field to autom-atize the feature extraction process and several advances were achieved, such as the case of face recognition problem. However, deep learning based methods require a large amount of data and with the exception of facial recognition, there are no databases large enough for the other biometric modalities, hindering the application of deep learning in multimodal methods. In this thesis, we propose a set of contributions to favor the use of deep learning in multimodal biometric systems. First of all, we explore data aug-mentation and transfer learning techniques for training deep convolution networks, in restricted biometric databases in terms of labeled images. Second, we propose a simple protocol, aiming at reproducibility, for the creation and evaluation of multimodal (or synthetic) multimodal databases. This protocol allows the investigation of multiple bio-metric modalities combination, even for less common and novel modalities. Finally, we investigate the impact of merging multimodal biometric systems in which all modalities are represented by means of deep descriptors.

(11)

even when they are represented by means of deep learning. We advance state-of-the-art for important databases in the literature, such as FRGC (periocular region), NICE / UBIRIS.V2 (periocular region and iris), MobBio (periocular region and face), CYBHi (o↵-the-person ECG), UofTDB (o↵-the-person ECG) and Physionet (EEG signal). Our best multimodal approach, on the chimeric database, resulted in the impressive decid-ability of 9.15± 0.16 and a perfect recognition in (i.e., EER of 0.00% ± 0.00) for the intra-session multimodal scenario. For inter-session scenario, we reported decidability of 7.91± 0.19 and an EER of 0.03% ± 0.03, which represents a gain of more than 22% for the best inter-session unimodal case.

(12)

Esta tese é resultado de meu próprio trabalho, exceto onde referência expl´ıcita é feita ao trabalho de outros, e não foi submetida para outra qualifica¸cão nesta nem em outra universidade.

Parte deste trabalho já foi publicado e este texto é uma composi¸cão adaptada de artigos publicados pelo autor.

• Luz, E., Moreira, G., Junior, L. A. Z., and Menotti, D. (2018). Deep periocular representation aiming video surveillance. Pattern Recognition Letters, 114, 2-12 • Luz, E. J., Moreira, G. J., Oliveira, L. S., Schwartz, W. R., and Menotti, D.

(2018). Learning Deep O↵-the-Person Heart Biometrics Representations. IEEE Transactions on Information Forensics and Security, 13(5), 1258-1270.

• Silva, P. H., Luz, E., Zanlorensi, L. A., Menotti, D., and Moreira, G. (2018, July). Multimodal Feature Level Fusion based on Particle Swarm Optimization with Deep Transfer Learning. In 2018 IEEE Congress on Evolutionary Computation (CEC) (pp. 1-8). IEEE.

• Schons, T., Moreira, G. J., Silva, P. H., Coelho, V. N., and Luz, E. J. (2017, November). Convolutional Network for EEG-Based Biometric. In Iberoamerican Congress on Pattern Recognition (pp. 601-608). Springer, Cham.

Eduardo Jos´e da Silva Luz

(13)

(14)

Primeiramente, agrade¸co ao Prof. Dr. David Menotti pela orienta¸cão e por sempre ter acreditado em mim e em meu potencial. Agrade¸co ainda o trato sempre correto e cient´ıfico, com que sempre abordou as nossas questões de trabalho, que foi fundamental para minha forma¸cão acadêmica.

Agrade¸co também ao Prof. Dr. Gladston Moreira, por ter me ajudado com inúmeras questões importantes da minha pesquisa, pelas nossas conversas, pelas palavras de ânimo e incentivo nos momentos dif´ıceis.

Agrade¸co aos colegas do CSILab, em especial, ao Pedro Silva, pela parceria na pesquisa.

Agrade¸co aos colegas do DECOM/UFOP por terem me permitido dedica¸c˜ao exclusiva a este trabalho, sem a qual a qualidade seria bastante comprometida.

Agrade¸co `a IBM pelo reconhecimento e por ter me concedido o PhD Fellowship 2017-2018. Esta premia¸c˜ao atestou a qualidade da minha pesquisa, o que para mim representou um importante marco pessoal.

Agrade¸co ao Prof. Dr. Lu´ıs Paquete, da Universidade de Coimbra, por ter me recebido de bra¸cos abertos e pelo tempo dedicado a mim.

Por fim, agrade¸co a minha fam´ılia pelo suporte imprescind´ıvel. Muito Obrigado.

(15)

(16)

List of Figures xv

List of Tables xix

1 Introduction 1

1.1 Objectives and Contributions . . . 4

1.2 Thesis Outline . . . 6

2 Biometric Concepts, Multimodality, Deep Learning And Challenges 7 2.1 Basic Concepts . . . 7

2.1.1 Biometric Mode Operation . . . 8

2.1.2 Evaluation Metrics . . . 10

2.1.3 Biometric Systems Limitations . . . 11

2.1.4 Multimodality . . . 12

2.2 Biometrical Multimodal Datasets and Benchmarks . . . 16

2.3 Biometrical Chimerichal Datasets . . . 17

2.4 Deep Learning for Feature Representation . . . 19

2.5 Challenges of Using Deep Learning on Multimodal Biometric Systems . . 21

3 Biometric Modalities and Databases 23

(17)

3.3 Face Recognition Grand Challenge - FRGC (Eye and Face) . . . 33

3.4 NICE/UBIRIS.v2 (Iris and Periocular) . . . 35

3.5 MobBio database (Face and Periocular) . . . 37

4 Multimodal Chimerical Dataset Creation Protocol 41 4.1 Cleaning and filtering the data . . . 42

4.2 Feature extraction . . . 42

4.3 Building the Chimerical dataset . . . 42

4.4 Feature matching and Decision . . . 43

4.5 Comparing Results . . . 44

5 Data Augmentation Approach 47 5.1 O↵-the-person ECG Study Case . . . 48

5.1.1 One Dimensional Deep Learning Model . . . 49

5.1.2 Two Dimensional Deep Learning Model . . . 50

5.1.3 Data Augmentation and Dropout . . . 50

5.1.4 Fusion . . . 52

5.1.5 Experimental Results and Discussion . . . 52

5.2 Physionet EEG Study Case . . . 60

5.2.1 Data pre-preprocessing . . . 60

5.2.2 Segmentation and Data Augmentation . . . 61

5.2.3 Deep Learning Model . . . 61

5.3 NICE Periocular and MobBio Study Case . . . 64

(18)

5.4 Conclusion . . . 73

6 Deep Transfer Learning Approach 75 6.1 NICE and MobBIO, Periocular - Study Case . . . 76

6.1.2 DED with feature size control layer . . . 80

6.1.3 Computational Cost Analysis . . . 83

6.1.4 Robustness Analysis . . . 83

6.2 MobBIO, Face Study Case . . . 86

6.3 FRGC Periocular and Face Study Case . . . 89

6.3.1 Pre-processing and segmentation . . . 89

6.4 UBIRIS.v2 and NICE, IRIS Study Case . . . 93

7 Multimodal Fusion Approach 97 7.1 Bi-modal Fusion Approach . . . 97

(19)

7.2 Multimodal Chimeric database Evaluation . . . 104

8 Final Considerations 115 8.1 Future Works . . . 116

Bibliography 119

(20)

2.1 Operation modes and data flow in a biometric system. Source: Lumini &

Nanni (2017)). . . 9

2.2 Error rate calculation. a) For a threshold t, FMR = false match rate (a.k.a. False Aceptance Rate) and FNMR = false nonmatch rate(a.k.a. False Rejection Rate). b) DET curve - threshold definition allows defining di↵erent values of FMR and FNMR. Source: Jain et al. (2004a). . . 10

2.3 Most Common fusion types. a) Features b) Score c) Decision making. Source: (Unar et al. 2014). . . 15

3.1 Example of commercial o↵-the-person ECG mobile equipment not yet used for Biometrics. Source: https://www.alivecor.com/en/ . . . 24

3.2 Database acquisition hardware prototypes. . . 26

3.3 Histogram of heartbeat distribution before (a) and after (b) the outlier removal algorithm for CYBHi DB. . . 28

3.4 Distribution of the 64 electrodes over the scalp . . . 32

3.5 . . . 32

3.6 Example of FRGC face images. Source: (Phillips et al. 2005). . . 34

3.7 UBIRIS.v2 images database. . . 35

3.8 Mobbio: Face and Eye images. . . 38

4.1 Recognition process of a chimerical individual . . . 41

(21)

5.1 Overview of proposed method. . . 48

5.2 Convolutional process of a raw ECG. . . 49

5.3 Spectrogram process of a raw ECG. . . 51

5.4 Heartbeats morphing due to di↵erent heart rates between sessions. Source (Da Silva et al. 2013) . . . 51

5.5 Example of data augmentation on one heartbeat from record 1. a) Normal heartbeat. b) Heartbeat in scale 10% bigger. c) Heartbeat in scale 10% smaller. d) P-wave attenuated. e) Gain on P-wave. f) T-wave attenuated. g) Gain on T-wave. . . 52

5.6 Impact of fusion rules and data augmentation. . . 56

5.7 Final results on CYBHi and UofTDB. [Best viewed in color.] . . . 58

5.8 Outputs for Arch B CN training with and without data augmentation. . 59

5.9 Deep learning model. . . 61

5.10 The distribution of ECG segments used for training and testing. Source Schons et al. (2017) . . . 63

5.11 DET Curve for proposed experiments. . . 65

5.12 Deep eye descriptor (DED) process for periocular modality. . . 66

5.13 Data augmentation examples for training VGG from scratch. a) Original image; b) Rotate -5 degree; c) Rotate +10 degree; d) image from other eye with random noise. . . 68

5.14 Training VGG from scratch on periocular UBIRIS.v2 data - 64 new imgs/-class generated with DCGAN. . . 69

5.15 Syntetic images generated with DC-GAN. . . 70

6.1 Deep Eye Descriptor (DED) process for iris modality. . . 77

6.2 Fine tuning training on periocular UBIRIS.v2 data for 14 epochs. . . 78

(22)

6.4 Intra-class and Inter-class histogram distribution on NICE evaluation set. The axes of the graph mean score (x) vs frequency (y). . . 80 6.5 Robusteness analisys for NICE test set, using the DED 256 network, and

EER threshold of 0.625. Scores before noise/after noise. . . 84 6.6 Very noisy images from the NICE test set. Evaluation is done with the

DED 256 network, the cosine distance metric, and ERR threshold of 0.625. Top images are used as the probe and the respective bottom one is the image recovery from the remaining data as the best match (scores in the middle row). Values lower than the established threshold (0.625) are considered the same subject. a) Images of the same subject, correctly classified; b) Images of di↵erent subjects. . . 86 6.7 Network outputs for a periocular image as input. Best seen in digital

format. . . 87 6.8 Face Frontalization on MobBIO. . . 88 6.9 Impact of frontalization, MobBio, Face recognnition. . . 90 6.10 Dlib Facial landmarks . . . 90 6.11 Example of the crops made from a image. . . 91 6.12 Iris normalization process, represented in polar coordinates. . . 94 6.13 Deep iris descriptor (DID) process for iris modality. . . 95

7.1 Method overview with fusion on matching score level and feature level for two modalities: iris and periocular. (DID Deep Iris Descriptor; DED -Deep Eye Descriptor) . . . 99 7.2 Distribution curves Intra-class and Inter-class on NICE database . . . 104 7.3 Distribution impostor/genuine curves. The first rown is from the

intra-session scenario, while the second one, from the inter-intra-session scenario. The fusion is at score level using the sum rule. . . 111

(23)

(24)

2.1 Multimodal biometric databases in The literature:WVU (Crihalmeanu

et al. 2007a);MBGC (Phillips et al. 2009);BioSecurrid (Fierrez et al. 2010);BMDB (Ortega-Garcia et al. 2010);SDUMLA-HMT (Yin et al. 2011);MMU GASPFA (Ho

et al. 2013);MobBIO (Sequeira et al. 2014a);gb2suMOD (R´ıos-S´anchez et al. 2015);LEA (Bharadwaj et al. 2015). Source: (Lumini & Nanni 2017) 16

3.1 UofTDB subject distribution by acquisition condition and session. . . 27 3.2 Outstanding results for on-the-person and o↵-the-person methods. MS =

Multiple Sessions; SS = Single Session . . . 30

5.1 Architectures for CN at raw ECG signal . . . 54 5.2 Architectures for CN at spectrogram . . . 55 5.3 %EER values obtained for the baseline methods and our approach each

scenario. Baseline: Odinaka et al.(2010)1, Agrafioti et al.(2008)2 Irvine et al.(2008)3, Chan et al.(2008)4, Islam et al.(2017)-FFT5, Islam et al.(2012)6

⇤_{We follow the fusion methodology proposed in (Islam et al. 2017). For}

CYBHi database, the fusion algorithm selected all methods and for UofTDB the fusion algorithm selected: Odinaka et al.(2010), Agrafioti et al.(2008), Islam et al.(2014) and Islam et al.(2017) (FFT). . . 59 5.4 Architecture for EEG biometry. Source Schons et al. (2017) . . . 62 5.5 EER obtained for the specified frequency bands. . . 64 5.6 Comparison with related works. . . 64

(25)

5.8 VGG from scratch trained with data augmentation techniques. Mean values after 12 executions. . . 72

6.1 Grid search on di↵erent number of neurons in the new layer for NICE.II. (⇤ same as VGG architecture (without layer 36)) . . . 81 6.2 Memory requirement to store a thousand periocular image representation

and time cost for the NICE test phase. . . 84 6.3 Results summarization. FS = from scratch with data augmentation; (⇤

evaluation on NICE.II training set on 161 subjects –# mean values after 10 executions). . . 85 6.4 Results on di↵erent metric distances . . . 89 6.5 Our FACE recognition approach in MobBIO database. ⇤Best competition

method are not published. . . 89 6.6 Comparison of Proen¸ca (Proen¸ca 2014) and our approach in FRGC database

ran for 30 times. The mean and standard deviation are reported. . . 92 6.7 Results of the approaches for iris recognition in NICE.II database. . . 94

7.1 Results of the approaches for iris recognition in NICE.II database. . . 102 7.2 Results of the approaches for iris + periocular recognition in NICE.II

database. ⇤ FS = Feature Selector. . . 103 7.3 Results summarization of methods evaluated with open-world scenario

scheme. . . 103 7.4 Obtained single modality e↵ectiveness (EER and decidability) from 30

executions in verification mode. . . 106 7.5 Resulting EER and Decidability from 30 executions in verification mode

combing modalities two at a time by di↵erent rules. (con = Simple Con-catenation; min = minimal; mul = multiplication.) . . . 108

(26)

7.6 Results of EER and Decidability from 30 executions in verification mode using three modalities combined by di↵erent rules. (con = Simple Con-catenation; min = minimal; mul = multiplication.) . . . 109 7.7 Results of EER and Decidability from 30 executions in verification mode

using the four modalities combined by di↵erent rules. (con = Simple Concatenation; min = minimal; mul = multiplication.) . . . 110

(27)

(28)

Introduction

Nowadays, robust and trustful mechanisms are worldwide required to protect our privacy, in particular when such information allows access to valuable goods or restricted places. In this direction, digital personal recognition is fundamental and the most common method is still using a personal identification number (PIN) or simply a password.

Password-based approaches are usually related to something familiar to the individ-ual, or it is written in somewhere and also encrypted then digitally stored. This scenario is susceptible to attacks from several fronts, in an attempt to steal important data. Due to these facts, more efficient ways to digitally recognize an individual are studied in the literature. In this context, biometric-based techniques are the most promising path.

The biometric approach overcomes such limitation encountered by PIN/password approach because it uses the human physiological or behavioral characteristics which may have a high uniqueness factor and are more difficult to copycat/fake.

Biometrics is a highly researched topic and several researchers are financed by govern-ments and corporations. The findings already have real usage, from simple smartphone access to terrorist identification in public spaces. Several biometrics techniques have almost perfect accuracy when the environment is controlled (Masek 2003), although this is not the case in unconstrained environments (Luz, Moreira, Junior & Menotti 2018) (a.k.a. in the wild environments). Furthermore, in a real scenario, individuals are of-ten away from surveillance cameras, subject to di↵erent light sources, and may even be wearing accessories to purposely deceive biometric systems (wearing glasses, hats, make-up, etc). According to Neves et al. (2016), there is no biometric system capable to work with standard surveillance systems. Thus, the unconstrained scenario is the most

(29)

researchable one in the literature. There are three major pathworks to improve biomet-rics on unconstrained environments: improve anti-spoofing techniques, add robustness to digital modality representation (feature representation), and by using multimodal systems. In this work, we focus on the latter two pathworks.

Today, deep learning techniques represent the state-of-the-art for robust feature rep-resentation. However, with the exception of face modality, few investigations have been done on the usage of deep learning to represent other biometric modalities (Ghosh 2015). Ghosh claims that a number of issues need to be investigated, among them, whether the lack of large databases could be the drawback to the use of deep learning on multiple biometric modalities, since deep learning demands large volume of data to successfully training very deep convolutional networks (CN) (Parkhi et al. 2015, Schro↵ et al. 2015a). Bringing the gains achieved with deep learning on the face recognition problem to other modalities may allow robust multimodal systems with real potential for use in surveillance systems (Luz, Moreira, Junior & Menotti 2018). Another pathway to im-prove biometric systems is by means of multimodality. A multimodal system is the one that uses two or more biometric modalities. A biometric modality is a characteristic that can be used to identify or di↵erentiate an individual. Some examples of biometric modalities are: face, fingerprint, iris, voice, vital signs, and gait. A strong reason to use a multimodal system is that it is not possible to define which is the “best” biometric modality beforehand because each modality has a scenario where it works better. More-over, from the ensemble classification theory, diversity of sources can improve recognition rates. In that sense, each biometric modality can counterbalance the pros and cons and result in enhanced performance. Nevertheless, how to merge two or more modalities in a biometric system is still an open problem.

An issue regarding multimodal biometric studies is related to limited databases. In the literature, multimodal databases are much more scarce than unimodal ones. Al-though it is possible to find genuine multimodal databases, such as: WVU (Crihalmeanu et al. 2007b) with six modalities: fingerprint, face, iris, palm-print, geometry from the hand and the voice; MOBio (McCool et al. 2012) with the face and voice; MobBIO (Se-queira et al. 2014b) with face, eye and voice, among others (Jain et al. 2005, Nageshku-mar et al. 2009), is still an insufficient number. In this direction, arises two important questions addressed by this work: how to use deep learning to enhance data representation of less common biometric modalities, often associated with limited data? Is there advantages in merging data from multiple modalities, when it is represented by deep learning methods?

(30)

Another problem raised here is with regard to the combination of the so-called stan-dard modalities (face, eye, fingerprint) with some unusual ones, such coming from vital signs. To the best of our knowledge, there are no databases in the literature that mix the standard modalities with less common modalities, such the ones provided by vital signals (ECG and EEG). Those Vital signs biometrics from the heart and cerebral waves, for instance, have shown outstanding results when used alone (Luz, Moreira, Oliveira, Schwartz & Menotti 2018, Schons et al. 2017). Therefore, there are few works in the literature combining these modalities. To the best of our knowledge, there is no work in the literature investigating the fusion between a face and eye/iris with modalities from o↵-the-person cardiac or cerebral signals. Thus, another contribution of this work is to investigate such fusion. Nonetheless, databases to allow such an investigation is not yet available. Hence, our motivation to explore chimerical databases. A chimerical database is an artificially constructed database, often, by the combination of multiple databases.

According to Wayman (2006a), the modalities of an individual are correlated with each other, and therefore, chimerical databases are not the ideal condition; Although, in the absence of multimodal databases in the literature, chimerical databases are the only option remaining to researchers, and their use can lead to a better understanding of the potential impacts of combining new modalities to enhance biometric systems.

It is worth mentioning that, there is no standard protocol in the literature guiding the creation of chimerical databases. It is still a point of discussion about how to combine the multiple databases to create a chimerical one. The present work also addresses the chimerical databases creation for biometric purposes.

The experiments handled in this work reinforce the use of multimodal systems for in-the-wild biometric (a.k.a noncooperative) scenario. Results show the advantages of multimodal systems, which in this work is a combination of five modalities described by deep representations: face, eye, iris, ECG and EEG. We also explore the use of data augmentation and transfer learning to favour the use of deep learning in modal-ities in which databases are limited in size and we overcome state-of-the-art meth-ods for subject recognition in verification mode, for modalities from periocular region (NICE,FRGC,MOBBio), iris (NICE), cardiac signal (CYBHi,UofTDB) and brain signal (Physionet). Furthermore, we conduct several experiments addressing di↵erent scenar-ios aiming to understand the e↵ects of di↵erent types of modalities combination/fusion with a systematic and reproducible protocol.

(31)

1.1 Objectives and Contributions

Motivated by the challenges previously mentioned, the main objectives of this work is twofold:

1. investigate the feasibility of learning deep feature representation for other biometric modalities beyond the face

2. and investigate whether there are advantages in combining data from multiple modalities, when it is represented by deep learning methods

The specific objectives are:

1. Propose deep learning based methods aiming biometric recognition for the biomet-ric modalities: o↵-the-person ECG signal, EEG signal, iris, periocular region, and face.

2. To investigate and propose data augmentation techniques for signals in one and two dimensions.

3. In the context of deep learning, explore transfer learning from di↵erent domains aiming biometric recognition.

4. Explore multimodal fusion techniques for biometric modalities represented with the aid of deep learning.

5. Propose a protocol for chimeric databases creation and evaluation.

In order to achieve the objectives, we propose di↵erent approaches. The contributions resulting from these approaches are:

1. Data augmentation approaches in the context of three modalities in which the amount of available data is limited: ECG signal, EEG signal, Eye region. We overcome state-of-the-art methods for o↵-the-person ECG and Brain Signal bio-metrics on public benchmarks databases.

2. Novel CN model, based on transfer learning from face to periocular/iris domain, aiming to improve non-cooperative and in-the-wild periocular/iris problem. We overcome state-of-the-art methods for periocular (eye) and iris modalities on chal-lenging databases such as NICE, FRGC, and MobBIO.

(32)

3. An easy and reproducible protocol for chimerical database creation and evaluation. 4. Analysis of the fusion of standard modalities (face, eye) with modalities extracted

from heart and brain signals.

5. Analysis of the late fusion (matching score level) and Feature level fusion for bio-metric modalities represented with deep learning techniques, more specifically, convolutional networks (CN).

6. We also propose a new feature level fusion by means of particle swarm optimization with decidability measure as the fitness function.

(33)

1.2 Thesis Outline

The remaining of this thesis is structured in as follows.

Chapter 2 [Biometric Concepts, Multimodality, Deep Learning And Chal-lenges] presents basic concepts about biometrics, biometric multimodality, and deep learning. Also presents challenges for the use of deep learning in biometrics.

Chapter 3 [Biometric Modalities and databases] presents the databases inves-tigated in this work and also, related works for each database.

Chapter 4 [Multimodal Chimerical Dataset Creation Protocol] describes our proposed protocol for the creation and evaluation of chimerical biometric databases. Chapter 5 [Data Augmentation Approach] presents our data augmentation techniques aiming deep representation of three biometric modalities: o↵-the-person ECG, EEG signal and Periocular region (eye).

Chapter 6 [Deep Transfer Learning Approach] presents our transfer learning approach aiming deep representation of three biometric modalities: Periocular region (eye), iris and face.

Chapter 7 [Multimodal Fusion Approach] analysis of the fusion of methods proposed in this work. We present an approach to a real bi-modal database and another approach to a chimerical database.

Chapter 8 [Final Considerations] concludes the thesis, by summarizing the main results and presents possible future research work.

(34)

Biometric Concepts, Multimodality,

Deep Learning And Challenges

In this chapter, we introduce some fundamental concepts regarding biometrics and present metrics aiming bimetric systems evaluation. We discuss concepts about mul-timodality, multimodal databases as well as chimerical database approaches. We also present challenges and limitation, in particular, we discuss the difficulties of representing biometric modalities by means of deep learning.

2.1 Basic Concepts

We start with some basic concepts and definitions. A common approach for biometric systems is to identify an individual from a biometric source captured by a sensor. Most biometric sources are generated/captured from body part images, such as face (Wagner et al. 2012), iris (Proen¸ca & Alexandre 2012), and fingerprint (Jain & Feng 2011). How-ever, sources from other signals generated by a subject such as audio/voice (Nakagawa et al. 2012, Reynolds & Rose 1995) and vital signals (Luz et al. 2014) could also be useful for identification as well as behavioral sources, such as gait (Liu & Sarkar 2007a). A biometric source captured by a sensor is considered a biometric modality.

A biometric system is composed of at least one biometric modality and one identifi-cation algorithm. According to Jain et al. (2004a), a human trait qualifies as a biometric modality when it satisfies:

(35)

• Universality: should be present in every human.

• Distinctiveness: should be capable of di↵erentiating two humans. • Permanence: should be time invariant.

• Collectability: it should be quantified/measured.

In general, a biometric system is a pattern recognition system composed of five modules, as shown in Figure 2.1:

1. Data acquisition: digitalization of a biometric modality using sensors.

2. Data representation: pattern recognition techniques employed to represent one modality, preferably promoting discriminatory traits.

3. Database: After the feature extraction, the data is stored in a database to allow future comparison.

4. Classification: compares feature vectors and provide a score.

5. Decision making: decides, based on classification output, whether the subject is whom he/she proclaims to be or whether the subject belongs to a certain set of subjects.

2.1.1 Biometric Mode Operation

Although the usage of the modules can vary according to system operation: enrollment, verification, and identification (see Figure 2.1).

1. Enrollment: during enrollment, the subject makes his/her first contact with the system. One modality is captured by sensors and then representation is built by means of a feature extraction technique. Finally, the data is stored in a database and a new individual is added to the system.

2. Verification: the system validates whether a subject is who he/she claims to be. That is, the input information is compared against information data referring to the identity previously informed to the system. Verification is also called positive identification, where the goal is to prevent more than one person from accessing

(36)

Figure 2.1: Operation modes and data flow in a biometric system. Source: Lumini & Nanni (2017)).

the same information (Jain et al. 2004a). Positive identification is widely used and most common forms are alphanumeric passwords and token cards.

Considering verification by similarity, a verification system can be mathematically modeled as in Equation 2.1: (I, Xq)2 8 < : w1, S(Xq, XI) > t w2, f alse (2.1)

where I is the identity to be verified, Xq is the input feature vector, w1 is the

genuine subject class, w2 is the impostor subject class, S is a function that

mea-sures the similarity between two feature vectors: Xq and XI, and t a pre-defined

threshold (Jain et al. 2004a).

The similarity score is calculated using some distance metric. Most common dis-tance metrics are Euclidean disdis-tance, Manhattan disdis-tance, Mahalanobis, Spear-man distance, Cosine distance, and Hamming distance.

3. Identification: the system verifies if a subject belongs to a subject class among a group of subjects classes. The identity of the subject is not previously provided and thus, the input information is compared against all data of all subjects in the group/gallery. This mode is commonly performed with the aid of classifiers.

Since the verification mode resembles an open world problem, it is considered a more challenging mode. Also, when using this mode, the emphasis is on the feature

(37)

Figure 2.2: Error rate calculation. a) For a threshold t, FMR = false match rate (a.k.a. False Aceptance Rate) and FNMR = false nonmatch rate(a.k.a. False Rejection Rate). b) DET curve - threshold definition allows defining di↵erent values of FMR and FNMR. Source: Jain et al. (2004a).

extraction/representation techniques rather than on the classifiers. In this thesis, we are interested in the feature representation of the biometric data and thus, the verification mode is the most appropriate mode. All the experiments performed in this work are evaluated in the verification mode.

It is important to note that the verification mode uses a threshold t in the comparison. Two sensor acquisitions from the same subject may present variations between them. These variations can be generated by the subject himself/herself (for example, a di↵erent haircut or a di↵erent pose) or even by sensor imperfection (noises). The threshold directly a↵ects system errors and finding the most appropriated one depends on the application. It is common in the literature to construct similarity distribution curves (histogram of scores for genuine and impostor pairs) as well as detection error trade-o↵ (DET) curve for method evaluation (See Figure 2.2).

2.1.2 Evaluation Metrics

Among the metrics adopted in the literature for biometric methods evaluation, in this thesis, we are considering the Equal Error Rate (EER) and the Decidability.

The decidability (d) measures how well intra-class (genuine) and inter-class (impos-tor) distribution scores are distant from each other (Daugman & Williams 1996) (See

(38)

Figura 2.2 (a)). Decidability can be seen as a measure of the quality of features. The decidability index can be defined as follows

d = q|µE µI| 1 2( 2 I + E2) (2.2)

where µI and µE are means and I and E stand for standard deviations of intra-class

and inter-class distribution scores, respectively.

The Equal Error Rate (ERR) is the most popular metric in the literature and it is defined as the point in which the False Match Rate is equal to the False Nonmatch Rate (See Figura 2.2 (b)). The FMR is the probability of error occurrence of the type I. The FNMR is the probability of error type II.

The EER is a metric that represents the common case for biometric systems evalu-ation, which explains its large use on biometric works. In this work, we are especially interested in the quality of the generated features, which also justifies the use of decid-ability.

2.1.3 Biometric Systems Limitations

According to Jain et al. (2004a), biometric systems su↵er from the following limitations:

1. Noise: Several types of noise can a↵ect biometric systems, and those types can be produced by the environment or by the subject himself, such as the use of contact lenses in iris recognition or wearing a sunglasses or makeup in case of face recognition.

2. Intra-class variation: Data collected during authentication may be di↵erent from the data stored in the database. These variations can be caused by noise or variations during interaction with the sensor (pose).

3. Distinctness: Every biometric modality has a maximum discrimination capability, that is, it has a maximum number of representable patterns. Consequently, every modality has a maximum number of subjects representable.

(39)

4. Non-universality: One good biometric modality should contemplate the majority of individuals. Even widely used modalities, such as fingerprint, have limitations in this area. For example, a subject may be born without limbs or even lose them in accidents.

5. Spoofing attack: Impostors may pretend to be someone else to gain access to third-party resources. In (Menotti et al. 2015), contact lenses are used to attack an iris recognition system.

Simultaneous use of multiple modalities has been shown to be a promising path (Liu & Sarkar 2007b, Ross & Jain 2003, Wang et al. 2003) to reduce the above limitations and to increase the accuracy of biometric systems. Multimodal systems difficult spoof-ing attack and reduce the problem of non-universality besides increasspoof-ing the distinctive-ness (Jain et al. 2004a).

2.1.4 Multimodality

Biometric systems require robustness combined with high accuracy and the challenges to identify individuals in uncontrolled/in-the-wild environments are enormous (Huang et al. 2007, Proen¸ca et al. 2010, Wolf et al. 2011). In this uncontrolled scenario, multimodal biometrics could be the solution (Bowyer et al. 2006, Karmakar & Murthy 2014, Ross & Jain 2003, Shekhar et al. 2014, Sim et al. 2014a, Wayman 2006b) to advance the field. Although the definition of multimodality terminology is not clear in the literature, this problem has been addressed in (Bowyer et al. 2006) and five terminologies are presented:

1. Multi-Algorithms: In this category, only one sensor is used to capture data from only one physical or behavioral trait. However, two or more di↵erent algorithms can be used to extract features and/or to match and fusion patterns. This approach is common because databases considered unimodal could be reused which generates an economy with the creation of databases.

2. Multi-Sample (ou Multi-Instance): Multiple samples from the same biomet-ric source are captured, processed by the same biometbiomet-ric system and subsequently merged. The research work presented in (Chang et al. 2005) suggests that multi-sampling can be as efficient or even more efficient than the use of multiple modal-ities. Furthermore, the multi-sampling system is facilitated by requiring only one

(40)

type of sensor. However, a tradeo↵ of this approach is the computational cost of acquisition.

3. Multi-Modal - Orthogonal: This category considers methods that actually use more than one biometric source, that is, more than one body or behavioral trait acquired on di↵erent sensors.

4. Multi-Modal - Independent: A single biometric source treated by di↵erent algorithms or di↵erent views. For example, the partition of the face into multiple sub-images (patches) (Ding & Tao 2015). This type of approach is widely found in the literature.

5. Multi-Modal - Collaborative: Collaborative approach is less common in the literature. The purpose of this approach is to use information from a biometric modality to aid the processing of a third modality. For example, in (Ding & Tao 2015), the 3D model was used to find fiducial points that were later used on the 2D face aiming biometrics.

To simplify the nomenclature, in this work we call the five terminologies categorized by Bowyer et al. (2006) only as multimodal.

In addition to the di↵erent nomenclature, multimodal systems can operate in three modes: serial, parallel and hierarchical (Jain et al. 2004a). In parallel mode, all modal-ities are acquired a priori and the system makes use of all modalmodal-ities simultaneously to perform the recognition. In serial mode, there are several unimodal systems connected in cascade. The output of a system is connected to the input of the subsequent. Hi-erarchical mode considers a tree taxonomy of modalities. In the literature, the most common mode is parallel and therefore this is the mode explored in this work.

Learn relevant representation is a problem when working with multimodality. Design handcrafted features for each modality requires great e↵ort and deep human knowledge of one specific modality. Currently, deep learning has been used to automate learning representation process and methods based on deep learning have proven to be efficient for several pattern recognition tasks and today represent the state-of-the-art for face recognition, even in uncontrolled or in-the-wild environments (Abdel-Hamid et al. 2014, Collobert et al. 2011, Parkhi et al. 2015, Schro↵ et al. 2015b). According to (Lumini & Nanni 2017), deep learning could bring the outstanding results from face recognition to the multimodal scenario.

(41)

Roughly speaking, the success of deep learning lies in two main aspects. First, given a large amount of data, the network is able to learn deep (several layers) and discriminative representations directly from the data themselves. Second, nowadays, we have the computational power to process such a large amount of data running parallel algorithms producing complex and powerful representations. However, an obstacle to applying deep learning in other modalities could be the shortage of data (Ghosh 2015) since there are few databases designed for multimodalities (Lumini & Nanni 2017).

Therefore, deep learning often provides a high-order feature vector and another major challenge is how to integrate (or fusion) information from many modalities.

Another complicating factor is that multimodal biometric systems can be merged at four levels: sensor level, feature extraction level, score level, and decision-making level (See Figure 2.3). It is worth mentioning that some authors consider fusion at the sensor level as a particular type of feature extraction level.

1. Sensor fusion: Data from multiple sensors are combined to generate a new input for the feature extraction phase (Hanif & Ali 2006, Singh et al. 2004, Wang, Liang, Hu & Li 2007).

2. Feature extraction fusion: The aim is to combine features extracted from two or more sensors in a single feature vector (Eskandari & Toygar 2015, Galdi et al. 2015, Ross & Govindarajan 2005, Shekhar et al. 2014).

3. Matching score level: Similarity scores provided by two or more classifiers are fused. The scores are normalized and then combined by means of some rule, for example, sum rule, product rule, maximum rule (Jain et al. 1999, Nguyen et al. 2015, Vajaria et al. 2007, Victor et al. 2002, Wang et al. 2003, Wild et al. 2016).

4. Decision making fusion: Classifiers perform the classification independently and a voting system is used for final decision (Bhatt et al. 2014, Li et al. 2015, Paul et al. 2014).

What type of fusion shall work better along with deep learning feature representation is also an open problem.

(42)

(a) Feature level fusion

(b) Score level fusion

(c) Decision level fusion

Figure 2.3: Most Common fusion types. a) Features b) Score c) Decision making. Source: (Unar et al. 2014).

(43)

Table 2.1: Multimodal biometric databases in The literature:WVU (Cri-halmeanu et al. 2007a);MBGC (Phillips et al. 2009);BioSecurrid (Fier-rez et al. 2010);BMDB (Ortega-Garcia et al. 2010);SDUMLA-HMT (Yin et al. 2011);MMU GASPFA (Ho et al. 2013);MobBIO (Sequeira et al. 2014a);gb2suMOD (R´ıos-S´anchez et al. 2015);LEA (Bharadwaj et al. 2015). Source: (Lumini & Nanni 2017)

Database Year Modality # subjects

WVU 2007 fingerprint, face, ´ıris, palmprint, hand, voice 270

MBGC 2009 face, iris > 146

BioSecurrid 2010 face, voice, iris, signature, fingerprint, 400 hand, keystroking

BMDB 2010 face, voice, iris, signature, fingerprint, hand > 600 SDUMLA-HMT 2011 face, iris, fingerprint, gait 106

MOBio 2012 face, voice 152

MMU GASPFA 2013 face, voice, gait 82

MobBIO 2013 face, iris, voice 105

gb2suMOD 2015 hand, iris, face 60

LEA 2015 fingerprint, iris, face 18000

2.2 Biometrical Multimodal Datasets and Benchmarks

Despite the increase in the number of researches related to multimodal biometrics in recent years, it is difficult to measure advances, since it is difficult to detect the state-of-the-art methods when dealing with multimodality. According to (Lumini & Nanni 2017), few databases were designed specifically for multimodal biometrics evaluation, as can be seen in Table 2.1 the available multimodal databases so far. Yet, most of them are difficult to access (not public). Thus, few studies take into account a benchmarking or evaluation protocol. Since benchmarks and standardized evaluation protocols are rare for the multimodal biometric problem, it is unfeasible to compare works. It is possible to infer which fusion methods are efficient for a set of modalities, however, it is impossible to generalize conclusions.

As can be seen in the Table 2.1, the multimodal databases contemplate only most popular and common modalities. In order to analyze fusion related to other modalities, one must create a chimeric database.

(44)

2.3 Biometrical Chimerichal Datasets

A chimeric dataset is built by creating a set of chimeric individuals where each modality comes from a di↵erent dataset. For instance, a new chimeric individual could be created by combining the fingerprint from the WVU dataset, with the voice from MOBio dataset, along with the face from MobBIO and the ECG from CYBHi dataset. More specifically, a chimeric individual is created by selecting the samples of a given individual from each kind of modality. The number of possible di↵erent combination of a chimeric dataset (a set of chimeric individuals) is exponential to the number of subjects of each considered unimodal modality. Considering equal the number n of subjects in each one of the k unimodal dataset, we have n!k 1 possible chimeric datasets. There are few works in the literature using chimeric datasets and follow we present an overview of some representative works dealing with the problem.

In Komeili et al. (2018) a chimeric dataset with two modalities is proposed: finger-print and ECG. The ECG signal is used both for liveness and recognition, although the focus of that work is liveness detection. One important contribution relies on the stop criterion proposed to reduce the size of the sample signal. Authors reported an EER of 3% for liveness detection with thirty seconds of acquired signal. The chimeric dataset is randomly built, pairing one subject from ECG dataset with another subject in fin-gerprint dataset. The ECG database was collected by the authors in BioSec lab in the University of Toronto, while the fingerprint came from the LivDet2015 dataset (Mura et al. 2015). Five hundred unique subjects are created. The process is repeated five times due to the randomness on the dataset construction.

Another chimerical approach involving on-the-person ECG signal is the one proposed by Singh et al. (2012), which also employed fingerprint and face for the recognition task. The ECG data came from the Physionet dataset, while the fingerprint and face came from NIST. The authors investigate a fusion in score level scenario, focusing on weighted sum rule. The reported EER, for the unimodal scenario, is 10.80% for the ECG, 4.52% for the face and 2.12% for the fingerprint. After fusion, the EER significantly dropped to 0.22%.

Barra et al. (2017) proposed a new chimeric dataset by merging the ECG signal with six bands of EEG signal. Their recognition approach is based on fiducial points of the ECG signal along with features extracted from the EEG spectrum. Five random non-overlapped segments from a total of 12 seconds of each signal are employed, in which one

(45)

segment is randomly chosen for the probe set, while the remaining is used as a gallery. One individual from ECG dataset is arbitrarily combined to another individual from EEG dataset, resulting in 52 chimeric individuals. The authors also explored di↵erent EEG channels. The fusion is performed in score level scenario with Euclidean distance as the metric distance. Three fusion rules are investigated: the sum, the product, and the weighted sum. The best result, reported on 5-fold cross-validation, is the one considering weight sum and EEG Delta band, achieving an EER of 0.928%.

One behavioral biometry, very common in the literature, is through the hand written signature. In (Joshi & Kumar 2016), it was proposed an approach for feature level fusion applying signature and face. Wavelet-based features are used to represent both modalities. The chimeric dataset is created by randomly associate a face to a signature, which resulted in 30 chimeric individuals. Hamming distance classifier is employed for the genuine or impostor decision. The reported accuracy is of 97.5% for ORL dataset (face) plus Caltech dataset (signature), and 98.88% for ORL Ucoer dataset (signature). According to the authors, the fusion of the modalities provided better e↵ectiveness than considering the modalities alone.

Two of the most promising modalities in the biometric scenario, the iris, and the face were merged in (Sim et al. 2014b) on score level fusion using the weighted sum rule. The weights are empirically chosen and are specific for each subset of data. The authors reported the accuracy of 99.4% and used three well-established datasets in the literature: Universiti Teknologi Malaysia Iris and Face Multimodal Datasets (UTMIFM), UBIRIS version 2.0 (UBIRIS v.2) and ORL face. The UTMIFM is a dataset with both modalities and the chimeric dataset build on UBIRIS v.2 and ORL stochastically.

Due to the di↵erent nature of chimeric datasets, there is no standard protocol, in the literature, to assist in the creation of such datasets. Also, for the methods above cited, the process of creating the chimeric dataset is unclear and contains few details. Yet, most of the authors did not perform a statistical test.

Thus, the reproducibility of the already published works is compromised and fair comparisons of the works in the literature are difficult to be done. In that direction, we propose a clear and reproducible protocol for the creation and evaluation of a chimerical dataset on Chapter 4.

(46)

2.4 Deep Learning for Feature Representation

The term deep learning gained visibility in the scientific community after the publication of seminal works by Geo↵rey Hinton (Hinton 2007, Hinton & Salakhutdinov 2006). In these works they showed how a network of multiple weighted layers can be e↵ectively pre-trained using one layer at a time, treating each layer as a restricted Boltzmann machine on unsupervised learning. Then using supervised learning (autoencoder) by means of back-propagation of the errors (using gradient descent ) (Rumelhart et al. 1986) through-out the network aiming fine adjustment of the initially learned weights. Deep learning models are generative feature models, whereas a conventional classifier, for example, the Support Vector Machines (Sch¨olkopf & Smola 2002) SVM), is a discriminative model of classification or regression. There is a great diversity of deep representation techniques being proposed and studied by the community involved in deep learning. One of the most promising and popular is the convolutional network (CN) (LeCun et al. 1998) and for this reason, CNs will be the main architecture to be investigated in this thesis.

Convolutional Network

The network architecture is composed of the operations of convolution, activation (ReLu), pooling, normalization, dropout and finally fully connected layers.

The convolutional operation can be defined as:

s(t) = Z 1

1

x(a)w(t a) da (2.3)

where x is referred to as the input and w as the kernel or filter and both have the same dimensions. For this work, the output s is refereed as feature map or feature vector.

Considering the discrete nature of digitalized signals, the computational implemen-tation can be defined as:

s(t) =

m

X

n=1

x(n)w(t n) (2.4)

where n is often smaller than the length of input signal. There is also one hidden parameter associated with equation 2.4: the stride. It defines the sliding step in which each signal is convolved with the other (or with a kernel). It is useful to control the downsampling rate of the output.

(47)

The convolutional operation can promote sparse connectivity, parameter sharing, and equivariant representations (Goodfellow et al. 2016). These properties assure less com-putational cost, in time and space, regarding the convolutional networks with unshared weights. However, the convolution operation is not invariant to other transformation such as changes in scale or object rotation inside images and signals. Thus, other oper-ations must be used along the network such as activation and pooling.

An activation function is applied after the convolution to adjust the output values to a given range. The Rectified Linear Unit (ReLU) is the one adopted here:

activation(s) = 8 < : 0 if s < 0 s otherwise (2.5)

This activation process aims at mimicking the function of a biological neuron allowing activation only when a certain amount of energy is reached.

Pooling operation highlights the output of convolution step aiming to keep relevant information and discard unnecessary details (Boureau et al. 2010). Similar to a convo-lution, the pooling operation is performed by passing a sliding window of size n⇥ m on the signal. Within the window, a mathematical operation is performed, such as sum, average, or maximum. For instance, in maximum operator, called max-pooling, the largest value contained in the window is preserved and used for the construction of the new post-pooling signal. It is useful to achieve invariance to signal transformations and robustness against noise. It is also used to represent the feature maps in a more compact shape when the stride parameter is larger than one. For example, a pooling operation with stride size two results in an output signal that is half the size of the input.

The normalization operator is bioinspired from the competitive interaction observed in natural neural systems such as the gain control contrast mechanism in the cortical area (Geisler & Albrecht 1992, Rolls & Deco 2002).

Dropout is a regularization technique (Srivastava et al. 2014) used to prevent neural networks from overfitting. The Dropout mechanism randomly selects a group of neurons to ignore during training, forcing the network to handle the flow of data by other paths. This intervention improves the capacity of the network to generalizes better.

Fully-connected layers are useful for controlling the final size of the feature vector and also for classification (Goodfellow et al. 2016).

(48)

The operations mentioned above form the layers and consequently the architecture of the CNs. From the architecture, a model is created through supervised learning. Often, a stochastic gradient descent is employed to optimize the model. The new trained model can then be used for feature representation.

2.5 Challenges of Using Deep Learning on Multimodal

Biometric Systems

According to Ghosh (2015), there are no systematized studies in the literature involving deep learning and multimodalities. Ghosh claims that a number of issues need to be investigated, such as:

• Deep learning for feature representation or classification?

• How to fuse systems (or features extracted) with deep learning? • How should data be synchronized?

• There are large databases for the employment of deep learning?

Studies involving deep learning and biometrics are insufficient in the literature. We believe this is due to the problem of insufficient data, as already discussed in (Ghosh 2015). Also, the natural imbalance in biometrics problems, generated by the disparity between intra-class and inter-class pairs could be another obstacle. Besides, there are few genuine databases designed specifically to the multimodal problem, as surveyed in (Lumini & Nanni 2017).

The face modality is the most popular among the works involving biometry and deep learning. Face recognition databases are more abundant in the literature and this may explain the popularity of this biometric source. For this modality, the best results reported in the literature are based on deep learning methods, specifically the convolutional networks (CN) (Parkhi et al. 2015, Schro↵ et al. 2015a, Sun et al. 2014, 2015, Taigman et al. 2014). In the future chapters of this work, we address the problem of multimodal biometric representation learning using deep learning and how to overcome the issues of small databases.

(49)

(50)

Biometric Modalities and Databases

For the development and evaluation of biometric systems, databases are of paramount importance. For this work, experiments on five modalities, from seven databases, were conducted: ECG, EEG, Eye, iris, and Face. In this chapter, all seven databases are explained in details as well as the most prominent techniques for each database is pre-sented.

3.1 CYBHi and UofTDB (ECG)

Electrocardiogram (ECG) as a identification source for biometrics is a recent research topic and it is a trend in the literature (Da Silva et al. 2013, Luz et al. 2014, Merone et al. 2016, Wahabi et al. 2014). The ECG is a robust biometric modality from a security point of view as it is difficult to fake or steal. However, it is also difficult to use it on day-to-day biometric applications since it often demands uncomfortable interventions for the signal acquisition process. Nonetheless, this reality can change due to new forms of capturing the ECG signal. According to Silva et al. (da Silva et al. 2015) there are three types of acquisition categories being investigated nowadays: in-the-person, on-the-person, and o↵-the-person.

Majority of devices used for ECG measurements are in on-the-person category. De-vices in this category usually require the usage of some electrodes attached to the skin surface greased with conductive gel. Nowadays, the standard devices used for medical heartbeat analysis come from this category and some few databases acquired with those devices are used to evaluate methods aiming ECG as biometry.

(51)

Figure 3.1: Example of commercial o↵-the-person ECG mobile equipment not yet used for Biometrics. Source: https://www.alivecor.com/en/

Acquisition of in-the-person category are even more invasive than the last one. In this category, there are equipment designed to be used inside the human body, such as surgically implanted ones, subdermal applications or even ingested in the form of pills. To the best of our knowledge, in the literature, there is no available database build upon those equipments.

Contrasting with the in-the-person category, there is the o↵-the-person category. Devices in this category are designed to measure ECG without skin contact or with minimal skin contact (e.g., through fingers).

According to da Silva et al. (2015), this category is aligned with future trends of med-ical application where pervasive computer systems are a reality. Furthermore, research on such equipment could bring heart biometrics to mass production equipment such as smartphones and tablets and thus leveraging the usage of ECG for biometrics on daily basis. In this direction, one equipment has been released in 2014 and the acquisition of ECG signal using mobile devices become practical (see Fig. 3.1), but the application on focus is the stroke detection, not biometrics. To the best of our knowledge, there are no databases acquired with this equipment. Therefore, our work considers publicly databases acquired with prototype equipment idealizing this new one, as we discuss further.

A systematic review of databases for evaluating Heart biometric systems is presented in (Merone et al. 2016).

(52)

According to them, most of the ECG databases are in the on-the-person category and they are not designed for biometrics. Two reasonably recent ECG databases were specifically designed for biometrics, i.e., The Check Your Biosignals Here initiative (CYBHi) (Da Silva et al. 2014) and The University of Toronto Database (UofTDB) (Wa-habi et al. 2014). They belong to the o↵-the-person category and are publicly available, which makes them most suitable for benchmarking Heart biometric systems nowadays. Based on results reported to date on those databases, biometrics in the o↵-the-person category is an open problem especially when one considers multiple acquisition sessions in the analysis. The ECG signal can morph due to cardiac frequency variations and this e↵ect is more apparent when multiple sessions are employed. Design ECG representation or extract good features to overcome such problems is a challenging task.

O↵-the-person ECG databases are the new trend for biometrics with ECG signals. In this category, the acquisition of the signal is feasible (easy and comfortable), without the inconvenience of placing electrodes on the chest or the need for conductive gel usage. Although there are some issues in these new trends such as excessive noise, o↵-the-person acquisition can boost the usage of ECG as day-to-day biometrics. Nowadays, it is possible to find commercial o↵-the-shelf ECG monitoring equipment integrated to mobile phones, as already mentioned and exemplified in Fig. 3.11.

The o↵-the-person databases considered in this work are the CYBHi (Da Silva et al. 2014) and UofTDB (Wahabi et al. 2014) and their prototypes used for acquisition are shown in Fig. 3.2. We present a brief description of the databases in the following sections.

CYBHi

Check Your Biosignals Here initiative Database is a public o↵-the-person database. Data acquisition was performed using two di↵erential lead electrodes at hand palms and fingers and a virtual ground (Silva et al. 2011). The acquisition was made at 1K Hz sample frequency and 12 bits resolution. The acquisition device used is the biosignalsPLUX research2. The database is divided into two types of experimental protocols: short-term and long-term.

The short-term is composed of ECG data collected from 65 people. Sessions were

1

http://www.alivetec.com/store/alivecor-heart-monitor-for-ios-and-android

2

(53)

(a) CYBHi prototype. Source (Da Silva et al. 2014).

(b) UofTDB prototype. Source (Wa-habi et al. 2014).

Figure 3.2: Database acquisition hardware prototypes.

taken at intervals of two days and demographics showed 65 healthier subjects, in which 49 are males and 16 are females, with an average age of 31.1 ± 9.46 years old. In the proposed experimental protocol (Da Silva et al. 2014), participants were stimulated by specific audio and video content.

The long-term acquisition has data acquired from 63 healthy subjects in two distinct two-minute sessions, collected 3 months apart. Among the subjects, there are 14 males and 49 females, with an average age of 20.68 _{± 2.83 years old. The ECG signal of both} sessions was collected on the fingers (one finger of the right hand and the other of the left hand) and in the sitting position.

The long-term session is more challenging from the biometric point of view and consequently, it is chosen for the experiments of the present work. We call the first session as T1 and the second session, made three months later, T2.

UofTDB

The University of Toronto Database is the largest o↵-the-person database publicly avail-able in the literature, containing data collected from 1020 individuals. The data acqui-sition was collected with dry electrodes Ag/AgCl placed on thumbs of both hands at a sample rate of 200 Hz and 12 bits resolution. Note that the sample rate used in this database is five times smaller than the one used in CYBHi, making the heartbeat rep-resentation of this database smaller. The Vernier EKG Sensor and the Vernier Go!Link

(54)

Table 3.1: UofTDB subject distribution by acquisition condition and session. Session Condition

sit stand exercise supine tripod

1 1012 0 0 0 0 2 72 72 0 0 0 3 76 5 71 0 0 4 63 0 0 0 0 5 0 0 0 63 63 6 65 65 0 0 0

interface was used as acquisition device3. Length of the acquisition recordings varies between 2 to 5 minutes.

This database was proposed to allow investigation of three aspects that influence the quality of the ECG signal as biometrics:

1. Large number of individuals. The database contains data from 1020 subjects with age ranging from 18 to 52 years old.

2. Posture. Five postures/conditions were used during the acquisition - sitting, stand-ing, supine, tripod and exercise.

3. Multiple sessions. Few databases in the literature perform acquisition in multiple spaced sessions. To address this issue, the authors of the database collected the data in six sessions (S1, S2, S3, S4, S5, and S6) spanned over a period of six months.

Although the database has data of 1020 subjects, not all subjects have data in all sessions. The first session is the largest, with data from 1012 subjects and only 100 subjects were selected to participate in the other 5 sessions (See Table 3.1). Among the 100 selected, only 46 participated in all subsequent sessions. As suggested by the creators of the database in (Wahabi et al. 2014), in order to perform an inter-session evaluation,

3

(55)

Figure 3.3: Histogram of heartbeat distribution before (a) and after (b) the outlier removal algorithm for CYBHi DB.

we only considered individuals in which the heartbeats were acquired in seated condition and who participated in all five sessions (S1, S2, S3, S4, and S6), adding up 46 subjects.

Main Published Works

Majority of works published in the literature are carried out with private on-the-person databases or public databases designed for medical investigation. To the best of our knowledge, the first work using the ECG as biometry was proposed by Biel et al. (Biel et al. 2001) in 2001, in which the author’s proposed methods to identify 22 individuals using ECG fiducial points, principal component analysis (PCA) and a generative model classifier (GMC). After that seminal work, other methods have been proposed (Chiu et al. 2008, Fanga & Chanb 2013, Irvine et al. 2008, 2001, Luz et al. 2014, Odinaka et al. 2012).

Irvine et al. (2001) use 15 fiducial points chosen by a particular feature selection method (the Wilks Lambda approach) and a linear discriminant analysis classifier to identify 104 subjects. Using an approach based on the wavelet coefficients extracted from ECG raw wave, Chiu et al. (Chiu et al. 2008) achieved 100% of verification rate for 35 healthy individuals and they found that adding heartbeats of arrhythmic individuals significantly degraded the verification rate.

In (Irvine et al. 2008), authors focused their method exclusively on PCA and showed that PCA is a great tool for removing noise from ECG signal without loss of relevant information. However, before the PCA, all heartbeats are subsampled by a factor of four and then filtered to eliminate baseline wander.