Deep learning approaches for soft biometrics classification in videos

Texto

(1)´ FEDERAL UNIVERSITY OF TECHNOLOGY – PARANA GRADUATE PROGRAM IN ELECTRICAL AND COMPUTER ENGINEERING. NELSON MARCELO ROMERO AQUINO. DEEP LEARNING APPROACHES FOR SOFT BIOMETRICS CLASSIFICATION IN VIDEOS. DISSERTATION. CURITIBA 2018.

(2) NELSON MARCELO ROMERO AQUINO. DEEP LEARNING APPROACHES FOR SOFT BIOMETRICS CLASSIFICATION IN VIDEOS. Dissertation presented to the Graduate Program in Electrical and Computer Engineering of the Federal University of Technology – Paraná as partial fulfillment of the requirements for the degree of “Master of Science (M.Sc.) ” – Area of concentration: Computer Engineering. Advisor:. CURITIBA 2018. Prof. Dr. Heitor Silvério Lopes.

(3) Dados Internacionais de Catalogação na Publicação A657d 2018. Aquino, Nelson Marcelo Romero Deep learning approaches for soft biometrics classification in videos / Nelson Marcelo Romero Aquino.-- 2018. 88 f.: il.; 30 cm. Disponível também via World Wide Web. Texto em inglês com resumo em português. Dissertação (Mestrado) - Universidade Tecnológica Federal do Paraná. Programa de Pós-graduação em Engenharia Elétrica e Informática Industrial. Área de Concentração: Engenharia de Computação, Curitiba, 2018. Bibliografia: f. 83-88. 1. Biometria. 2. Computação flexível. 3. Aprendizado do computador. 4. Processamento de imagens – Técnicas digitais. 5. Identificação biométrica. 6. Segmentação de imagens. 7. Visão por computador. 8. Métodos de simulação. 9. Engenharia elétrica - Dissertações. I. Lopes, Heitor Silvério, orient. II. Universidade Tecnológica Federal do Paraná. Programa de Pós-graduação em Engenharia Elétrica e Informática Industrial. III. Título. CDD: Ed. 23 -- 621.3. Biblioteca Central do Câmpus Curitiba – UTFPR Bibliotecária: Luiza Aquemi Matsumoto CRB-9/794.

(4) Ministério da Educação Universidade Tecnológica Federal do Paraná Diretoria de Pesquisa e Pós-Graduação. TERMO DE APROVAÇÃO DE DISSERTAÇÃO Nº 786 A Dissertação de Mestrado intitulada “Deep Learning Approaches for Soft Biometrics Classification in Videos” defendida em sessão pública pelo(a) candidato(a) Nelson Marcelo Romero Aquino, no dia 02 de março de 2018, foi julgada para a obtenção do título de Mestre em Ciências, área de concentração Engenharia da Computação, e aprovada em sua forma final, pelo Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial.. BANCA EXAMINADORA:. Prof(a). Dr(a). Heitor Silvério Lopes - Presidente – (UTFPR) Prof(a). Dr(a). André Eugênio Lazzaretti - (UTFPR) Prof(a). Dr(a). Hugo Alberto Perlin - (IFPR). A via original deste documento encontra-se arquivada na Secretaria do Programa, contendo a assinatura da Coordenação após a entrega da versão corrigida do trabalho.. Curitiba, 02 de março de 2018..

(5) ACKNOWLEDGEMENTS. During the two years that the master’s degree lasted, in which there were several disciplines, many works of the disciplines, innumerable reviews of literature, infinite experiments, and the occasional writing of articles, I had the support of people who, had it not been for them, this work would never have reached a good port. This text is dedicated to them, who were by my side unconditionally and never stopped having faith in me, even in those moments when even I did not. I am infinitely grateful to each one of you. To my parents, Nelson and Eva, and my brothers Guido and Nathalia, who gave me all their support to overcome the adversities that I have faced during this time and have enjoyed with me every little achievement. They are everything a son or a brother could ask for. I am who I am because of you. To family Aguilera. It is often said that home is wherever the heart is, and, thanks to you I was blessed to feel at home during this time. I will be eternally grateful for the unconditional support you have given me. Much of this work is yours. To Professor Heitor Silvério Lopes, for trusting in me and giving me the opportunity to work on an area that I am passionate about. Thanks for the advice, the continuous support and all the teachings, which I will carry with me forever. To the tchos from the LABIC laboratory: Manassés Ribeiro, Matheus Gutoski, Leandro Hattori, César Vargas Ben´ıtez, André Lazzaretti, Lia Ayumi Takiguchi, Bruna De Paula, Lucas Albini, Fernando Carvalho. The amount of coffee, pizza and sfiha that I consumed because of you guys during this time is not normal. Thanks for the good talks, the laughter and the continuous support. To all those who, in one way or another, contributed to this work. Thank you..

(6) RESUMO. ROMERO AQUINO, Nelson Marcelo. DEEP LEARNING APPROACHES FOR SOFT BIOMETRICS CLASSIFICATION IN VIDEOS. 88 f. Dissertation – Graduate Program in Electrical and Computer Engineering, Federal University of Technology – Paraná. Curitiba, 2018. O número de câmeras de vigilância instaladas em locais públicos cresceu enormemente nos u´ ltimos anos devido a` necessidade de aumentar a segurança pública, permitindo obter uma grande quantidade de imagens e v´ıdeos em tempo real sem muito esforço. Diferentes tipos de problemas podem ser resolvidos através do processamento dos dados obtidos por estas câmeras, como a identificaça˜ o de indiv´ıduos. As biometrias fracas podem ser u´ teis para executar esta tarefa, uma vez que elas fornecem informaço˜ es que podem ser usadas para diferenciar uma pessoa de outra sem exigir a cooperaça˜ o direta delas. No entanto, isso exige uma tarefa exaustiva de análise a ser feita por observadores humanos. Dependendo da quantidade de câmeras, isso pode até se tornar uma tarefa imposs´ıvel. Métodos de visão computacional podem ser uma alternativa válida para realizar classificaça˜ o de biometrias fracas em imagens ou v´ıdeos. Os métodos de Deep Learning (DL) têm alcançado desempenhos muito bons em tarefas de visão computacional, como reconhecimento e detecça˜ o de objetos, ou segmentaça˜ o de imagens. Seguindo esta linha, este trabalho tem como objetivo estudar a adequaça˜ o de métodos de DL para classificar biometrias fracas em imagens ou v´ıdeos. Três contribuiço˜ es são apresentadas sobre este tema nesta dissertaça˜ o. Primeiro, realizou-se um estudo sobre o efeito do aumento de dados no desempenho de redes neurais convolucionais para classificaça˜ o de biometrias fracas em imagens. A segunda contribuiça˜ o está relacionada com a transferência de informaça˜ o de um conjunto de imagens a outro. Este processo se baseia em treinar um modelo com dados de uma distribuiça˜ o e testá-lo em dados de outra distribuiça˜ o. Finalmente, foi avaliado o uso de modelos de DL para realizar a classificaça˜ o em v´ıdeos. Para este propósito, foi proposta uma nova abordagem baseada no uso de redes de memória bidirecionais de longo e curto prazo. Resultados para os experimentos de aumento de dados mostram que grandes aumentos não induzem ao sobre-ajuste e que balancear um conjunto de dados antes do treino requer menor aumento para que o desempenho do modelo melhore. Quanto a` transferência de informaça˜ o, os resultados mostram que pode haver uma correlaça˜ o entre a complexidade e similaridade dos conjuntos de dados que são utilizados para treinar e testar um modelo. Assim, se esta técnica for aplicada, o conjunto de treinamento deve preferencialmente ser muito semelhante ao do teste e deve ser de maior complexidade. Embora isso não seja definitivo, já que pode haver exceço˜ es dependendo da biometria fraca a classificar. Em termos de classificaça˜ o de v´ıdeo, em geral, nossas abordagens baseadas em uma rede neural recorrente e um modelo DL que representa dependências temporais através de um filtro passa-baixas produziram melhores resultados, em termos de acurácia geral e balanço de classificaça˜ o, que uma abordagem baseada em classificar um v´ıdeo usando apenas um de seus quadros. Palavras-chave: Aprendizado de Máquina, Aprendizado Profundo, Biometrias Fracas.

(7) ABSTRACT. ROMERO AQUINO, Nelson Marcelo. DEEP LEARNING APPROACHES FOR SOFT BIOMETRICS CLASSIFICATION IN VIDEOS. 88 p. Dissertation – Graduate Program in Electrical and Computer Engineering, Federal University of Technology – Paraná. Curitiba, 2018. The number of surveillance cameras installed in public places has grown enormously during the past years due to the necessity to increase public security, allowing to obtain a large amount of images and videos in real time without much effort. Different types of problems can be solved by processing the data obtained by security cameras, such as the identification of individuals. Soft biometrics attributes can be useful to perform this task, since they provide information that can be used to differentiate one person from another without requiring their direct cooperation. However, this demands an exhaustive process of analysis to be carried by one or more human observers. Depending on the quantity of cameras, this could even become an impossible task for humans. Hence, computer vision methods could be a valid alternative to perform soft biometric classification in images or videos. Within this score, Deep Learning (DL) methods have risen recently, achieving state-of-the art performances for several computer vision tasks such as object recognition, object detection and image segmentation. This is possible due to their capability to learn both, features and classifier, at once, in order to solve a particular problem. Following this line, this work aims at empirically studying the suitability of DL methods for classifying soft biometrics in images or videos. We present three contributions regarding this subject in this dissertation. First, we perform a study on the effect of data augmentation on the performance of convolutional neural networks for soft biometrics classification. The second contribution is related to transferring information from one soft biometric dataset into another to perform classification. This process is achieved by training a model with data from a dataset in order to test it on data from another one. Finally, we evaluate the use of DL models to represent or learn temporal dependencies, so as to perform soft biometrics classification in videos. For this task, we propose a novel approach based on the use of bidirectional long short term memory networks. Results for the experiments regarding data augmentation show that large augmentation sizes do not induce overfitting and that balancing a dataset before performing on-line data augmentation leads to the necessity of smaller augmentation sizes in order to start improving the performance of the networks. As for transfer learning, results show that there could be a correlation between the complexity and the similarity of the datasets that are used for training and testing a model. Thus, if this technique is applied, the training set should preferably be very similar to the test data and should have a higher complexity. Although this is not definitive, since there could be exceptions depending on the soft biometric attribute to classify. Regarding video classification, in general, our approaches based on a recurrent network and a DL model that represents temporal dependencies through a low-pass filter yielded better results, in terms of overall accuracy and classification balance, than the baseline, based on classifying a video using only one of its frames. Keywords: Machine learning, Deep Learning, Soft Biometrics.

(8) LIST DE FIGURES. FIGURE 1 FIGURE 2. FIGURE 3 FIGURE 4. FIGURE 5 FIGURE 6 FIGURE 7 FIGURE 8 FIGURE 9 FIGURE 10. FIGURE 11 FIGURE 12 FIGURE 13 FIGURE 14. FIGURE 15. FIGURE 16 FIGURE 17 FIGURE 18 FIGURE 19 FIGURE 20. FIGURE 21. – The process of image classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Sample image with detected objects by an algorithm. For this particular case, the objects are detected using the Single Shot MultiBox Detector (SSD) network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Feed forward neural network composed of one hidden layer. . . . . . . . . . . . . – Architecture of a simple CNN composed of a Convolutional layer followed by a Pooling layer. Fully Connected layers compose the last part of the architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Simple RNN architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Internal structure of a LSTM Memory Cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . – Bidirectional Recurrent Network composed of LSTM cells. . . . . . . . . . . . . . – Overview of the methodology followed to study the effect of data augmentation on the performance of two CNN architectures. . . . . . . . . . . . . – Sample image with its transformed versions. . . . . . . . . . . . . . . . . . . . . . . . . . . . – Methodology proposed in this work. The performance of a model k, with its fully connected layer trained with a dataset k is evaluated on other n datasets. The feature extractor part of the model is pre-trained with the ImageNet dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Discretization process of a video dataset using the Single Shot MultiBox Detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – An Inception module used in the Inception-v3 architecture. . . . . . . . . . . . . . – Overview of our approach for video classification using a retrained CNN with low pass filter to represent temporal dependencies. . . . . . . . . . . . . . . . . – Inference process for a video containing 10 frames using a window size is equal to 5. The score yt for a window t is calculated through the mean of the scores of the frames within the window. The final classification y is obtained by averaging the scores of all windows. . . . . . . . . . . . . . . . . . . . . . . – Classification process for the sequence k of n frames. The output yk is the score for the sequence. The final score is obtained by averaging the scores of all sequences of the video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Sample images from the LABICV1 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . – Distribution of the instances of the LABICV1 dataset considering each attribute. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Sample frames from the LABICV2 dataset. Each frame corresponds to one of the four walking angles of the individuals. . . . . . . . . . . . . . . . . . . . . . . . . . . – Distribution of the instances of the LABICV2 dataset considering each attribute. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Sample images from the VQH dataset. Sample a was drawn from the H3D dataset, sample b was drawn from HATdb and sample c was drawn from ViPer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Results obtained for the label Gender by the architectures Arch#1 and Arch#2 for each augmentation size. Figure 21a presents the results. 20. 22 24. 27 29 30 32 38 39. 42 43 44 46. 47. 48 53 53 54 55. 55.

(9) FIGURE 22 –. FIGURE 23 –. FIGURE 24 – FIGURE 25 – FIGURE 26 – FIGURE 27 – FIGURE 28 –. FIGURE 29 –. FIGURE 30 –. FIGURE 31 – FIGURE 32 –. considering the metric Accuracy, whilst Figure 21b shows the results considering the metric Se×Sp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results obtained for the label Lower Clothes by the architectures Arch#1 and Arch#2 for each augmentation size. They were obtained without balancing the training dataset before applying the augmentations. . . . . . . . Results obtained for the label Lower Clothes by the architectures Arch#1 and Arch#2 for each augmentation strategy. They were obtained balancing the training dataset before applying the augmentations. . . . . . . . . . . . . . . . . . Loss of the models during the training phase for the attribute Gender considering all augmentation sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relationship between the train loss and the test Se×Sp for the attribute Gender considering all augmentation sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structural Similarity Index (SSIM) between the datasets. . . . . . . . . . . . . . . . Relation between the performances (mean AUC) obtained by the models and the complexities (SI) of the datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ROC curve for the attribute Gender. The left side of the label refers to the dataset used to train the model and the right side refers to the dataset used to test the model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ROC curve for the attribute Lower Clothes. The left side of the label refers to the dataset used to train the model and the right side refers to the dataset used to test the model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ROC curve for the attribute Hat. The left side of the label refers to the dataset used to train the model and the right side refers to the dataset used to test the model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Organization of Section 4.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boxplot of the accuracies obtained for the attribute Hat, considering different frame positions to perform the classification. . . . . . . . . . . . . . . . . . .. 57. 58. 59 60 60 63 65. 66. 67. 68 71 73.

(10) LIST OF TABLES. TABLE 1 TABLE 2 TABLE 3 TABLE 4. –Layers of Architecture #1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . –Layers of Architecture #2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . –Complexity of each dataset considering different metrics . . . . . . . . . . . . . . . . . – Mean AUC obtained by Inception-v3 for each dataset without transfer learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TABLE 5 – Results obtained using transfer learning with different datasets for the attribute Gender. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TABLE 6 – Results obtained using transfer learning with different datasets for the attribute Lower Clothes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TABLE 7 – Results obtained using transfer learning with different datasets for the attribute Lower Clothes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TABLE 8 –Mean accuracies obtained by Inception-v3 for discrete versions of LABICv2. TABLE 9 –Mean Se×Sp obtained by Inception-v3 for discrete versions of LABICv2. . . TABLE 10 – Mean and standard deviations for the Accuracy and Se×Sp metrics, considering each attribute, achieved by our approaches and the baseline. . . TABLE 11 –TP, FP, FN and TN obtained when each sample was part of the test set during the cross-validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TABLE 12 –Parameters that led to obtain the best results for each attribute. . . . . . . . . . . . .. 40 41 62 63 64 66 68 72 72 76 76 77.

(11) LIST OF ACRONYMS AND ABBREVIATIONS. DL CNN RNN BLSTM ANN LSTM SVM YOLO SSD MSE CEL. Deep Learning Convolutional Neural Network Recurrent Neural Network Bidirectional Long Short-Term Memory Network Artificial Neural Networks Long Short-Term Memory Network Support Vector Machine You Only Look Once Single Shot MultiBox Detector Mean Squared Error Cross Entropy Loss.

(12) CONTENTS. 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 OBJECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 General Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Specific Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 STRUCTURE OF THE DISSERTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 THEORETICAL ASPECTS AND RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 COMPUTER VISION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Image and Video Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 DEEP LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 RECURRENT NEURAL NETWORK (RNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Long Short Term Memory (LSTM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Bidirectional Long Short Term Memory (BLSTM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 TRANSFER LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 THE EFFECT OF DATA AUGMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 The Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 TRANSFER LEARNING AMONG SOFT BIOMETRICS DATASETS . . . . . . . . . . . . 3.2.1 The Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1.1 Inception-v3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 VIDEO CLASSIFICATION USING DEEP LEARNING METHODS . . . . . . . . . . . . . 3.3.1 Video Classification using CNN with Low Pass Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Video Classification using BLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 EVALUATION METRICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Classification Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Complexity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Similarity Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 RESULTS AND DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 HARDWARE SETTINGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 DATASETS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 LABICV1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 LABICv2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 VQH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 THE EFFECT OF DATA AUGMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Experiment #1: Gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Experiment #2: Lower Clothes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 13 14 17 17 17 17 18 18 19 21 23 23 26 28 29 31 31 33 37 37 38 41 41 44 45 45 47 49 49 50 51 52 52 52 52 54 55 56 56 56 57.

(13) 4.3.4 Analysis of the Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 TRANSFER LEARNING AMONG SOFT BIOMETRICS DATASETS . . . . . . . . . . . . 4.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Datasets Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Inception-v3 without transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Experiment #1 - Gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5 Experiment #2 - Lower Clothes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.6 Experiment #3 - Hat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 VIDEO CLASSIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Baseline: Discrete Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 FUTURE WORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 59 61 61 62 62 63 64 66 67 69 69 71 71 74 77 79 81 83.

(14) 13. 1. INTRODUCTION. Humans have an inherent ability to perceive their surrounding environment through their visual system. The visual perception involves several physiological components that allow to receive visual stimulus in the form of light within a certain spectrum which is reflected by the surrounding objects. Those stimulus are, later, interpreted through a very sophisticated process. The main biological components that allow the existence of human vision are the eyes and the neurological connections capable of interpreting the signals received. The ability to see is one of the most important human senses, most activities that humans perform in a daily basis are based on the use of this sense. Likewise, most aspects of human lives demand or are designed to be experimented through this sense, such as work positions, entertainment, arts, sports, among others. Although visual perception may seem innate and static, human vision is always evolving and improving. This is accomplished by a learning process that starts at the beginning of our lives, and it is based on gaining knowledge by experiencing visual contact with the environment. This knowledge can be acquired in a supervised manner, accomplished when prior information regarding a scene is available for the individual, or in an unsupervised way, in which there is no previous information related to the scene and the individual must draw conclusions and interpret the underlying aspects of a scene. Vision can be divided into two main steps. The first is the reception of the signal provided by the external world, which is accomplished by the eyes. In this step, low-level features are perceived, such as textures, shapes, colours and dots. The second phase includes the interpretation of the semantic that underlies the signal, a task accomplished by the neurological system, which receives the initial low-level information and process it to generate the proper meaning of the stimulus or a high-level conclusion (PALMER, 1999). The interpretation of images, videos or scenes depends on previous experiences that range from knowledge acquired in an unsupervised manner to definitions that were previously learned and that are modulated by external factors such as society, culture, prejudices, among.

(15) 14. others. This is a common issue, since individuals have different interpretations from the same scene or image due to the fact that the learning process that each experimented during their lives led them to produce different high-level conclusions even though the initial low-level information is the same. Since vision is very important to obtain information in order to make decisions, research community has been attempting to develop systems able to perform this task automatically. Computer vision is a research area that focuses on developing methods for giving computers the ability to replicate visual perception. Analogous to human vision, the computer vision process, in its most basic structure, is focused on receiving as input an image or a sequence of images and interpreting their meaning through some heuristic. In other words, the main idea is to create computational methods for extracting high-level information from low-level features similarly as humans are able to do. This is a very hard and complex task, considering that even the interpretations for certain scenes may differ from one human to another. Advances in computer vision may lead to the improvement of people’s life considering different aspects.. For instance, self-driving cars may have computer vision modules. incorporated to improve their external-world awareness; malign changes such as tumours or arteriosclerosis could be automatically detected in medical images; controlling processes through industrial robots or organizing information by classifying images or sequences of images belonging to a database could also be achieved. Computer vision may also allow to improve the security level in public places that have cameras installed in their surroundings. For instance, methods to perform person reidentification using the footage obtained by the cameras could be applied. It may also be possible to extract frames containing only individuals with certain physical characteristics, known as soft biometrics, such as gender, clothes length, hair length, among others. Following this line, this work explores methods focused on processing low-level features extracted from images or videos and processing those features in order to obtain a final high-level interpretation of the input. More concretely, this work aims at exploring approaches to perform soft biometrics classification in images and videos. 1.1. MOTIVATION This work addresses the problem of automatically classifying soft biometrics traits. from a set of images or a sequence of temporally consecutive images that compose a video..

(16) 15. Soft biometrics are physiological or behavioural characteristics from humans that provide information useful to differentiate one individual from another.. Although soft. biometrics are usually not unique for each individual, they can provide some prior information about the subjects. Furthermore, using just one soft biometric may not be a enough to identify a particular individual. However, when using a combination of them it is possible to achieve satisfactory results to perform that task. Soft biometrics can also be used to complement other primary biometric identifiers such as fingerprints and faces. Some common soft biometrics are: gender, age, height, weight, clothes colour, tattoos, and hair length. Over the years, the necessity to increase public security induced the growth of the number of surveillance cameras installed in public places. They allow to obtain images and videos in real time without much effort. Hence, a lot of information can be extracted from these resources to solve different types of problems. The identification of individuals in the images obtained by surveillance cameras is one of those problems. For this task, soft biometrics are useful, since they provide information that can be used to differentiate the subjects that are present in frames of the videos. Nonetheless, this is an exhaustive process to be carried by a human observer. Therefore, computer vision plays an important role to tackle this problem. The classification of soft biometrics is a relevant research topic, since it may allow to reduce the necessity to manually analysing footages from surveillance cameras. As stated before, obtaining soft biometrics does not require the direct cooperation of the individuals and footage containing people is very easy to gather due to the presence of cameras in almost all public places nowadays. These factors represent a major motivation for exploring soft biometrics automatic classification. For instance, a computer may be able to interpret in real time if an assault is in course and alert the authorities only by analysing the images obtained by a surveillance camera. The extraction of soft biometrics attributes from the footage could be a relevant part of the process by helping to identify the assailant through appearance aspects, such as the colour of the shirt, if the individual has long hair or if it carried a bag during the assault. Furthermore, this is a task that is not be limited to real time applications, since the footage can be analysed after the assault in order to identify suspects. Traditional image or video classification is based on a series of steps from pattern recognition that include data pre-processing, feature extraction, in which the low-level information contained in the data is obtained, and finally, training a classifier using the extracted features as input (DUDA; HART; STORK, 1973). The classifier is then tested with unseen data in order to measure its classification performance. The main issue of this approach is that there are many feature extractors proposed in the literature and finding the appropriate extractor.

(17) 16. requires deep knowledge about the problem to be solved, the feature extractor and the classifier. In addition to the traditional pattern recognition process, Deep Learning (DL) methods have arised recently, achieving the state-of-the art performances for several computer vision tasks such as image recognition, object detection and image segmentation (RUSSAKOVSKY et al., 2015; LIU et al., 2015; CHEN et al., 2016). For classification, these methods differ from the traditional ones in their learning process, since they are end-to-end classifiers. This implies that they learn both the features and the classifier during the learning phase of the model. This work is focused on empirically studying the suitability of DL methods for classifying soft biometrics such as gender, long hair, upper and lower clothes lengths in images or videos. We address three main topics related to soft biometrics classification using DL methods. The first refers to the use of data augmentation, a technique based on creating new training samples using the available data so as to be used to train a model with the purpose of improving its generalization capability. Since this technique is widely used to train DL models, such as Convolutional Neural Networks (CNNs), which require large amounts of samples to be trained, an analysis regarding the effect of the technique is worth to be carried so as to study which augmentation sizes allow to improve the performance of the models. We also aim at evaluating he influence of the technique when using balanced and unbalanced classes. The second aspect is regarding transferring information from one soft biometric dataset into another. In this work, we consider transfer learning as the process of training a classifier with data from a certain dataset and testing it by using samples from a different dataset, which we consider as a transductive transfer learning process (WEISS; KHOSHGOFTAAR; WANG, 2016). Although this process could also be considered as testing the generalization capability of the classifier, we aim at studying the influence of factors as complexity or similarity in the training phase, so as to aid at choosing the proper dataset for training a classifier (considering those factors) that will be latter applied to a dataset that may not have labelled information. Finally, we evaluate the use of DL models to represent or learn temporal dependencies, so as to perform soft biometrics classification in videos. We aim at studying if there is an improvement on the performance of the classification capability of a DL model when receiving as input a sequence of frames from a video instead of using only a single frame of the video to perform the classification. We also aim at comparing the results obtained when only representing the time variable through a filter with those achieved by a model specifically devised to learn temporal dependencies. For this purpose, we use CNN combined with a low-pass filter and a type of Recurrent Neural Network (RNN) known as Bidirectional Long.

(18) 17. Short-Term Memory Network (BLSTM). 1.2. OBJECTIVES. 1.2.1. GENERAL OBJECTIVE The objective of this work is to automatically describe soft biometrics traits in images. and videos through deep learning methodologies. 1.2.2. SPECIFIC OBJECTIVES The following points are derived from the main objective and each corresponds to a. contribution of this work: • To study the performance of Deep Learning approaches when representing or learning temporal dependencies to perform soft biometrics classification in videos. • To study the effect of data augmentation on the performance of Convolutional Neural Networks for soft biometrics classification for balanced and unbalanced classes. • To analyse the effect of complexities and similarities of datasets in a transfer learning process. • To introduce a labelled image dataset for soft biometrics classification. • To introduce a labelled video dataset for soft biometrics classification. 1.3. STRUCTURE OF THE DISSERTATION This work is structured as follows.. Theoretical aspects related to the methods. approached in this work are presented in Chapter 2, the related literature is also introduced in the chapter. The methods proposed in this work as well as the metrics used to evaluate the approaches are described in Chapter 3. Chapter 4 presents the experiments along with their results and their respective discussions. Finally, general conclusions regarding the work and proposals for future work are presented in Chapter 5..

(19) 18. 2. THEORETICAL ASPECTS AND RELATED WORK. This chapter introduces basic concepts used in this work. First, we describe computer vision concepts such as image and video classification, and object detection, which is a method used to perform pedestrian detection in this work. Next, concepts regarding Deep Learning (DL) are introduced, such as Artificial Neural Networks (ANNs), Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). We also describe two types of RNNs, one of which is used in this work: Long Short-Term Memory Network (LSTM) and Bidirectional Long Short-Term Memory Network (LSTM). Next, a brief introduction regarding transfer learning is presented. Finally, the relevant literature related to this work is described. 2.1. COMPUTER VISION Human vision gives individuals the ability to process visual information, allowing them. to build representations of their surrounding environment in order to perform interactions with it. The first step of the vision process allows to receive stimulus from the external world. Then, it is possible to categorize the objects within the vision range as well as estimating the distances between each other. Human vision does not only allow to perceive signals, but its main component is the interpretation of those signals in order to obtain descriptions that can be used to make decisions. Although human vision is considered an innate process, since humans perform it automatically and instinctively, researches from the neuroscience field are still focused on uncovering its underlying mechanism. Image analysis is a very complex task, it is based on interpreting not only the current stimulus but, also, considering previous knowledge that was acquired through different means. For instance, a child may not have ever seen an airplane in person, but in some point of his life a drawing of a plane was shown to him, which could allow him to identify that object, even though is the first time he sees it. The interpretation of an image or a scene is defined as a high-level information (PERLIN et al., 2015). Basically, it is the final description of one or more objects within a.

(20) 19. scene. In order to obtain this description, the brain processes low-level features of an image, such as edges, shapes, textures and others, which do not provide much information when treated separately. High-level information is obtained when the low-level artefacts are used in conjunction to determine the context of the scene. This interpretation process is a very difficult task to replicate and it even differs from one person to another, since their point-of-view or the way they interpret the low-level features may differ. Computer vision is a research field that pursuits to replicate the vision process by means of computational tools. It aims at developing algorithms and methods capable of achieving visual understanding of the content of digital images or sequences of images. The objective is to automatically describe the world contained in one or more images and to reconstruct its properties, such as shape, illumination, colour, among others (SZELISKI, 2010). Methods for acquiring, processing and analysing images are used within the scope of computer vision applications. There are several applications, such as motion analysis, image restoration, scene reconstruction and image recognition, with the last one being the most common and it is based on determining the objects, features or activities that are present within an image or video. 2.1.1. IMAGE AND VIDEO CLASSIFICATION A digital image is a 2-dimensional matrix where each of its positions correspond to a. pixel. A pixel is the smallest individual element of an image. The intensity of the pixels are variable. The different levels of pixel intensities, as well as their spatial arrangement, produces the final representation of an image. An image can be represented in grey levels or in colours. Images in grayscale are composed of pixels represented by 1 byte (8 bits), which leads to 256 possible grey intensities. Colour images are represented by pixels composed of 3 components, which are interpreted as coordinates in some colour space. The most common colour space is RGB, in which a pixel encodes three colours: red, green and blue. The RGB colour space is commonly used in computer displays. There are other colour spaces such as YCbCr or Hue Saturation Value (HSV). However, they are used in different contexts. Image classification aims at separating images according to their content through different techniques. Traditionally, the major steps of image classification include image preprocessing, feature extraction, feature selection, application of the classification approach, and performance assessment (LU; WENG, 2007). The traditional image classification process is presented in Figure 1..

(21) 20. Figure 1: The process of image classification.. Image pre-processing includes restoration, detection and geometric rectification techniques. The images can be also normalized and configured in different colour spaces in this stage. If the images proceed from different sources, their formats and qualities must also be evaluated before being incorporated into the classification procedure. Feature extraction consists in describing an image through numerical vectors known as feature vectors or descriptors. There are three categories of image descriptors: texture, colour and shape (WANG; LIU; HUANG, 2000). Feature vectors are commonly used as input to a classifier. However, often they are very high-dimensional, which may lead the classifier to decrease its performance since most of the information provided by the descriptor may not be relevant for the classification task. Therefore, a feature selection process could be applied, which consists in selecting only the most relevant variables and returning a subset of the original features. Features must be extracted and afterwards selected by using hand-crafted methods that must be chosen considering the problem to be solved. This implies that it is necessary human expertise in order to chose the proper tools. This is a setback tackled through methods based on combining the feature extractor with the classifier, in order to learn not only the discriminant surfaces that separate the classes but, also, the features that allow to optimize those surfaces. Next sections of this chapter will describe in detail some of those methods. The classification process is based on a learning procedure. There are three types of learning: supervised, unsupervised and semi-supervised. All learning types seek to perform data separation according to some metric (PERLIN et al., 2015). In supervised learning there is a prior information related to the data. Let the matrix X = (x1 , x2 , ..., xn ) represent a set of n samples in ℜd . This information is represented by a ground-truth label yi ∈ Y for each point xi , where Y is a set of labels. A label is basically the actual class of the data. The objective of supervised learning algorithms is to obtain a function f (xi ) = yî , such that yî is an estimate of the true label yi . There are many algorithms to perform this task, such as Decision Trees, Support Vector Machines (SVMs), Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs), among others..

(22) 21. In unsupervised learning there is no initial information regarding yi . The objective is to find some type of hidden structure of the unlabelled data. Some unsupervised learning approaches include clustering algorithms such as K-Means and X-Means, neural networks such as Generative Adversarial Networks and Autoencoders, and statistical methods such as Principal Component Analysis or Linear Discriminant Analysis. Semi-supervised learning combines both, supervised and unsupervised learning. It is based on using a small amount of labelled data combined with a large amount of unlabelled for training a model. Unsupervised and semi-supervised learning are beyond the scope of this work. Thus, the rest of this chapter discusses only aspects related to the supervised learning process. In order to assess the performance of a classifier, a dataset X is usually divided into train and test sets, the first one is used to adjust the algorithm and the second one is used to evaluate the classification capability of the trained model. If the dataset contains a very large number of instances, X could be divided into three subsets: train, validation and test. Train and test sets are used for the purposes mentioned before, whilst the validation set is used during the training to perform hyperparameter adjustment and to measure if there are signs of overfitting during the training phase, which occurs when the model is over adjusted to classify a particular set of data, losing its capability to generalize to new data. As for videos, which are a sequences of consecutive images or frames obtained at a specific moment, the classification process is analogous to image classification. However, there are some differences derived from the temporal variable: automatically separating videos according to their content is directly related to learning temporal dependencies. The most basic video classification method consists in treating the problem as image classification, discretely classifying each frame of the video and applying a voting scheme to obtain the final classification (KARPATHY et al., 2014). However, this approach does not allow to keep sequential information of videos. Other approaches seek to obtain spatial-temporal features to improve the classification performance of the models (XU; LI, 2003). 2.1.2. OBJECT DETECTION Object detection encompasses algorithms and methods focused on detecting objects. that correspond to certain classes, such as pedestrians, cars or animals, in images or videos. The goal of object detection is to obtain regions of interest containing the objects that are aimed to be detected. The region of interest could be a bounding box containing the object, including pixels from the background, or only the pixels that correspond to the object, in which the background is discarded. The second option is known as object segmentation and the regions of interest.

(23) 22. are called segments. For videos, object detection allows to perform object tracking by applying some detection method at each frame of the video and returning a bounding box containing the proper object. Object tracking is useful to measure the motion, positioning and occlusion of the object of interest (PAREKH; THAKORE; JALIYA, 2014). Figure 2 presents an image with the objects detected by a detection algorithm.. Figure 2: Sample image with detected objects by an algorithm. For this particular case, the objects are detected using the Single Shot MultiBox Detector (SSD) network.. Traditionally, object detection starts by extracting features from the input image, such as Haar (PAPAGEORGIOU; OREN; POGGIO, 1998), Histogram of Oriented Gradients (HOG) (DALAL; TRIGGS, 2005) or Scale Invariant Feature Transform (SIFT) (LOWE, 1999). Some approaches are based on using features obtained by training Deep Neural Networks (SZEGEDY; TOSHEV; ERHAN, 2013). The features are then fed to some classifier, such as Support Vector Machine (SVM), Artificial Neural Network (ANN) or AdaBoost, which performs the proper identification of the objects within the feature space. The identification process is accomplished through a window that slides over the whole image and runs the classifier at each window (VIOLA; JONES, 2004). Sliding window based models are known as Deformable Parts Models (DPM) (FELZENSZWALB et al., 2010). Other approaches run the classifier on subregions of the image instead of using a sliding window (GIRSHICK et al., 2014). Recent methodologies suggest a different approach. Instead, the feature extraction, classification and identification are carried by a single network. You Only Look Once (YOLO) (REDMON et al., 2016) and Single Shot MultiBox Detector (SSD) (LIU et al., 2015) are some approaches that follow this line of single network based methods. Both methods are based on optimizing CNNs, in an end-to-end manner, in order to predict bounding boxes and class.

(24) 23. probabilities for an entire image in a single evaluation. They are much faster and more accurate than DPM-based approaches (REDMON et al., 2016). 2.2. DEEP LEARNING As presented in the past sections, traditional machine learning approaches to perform. image or video classification consist of a series of steps at which feature extraction and classification are performed separately. These methods require knowledge regarding the feature extractors that will be used in order to select the proper ones so as to achieve a satisfactory classification performance. Deep Learning (DL) is a sub-area of machine learning that includes algorithms capable of performing representation learning with multiple levels of abstraction, which are obtained by combining non-linear modules to transform a low level representation into a higher and more abstract one (LECUN; BENGIO; HINTON, 2015). Since DL methods use raw data as input and automatically discover the representations needed to solve problems, such as detection or classification, they avoid the need for domain expertise in designing feature extractors used in traditional pattern recognition approaches, as presented in Section 2.1.1 for the particular case of image classification. These methods achieved the state-of-the-art results in several domains such as speech recognition, image recognition, object detection, among others. A typical DL architecture is based on Artificial Neural Networks (ANNs) composed of multiple layers that stack simple modules that compute non-linear input-output mappings and are subject to learning. With this structure, a system can achieve a high level of generalization capability by implementing very complex functions of its inputs that are sensitive to the relevant details and insensitive to the irrelevant ones (LECUN; BENGIO; HINTON, 2015). 2.2.1. ARTIFICIAL NEURAL NETWORKS Neural networks are computational models inspired by the functioning of biological. networks that conform the animal brains, which are composed of a set of processing units (neurons) that transmit signals through connections called synapses (BISHOP, 2007). Analogous to biological neural networks, neural networks are composed of layers that contain processing units or artificial neurons. Each layer is connected to the next one and they propagate their input in order to generate a final output. The connections between the neurons (synapses) have different values that define the relevance of each particular connection. Those relevance values are known as the weights of the network..

(25) 24. A feed forward neural network contains an input layer composed by neurons, which receive the data that is introduced to the network and forward it to the next layer. The input layer is connected to the hidden layer, at which the actual processing is performed through the dot product between the output of the previous layer and the weights connected to its neurons, a bias term added to the result of the product operation and finally a non-linear activation function is applied. The output of a hidden layer can be propagated to another hidden layer or to the final layer of the network, which provides the result of the network. This structure is also known as a Dense or Fully Connected layer, in which all neurons of two consecutive layers are fully connected. Figure 3 presents a simple feed forward neural network architecture composed of an input layer with three neurons, a hidden layer containing five units, and an output layer with two units. The layers are fully connected, with a weight for each connection between neurons.. Figure 3: Feed forward neural network composed of one hidden layer.. For a given artificial neuron k, its output yk is defined by Equation 1, where m is the number of inputs of the neuron, wk j is the j-th weight connecting to neuron k and x j is the j-th input of the unit. m. yk = φ ( ∑ wk j x j ),. (1). j=0. The activation function is represented by φ . Some common activation functions are the sigmoid and the hyperbolic tangent, presented in Equations 3 and 2 respectively, where z is the input value for the function. Another frequently used function is the Rectified Linear Unit (ReLU) (NAIR; HINTON, 2010), shown in Equation 4, which converts negative inputs into.

(26) 25. zeros, otherwise it returns its original input.. sig(z) =. 1. , 1 + ez. tanh(z) =. (2). ReLU(z) = max(0, z),. ez − e−z ez + ez. ,. (3). (4). The weights of a neural network are usually initialized with random values. Afterwards, they must be optimized in such a way to generate the proper decision boundaries to separate instances of a given problem. This process is done iteratively by feeding a set of samples to the network and measuring its classification errors, which are measured by a cost function. Two common cost functions are Mean Squared Error (MSE) and Cross Entropy Loss (CEL), presented in Equations 5 and 6 respectively, where yˆ is the ground truth label, and y is the output of the network for a pattern i considering n patterns fed to the network.. MSE =. 1 n ∑ (yî − yi)2, n i=1. (5). 1 n CEL = − ( ∑ yi · log(yî )), n i=1. (6). The error of the network is propagated backwards to adjust the values of the weights in order to minimize the cost function. A widely used algorithm to perform this task is the Gradient Descent (GD). It is based on calculating the partial derivative of the cost function with respect to the weights of the network. GD allows to minimize the cost by performing steps in the opposite direction of the gradient. The weight vector w for the current epoch i of the training process is given by Equation 7, where wi−1 is the weight in the previous epoch, C is the cost function, such as CEL or MSE, and η is the learning rate, which represents the size of the step taken in the opposite direction of the gradient.. wi = wi−1 − η. ∂C ∂w. ,. (7). GD is based on updating the weights of the network using each training sample that is fed to the model. This implies that the derivatives must be calculated for each sample and averaged in order to proceed to update the weights, which could take a long time depending on the size of the dataset. Another approach aims at addressing this issue by updating the weights only after seeing a randomly chosen subset of samples drawn from the dataset. The entire dataset is used at each training epoch and each weight update is called iteration. This method is known as Stochastic Gradient Descent (SGD)..

(27) 26. Some optimization methods, such as AdaGrad (DUCHI; HAZAN; SINGER, 2011) or Adam (KINGMA; BA, 2014), which are modified versions of SGD, are based on dynamically modifying the learning rate η during the training in order to avoid local minima or to converge more rapidly. 2.2.2. CONVOLUTIONAL NEURAL NETWORK (CNN) Convolutional Neural Networks (CNNs) (LECUN et al., 1998) are feed forward neural. networks that allow to deal with n-dimensional input data, such as images or videos. CNNs learn hierarchies of features, obtaining more abstract representations of the original data. This allows to feed the model with raw data without the need of previously extracting their features. CNNs are end-to-end classifiers: the features and the classifier are jointly learnt. This allows to obtain a feature extractor that is built specifically based on the data that was used to train the classifier. A CNN contains Convolutional, Pooling and Fully Connected layers. The first two are present in the first part of the network and they act together as the feature extractor of the model, whilst the Fully Connected layers are used for the proper classification. Convolutional layers are composed of a set of learnable filters, also known as kernels, that slide across their input during the forward pass of the network and calculate dot products between their entries and the input at any position. The input of this layer could be ndimensional, such as grayscale or colour images.. The output is composed of a set of. concatenated activation maps, which are produced by each filter. There are three parameters that define the size of the output of the a convolutional layer: depth, stride and zero-padding. The depth, or number of output channels, defines the number of filters that are going to be used, each learning different characteristics from the input. The stride is the size of the step that the kernels use to slide over the input; a stride equal to one means that the kernel will move one pixel at a time. If the stride is equal to three, the kernel will move three pixels at a time, which will produce a smaller output. Finally, the zero-padding parameter defines if the border of the input will be padded with zeros and the size of that padding, this allows to control the size of the output of the layer. Since convolutions are linear transformations, a non-linearity is applied to their output. The most used function is ReLU (Equation 4), which presents some advantages compared to other non-linearities: the training phase of the network is faster since it is computationally very efficient, and it allows to mitigate the vanishing gradient problem (HOCHREITER et al., 2001). Pooling layers are inserted between convolutional layers to progressively downsample.

(28) 27. the data representation in order to reduce the amount of parameters of the model. This is done to reduce the computation required to train the network and to control overfitting, which occurs when the model is over-trained and loses its generalization capability (HUNG; HUNG, 2014). Pooling operations are performed by a filter (not learnable) that slides across the input and outputs the maximum value of each subregion that it is applied onto. This is known as Max Pooling. The hyperparameters for this operation are the size of the filter and the stride, with both being usually the same value. Other functions can also be applied by a Pooling filter, such as calculating the average of a subregion instead of the maximum value, this operation is called Average Pooling. The Fully Connected layers of a CNN behave as described in Section 2.2.1, with the first dense layer receiving as input the output of the last layer of the feature extractor part of the network. Often this value is a set of feature maps, which are previously flattened before being fed to the first Dense layer. The final layer of a CNN is also a Dense layer. For classification tasks, the non-linearity of the last layer is usually the Softmax function. It allows to squash a K-dimensional vector z to a vector σ (z) containing real values in the range [0..1] that add up to 1. The label of the class with the highest probability is employed as the final classification. Equation 8 presents σ (z j ) for the j-th position of z.. σ (z j ) =. ez j zi ∑K i=1 e. ,. (8). Figure 4 presents a simple CNN architecture with a convolutional layer and a pooling layer followed by fully connected layers. The output layer of the model contains only one processing unit.. Figure 4: Architecture of a simple CNN composed of a Convolutional layer followed by a Pooling layer. Fully Connected layers compose the last part of the architecture.. CNNs are very expensive models for training, and they are able to learn very complex relationships between their inputs and outputs (SRIVASTAVA et al., 2014). However, overfitting is a frequent issue, specially when the network is deep and there are very few training.

(29) 28. samples available. There are several methods to avoid this phenomenon, the most common ones are regularization (NG, 2004) and dropout (SRIVASTAVA et al., 2014). The first one is based on adding an extra term to the cost function of the network in order to “force” it to preferably learn small weights. L2 regularization (CORTES; MOHRI; ROSTAMIZADEH, 2009), for instance, adds the squared magnitude of all parameters of the network to the cost function. Thus, for a weight vector w of the network, the term 21 λ w2 is added to the loss function, where λ represents the regularization strength. It is worth mentioning that λ is an user-defined parameter. Dropout (SRIVASTAVA et al., 2014), on the other hand, consists in deactivating a percentage of randomly chosen units of the network (with their weights) during each epoch of the training stage. This technique prevents the units of the network from co-adapting too much. Dropout is not applied during the testing phase. It was shown to improve the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology. Local Response Normalization (LRN) (KRIZHEVSKY; SUTSKEVER; HINTON, 2012) is another strategy to aid generalization. The method implements a lateral inhibition by applying normalization on local input regions. It is usually employed after a convolution operation but before the pooling layer that is stacked after the convolution, if used. LRN is very useful when the ReLU activation function is used, which produces unbounded activations. 2.3. RECURRENT NEURAL NETWORK (RNN) Recurrent Neural Networks (RNNs) are a type of neural network with cyclic. connections between their processing units.. These recurrent connections allow to store. information related to past inputs for an amount of time that is not initially fixed. Instead, it depends on the weights of the network and the input data. The structure of RNNs allows to operate over sequences of input and output data, unlike feed forward neural networks, which can only receive fixed-sized vectors as input and produce fixed-sized vectors as outputs. A standard RNN receive as input a vector x and produces an output vector y, which is influenced not only by the current input but, also, by the history of inputs that were previously fed to the model. This is accomplished by a hidden state vector. Figure 5 shows a simple RNN architecture. Given the input xt at a time t, the hidden state ht of the RNN is defined by Equation 9, where Whx and Whh are the weights connected to the input of the network and the previous.

(30) 29. Figure 5: Simple RNN architecture.. hidden state respectively, and ht−1 is the hidden state at the time t − 1.. ht = tanh(Whh ht−1 + Whx xt ),. (9). The output yt , shown in Equation 10, is defined by the dot product of the output weights Why and the output of the current hidden state ht . A non-linearity is often applied to generate the final output, such as the Softmax function, very usual for classification problems.. yt = so f tmax(Why ht ) 2.3.1. (10). LONG SHORT TERM MEMORY (LSTM) Theoretically, RNNs should be able to learn long-term dependencies. This is, a RNN. should have the capability to consider a relevant input that was presented to the network with a relatively large gap in the past to generate the current output. However, RNNs are not able to perform this task, mainly when the time gap is very large. This occurs due to the settling of parameters of RNNs in sub-optimal solutions, such that only short-term dependencies are learnt (BENGIO et al., 1992). Hence, learning to store information for a long time is a very difficult task (BENGIO; SIMARD; FRASCONI, 1994). Long Short Term Memory networks (LSTMs) were introduced to address this issue (HOCHREITER; SCHMIDHUBER, 1997). They are capable of learning those long-term dependencies through a more complex structure based on special hidden units known as memory cells. The internal structure of memory cells allows to overcome the vanishing (or exploding) gradient problem, allowing the network to retain dependencies from very early inputs (HOCHREITER et al., 2001). A LSTM memory cell is shown in Figure 6. LSTM cells allow to optionally remove or save information, which is accomplished.

(31) 30. Figure 6: Internal structure of a LSTM Memory Cell.. through a mechanism based on gates. A LSTM cell contains three gates: one to forget information, another for updating the previous cell state, and the last one to generate the output of the cell state. Gates are composed of a sigmoid neural network layer and a pointwise product operation. The sigmoid layer defines how much information will be stored by generating values spanning from zero to one. The inputs of a cell at the time t are the previous internal long term cell state Ct−1 , the previous output ht−1 and the current input vector xt . The cell generates an output vector ht considering k hidden units. Equations 11 - 16 present the formal functioning of the LSTM cell, where ft , it , ot and Ct are the forget gate, input gate (also known as update gate), output gate, and the internal cell state, respectively. The weights matrices are W f , Wi , Wc and Wo , whilst the bias terms are defined by b f , bi , bc and bo . The activation functions sigmoid and hyperbolic tangent are represented by σ and tanh, respectively. The symbol ◦ represents the Hadamard product operation.. ft = σ (W f [ht−1 , xt ] + b f ),. (11). it = σ (Wi [ht−1 , xt ] + bi ),. (12). ˜ t = tanh(Wc [ht−1 , xt ] + bc ), C. (13).

(32) 31. 2.3.2. ˜ t, Ct = ft ◦ Ct−1 + it ◦ C. (14). ot = σ (Wo [ht−1 , xt ] + bo ),. (15). ht = ot ◦ tanh(Ct ),. (16). BIDIRECTIONAL LONG SHORT TERM MEMORY (BLSTM) Although LSTM networks are a widely used DL method due to their capability to. learn semantic problems with long-term dependencies, they have a weakness: they consider only previous information without exploring future contexts (NIE et al., 2016). Thus, for an input at the time t belonging to a sequence of length n, the LSTM will consider only the inputs from times t − 1, t − 2, ..., t − n. If there is available information from the future, for instance inputs at the times t +1, t +2, ..., t +n, the network will not consider them to produce the output. Including information from the future could lead to achieve better performances for particular tasks, such as video classification, which is explored in this work. Bidirectional Long Short Term Memory networks (BLSTM) were designed to take advantage from both previous and future contexts by using a structure based on two separate hidden layers in order to process the data. Figure 7 presents a simple BLSTM architecture. Basically, it is a RNN with two LSTM cells, one performs a forward propagation whilst the other propagates the data backwards. The output of each cell is concatenated in order to produce the final output of the layer. More backward-forward layers can be stacked to form a deep BLSTM, with the output of each layer used as input for the next one. 2.4. TRANSFER LEARNING Transfer learning is a learning paradigm based on employing knowledge that was. previously acquired to solve new problems (PRATT, 1993). Opposite to traditional machine learning algorithms, which assume that the data used to initially obtain the knowledge lies within the same feature space or share the same distribution of the test data (SHAO; ZHU; LI, 2015), transfer learning considers that the domains of the training and testing data may be different (FUNG et al., 2006). It allows the domains, tasks and distributions used in training and testing to be distinct (LU et al., 2015). Although the domains of the datasets may differ, transfer.

(33) 32. Figure 7: Bidirectional Recurrent Network composed of LSTM cells.. learning applications are able to sufficiently solve problems that are similar. For instance, a model capable of classifying images of guitars should be also able to classify basses. This allows to overcome a very common problem in real-world applications: the lack of available labelled data. In deep learning, transfer learning is becoming a common practice, since the models contain a very high number of parameters that must be optimized, requiring a large amount of training data, that also must be labelled for supervised learning. For some tasks, the available data is not sufficient to properly train deep models, making it necessary to gather more data or to produce new samples by applying transformations to the original data, a process known as data augmentation. However, data augmentation is a computationally expensive process, mainly if it is applied on-line during the training phase of a model. Moreover, training a model with hundreds of thousands or even millions of samples (original data or produced with data augmentation) demands high computational resources. To tackle this issue, transfer learning is an alternative. In this scenario, transfer learning is based on the use of models that are already pre-trained with a dataset created for particular purposes in order to be applied to another, yet similar, tasks. The models can be used without modifications or they may be fine-tuned by retraining the classifier part of the network. Some deep learning networks commonly used for transfer learning are the VGG architectures (SIMONYAN; ZISSERMAN, 2014), Inception (mainly the versions v3 and v4) (SZEGEDY et al., 2015) or ResNet (HE et al., 2016)..