Automatic Emotion Identification: Analysis and Detection of Facial Expressions in Movies

(1)

F

ACULDADE DE

E

NGENHARIA DA

U

NIVERSIDADE DO

P

ORTO

Automatic Emotion Identification:

Analysis and Detection of Facial

Expressions in Movies

João Carlos Miranda de Almeida

Mestrado Integrado em Engenharia Informática e Computação Supervisor: Paula Viana

Co-Supervisor: Inês Nunes Teixeira Co-Supervisor: Luís Vilaça

(2)

(3)

Automatic Emotion Identification: Analysis and

Detection of Facial Expressions in Movies

João Carlos Miranda de Almeida

Mestrado Integrado em Engenharia Informática e Computação

(4)

(5)

Abstract

The bond between spectators and films takes place in the emotional dimension of a film. The emotional dimension is characterized by the filmmakers’ decisions on writing, imagery and sound, but it is through acting that emotions are directly transmitted to the audience. Understanding how this bond is created can give us essential information about how humans interact with this increasingly digital medium and how this information can be integrated into large film platforms.

Our work represents another step towards identifying emotions in cinema, particularly from camera close-ups of actors that are often used to evoke intense emotions in the audience. During the last few decades, the research community has made promising progress in developing facial expression recognition methods but without much emphasis on the complex nature of film, with variations in lighting and pose, a problem discussed in detail in this work.

We start by focusing on the understating from social sciences of the state-of-the-art models for emotion classification, discussing its strengths and weaknesses. Secondly, we introduce the Facial Emotion Recognition (FER) systems and automatic emotion analysis in movies, analyzing unimodal and multimodal strategies. We present a comparison between the relevant databases and computer vision techniques used in facial emotion recognition and we highlight some issues caused by the heterogeneity of the databases, since there is no universal model of emotions.

Built upon this understanding, we designed and implemented a framework for testing the fea-sibility of an end-to-end solution that uses facial expressions to determine the emotional charge of a film. The framework has two phases: firstly, the selection and in-depth analysis of the relevant databases to serve as a proof-of-concept of the application; secondly, the selection of the deep learning model to satisfy our needs, by conducting a benchmark of the most promising convolu-tional neural networks when trained with a facial dataset.

Lastly, we discuss and evaluate the results of the several experiments made throughout the framework. We learn that current FER approaches insist on using a wide range of emotions that hinders the robustness of models and we propose a new way to look at emotions computationally by creating possible clusters of emotions that do not diminish the collected information based on the evidence obtained. Finally, we develop a new database of facial masks discussing some promising paths that may lead to the implementation of the aforesaid system.

Keywords: Facial Expression Recognition, Multimodal Sentiment Analysis, Emotion Analysis, Movie Databases, Machine Learning, Deep Learning, Computer Vision

(6)

(7)

Resumo

A ligação estabelecida entre espetadores e filmes ocorre na dimensão emocional de um filme. A dimensão emocional é caracterizada pela decisão dos cineastas na escrita, imagem e som, mas é através da atuação dos atores que as emoções são diretamente transmitidas para a plateia. Perceber como esta ligação é criada pode dar-nos informações essenciais acerca de como os humanos inter-agem com este meio digital e como esta informação poderá ser integrada em grandes plataformas de filmes.

O nosso trabalho é mais um passo nas técnicas de identificação de emoções no cinema, par-ticularmente através de close-ups aos atores muitas vezes utilizados para evocar emoções intensas na audiência. Nas últimas décadas, a comunidade científica tem feito progressos promissores no desenvolvimento de métodos para o reconhecimento de expressões faciais. No entanto, não é dado grande ênfase à complexa natureza dos filmes, como variações na iluminação das cenas ou pose dos atores, algo que é explorado mais em detalhe neste trabalho.

Começamos por uma explicação extensiva dos modelos mais modernos de classificação de emoções, no contexto das ciências sociais. Em segundo lugar, apresentamos os sistemas de Recon-hecimento Facial de Emoções (FER, do inglês Facial Emotion Recognition) e análise automática de emoções em filmes, explorando estratégias unimodais e multimodais. Apresentamos uma com-paração entre as base de dados relevantes e as técnicas de visão computacional usadas no recon-hecimento de emoções faciais e destacamos alguns problemas causados pela heterogeneidade das bases de dados, uma vez que não existe um modelo universal de emoções.

Com base neste conhecimento, projetámos e implementámos uma framework para testar a viabilidade de uma solução end-to-end que utilize as expressões faciais para determinar a carga emocional de um filme. A framework é composta por duas fases: em primeiro lugar, a seleção e análise aprofundada das bases de dados relevantes para servir como uma prova de conceito da apli-cação; e em segundo lugar, a seleção do modelo deep learning mais apropriado para alcançar os objetivos propostos, conduzindo um benchmark de redes neurais convolucionais mais promissoras quando treinadas com uma base de dados contendo faces.

No final, discutimos e avaliamos os resultados das diversas experiências feitas ao longo da framework. Aprendemos que as actuais abordagens de FER insistem na utilização de uma vasta gama de emoções que dificulta a robustez dos modelos e propomos uma nova forma de olhar para as emoções computacionalmente, criando possíveis grupos de emoções que não diminuem a informação recolhida com base na evidência obtida. Finalmente, desenvolvemos uma nova base de dados de máscaras faciais, discutindo alguns caminhos promissores que podem levar à implementação do referido sistema.

Keywords: Reconhecimento de Expressões Faciais, Análise Multimodal de Sentimento, Análise de Emoção, Bases de dados cinematográficas, Aprendizagem Computacional, Inteligência Artifi-cial, Visão por Computador

(8)

(9)

Acknowledgements

First and foremost, I would like to thank my Supervisors Professor Paula Viana and Inês Nunes Teixeira for giving me the opportunity to participate both in an internship and in a dissertation about two of the subjects I am more passionate about. Luís Vilaça, thank you for bringing me the calmness and the necessary clear view on the problems and how to tackle them. To all three supervisors, a big thank you for your support, guidance and feedback throughout this project and for making me grow as an academic and as a person.

Secondly, I am eternally grateful to my parents, brother, girlfriend and friends for endlessly supporting me during this coursework and for always being my Home regardless of how far I fly.

Finally, I would like to thank all persons who I met trough the years for making me who I am today and for showing me that love and tolerance are the only values that should really matter.

Obrigado. João Almeida

(10)

(11)

“If we opened people up, we’d find landscapes.”

Agnès Varda

(12)

(13)

List of Figures

1.1 L’arrivée d’un train en gare de La Ciotat(1985) . . . 2

2.1 Circumplex Model of Emotion . . . 7

2.2 Upper Face Action Units (AUs) from the Facial Action Coding System . . . 8

2.3 Spatial distribution of NAWL word classifications in the valence-arousal affective spac . . . 10

2.4 Distribution of emotions in the valence-arousal affective map . . . 11

3.1 Facial Expression Recognition stages outline . . . 14

3.2 Histogram of Oriented Gradients of Barack Obama’s face . . . 16

3.3 Examples of facial landmarks with their respective faces . . . 16

3.4 Image geometric transformation examples . . . 17

3.5 Example of feed-forward neural networks . . . 18

3.6 Biological neuron and a possible mathematical model . . . 19

3.7 Flow of information in a artificial neuron . . . 20

3.8 Most common activation functions . . . 21

3.9 Example of a 3x3 convolution operation with the horizontal Sobel filter . . . 22

3.10 Average-pooling and max-pooling operation examples . . . 23

3.11 Convolutional Neural Network example . . . 23

3.12 An example of a residual block . . . 25

3.13 Evolution of the diverse CNN architectures milestones . . . 25

3.14 Transfer learning process . . . 26

3.15 Possible queries accepted by the EmotionNet database . . . 30

3.16 Examples of the FER2013 database . . . 31

3.17 Class distribution per dataset . . . 33

3.18 Face-api.js emotion recognition examples [37] . . . 34

3.19 Deepface facial attribute analysis examples [24] . . . 35

3.20 An example of the Azure Face API response . . . 36

3.21 An example of the Algorithmia API response . . . 36

3.22 Amazon Rekognition face detection and analysis example . . . 37

3.23 Google Vision API attributes example . . . 38

3.24 Face++ Emotion Recognition API response example . . . 39

4.1 The girl in red coatfrom Schindler’s List . . . 43

4.2 Video Indexer functionalities overview . . . 45

4.3 Video Indexer insights from the 1999 movie The Matrix . . . 46

5.1 FER2013 content overview . . . 51

5.2 FER2013 class distribution . . . 51

(16)

xii LIST OF FIGURES

5.3 FER2013 image examples . . . 52

5.4 SFEW face aligned samples . . . 53

5.5 Dlib face bounding box . . . 54

5.6 Baseline Network Architecture diagram . . . 55

5.7 Baseline network accuracy, loss and confusion matrix . . . 56

5.8 VGG16 accuracy, loss and confusion matrix on FER2013 . . . 57

5.9 InceptionV3 accuracy, loss and confusion matrix on FER2013 . . . 57

5.10 Xception accuracy, loss and confusion matrix on FER2013 . . . 58

5.11 MobileNetV2 accuracy, loss and confusion matrix on FER2013 . . . 58

5.12 ResnetV2 accuracy, loss and confusion matrix on FER2013 . . . 59

5.13 DenseNet accuracy, loss and confusion matrix on FER2013 . . . 59

5.14 Final model . . . 61

5.15 Accuracy and loss learning curves and confusion matrix of the model when train-ing with FER2013 . . . 63

5.16 Confusion matrix of testing with SFEW dataset . . . 64

5.17 Learning curve of the model when training with class weights . . . 65

5.18 Learning curve of the model when training with the top-4 performing emotions . 67 5.19 Validation confusion matrix of FER2013 top-4 emotions . . . 68

5.20 Confusion matrix of testing on SFEW . . . 69

5.21 Learning curve of the model when training with the possible clusters of emotion . 69 5.22 Angry, Happy and Neutral validation confusion matrix on FER0213 . . . 70

5.23 Angry, Happy and Neutral testing confusion matrix on testing SFEW . . . 71

5.24 Facial mask example from FER2013 . . . 72

(17)

List of Tables

2.1 Basic and compound emotions with their corresponding AUs . . . 8

3.1 Confusion Matrix . . . 27

3.2 Principal databases used in FER systems . . . 32

3.3 FER approaches and results on widely evaluated datasets . . . 33

3.4 Comparison between the different open-source and commercial solutions . . . . 40

5.1 Versions of the software used in this work . . . 49

5.2 FER2013 number of samples per class . . . 51

5.3 SFEW aligned face samples per class . . . 52

5.4 Facial Recognition in movies experiment . . . 53

5.5 Benchmark of CNN architectures . . . 60

5.6 Final configuration of the network . . . 62

5.7 Precision, Recall and F1-score of FER2013 validation set . . . 63

5.8 Precision, Recall and F1-score of SFEW test set . . . 64

5.9 Training class weights . . . 65

5.10 Precision, Recall and F1-score of balanced FER2013 validation set . . . 66

5.11 Precision, Recall and F1-score of the top-4 performing emotions during training on FER2013 . . . 67

5.12 Precision, Recall and F1-score of the top-4 performing emotions during testing on SFEW . . . 68

5.13 Precision, Recall and F1-score with possible clustered emotions in validation phase 70 5.14 Precision, Recall and F1-score with clustered emotions in SFEW . . . 71

A.1 FER datasets sources . . . 77

B.1 Movie sources for the SFEW and AFEW databases . . . 80

D.1 Multimodal movie datasets . . . 90

D.2 Some multimodal relevant studies . . . 91

(18)

(19)

Abbreviations

AI Artificial Intelligence

AU (Facial Action Coding System) Action Unit API Application Programming Interface

BLSTM Bidirectional Long Short Term Memory CNN Convolutional Neural Networks

FACS Facial Action Coding System FER Facial Expression Recognition GAN Generative Adversarial Network LSTM Long Short Term Memory

ML Machine Learning

NLP Natural Language Processing

OCC Ortony, Clore and Collins’s (Model of Emotion) PAD Pleasure-Arousal-Dominance

PA Pleasure-Arousal

PCA Principal Component Analysis SaaS Software as a Service

SoA State-of-the-art

SGD Stochastic Gradient Descent SDK Software development kit SVM Support Vector Machine RNN Recurrent Neural Network

(20)

(21)

Chapter 1

Introduction

1.1 Context . . . 1 1.2 Motivation . . . 2 1.3 Goals . . . 4 1.4 Document Structure . . . 4

The context, motivation, main objectives and document structure of this dissertation are intro-duced in this chapter.

1.1 Context

In the early showings of "L’arrivée d’un train en gare de La Ciotat", as shown in Figure1.1, by Louis and Auguste Lumière, credited as among the first inventors of the technology for cinema as a mass medium, there are reports that the audience panicked, thinking that the silent and grainy black-and-white train pictured in the movie was going to drive right into them. Although there is no sufficient evidence that audience panic rush ever occurred [88], the "Cinema’s Founding Myth" still serves its purpose of showing the power of film, in particular by eliciting strong sentiments over the audiences. The tale began to surface when people tried do describe the emotional power inherent to the convincing three-dimensional effect of the emerging medium.

Since then, films have been used not only as entertainment, but also as a means of communica-tion, to provoke feelings in their audience and make them relate with the story being told. Despite the human expression of emotions when watching movies being nowadays more contended, it is impossible to dissociate the emotional dimension from this kind of activity. In particular, there are studies that suggest that it is possible to extract emotion from the audience’s facial expres-sions [34], or even with an analysis of it’s physiological signals [132].

Furthermore, the way people watch films has changed significantly in recent years. With the advent of the internet, online streaming services and movie databases, video platforms have

(22)

2 Introduction

Figure 1.1: L’arrivée d’un train en gare de La Ciotat (1985)

been developing increasingly more complex video indexing and summarization tools, as well as personalized content delivery. There are problems that still persist in current solutions, with much of the difficulties lying specifically in the state-of-the-art cinematographic databases, since there is a large quantity of undocumented media content and an underlying difficulty in annotating them. Consequently, without information on contents, establishing relationships between them is a complex task, thus making dataset access and retrieval inefficient, overlooking the information potential that the content itself is able to provide.

The work developed is framed in the scope of activity B.1 (Plataforma Tecnológica de apoio ao Plano Nacional do Cinema) of the CHIC project (Cooperative Holistic View on Internet and Content — POCI-01-0247-FEDER- 024498). One of the contributions of this project will be to allow a simple and intuitive consultation of the contents stored in the vast database of the Cine-mateca Portuguesa, which is a public institution dedicated to the dissemination and preservation of cinematic art. For this purpose it is intended to develop new forms of navigation, research and interaction with the contents that consider aspects typically not contained in the description meta-data. As there is evidence that 7% of the human communication is verbal, 38% is vocal and 55% is visual [99], in this project we try to enrich the platform with the information provided by the emotions portrayed by the actors in movies.

1.2 Motivation

Affective Computing is an interdisciplinary field that studies and develops systems that are able to recognize, interpret, process and simulate human affection. Understanding the emotional state of people is a subject that has attracted and fascinated researchers from different branches, combining the findings in the communities of computer science, psychology and cognitive science.

One of the challenges of this field is to try to answer the highly subjective question "What emotion does this particular content convey?", which is studied in detail in the subfield of Emotion

(23)

1.2 Motivation 3

and Sentiment Analysis. Several studies have been trying to answer this question by analyzing different modalities of content.

For a long time, text-based sentiment analysis has been the reference in this area, with the use of Natural Language Processing (NLP) and text analysis computational techniques for the extrac-tion of the sentiment that a text conveys. These techniques are commonly used in the analysis of texts from social networks or online reviews found on e-commerce platforms, due to the proven value that is added to companies and organizations [111].

However, in recent years, due to the advances observed in the Artificial Intelligence and Com-puter Vision fields, other media modalities have been considered. In the speech recognition field, relevant correlations have been found between statistical measures of sound and the mood of the speaker. Additionally, approaches try to recognize different facial expressions from video sequences, as it is suggested by some authors that they might have a close relationship with emo-tions.

The advantage of analyzing videos over text is the behaviour contextualization from the indi-viduals being studied: it is possible to combine visual and sound clues to better identify the true affective state of the videos. Additionally, when analyzing movies, other stylistic characteristics can be used to improve the accuracy of emotion recognition, such as shot length, number of shots, dominant colours or colour histograms. Even though these developments are not film-oriented, combined with the state-of-the-art machine learning algorithms and the ever growing computa-tional power, they could be very valuable in the context of this work, providing crucial cues on how emotion is detected efficiently across several fields.

Therefore, what we will investigate in this work is the applicability of FER solutions in the field of films. We intend to gather a solid understanding of how emotions are addressed in the social and human sciences and discuss possible adaptations of the emotional theories to computational models. Additionally, we shall explore the various datasets available and try to understand their behaviour when confronted with the difficult conditions of a film. Furthermore, we will look for the most suitable computer models for this task and evaluate their reliability. Currently, there is a vast amount of solutions available but still no consensus on what is the best solution when applied to films, an uncertainty that we will embrace. In particular, we will seek to understand the results of multi-class classification models, its limitations, and what adjustments can be made to achieve the desired goals.

The benefits of this study could serve a vast amount of applications: personalized content distribution and better recommendation systems that are able to suggest new content based on the characteristics that a user generally connects with may arise from finding new emotion-based relationships between media content. Additionally, the conclusions of this study may also be used to enhance information retrieval from large movie platforms, and emotion aware content clustering may greatly improve the organization of such systems.

(24)

4 Introduction

1.3 Goals

This dissertation has the following goals:

• Facial expression analysis — evaluate the contribution of facial expression analysis in the context of the affective dimension of a movie.

• Database evaluation — evaluate the best databases that could be applied in facial expres-sion recognition tasks in the movie domain.

• Machine Learning evaluation — evaluate the most appropriate machine/deep learning techniques to achieve the best possible result.

The ultimate objective of this work is to create a framework that evaluates the feasibility of using faces to determine the emotional charge of a movie, discussing the limitations and promising paths of such system.

1.4 Document Structure

This document is divided into the following six chapters: • Chapter1— Introduction

This chapter introduces the work, its context, motivation and goals. • Chapter2— The Human Emotion

This chapter will provide sufficient knowledge about the models that were developed to classify human emotion. It also shows the limitations of current research, given the non universal nature of the subject.

• Chapter3— Facial Expression Recognition

In this chapter the Facial Expression Recognition is introduced along with an in-depth re-view of the current deep learning approach.

• Chapter4— Automatic Emotion Analysis in Movies

This chapter provides an overview of the current approaches for the unimodal or multimodal automatic emotion analsysis in movies.

• Chapter5— Methodology, Results and Evaluation

This chapter includes a detailed definition of the problem and the developed framework with a description of the technologies used to address it. In the second part of this chapter, the experimentation done throughout this work is presented, accompanied by the evaluation and discussion of the results that were achieved.

• Chapter6— Conclusions

Finally, the last chapter provides a synthesis of the main ideas presented in this work and conclusions are drawn, pointing out promising paths that can be made as future work.

(25)

Chapter 2

The Human Emotion

2.1 Discrete emotion model . . . 5 2.2 Dimensional emotion model . . . 6 2.3 Facial Action Coding System (FACS) . . . 7 2.4 Mappings between Emotion Representation Models . . . 7 2.5 Discussion . . . 11

This chapter describes the background and related work on social sciences related to the Hu-man emotion. It starts by proving an in-depth description of how emotion classification models have evolved, from a historical and psychological point of view. In the end, mappings between these models are discussed.

2.1 Discrete emotion model

Understanding the emotional state of people has been a complex topic over the years. The studies date back to ancient Greeks, where Cicero first described emotions as a set of four basic states: metus (fear), aegritudo (pain), libido (lust), and laetitia (pleasure) [47]. Years later, Darwin ar-gued that emotions evolved via natural selection, hence being independent of the individuals culture [13]. In this context, the first scientific models for emotion were developed, starting by dividing them into a limited and discrete set of basic emotions — the discrete model of emotions. Paul Ekman proposed in 1970 that facial expressions are universal and provide sufficient in-formation to predict emotions. His studies suggest that our emotions evolved through natural se-lection in humans into a limited and discrete set of basic emotions: anger, disgust, fear, happiness, sadness, and surprise [34]. Each emotion is independent of the others in its behavioural, psycho-logical and physiopsycho-logical manifestations, and each of them is born from the activation of unique areas in the central nervous system. The criterion used was the assumption that each primary emotion has a distinct facial expression that is recognized even between different cultures [108].

(26)

6 The Human Emotion

Other studies tried to expand the set of emotions to non-basic ones, such as fatigue, anxiety, satisfaction, confusion, or frustration [62,144]. The Ortony, Clore and Collins’s Model of Emotion (OCC Model) [107] is popular amongst systems that incorporate emotions in artificial characters. It is a hierarchy model that classifies 22 emotion types that might emerge as a consequence of events (e.g., happiness and pity), actions of agents (e.g., shame and admiration), or aspects of objects (e.g., love and hate).

Despite the six basic emotion model being the dominant theory of emotion in psychiatric and neuroscience research, recent studies have pointed out some limitations. Certain facial expressions are associated with more than one emotion, which suggests that the initial proposed taxonomy is not adequate [113]. Other studies suggest that there is no correlation between the basic emotions and the automatic activation of facial muscles [14]. Other claims suggest that this model is culture-specific and not universal [60]. These drawbacks caused the emergence of additional methods that intend to be more exhaustive and universally accepted regarding emotion classification.

2.2 Dimensional emotion model

Some studies have assessed people’s difficulty in evaluating and describing their own emotions, which points out that emotions are not discrete and isolated entities, but rather ambiguous and overlapping experiences [120]. This line of thought reinforced a dimensional model of emotions, which describes them as a continuum of highly interrelated and often ambiguous states.

The model that gathered the most consensus among researchers — the Circumplex Model of Emotion — argues that there are two fundamental dimensions: valence, which represents the hedonic aspect of emotion (that is, how pleasurable it is for the human being), and an enthusiasm or tension dimension, which represents the energy level [119]. Each emotion is represented using coordinates in a multi-dimensional state. Figure 2.1 represents a two-dimensional visualization of the model. Valence spans from negative (e.g., depressed) to positive (e.g., content), whereas arousal ranges from inactive (e.g., tired) to active (e.g., excited). Some predefined points represent a categorical definition of emotion to facilitate its interpretation.

Another bi-dimensional model was proposed by Whissell, which defines an emotion by the pair of values <activation, evaluation>, that evaluates the emotion as positive or negative in the activation dimension, and active or passive in the evaluation dimension [146]. This is the particular case of polarity, which is very popular in NLP studies.

Other approaches proposed tri-dimensional models. The most famous one, the pleasure-arousal-dominance (PAD), adds a third dimension to the arousal-valence scale called dominance, representing the controlling and dominant nature of the emotion on the individual. For instance, anger is considered a dominant/in control emotion, while fear is considered submissive/domi-nated [98]. It remains unclear the utility of the third dimension, as several studies revealed that valence and arousal axes are sufficient to model emotions, particularly when handling with emo-tions induced by videos [93]. More recently, Fontaine concluded that using a four-dimension

(27)

2.3 Facial Action Coding System (FACS) 7

Figure 2.1: Circumplex Model of Emotion [87]

model best describes the variety of emotions (valence, arousal, dominance and predictability), but the optimal number of dimensions depends on the purpose of the study [59].

The advantages of a dimensional model compared with a discrete model are the accuracy in describing emotions, by not being limited to a closed set of words, and a better description of emotion variations over time, since the variation in emotion is not realistically discrete, from one universal emotion to another, but rather continuous.

2.3 Facial Action Coding System (FACS)

The Facial Action Coding System (FACS) [33] is an anatomically-based system used to de-scribe all visually discernible movement of face muscles. FACS is able to objectively measure the frequency and intensity of facial expressions using Action Units (AUs), i.e., the smallest distin-guishable unit of measurable facial movement, such as brow lowerer, eyes blink or jaw drap. The system has a total of 46 action units with some of the AUs having a 5-point ordinal scale used to measure the degree of muscle contraction. Figure2.2represents the action units of the upper face region.

FACS is strictly descriptive and does not include an emotion correspondence. The same au-thors of the system proposed a Emotional Facial Action Coding System (EMFACS) [39] based on the six-basic discrete emotion model described in section2.1that makes a connection between emotions and facial expressions. Recent studies have proposed a new classification system based on simple and compound emotions [8], as shown in Table2.1.

2.4 Mappings between Emotion Representation Models

Some studies suggest that there are neurological variations within the processes of emotion catego-rization and assessment of emotion dimensions [65]. Understanding what can bring these models

(28)

8 The Human Emotion

Figure 2.2: Upper Face Action Units (AUs) from the Facial Action Coding System. The AUs marked with * are graded according to muscle contraction intensity [145]

Table 2.1: Basic and compound emotions with their corresponding AUs [8]

Category AUs Category AUs

Happy 12,25 Sadly disgusted 4,10

Sad 4,15 Fearfully angry 4,20,25

Fearful 1,4,20,25 Fearfully surprise 1,2,5,20,25

Angry 4,7,24 Fearfully disgusted 1,4,10,20,25

Surprised 1,2,25,26 Angrily disgusted 1,2,5,10 Disgusted 9,10,17 Disgusted surprised 1,2,5,10 Happily sad 4,6,12,25 Happily fearfully 1,2,12,25,26 Happily surprised 1,2,12,25 Angrily disgusted 4,10,17

Happily disgusted 1,4,15,25 Awed 1,2,5,25

Sadly fearful 1,4,15,25 Appalled 4,9,10

Sadly angry 4,7,15 Hatred 4,7,10

Sadly surprised 1,4,25,26

together is just as relevant as understanding what separates them, if not more so. Motivated by the dispersion of classification methods across emotional databases, some studies have investigated a potential mapping between discrete/categorical and dimensional theories.

A first linear mapping between the emotions anger, disgust, fear, happiness and sadness with a dimensional space was proposed in 2011 [128]. This linear mapping is based on the matrix of coefficients shown in Equation2.1.

PAD[Anger, Disgust, Fear, Happiness, Sadness] =    −0.51 −0.40 −0.64 0.40 −0.40 0.59 0.20 0.60 0.20 −0.20 0.25 0.10 −0.43 0.15 −0.50    (2.1)

(29)

2.4 Mappings between Emotion Representation Models 9

Considering that the matrix was obtained following the quantitative relationship between three-dimensional mood Pleasure-Arousal-Dominance (PAD) space and OCC emotions [41], it might be considered a theoretical model instead of an evidence-based one, since no data source was used to come up with the mapping.

In 2018, a new study proposed a method and evaluation metrics to assess the mapping accu-racy, and elaborated a new mapping between the six basic emotions and the PAD model [79]. The new mapping in the three-dimensional PAD space is shown in Equation2.2and a new mapping in the two-dimensional Pleasure-Arousal (PA) space is shown in Equation2.3.

PAD[Happy, Sad, Angry, Scared, Disgusted, Sur prised, 1] =    0.46 −0.30 −0.29 −0.19 −0.14 0.24 0.52 0.07 −0.11 0.19 0.14 −0.08 0.15 0.53 0.19 −0.18 −0.02 −0.10 −0.02 0.08 0.50    (2.2)

PA[Happy, Sad, Angry, Scared, Disgusted, Sur prised, 1] = "

0.54 −0.14 −0.21 −0.06 −0.16 0.00 0.46

0.50 0.06 0.37 0.36 0.12 0.00 −0.01

#

(2.3)

These linear representations were obtained by cross-referencing the information of lexicons annotated in both models. The PAD mapping was obtained by pairing Affective Norms for En-glish Words (ANEW) [9] and Synesketch [74] lexicons. The compound set is composed of En-glish words annotated in the three-dimensional PAD model (originally in the Synesketch lexicon) and in the Ekman’s six-basic emotions and a neutral state (originally in the ANEW lexicon). The PA mapping was derived using two sentiment annotations of the same lexicon, Nencki Affective Word List (NAWL) [116,147]. The NAWL dataset contains 2902 Polish words, annotated in the valence-arousal space and in five basic emotions (happy, sad, angry, scared and disgusted).

During the construction of the NAWL dataset, each subject provided a vector of five individual ratings, one for each emotion. Later, the Euclidean distance to six “extreme” points was calculated, representing pure basic emotions: (7,1,1,1,1) for happiness, (1,7,1,1,1) for anger, (1,1,7,1,1) for sadness, (1,1,1,7,1) for fear and (1,1,1,1,7) for disgust and (1,1,1,1,1) for the neutral state. The average distance was calculated to estimate how close the word is to each of the extreme points. To classify a word to one of the six classes, a threshold should be established, which defines the maximum distance a word must be from an emotion in order to be classified as such. Additionally, a world should only be in a unique category region, i.e., if it falls in an intersection area between two categories it remains unclassified. An Interactive Analysis of the NAWL Database 1 is available on the Internet, where various combinations of parameters and the consequent results can be tested.

(30)

10 The Human Emotion

Using the Euclidean distance based classification method with the thresholds 2.5 for happiness, 5.5 for the remaining basic emotion classes and 2.5 for the neutral class, the researchers of the NAWL lexicon were able to classify 739 out of 2902 available words. Figure 2.3illustrates the spatial distribution of these classifications in the valence-arousal affective space. A significant finding was the apparent formation of emotion clusters: happiness is a high-valence medium-arousal emotion, neutral is low is both dimensions and the remaining emotions seem to overlap, particularly anger and sadness.

Figure 2.3: Spatial distribution of NAWL word classifications in the valence-arousal affective space [147]

Further interesting insights have been emerging from the music research field. In 2010 a study was conducted in order to perceive a possible correlation of these theories using musical stimuli. Firstly, a validation experiment was carried out involving twelve music experts who had studied a musical instrument for at least ten years. Each member of the study was given five different movie soundtracks and asked to classify it in the target emotions. Half of the panel focused on discrete emotions (happiness, sadness, fear, anger, surprise and tenderness) and the other half on the dimensional emotions (high-low valence, tension arousal and energy arousal). This phase was important to validate the choice of the emotional frameworks since the selected stimuli needed to represent emotion concepts. Secondly, the experiment was conducted involving 116 university students, aged between 18 and 42 years. Similar to the validation experiment, the participants were organized in two different blocks. The members of the first block were asked to rate the musical excerpt in a 1 to 9 intensity scale for each discrete emotion, and the participants of the second block were asked to do the same in a (negative) 1-9 (positive) scale for each emotion

(31)

2.5 Discussion 11

dimension. A more detailed description about the participants, stimuli, apparatus and procedure can be found in the corresponding study [32].

Some results of the study are shown in Figure2.4. Despite having different scales, a clustering process similar to the previous study appears to have happened, particularly in the happiness emotion and in the overlap of fear and anger. However, the same did not happen with sadness, appearing to be in a different region of the valence-arousal affective map.

Figure 2.4: Distribution of emotions in the valence-arousal affective map. Left: Variance of the five discrete target emotions — capital letters represent well-defined areas (high intensity and low deviation) whereas the small letters represent less clearly defined areas (lower intensity and higher variation). Right: Mean rating per discrete emotion represented as marker type in the valence-arousal space [32]

2.5 Discussion

Humans have a complex nature and the study of the social and psychological motivations behind their emotions could not be different. At the moment, there is no truly universal model of emo-tion. The models covered in this section should not be seen as a competition for a single universal truth, but rather as different angles from different academic fields (biology, psychology, psychi-atry and social sciences) that with different experimental methods try to analyze and describe a convoluted subject. The discrete emotion model, being categorical, has the advantage of the population’s broad familiarity with the concepts, which may be useful in film rating or recommen-dation systems. The dimensional model, on the other hand, may have a greater granularity in the evolution of emotion throughout a film, which may be particularly interesting in systems that need a more comprehensive information about emotion at a given time.

However, trying to create a computational model of non-universal theory is an arduous task that becomes more evident with the dispersion of models used in the various databases, as it will be exposed in the following sections. Even so, what might initially seem like a limitation could become a valuable asset if we consider a possible hybrid model involving classification and

(32)

12 The Human Emotion

linear regression tasks, as indicated by the results of the mappings between emotion representation models.

(33)

Chapter 3

Facial Expression Recognition

3.1 Historical Overview . . . 13 3.2 Deep Learning current approach . . . 14 3.3 Model Evaluation and Validation . . . 26 3.4 Datasets . . . 28 3.5 Open-source and commercial solutions . . . 34

Like many other computer vision subjects, the face-related studies have changed from en-gineering features by hand to the use of deep learning, which surpassed SoA approaches at the time.

This section is introductory to Facial Expression Recognition (FER). It starts with a brief overview of the historical context of the subject. Then, it explores the ongoing deep learning approach, from the data pre-processing stage to the construction and optimization of a CNN model.

3.1 Historical Overview

Facial Expression Recognition (FER) systems use biometric markers to detect emotion in human faces. Despite the complexity of emotion classification models, the discrete model is the most popular perspective in facial expression recognition algorithms, given its pioneering investigations and the intuitive definition of emotions.

FER systems can be branched into two principal categories:

• Static-based methods — the feature representation takes into account only spatial infor-mation from the current single image.

• Dynamic-based methods — temporal relation between contiguous frames in a facial ex-pression sequence is taken into consideration.

(34)

14 Facial Expression Recognition

Figure 3.1: Facial Expression Recognition stages outline [56]

Figure3.1 illustrates the principal stages of a static-based FER system. In the image pre-processing step, several transformations to the image are performed, such as noise reduction (us-ing a Gaussian filter, for instance), normalisation and histogram equalisation. Face related image transformations are further explored in Section 3.2.1. Initially, most traditional methods were based on shallow learning for feature extraction — unlike deep learning, features are drawn by hand based on heuristics related to the target problem, like Local Binary Patterns (LBP) [127], Non-Negative Matrix Factorization [161] and Sparse Learning [162]. However, the increasing computational power and the emergence of higher quality databases made the deep learning mod-els stood out from SoA techniques at that time, being the current focus of FER researchers. A typical Deep Learning-based FER pipeline is discussed in the forthcoming section.

3.2 Deep Learning current approach

Since 2013, international competitions (like FER2013 [45] and EmotiW [31]) have changed the paradigm in relation to the recognition of facial expressions by providing a significant increase in training data. These competitions introduced more unconstrained datasets which led to the tran-sition of the study from controlled simulations in the laboratory to the real-life more unpredictable environment.

Current approaches are made using deep learning techniques, due to the increased pro-cessing power of the chips as well as the availability of more detailed datasets, with results that far exceed traditional methods [75,50]. The general pipeline of deep facial expression recognition systems includes a pre-processing phase of the input image, which incorporates face alignment, data augmentation and face normalization techniques. Then, the processed data is passed to a neural network, commonly Convolutional Neural Network (CNN), Deep Belief Network (DBN), Recurrent neural network (RNN), Generative Adversarial Network (GAN) or Deep Autoencoder (DAE). In the end, the features extracted by the neural network are passed to a classifier that will label the image. The following sections delve into the details for each of these procedures, with special emphasis on CNNs since they are the algorithm of choice for image classification challenges [82].

3.2.1 Data Preprocessing

The data quality and quantity have a great impact on the results of deep learning-based related tasks. In deep learning algorithms the feature extraction is done throughout the inner layers of the model, becoming crucial to have an initial data processing phase in order to bring our

(35)

3.2 Deep Learning current approach 15

data closer to the goals we intend to achieve. It is unrealistic to expect that a given dataset will be perfect so it is fundamental to evaluate the quality of it. The assessment includes checking for missing or duplicate images, examining inconsistent images regarding its label, or analyzing class balance, i.e., the distribution of images per class in a given dataset. The general methods used to preliminary process image data containing faces are explained in the following sections.

Data quality assessment

Essential for determining the quality of a given dataset. Includes checking for missing or duplicate images, examining inconsistent images regarding its label or analyzing class imbalance, i.e., the distribution of images per class in a given dataset.

Face detection

Face detection is used to detect inconsistent images that do not contain faces. Dlib1is a toolkit for making real-world machine learning and data analysis applications written in C++ and allows face detection in two distinct ways:

• Histogram of Oriented Gradients (HOG) combined with a linear classifier [23]: HOG is a feature descriptor often used to extract features from image data, focusing on the shape of the object. By counting the occurrences of gradient orientation in localized portions of an image, it allows a trained linear classifier (such as SVM) to use this information and classify if an image contains a face or not. Figure3.2contains an example of this gradient applied to a face.

• Convolutional Neural Network: it uses a pre-trained CNN model to find faces in an image. The model was trained using images from ImageNet [27], AFLW [96], Pascal VOC [36], VGG-Face [110], WIDER [153], and Face Scrub [105]. It is more accurate than the HOG based model, but at the cost of a much more computational power to run.

Face alignment

Although face detection is the only necessary method to train a deep neural network to learn meaningful features, there are natural variations in pictures that are irrelevant to facial expressions, such as background or head poses. A technique to overcome these variations is to perform face alignment using the localization of facial landmarks, proven to enhance the performance of FER systems [104].

The detection of facial landmarks is a subset of the shape prediction problem. Given an input image of a face, the shape predictor tries to localize key points of interest throughout the face’s shape, namely the mouth, the right and left eyebrows, the right and left eyes, the nose and the jaw. The different position of the facial landmarks and contour indicate facial deformations due to head

(36)

Figure 3.2: Histogram of Oriented Gradients of Barack Obama’s face

movements and facial expressions. This information is later used to establish a feature vector of the human face that can be used to get a normalized rotation, translation, and scale representation of the face for all images in the database.

There are several open-source solutions to estimate facial landmarks. Dlib uses an ensem-ble of regression trees 2 _{trained on the iBUG 300-W dataset [}₁₂₁_{] to determine the positions}

of the facial landmarks directly from the pixel intensities [64]. Another strategy proposes a deep neural network to estimate facial landmarks in the two-dimensional and three-dimensional spaces [12], with a Pytorch implementation available on Github 3. Figure3.3contains a visual representation of 68 two-dimensional facial landmarks on their respective faces.

Figure 3.3: Examples of facial landmarks with their respective faces [12]

2_{An ensemble of regression trees is a predictive model composed of a weighted combination of multiple regression}

trees.

(37)

Data augmentation

Data augmentation is a method used to enlarge the size of training or testing data, by applying several transformations to real face samples or simulated virtual face samples. This technique is particularly useful in unbalanced databases since we can generate new images to balance the number of samples per class in a given dataset, reducing overfitting. Image data augmentation is usually only applied in the training dataset, and not in the validation or test datasets. In the context of facial expressions, data augmentation techniques can be divided into three groups [143]:

• Generic transformations: includes geometric and photometric transformations to the ge-ometry or colour of the image. The most frequently used operations include rotation, skew, shifting, scaling, noise, shear, horizontal or vertical flips, contrast and colour jittering. Ex-amples of geometric transformations can be found on Figure3.4.

• Component transformations: includes face specific transformations, such as changing the persons’ hairstyle, makeup or accessories.

• Attribute transformations: includes face specific transformations, such as changing the persons’ pose, expression or age.

Figure 3.4: Image geometric transformation examples [143]

Component and attribute transformations are generally done using Generative Adversarial Networks (GAN) that are able to learn disentangled representations of the face and modify its characteristics [143]. Besides the manipulation of human faces, there are also studies that point out that the construction of virtual faces based on the facial structure can be advantageous to overcome the problem of small sample size [81].

The selection of the specific data augmentation techniques utilized for the training dataset must be cautiously chosen and within the context of the dataset and the domain knowledge of the problem. For instance, a vertical flip of a face may not be particularly advantageous since it is quite unlikely that the model will ever see a picture of an upside down face.

(38)

Face normalization

Face normalization techniques seek to normalize the illumination and pose in all the samples of a database. Related studies have shown that illumination normalization combined with histogram equalization can improve FER results [82]. On the other hand, normalizing pose and yielding for frontal facial views report promising performances, generally using facial landmarks [49] or GAN-based deep models [154] for frontal view synthesis, as discussed above.

3.2.2 Convolutional Neural Networks

A Convolutional Neural Network is a class of deep, feed-forward artificial neural network that has been proven to be very effective in areas such as image recognition and classification, particularly in FER systems. To understand these models, we firstly need to understand how a deep neural network works.

Deep feed-forward neural network

Deep feed-forward neural network (FNN), or multilayer perceptron (MLP) is the quintessential structure that support deep learning models. The idea was inspired by the structure of the human brain, with the so-called perceptrons trying to recreate the firing of neurons [117].

In [44], the following definition of a FNN can be found:

The goal of a feed-forward network is to approximate some function f*. For example, for a classifier, y = f*(x) maps an input x to a category y. A feed-forward network defines a mapping y = f(x;θ ) and learns the value of the parameters θ that results in the best function approximation.

The early neural networks were composed by three single layers: the input layer, the hidden layer, and the output layer. Each layer of this network has a connection to the previous layer in one-way only, so that nodes can’t form a cycle. The information in a feed-forward network only moves in one direction – from the input layer, through the hidden layers, to the output layer.

Figure3.5illustrates a simple and a deep feed-forward neural networks.

Figure 3.5: Example of feed-forward neural networks. Left: One-layer feed-forward neural net-work. Right: Deep feed-forward neural network

(39)

The Universal Approximation Theorem states that "a perceptron with one hidden layer of finite width can arbitrarily accurately approximate any continuous function"[54]. However, a single-layer perceptron is a linear classifier and therefore is not able to distinguish data that is not linearly separable. Stacking up hidden layers — that with a combination of non-linear activation function creates a multi-layer perceptron — increases the depth of the network which requires fewer parameters to train and allows more complex functions to be approximated.

The goal of the training phase is to minimise some loss function. To do this, we apply the Backpropagation algorithm. "The backpropagation algorithm allows the information from the loss to then flow backward through the network in order to compute the gradient" [44]. This negative gradient is the direction of steepest descent, resulting in a decrease in the loss function. The size of the step taken by the algorithm is controlled by a parameter called the learning rate.

Usually, the algorithm is applied to the entire dataset of data. When applied to a random batch of data, the algorithm is called Stochastic Gradient Descent (SGD). In SGD, only a single sample is used to perform each iteration. The sample is randomly shuffled and selected for performing the iteration. Small batches of data will lead to a more noisy gradient and can help escape local minima. On the other hand, large batches of samples better represent the true direction of steepest descent, but at the cost of a more memory intensive implementation. After each iteration, the algorithm will recursively update the parameters until convergence or the predefined maximum number of iterations is achieved.

Activation Function

The activation function is a mathematical function that determines if there is enough informative input at a node to fire a signal to the next layer. This motivation was inspired in the biological foundations of a neuron, illustrated in Figure3.6.

Figure 3.6: Biological neuron and a possible mathematical model [135]

As stated previously, every neuron is connected to all the neurons in the previous layer. Each connection has an associated weight, and in each iteration a neuron adds all the incoming inputs multiplied by its corresponding connection weight plus an extra optional bias. The sum of these inputs will be the passed to the activation function, as illustrated in Figure3.7.

The most common activation functions used in computer vision and classification problems are:

(40)

Figure 3.7: Flow of information in a artificial neuron [148]

• Sigmoid function — is the standard logistic function and translates the input from [-∞,+∞] to [0,1], often used in binary classification problems. Being an exponential function, it is computationally expensive and can lead to a vanishing gradient during training, so should not be used in hidden layers. The vanishing gradient problem is a consequence of the deriva-tive of sigmoid becoming very small in the saturating regions of the function and therefore the updates to the weights of the network almost vanish, slowing down learning.

• Softmax function — is a generalization of the logistic function to multiple dimensions, being useful in multi-class classification problems. It is normally used as the last activa-tion funcactiva-tion of a neural network, normalizing its output to a probability distribuactiva-tion over predicted output classes.

• Rectified Linear Unit (ReLU) function — is easy to compute, converges faster compared to Sigmoid, can be used in hidden layers and does not have the problem of vanishing gradi-ent. However, it could lead to the "dying ReLu" problem, since being zero for all negative values, once a neuron gets negative it is unlikely for it to recover. Leaky ReLU is a variation of this function and attempts to solve this problem by giving a small negative slope instead of zero for negative values, but most architectures covered in this paper still prefer to use the ReLU function.

A non-exhaustive list of activation functions can be found in Figure3.8.

Currently, image and video media files are high resolution. Considering a colored RGB picture with a 250x250 resolution, a simple feed-forward neural network will have 250x250x3 = 187,500 input features. Each of these numbers is given a value from 0 to 255 which describes the pixel intensity at that point. If the hidden layer has 1,000 neurons, it means that the network will have 187,500x1,000 parameters to compute, which is a huge computation overhead for the network. Moreover, simple networks are not translation invariant — if the same object is geometrically translated in a given picture, the network should not be able to identify it. Convolutional Neural

(41)

Figure 3.8: Most common activation functions [61]

Networks overcome these problems, being extensively used in diverse computer vision applica-tions.

The basic building blocks of a CNN are the convolutional base , composed by convolutional layers and pooling layers, responsible to generate features from an image; and the classifier, usually composed by fully connected layers, responsible for classifying the image.

Convolutional layers

After the input layer, the convolution layer is the first layer of a CNN. The mathematical operation of convolution is applied to the input image, where a kernel (or convolution filter) slides over the whole image originating an activation map (or feature map). This mechanism performs an elemen-twise dot product between the kernel and the previous layer’s corresponding neuron. Figure3.9

illustrates a single convolution with a 3x3 filter.

A hyper-parameter is a value defined a priori to control the network learning process. In CNNs, the hyper-parameters that have more impact on the performance of the network are:

• Kernel size — establishes the dimensions of the sliding window over the input. This hyper-parameter has a great impact on the image classification task: smaller filter sizes are able to extract useful information from local features, but larger kernel sizes could be relevant to extract larger less detailed features.

• Padding — necessary if the kernel exceeds the size of the activation map to prevent loosing information on corners of the image or shrinking outputs. The most commonly used padding technique is the highly efficient zero-padding, which adds a layer of zeros around the input, conserving the input’s spatial size between convolutions.

• Stride — defines how many pixels the filter should shift over time. Normally, a stride of one is used, which means that the filter slides over the input pixel by pixel.

As multiple convolutions are employed to the same input, features maps are stacked to create the convolutional layer. Normally, the convolutional layers closer to the input layer learn basic

(42)

Figure 3.9: Example of a 3x3 convolution operation with the horizontal Sobel filter. This filter is able to estimate the presence of a light-dark transition zones, normally associated with the edges of an object [22]

feature such as edges and corners, the middle layers are able to learn filters that detect parts of objects (for faces, they might learn the representation of an eye or a nose) and the last layers have a high-level representation of the object, being able to distinguish it in different shapes and positions. Considering that the convolution operation is a linear combination, a non-linear layer is normally added after each convolution layer. Initially, Sigmoid functions were used, but recent developments showed that using ReLU decreases training time [122].

Pooling layers

Pooling is a non-linear downsampling operation that reduces the dimensionality of a feature map by applying an operation window of an arbitrary size. Figure3.10represents a 4x4 image with a 2x2 sub-sampling window that divides the image in four non-overlapping matrices with size 2x2. In the case of average-pooling, the average value of the four values is the output. In the case of max-pooling, the maximum value of each matrix is calculated. There are other non-linear functions to implement pooling not covered in this paper since max-pooling is the most common. A Pooling layer is commonly placed between Convolutional layers (with the corresponding ReLU layer) and serves two principal purposes. The first is to reduce the number of parameters and the overall computation load of the network, and the second is to avoid overfitting. The pooling operation will also help CNNs to achieve translation invariance [42].

Fully connected layers

Fully connected layers are often used as the final layers of a CNN. The neurons in these layers have full connections to all the activations in the previous layer. Frequently, they are preceded by a

(43)

Figure 3.10: Average-pooling and max-pooling operation examples

Flatten layer to transform three-dimensional data into a one-dimensional vector accepted by these layers. Combined with the Softmax function, the output of this layer (and the overall network) is a N-dimensional vector, where N is the number of classes, and each value of the vector is the probability of the given picture belonging to each class.

A classical CNN will have a combination of these layers, as pictured in Figure3.11. With the recent advances in the field, new and more efficient models have appeared with an improved performance. Some of these models are discussed in the following section.

Figure 3.11: Convolutional Neural Network example [28]

Modern Architectures

Convolutional Neural Networks have been around since the 1990s, with the emergence of one of the very first CNN, LeNet, in 1998 [80]. Between the late 1990s and early 2010s, as more data and computing power became available, CNNs began to have increasingly interesting results. The year of 2010 marks the first ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [118], being since then the de facto benchmark tool for evaluating object detection and image classification models at large scale. ImageNet [27] is a large image database containing over 14 million hand-annotated images indicating what objects are present.

(44)

• AlexNet [76] (2012) — AlexNet was the first CNN to achieve a remarkable result in the competition, achieving a top-5 test error of 15.4%, whereas the second best result had at the time a top-5 test error rate of 26.4%. The network is composed of five convolutional layers, max-pooling layers, three full-connected layers and dropout layers. Dropout is a regularization technique that ignores (drops out) randomly chosen neurons (along with their connections) from the network during training. With dropout, a network can not fully rely on one feature, but will rather have to learn robust features that are useful, preventing overfitting. The original work of AlexNet also included data augmentation techniques during the training phase and the use of the ReLU non-linear function.

• VGG [129] (2014) — The major improvement over AlexNet was the reduction of the con-volution kernel size to 3x3. The VGG network, with nineteen concon-volution layers with a fixed filter size of 3x3 and stride and padding of one, interspersed with 2x2 max-pooling layers with stride two, achieved a 7.3% top-5 error rate in ILSVRC 2014.

• GoogLeNet/Inception [136] (2014) — The simplicity of the VGG network contrasts with the complexity of the GoogLeNet network, the winner of ILSVRC 2014 with a top-5 error rate of 6.7%. This network launched a new concept of CNN, proposing parallel modules called the Inception modules. These modules allow the network to perform several con-volution and pooling operations with different sizes in parallel and concatenating the result at the end. Another difference from AlexNet/VGG is the use Global Average Pooling layer instead of fully connected layers to reduce the spatial dimensions of a three-dimensional tensor to a one-dimensional tensor. Despite its complexity, with over a hundred layers, this network has around 12x less parameters than AlexNet, being computationally efficient. • ResNet [50] (2015) — The winner of ILSVRC 2015 was a Residual Network with top-5

error rate of 3.6%. With the network depth constant increase new problems have begun to emerge, such as the vanishing gradient problem and the degradation problem, defined as "loss of meaningful information on the feed-forward loop as accuracy gets saturated and then degrades rapidly"[50]. To overcome these problems, residual blocks have been proposed. In a network with residual blocks, each layer feeds into the next layer and directly into the 2 or 3 most distant layers, skipping connections. Figure 3.12illustrates a single residual block. This architecture allows both the input and the loss to propagate much further through the network, making training remarkably more efficient. A combination of different architectures also have begun to emerge, with the example of Xception that merges the concepts of Inception and ResNet families into a new architecture.

• Efficient designs (2019) — the most recent trends in the field focus on highly parameter ef-ficient networks, with the advent of Neural Architecture Search (NAS) and MobileNet fam-ilies. MobileNet architectures thrive to have the less number of parameters possible, allow-ing the execution of the networks on mobile devices, as the name suggests. MobileNet-V2 [123], for instance, has a top-5 error rate of 7,5% with only 6 million parameters. Another

(45)

Figure 3.12: An example of a residual block

efficient design is NASNet [164] that has a defined search space of CNN common building blocks and tries to build the best child network using reinforcement learning. This network achieved a top-5 error rate of 3.8% with 88.9 million parameters. The most recent devel-opments in the field focus in achieving best accuracy possible with the smallest number of parameters [53].

Figure3.13illustrates the chronological path of the most relevant CNNs in terms of its accu-racy, familiarity and size.

Figure 3.13: Evolution of the diverse architectures milestones for the image recognition task achieved using the Imagenet 2012 database. The size of the circle represent the number of pa-rameters of the networks and the colours represent the familiarity between them [53]

Transfer Learning and model fine-tuning

Transfer Learning is a machine learning technique whereby a model is trained and developed for a specific task and then is re-used on a similar task. Concretely, the weights and parameters of a network which has already gone through a training process with a large dataset (a pre-trained

(46)

network) are used as initialization for a new model being trained on a different dataset from the same domain, in a process called fine-tuning. Transfer learning and fine-tuning generally improve the generalization of a network (provided that are sufficient samples) and often speeds up training yielding to a better performance than training the network from scratch [155].

The transfer learning process is represented in Figure3.14.

Figure 3.14: Transfer learning process [58]

There are different transfer learning strategies that could improve the performance of a dataset, depending on the domain, the task at hand, or the availability of data. The most common strategy is to use the convolutional base of the pre-trained network as a feature extractor. Then, the last layers of the network are removed and replaced by the classifier of the problem at hand. Finally, the new network is trained with convolutional base frozen, so that the weights of the pre-trained network are not replaced with the new learning process.

The pre-trained weights can provide excellent initial values and can still be adjusted by the training to better fit the problem. Therefore, training the entire model or training some specific layers and leave the others frozen are other alternatives to accomplish transfer learning.

To find the optimal model there are other parameters of a network that could also be fine-tuned. Adding batch normalization can accelerate the learning of a neural network, and weight regu-larization can prevent overfitting in unbalanced datasets. Automatically adjusting the learning rate upon plateauing can also improve the performance of a model. Finally, opting for an efficient optimization algorithm, such as Adam or RMSProp, can also lead to better results.

3.3 Model Evaluation and Validation

In the case of a discrete model, the results can be organized in a Confusion Matrix, shown in Table 3.1, that compares the predicted values with the truth values. True positives are data points classified as positive by the model that are positive (meaning that they are correct), and false negatives are data points the model identifies as negative that are positive (meaning that they are incorrect).

(47)

3.3 Model Evaluation and Validation 27

Table 3.1: Confusion Matrix Predicted

Positive Negative

Actual Positive True Positive (TP) False Negative (FN) Negative False Positive (FP) True Negative (TN)

The relevant evaluation measures are Accuracy, Precision, Recall, and F-Score.

Accuracy (Equation3.1) refers to the fraction between the number of correct predictions and the total number of predictions. It is used to estimate the number of correct predictions that the model was able to identify.

Accuracy=T P+ T N

total (3.1)

Precision (Equation3.2) refers to the fraction between the number of correct predictions and the total number of positive predictions. It is used to estimate the proportion of correct estimations between the positive ones, and therefore estimate the impact of false positives in the predictions.

Precision= T P

T P+ FP (3.2)

Recall (Equation3.3) refers to the fraction between the number of correct predictions and the total number of all predictions that are relevant to the model (i.e., all predictions that should have been identified as positive). It is used to estimate cost of false positives in the model under consideration.

Recall= T P

T P+ FN (3.3)

F1 score (Equation3.4) is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It represents how precise the classifier is (how many correct predictions), as well as how robust it is (by not missing a significant number of instances).

F1 =2 · Precision · recall

Precision+ Recall (3.4)

In the case of a dimensional model, the relevant measures are Mean-Squared Error (MSE) and Person’s R.

Mean-Squared Error (MSE) (Equation 3.5) measures the average squared difference be-tween the estimated values and the actual value. It is a measure of quality of an estimator, being

Automatic Emotion Identification: Analysis and Detection of Facial Expressions in Movies

F

E

U

P

Automatic Emotion Identification:

Analysis and Detection of Facial

Expressions in Movies

João Carlos Miranda de Almeida

Automatic Emotion Identification: Analysis and

Detection of Facial Expressions in Movies

João Carlos Miranda de Almeida

Mestrado Integrado em Engenharia Informática e Computação

Abstract

Resumo

Acknowledgements

Contents

List of Figures

List of Tables

Abbreviations

Chapter 1

Introduction

1.1

Context

1.2

Motivation

1.3

Goals

1.4

Document Structure

Chapter 2

The Human Emotion

2.1

Discrete emotion model

2.2

Dimensional emotion model

2.3

Facial Action Coding System (FACS)

2.4

Mappings between Emotion Representation Models

2.5

Discussion

Chapter 3

Facial Expression Recognition

3.1

Historical Overview

3.2

Deep Learning current approach

3.3

Model Evaluation and Validation