On the Role of Facial Expressiveness in the Sign Language Recognition

(1)

F

ACULDADE DE

E

NGENHARIA DA

U

NIVERSIDADE DO

P

ORTO

Facial Expression Recognition:

Towards Meaningful Prior Knowledge

in Deep Neural Networks

Filipe Martins Marques

M

ASTER

’

S

T

HESIS

Integrated Master in Bioengineering Supervisor: Jaime dos Santos Cardoso Second Supervisor: Pedro Miguel Martins Ferreira

(2)

c

(3)

Resumo

As expressões faciais, por definição, estão associadas à forma como as emoções são expressas e desempenham um papel fulcral na comunicação. Isso torna a expressão facial um domínio inter-disciplinar transversal a várias ciências como ciência comportamental, neurologia e inteligência artificial. A expressão facial é documentada como o meio mais informativo de comunicação para os seres humanos, pelo que as aplicações de visão por computador, como a interação homem-máquina ou o reconhecimento de língua gestual, precisam de um sistema eficiente de reconheci-mento de expressão facial.

Os métodos de reconhecimento da expressão facial têm sido estudados e explorados, demon-strando desempenhos impressionantes na deteção de emoções discretas. No entanto, essas abor-dagens apenas têm uma eficiência notável em ambientes controlados, ou seja, ambientes onde a iluminação e a pose são monitorizadas. Os sistemas de reconhecimento de expressão facial em aplicações de visão por computador, par além de poderem ser melhorados em cenários controla-dos, precisam de ser eficientes em cenários reais, embora os métodos mais recentes não tenham atingido um desempenho desejável em tais ambientes.

As redes neurais convolucionais têm sido amplamente utilizadas em várias tarefas de visão por computador e reconhecimento de objetos. Recentemente, as redes neuronais convolucionais foram aplicadas ao reconhecimento da expressão facial. Contudo, estes métodos ainda não alcançaram o seu potencial no reconhecimento de expressões faciais, dado que o treino de modelos complexos em bases de dados pequenas, como aquelas disponíveis para o reconhecimento de expressão facial, normalmente resultam no sobreajuste dos dados. Tendo isto em conta, é necessário o estudo de novos métodos com redes neuronais que envolvam estratégias de treino inovadoras.

Na presente dissertação, um novo modelo é proposto no qual diferentes fontes de conheci-mento do domínio são integradas. O método proposto visa incluir informação extraída de redes pré-treinadas em tarefas do mesmo domínio (reconhecimento de imagem ou objeto), juntamente com informção morfológica e fisiológica de expressão facial. Esta inclusão de informação é con-seguida pela regressão de mapas de relevância que destacam zonas chave para o reconhecimento de expressão facial. Foi estudado em que medida o refinamento dos mapas de relevância de ex-pressões faciais e o uso de características de outras redes permitem obter melhores resultados na classificação de expressão.

O método proposto alcançou o melhor resultado quando comparado com métodos do estado de arte implementados, mostrando assim capacidade de aprender características específicas de expressão. Para além disso, o modelo é mais simples (menos parâmetros para serem treinados) e requer menos recursos computacionais. Deste modo demonstra-se que uma eficiente inclusão de informação do domínio origina modelos mais eficientes em tarefas onde as respetivas bases de dados são limitadas.

(4)

(5)

Abstract

Facial expressions by definition are associated with how emotions are expressed. This makes facial expression an interdisciplinary domain transversal to behavioral science, neurology and artificial intelligence. Facial expression is documented as the most informative mean of communication for humans, reason why computer vision applications such as natural human computer interaction or sign language recognition need an efficient facial expression recognition system.

Facial expression recognition methods (FER) have been deeply studied and these methods have impressive performances on the detection of discrete emotions. However, these approaches only have remarkable efficiency in controlled environments, i.e., environments where illumination and pose are monitored. An integration of FER systems in computer vision applications need to be efficient in real world scenarios. However, current state-of-the-art methods do not reach accurate expression recognition in such environments.

Deep convolutional neural networks have been widely used in several computer vision tasks involving object recognition. Recently, deep learning methods have also been applied to facial expression recognition. Nonetheless, these methods have not reached their full potential in the FER task as training high capacity models in small datasets, such as the ones available in the FER field, usually result in overfitting. In this regard, further research of novel deep learning models and training strategies has crucial significance.

In this dissertation, a novel neural network that integrates different sources of domain knowl-edge is proposed. The proposed method integrates knowlknowl-edge transfered from pre-trained net-works on similar recognition tasks with prior knowledge of facial expression. The prior knowl-edge integration is achieved by the means of a regression map with meaningful spatial features for the model. Further experiments and studies were performed to assess whether refined regressed maps of facial landmarks can lead to better performances and whether transfered features from other networks can lead to better results.

The proposed method outperforms the implemented state of the art methods and shows the ability to learn expression-specific features. Besides, the network is simpler and requires less computational resources. Thus, it is demonstrated that an effective use of prior knowledge can lead to more efficient models in tasks where large datasets are not available.

(6)

(7)

Acknowledgments

First of all, I would like to thank the Faculty of Engineering of University of Porto and all the people that I met in university, from teachers to my colleagues and friends, for all the education, the support and the strength to make me complete this course in these five years and, most impor-tantly, for making me discover my potential.

To my supervisor, Professor Jaime Cardoso, for all the orientation, support and experience. To the INESC-TEC, for the facilities, kindness and networking provided. To my second supervisor, Pedro Ferreira, for all the patience, the availability to help me, the experience, dedication and mo-tivation when I needed.

To Tiago, for bringing out the dedicated worker that was hidden in me. Thank you as well for all the motivation, time, patience and joy. To Inês, for being my voice of reason. To Joana, for bringing out my free spirit. To Rita, for growing up with me and helping me build my personality. To my parents, for being the best they can be everyday, for all the things that I can not enu-merate here and specially, for all the love.

To my siblings, for annoying me since forever, but most importantly, for being here for me all the time as well.

To my family, for teaching me moral values and lessons that I will carry all my life. To my grandfather that is looking down on me somewhere.

To all my friends who handle me and let me be just as I am.

Filipe Marques

(8)

(9)

List of Figures

2.1 Study of FEs by electrically stimulate facial muscles. . . 6

2.2 Examples of AUs on the FACS. . . 7

2.3 Illustration of the neutral expression and the six basic emotions. . . 7

2.4 A typical framework of an HOG-based face detection method. . . 9

2.5 Example of LBP calculation. . . 10

2.6 Example of an SVM classifier applied to features extracted from faces. . . 11

2.7 Architecture of a deep network for FER. . . 13

2.8 Visualization of the activation maps for different layers. . . 14

2.9 Batch-Normalization applied to the activations x over a mini-batch. . . 15

2.10 Dropout Neural Network Model. . . 16

2.11 Common pipeline for model selection. . . 17

3.1 Multiple face detection in uncontrolled scenarios. . . 20

3.2 Architecture of the Multi-Task CNN for face detection. . . 21

3.3 Detection of AUs based on geometric features. . . 22

3.4 Spatial representations of the main approaches for feature extraction. . . 24

3.5 Framework of the curriculum learning method. . . 25

4.1 Illustration of the pre-processing. . . 29

4.2 Illustration of the implemented geometric feature computation. . . 30

4.3 Architecture of the conventional deep network used. . . 32

4.4 Examples of the implemented data augmentation process. . . 34

4.5 Network configurations of VGG. . . 35

4.6 Original image followed by density maps obtained by a superposition of Gaussians at the location of each facial landmark, with an increasing value of σ . . . 37

5.1 Architecture of the proposed network. The relevance maps are produced by regres-sion from the facial component module who is composed by an encoder-decoder. The maps ˆxare operated (⊗) with the feature representations ( f ) that are outputted by representation module and then fed to the classification module, predicting the classes probabilities ( ˆy). . . 40

5.2 Pipeline for feature extraction from Facenet. Only the layers before pooling oper-ations are represented. GAP- Global Average Pooling. . . 41

5.3 Facial Module architecture. . . 42

5.4 Illustrative examples of the facial landmarks computation for the SFEW dataset. . 43

5.5 Architecture of the proposed network for iterative refinement. . . 44

6.1 Samples from CK+ and from SFEW database. . . 45

6.2 Examples of predicted relevance maps for different methods used. . . 48 ix

(12)

x LIST OF FIGURES

6.3 Frame-by-frame analysis of the relevance maps. . . 49

6.4 Class Distribution on CK+ database. . . 50

6.5 Confusion Matrix of CK+ database. . . 52

6.6 Class distribution for SFEW database. . . 53

(13)

List of Tables

6.1 Hyperparameters sets. . . 47

6.2 Performance achieved by the traditional baseline methods on CK+. . . 51

6.3 CK+ experimental results. . . 52

6.4 SFEW experimental results. . . 54

(14)

(15)

Abbreviations

AAM Active Appearance Model

AFEW Acted Facial Expression In The Wild

AU Action Unit

BN Batch-Normalization

CNN Convolutional Neural Network

CK Cohn-Kanade

FACS Facial Action Coding System

FE Facial Expression

FER Facial Expression Recognition

GPU Graphics Processing Unit

HCI Human–Computer Interaction

HOG Histogram of Oriented Gradients

kNN k-Nearest Neighbors

LBP Local binary patterns

MTCNN Multi-Task Convolutional Neural Network

NMS Non-Maximum Suppression

P-NET Proposal Network

PCA Principal Component Analysis

R-NET Refine Network

ReLU Rectified Linear Units

SFEW Static Facial Expressions in the Wild

SVM Support Vector Machine

(16)

(17)

Chapter 1

Introduction

1.1 Context

In psychology, emotion refers to the conscious and subjective experience that is characterized by mental states, biological reactions and psychological or physiologic expressions, i.e., Facial Expressions (FE). It is common to relate FE to affect, as it can be defined as the experience of emotion, and is associated with how the emotion is expressed. Together with voice, language, hands and posture of the body, FE form a fundamental communication system between humans in social contexts.

Facial expressions were introduced as a research field by Charles Darwin in his book "The Ex-pression of the Emotions in Man and Animals" [1]. Darwin questioned whether facial exEx-pressions had some instrumental purpose in the evolutionary history. For example, lifting the eyebrows might have helped our ancestors respond to unexpected environmental events by widening the visual field and therefore enabling them to see more. Even though their instrumental function may have been lost, the facial expression remains in humans as part of our biological endowment and therefore we still lift our eyebrows when something surprising happens in the environment whether seeing more is of any value or not. Since then, FEs were established as one of the most important features of human emotion recognition.

1.2 Motivation

Expression recognition is a task that human beings perform daily and effortlessly, but it is not yet easily performed by computers. By definition Facial Expression Recognition (FER) involves identification of cognitive activity, deformation of facial feature and facial movements. Computa-tionally speaking, the FER task is done using static images or their sequences. The purpose is to categorize them into different abstract classes based on the visual facts only.

In the last few years, automated FER has attracted much attention in the research community due to its wide range of applications. In technology and robotic systems, several developed robots with social skills like Sony’s AIBO and ATR’s Robovie have been developed [2]. In education,

(18)

2 Introduction

FER can also play a role detecting students’ frustration by improving e-learning experiences [3]. Game industry is already investing in the expansion of gaming experience by adapting difficulty, music, characters or mission according to the player’s emotional responses [4, 5].

In the medical field, emotional assessment can be used in multiple conditions. For instance, pain detection is used for monitoring the patient progress in clinical settings and depression recog-nition from FEs is a very important application for the analysis of psychological distress [6, 7]. Facial Expression also plays a significant role in several diseases. In autism, for instance, emo-tions are not expressed the same way and the understanding of how basic emoemo-tions work and how they are conveyed in autism could lead to therapy improvement [8]. Deafness also leads to an adaptation of communication, being sign language the common mean of communication. In sign language, FE can play a significant role. Facial and head movements are used in sign languages at all levels of linguistic structure. At the phonological level, some signs have an obligatory fa-cial component in their citation form. Fafa-cial actions mark relative clauses, content questions and conditionals, amongst others [9]. Therefore, an integration of automated FER is essential for an efficient automated sign language recognition system.

Several automated FER methods have been proposed and demonstrated remarkable perfor-mances in highly controlled environments (i.e., high-resolution frontal faces with uniform back-grounds). However, the automatic FER in real-world scenarios is still a very challenging task. Those challenges are mainly related to the inter-individual’s facial expressiveness variability and to different acquisition conditions.

Most of machine learning methods are task-specific methods, in which the representation (fea-ture) is first extracted and, then, a classifier is learned from it. Deep learning can be seen as a part of machine learning methods that are able to jointly learn the classification and representation of data. Deep learning approaches learn data representations with multiple levels of abstraction, leading to features that traditional methods could not extract. The recent success of deep networks relies on the current availability of large labeled datasets and advances in GPU technology. In some computer vision tasks the availability of diverse and large datasets is scarce. To overcome this, deep training strategies are needed. State of the art methods use some strategies such as data augmentation, dropout and ReLU and reach the optimal results in most object recognition tasks.

FER is one of the cases where only small datasets are available. Current state of the art strate-gies for deep neural networks achieved satisfactory results in controlled environments but when applied to expressions in the wild the performance decays abruptly. Therefore, novel strategies for regularization in deep neural networks are needed in order to develop a robust system that is able to recognize emotions in natural environments.

1.3 Goals

The purpose of this dissertation is the development of fundamental work on FER to propose a novel method mainly based on deep neural networks. In particular, the main goal of this dissertation is the proposal and development of a deep learning model for FER that explicitly models the facial

(19)

1.4 Contributions 3

key-points information along with the expression classification. The underlying idea is to increase the discriminative ability of the learned features by regularizing the entire learning process and, hence, improve the generalization capability of deep models to the small datasets of FER.

1.4 Contributions

In this dissertation, our efforts were targeted towards the development and analysis of different deep learning architectures and training strategies to deal with the problem of training deep models in small datasets. In this regard, the main contributions of this work can be summarized as follows: • The implementation of several baseline and state of the art methods for FER, in order to provide a fair comparison and evaluation of different approaches. The implemented methods include traditional methods based on hand-crafted features and state of the art methods based on deep neural networks, such as transfer learning approaches and methods that intend to integrate physiological knowledge on FER.

• Development of a novel deep neural network that, by integrating different sources of prior knowledge, achieves state-of-the-art performances. The proposed method integrates knowl-edge transfered from pre-trained networks jointly with physiological knowlknowl-edge on facial expression.

1.5 Dissertation Outline

This dissertation will cover the historic overview on expression recognition followed by the ex-position of the pipeline for FER in Chapter 2. Chapter 3 looks over the state-of-the-art on FER in which the relevant works proposed for each step of the FER pipeline are presented (ranging from face detection to expression recognition). Chapter 4 details the methodology followed for the implementation of base-line and state-of-the art methods. Chapter 5 focuses on the proposed method. A description of the databases and the implementation details are then followed by the results and a discussion on the findings. As last chapter of the dissertation, Chapter 6 draws the main conclusions on the performed study and discusses future work.

(20)

(21)

Chapter 2

Background

Human faces generally reflect the inner feelings/emotions and hence facial expressions are sus-ceptible to changes in the environment. Expression recognition assists in interpreting the states of mind and distinguishes between various facial gestures. In fact, FE and FER are interdisciplinary domains standing at the crossing of behavioral science, neurology, and artificial intelligence.

For instance, in early psychology, Mehrabian [10] has found that only 7% of the whole in-formation that an human expresses is conveyed through language, 38% through speech, and 55% through facial expression. FER aims to develop an automatic, efficient and accurate system to dis-tinguish facial expressions of human beings, so that human emotions can be understood through facial expression, such as happiness, sadness, anger, fear, surprise, disgust, etc. The develop-ments in FER hold potential for computer vision applications, such as natural human computer interaction (HCI), human emotion analysis and interactive video.

Section 2.1 starts with a historic overview on facial expressions followed by how human emo-tion can be described. The default pipeline of FER is detailed after, beginning with pre-processing in section 2.2. A technical explanation on how the main feature descriptors work is then presented in section 2.3. These descriptors are then fed into a classifier for learning purposes: section 2.4 covers how the learning is processed. The Chapter ends with an overview on a model selection strategy in section 2.5.

2.1 Facial Expressions

Duchenne de Boulogne believed that the human face worked as a map whose features could be codified into universal taxonomies of mental states. This lead him to conduct one of the first studies on how FEs are produced by electrically stimulating facial muscles (Figure 2.1)[11]. At the same time, Charles Darwin also studied FEs and hypothesized that they must have had some instrumental purpose in the evolutionary history. For instance, constricting the nostrils in disgust served to reduce inhalation of noxious or harmful substances [12].

(22)

6 Background

Following these works, Paul Ekman claimed that there is a set of facial expressions that are innate, and they mean that the person making that face is experiencing an emotion [13], defend-ing the universality of facial expression. Further studies support that there is an high degree of consistency in the facial musculature among peoples of the world. The muscles necessary to ex-press primary emotions are found universally and homologous muscles have been documented in non-human primates [14] [15].

Figure 2.1: Study of FEs by electrically stimulate facial muscles [11].

Physiological specificity is also documented. Heart-rate and skin temperature vary with basic emotions. For instance, in anger, blood flow of the hands increases to prepare to a fight. Left-frontal asymmetry is greater during enjoyment while right Left-frontal asymmetry is greater during disgust. These evidences support the argument that emotion expressions reliably signal action tendencies [16] [17].

Facial expression signals emotion, communicative intent, individual differences in personal-ity, psychiatric and medical status and helps to regulate social interaction. With the advent of automated methods of FER, new discoveries and improvements became possible.

The description of human expressions and emotions can be divided in two main categories: categorical and dimensional description.

It is common to classify emotions into distinct classes essentially due to Darwin and Ekman studies [13]. Affect recognition systems aim at recognizing the appearance of facial actions or the emotions conveyed by the actions. The former set of systems usually relies on the Facial Action Coding System (FACS) [18]. FACS consists of facial Action Units (AUs), which are codes that describe facial configurations. Some examples of AUs are presented in Figure 2.2.

The temporal evolution of an expression is typically modeled with four temporal segments: neutral, onset, apex and offset. Neutral is the expressionless phase with no signs of muscular activity. Onset corresponds to the period during which muscular contraction begins and increases in intensity. Apex is a plateau where the intensity usually reaches a stable level; whereas offset is the phase of muscular action relaxation. [18]. Usually, the order of these phases is: neutral-onset-apex-offset. The analysis and comprehension of AUs and temporal segments are studied in

(23)

2.1 Facial Expressions 7

Figure 2.2: Examples of AUs on the FACS [18].

psychology and their recognition enables the analysis of sophisticated emotional states such as pain and helps distinguishing between genuine and posed behavior [19].

Systems and models that recognize emotions can recognize basic or non-basic emotions. Ba-sic emotions come from the affect model developed by Paul Ekman that describe six baBa-sic and universal emotions: happiness, sadness, surprise, fear, anger and disgust (see Figure 2.3).

Figure 2.3: Illustration of the neutral expression and the six basic emotions. The images are extracted from the JAFFE database [20].

Basic emotions are believed to be limited in their ability to represent the broad range of every-day emotions [19]. More recently, researchers considered non-basic emotion recognition using a variety of alternatives for modeling non-basic emotions. One approach is to define an extra set of emotion classes, for instance, relief or contempt [21]. In fact, Cohn-Kanade database, a popular database for FE, integrates contempt as emotion label.

Another approach, which represents a wider range of emotions, is the continuous modeling using affect dimensions [22]. These dimensions include how pleasant or unpleasant a feeling is, how likely is the person to take action under the emotional state and the sense of control over the emotion. Due to the higher dimensionality of such descriptions they can potentially describe more complex and subtle emotions. Nonetheless, the richness of the space is more difficult to use for

(24)

8 Background

automatic recognition systems because it can be challenging to link such described emotion to a FE [12].

For automatic classification systems is common to simplify the problem and adopt a cate-gorical description of affect by dividing the space in a limited set of categories defined by Paul Ekman. This will be the approach followed in this dissertation.

2.2 Pre-processing

The default pipeline of a FER system includes as first step face detection and alignment. This is considered a pre-processing of the original image and will be covered in this section. Face de-tection and posterior alignment can be achieved using classical approaches, such as Viola&Jones algorithm and HOG descriptors, or by deep learning approaches.

The Viola&Jones object detection framework [23] was proposed by Paul Viola and Michael Jones in 2001 as the first framework to give competitive object detection rates. It can be used for detecting objects in real time, but it is mainly applied for face detection. Besides processing the images quickly, another advantage of the Viola&Jones algorithm is the low false positive rate. The main goal is to distinguish faces from non-faces. The main steps of this algorithm can be summarized as follows:

(1) Haar Feature Selection: Human faces have similar properties (e.g., the eyes region is darker than the nose bridge regions). These properties can be matched using Haar features, also known as digital image features based upon Haar basis functions. An Haar-like feature considers adjacent rectangular regions at a specific location in a detection window, sums up the pixel inten-sities in each region and calculates the difference between these sums. This difference is then used to categorize subsections of an image.

(2) Creating an Integral Image: The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y) inclusive. This image representa-tion allows computing rectangular features such as Haar-like features, speeding up the extracrepresenta-tion process. As each feature’s rectangular area is always adjacent to at least another rectangle, any two-rectangle feature can be computed just in six array references.

(3) Adaboost Training: The Adaboost is a classification scheme that works by combining weak learners into a more accurate ensemble classifier. The training procedure consists of multiple boosting rounds. During each boosting round, the goal is to find a weak learner that achieves the lowest weighted training error. Then, the weight of the misclassified training samples are raised. At the end of the training process, the final classifier is given by a linear combination of all weak learners. The weight of each learner is directly proportional to its accuracy.

(4) Cascading Classifiers: The attentional cascade starts with simple classifiers that are able to reject many of the negative (i.e., non-face) sub-windows, while keeping almost all positive (i.e., face) sub-windows. That is, a positive response from the first classifier triggers the evaluation of a second and more complex classifier and so on. A negative outcome at any point leads to the

(25)

2.3 Feature Descriptors for Facial Expression Recognition 9

immediate rejection of the sub-window.

Another common method for face detection is the extraction of HOG descriptors to be fed into a Support Vector Machine (SVM) classifier. The basic idea is that local object appearance and shape can often be characterized rather well by the distribution of local intensity gradients or edge directions, even without precise knowledge of the corresponding gradient or edge positions. The HOG representation has several advantages. It captures the edge or gradient structure that is very characteristic of local shape, and it does so in a local representation with an easily controllable degree of invariance to local geometric and photometric transformations: translations or rotations make little difference if they are much smaller that the local spatial or orientation bin size [24].

In practice, this is implemented by dividing the image frame into cells and, for each cell, a local 1-D histogram of gradient directions or edge orientations, over the pixels of the cell, is created. The combined histogram entries form the representation [24]. The feature vector is then fed into an SVM classifier to find whether there is a face in the image or not. A representation of this framework can be found in Figure 2.4.

Figure 2.4: A typical framework of an HOG-based face detection method [24].

More recently, deep learning methods have shown efficiency in most computer vision tasks and hold the state of the art in face detection as well. A detailed description on the fundamentals of deep learning can be found in section 2.4.2 and the sate of the art in deep learning methods for face detection in section 3.1.

2.3 Feature Descriptors for Facial Expression Recognition

After face detection, the facial changes caused by facial expressions have to be extracted. This subsection presents two of the most widely used feature descriptors, namely Local Binary Patterns (LBP) and Gabor filters.

2.3.1 Local-Binary Patterns (LBP)

Local Binary Patterns (LBP) were first presented in [25] to be used in texture description. The basic method labels each pixel with decimal values called LBPs or LBP codes, to describe the local structure around each pixel. As illustrated in Figure 2.5, the value of the center pixel is subtracted from the 8-neighbor pixels’ values; if the result is negative the binary value is 0, otherwise 1. The calculation starts from the pixel at the top left corner of the 8-neighborhood and continues in clockwise direction. After calculating with all neighbors, an eight digit binary value is produced. When this binary value is converted to decimal, the LBP code of the pixel is generated, and placed at the coordinates of that pixel in the LBP matrix.

(26)

10 Background

Figure 2.5: Example of LBP calculation, extracted from [26].

2.3.2 Gabor Filters

Gabor filter is one of the most popular approaches for texture description. Gabor filter-based feature extraction consists in the application of a Gabor filter bank to the input image, defined by its parameters including frequency ( f ), orientations (θ ) and smooth parameters of the Gaussian envelop (σ ). This makes the approach invariant to illumination, rotation, scale and translation. Gabor filters are based on the following function [27]:

Ψ(u, v) = e− π2 f 2(γ 2(u0− f )2+η2v02)₎ (2.1) u0= ucosθ + vsinθ (2.2) v0= −usinθ + vosθ (2.3)

In the frequency domain (Eq. 2.1, 2.2, 2.3)) the function is a single real-valued Gaussian centered at f . γ is the sharpness (bandwidth) along the Gaussian major axis and η is the sharpness along the minor axis (perpendicular to the wave). In the given form, the aspect ratio of the Gaussian is η

γ. Gabor features, also referred as Gabor jet, Gabor bank or multi-resolution Gabor features,

are constructed from responses of Gabor filters by using multiple filters with several frequencies and orientations. Scales of a filter bank are selected from exponential spacing and orientations from linear spacing. These filters are then convolved with the image, in order to obtain different representations of the image to be used as descriptors.

2.4 Learning and classification

Given a collection of extracted features, it is necessary to build a model capable of correctly sep-arate and classify the expressions. Traditional FER systems use a three stage training procedure: (i) feature extraction/learning, (ii) feature selection, and (iii) classifier construction. On the other hand, FER systems based on deep learning techniques comprise these three steps into one single

(27)

2.4 Learning and classification 11

step. This section presents an overview of one of the most widely used traditional classifiers, the Support Vector Machines (SVMs) as well as one of the most relevant deep learning approaches, the Convolutional Neural Networks (CNNs).

2.4.1 Support Vector Machine (SVM)

Support Vector Machine [28] performs an implicit mapping of data into a higher (potentially infinite) dimensional feature space, and then finds a linear separating hyperplane with the maximal margin to separate data in this higher dimensional space. Given a training set of labeled examples a new test example x is classified by the following function:

f(x) = sgn(

l

∑

i=1

αiyiK(xi, x) + b), (2.4)

where αi are Lagrange multipliers of a dual optimization problem that describe the separating

hyperplane, K is a kernel function, and b is the threshold parameter of the hyperplane. The training sample xiwith αi> 0 is called support vector, and SVM finds the hyperplane that maximizes the

distance between the support vectors and the hyperplane. SVM allows domain-specific selection of the kernel function. Though new kernels are being proposed, the most frequently used kernel functions are the linear, polynomial, and Radial Basis Function (RBF) kernels. SVM makes binary decisions, so the multi-class classification is accomplished by using, for instance, the one-against-rest technique, which trains binary classifiers to discriminate one expression from all others, and outputs the class with the largest output of binary classification. The selection of the SVM hyper-parameters can be optimized through a k-fold cross-validation scheme. The parameter setting producing the best cross-validation accuracy is picked [29].

In general, SVMs exhibit good classification accuracy even when only a modest amount of training data is available, making them particularly suitable to expression recognition [30]. Figure 2.6 represents a possible pipeline for FER using feature descriptors along with SVM classifier.

(28)

12 Background

2.4.2 Deep Convolutional Neural Networks (DCNNs)

Recently, deep learning methods have shown to be efficient on many computer vision tasks like pattern recognition problems, character recognition, object recognition or autonomous robot driv-ing for instance. Deep learndriv-ing models are composed of consecutive processdriv-ing layers that learn representations of data with multiple levels of abstraction, capturing features that traditional meth-ods could not compute. One of the factors that allows the computation of complex features is the back-propagation algorithm that indicates how a machine should change its internal parameters to compute new representations of input data [31].

The emergent success of CNNs on recognition and segmentation tasks can be explained by 3 factors: (1) The availability of large labeled training sets; (2) The recent advances in GPU tech-nology, which allows training large CNNs in a reasonable computation time; (3) The introduc-tion of effective regularizaintroduc-tion strategies that greatly improve the model generalizaintroduc-tion capacity. However, in the FER context, the availability of large training sets is scarce, arising the need of strategies to improve the models.

CNNs learn to extract the features directly from the training database using iterative algorithms like gradient descent. An ordinary CNN learns its weights using the back-propagation algorithm. A CNN has two main components, namely, local receptive fields and shared weights. In local receptive fields, each neuron is connected to a local group of the input space. The size of this group of the input space is equal to the filter size where the input space can be either pixels from the input image or features from the previews layer. In CNN the same weights and bias are used over all local receptive fields, which significantly decreases the number of parameters of the model. However, the increased complexity and depth of a typical CNN architecture network are prone to overfit [32]. CNNs can have multiple architectures but the standard is having series of convolutional layers that produce a certain amount of feature maps given by the number of filters defined for the convolutions, leading to different image representations. This is followed by pooling layers. Max pooling, the common pooling layer applies a max filter to (usually) non-overlapping subregions of the initial representation, reducing the dimensionality of the current representation. Then, these representations are fed into fully-connected layers that can be seen as a multilayer perceptron that aims to map the activation volume, from the combination of previous different layers, into a class probability distribution. The network is followed by an affine layer that computes the scores [33] [34].

Figure 2.7 represents a possible network for facial expression recognition applying regulariza-tion methods.

It is useful to understand the features that are being extracted by the network in order to understand how the training and classification is performed. Figure 2.8 shows the visualization of the activation maps of different layers. It can be seen that, the deeper the layers are, the more sparse and localized the activation maps become [34].

(29)

Figure 2.7: The architecture of the deep network proposed in [34] for FER.

2.4.2.1 Activation Functions

An activation function is a non-linear transformation that defines the output of a specific node given a set of inputs. Activation functions decide whether a neuron should be activated so they assume an important role in the network design. The commonly used activation functions are presented as follows:

• Linear Activation: The activation is proportional to the input. The input x, will be trans-formed to ax. This can be applied to various neurons and multiple neurons can be activated at the same time. The issue with a linear activation function is that the whole network is equivalent to a single layer with linear activation.

• Sigmoid function: In general, a sigmoid function is real-valued, monotonic, and differen-tiable having a non-negative first derivative which is bell shaped. The function ranges from 0 to 1 and has an S shape. This means that small changes in x bring large changes in the out-put, Y . This is desired when performing a classification task since it pushes the predictions to extreme values. The sigmoid function can be written as follows:

Y = 1

1 + e−x (2.5)

• ReLU: ReLU is the most widely used activation function since it is known for having better fitting abilities than the sigmoid function [35]. ReLU function is non linear so it back-propagates the error. ReLU can be written as:

Y = max(0, x) (2.6)

It gives an output equal to x if x is positive and 0 otherwise. Only specific neurons are activated, making the network sparse and efficient for computation.

• Softmax: For classification problems, commonly, the output consists in a multi-class prob-lem. Sigmoid function can only handle two classes so softmax is used for outputting the probabilities of each class. The softmax function converts the outputs of each unit to values

(30)

14 Background

(31)

between 0 and 1, just like a sigmoid function, but it also divides each output such that the total sum of the outputs is equal to 1. The output of the softmax function is equivalent to a categorical probability distribution. Mathematically, the softmax function is shown below:

σ (z)j=

ezj

∑Kk=1ezk

, (2.7)

where z is a vector of the inputs to the output layer and j indexes the output units, so j = 1, 2, ..., K.

2.4.2.2 Regularization

As mentioned previously, CNN can easily overfit. To avoid overfitting, regularization methods can be applied. Regularization techniques can be seen as an imposition of certain prior distributions on model parameters.

Batch-Normalization is a method known for reducing internal covariate shift in neural networks [36].

To increase the stability of a neural network, batch normalization normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard devia-tion. In figure 2.9 a representation of the Batch-Normalization transform is presented.

Figure 2.9: Batch-Normalization applied to the activations x over a mini-batch. Extracted from [36]

In the notation y = BNγ ,β(x), the parameters γ and β have to be initialized and will be learned.

Dropoutis widely used to train deep neural networks. Unlike other regularization techniques that modify the cost function, dropout modifies the architecture of the model since it forces the network to drop different neurons across iterations. Dropout can be used between convolutional layers or only in the classification module. Its contribution to the activation of downstream neu-rons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass [37].

(32)

16 Background

Dropout reduces complex co-adaptations of neurons. Since a neuron cannot rely on the presence

Figure 2.10: Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right: An example of a thinned net produced by applying dropout to the network on the left. Extracted from [37].

of particular other neurons, it is forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons (see figure 2.10).

Data-augmentation is one of the most common most common method to reduce overfit-ting on image data by artificially enlarging the dataset using label-preserving transformations. The main techniques are classified as data warping, which is an approach which seeks to directly augment the input data to the model in the data space [38]. The generic practice is to perform ge-ometric and color augmentation. For each input image, it is generated a new image that is shifted, zoomed in/out, rotated, flipped, distorted, or shaded with a hue. Both image and duplicate are fed into the neural net.

L₁ and L2 regularization are traditional regularization strategies that consist in adding a

penalty term to the objective function and control the model complexity using that penalty term. L1 and L2are common regularization techniques not only in deep neural networks but in general

machine learning algorithms. The first L1regularization uses a penalty term which encourages the

sum of the absolute values of the parameters to be minimum. It has frequently been observed that L1 regularization in many models forces parameters to equal zero, so that the parameter vector is

sparse. This makes it a natural candidate for feature selection.

L2regularization can be seen as adaptive minimization of the square error with a penalization

term that penalize in such a way that less influential features, features that cause very small in-fluence on dependent variable, undergo more penalization. A high penalization term can lead do underfitting. Therefore, this term needs an optimal value to prevent overfitting and underfitting [39].

Early-Stopping is a strategy that by monitoring the validation set metrics decides when the model stops training. An indicator that the network is overfitting to the training data is when the

(33)

2.5 Model Selection 17

loss of the validation set is not improving for a certain number of epochs. To avoid this, early-stopping is implemented: the network will stop the training when it reaches a certain number of epochs without improving the validation set.

2.5 Model Selection

Model selection is the task of selecting a model from a set of candidate models. Commonly, model selection strategies are based in validation. Validation consists in partitioning a set of the training data and use this sub-set to validate the predictions from the training. It intends to assess how the results of a model will generalize to an independent data set as it is illustrated in Figure 2.11.

Figure 2.11: Common pipeline for model selection. Extracted from [40].

Different pipelines can be designed using validation for model selection. Commonly, a set of data is split in three sub-sets: train set, where the model will be trained, validation set in which in the model will be validated and test set where the model performance is assessed. Different models or models with different hyper-parameters are validated in the validation set. The hyper-parameter optimization is typically conveyed by a grid-search approach. Grid-search is exhaustive searching through a manually specified subset of the hyper-parameter space. A step-by-step description for grid-search pipeline as model selection is presented as follows:

1. The data set is split randomly, with user-independence between the sets, P times in three sub-sets: train-set, validation-set and test-set.

2. Sets of hyper-parameters to be optimized are defined. Being A and B two hyper-parameters sets to be optimized, each value of set A is defined as ai( i = 1, ..., I values) and each value

of hyper-parameter B as bi( j = 1, ..., J values).

3. The Cartesian product of the two sets, A and B, is performed, returning a set of pairs, (ai, bj)

in which a model will be trained. In the end, I × J models are trained.

4. Each model is evaluated on the validation set, returning a specific metric value.

5. The models are ordered by their performance on the validation test. The best set of hyper-parameters that produces the best model is selected.

(34)

18 Background

6. The model with the selected hyper-parameters is evaluated on the test set of split p (with p= 1, ..., P splits).

7. The performance of the algorithm corresponds to the average value of the performance of the selected model on the P splits.

(35)

Chapter 3

State-of-the-Art

Automatic Facial Expression Recognition (FER) can be summarized in four steps: Face Detection, Face Registration, Feature Extraction and Expression Recognition. These four steps encompass methods and techniques that will be covered in the next sections.

3.1 Face detection (from Viola & Jones to DCNNs)

Face detection is usually the first step for all automated facial analysis systems, for instance face modeling, face relighting, expression recognition or face authentication systems. Given an image, the goal is to detect faces in the image and return their location in order to process these images. There are some factors that may compromise and determine the success of face detection. Beyond image conditions and acquisition protocols, different camera-face poses can lead to different views from a face. Furthermore, structural components such as beards or glasses introduce variability in the detection, leading to occlusions [41].

For RGB images, the algorithm of Viola & Jones [23] is still one of the most used face detec-tion methods. It was proposed in 2001 as the first object detecdetec-tion framework to provide compet-itive face detection in real time. Since it is essentially a 2D face detector, it can only generalize within the pose limits of its training set and large occlusions will impair its accuracy.

Some methods overcome these weaknesses by building different detectors for different views of the face [42], by introducing robustness to luminance variation [43], or by improving the weak classifiers. Bo Wu et al. [44] proposed the utilization of a single Haar-like feature, in order to compute an equally bin histogram that is then used in a RealBoost learning algorithm. In [45] a new weak classifier, the Bayesian stump, is proposed. Features as LBP can also be used to improve invariance to image conditions. Hongliang Jin et al. [46] apply LBP on a Bayesian framework and Zhang et. al [47] combines LBP with a boosting algorithm that uses multi-branch regression tree as its weak classifier. Another feature set can be found in [24] that applies SVM over grids of histograms of oriented gradient (HOG) descriptors.

(36)

20 State-of-the-Art

Convolutional Neural Networks (CNNs) have been widely used in image segmentation or clas-sification tasks as well as for face localization. Convolutional networks are specifically designed to learn invariant representations of images as they can easily learn the type of shift-invariant local features that are relevant to face detection and pose estimation. Therefore, CNN-based face de-tectors outperform the traditional approaches, specially in unconstrained scenarios, in which there is a large variability of face-poses, viewing angles, occlusions and illumination conditions. Some examples of face detection in unconstrained scenarios can be found in Figure 3.1. They can also be replicated in large images at a small computational cost when compared with the traditional methods mentioned before [48].

Figure 3.1: Multiple face detection in uncontrolled scenarios using the CNN-based method pro-posed in [49].

In [48] a CNN detects and estimates pose by minimizing an energy function with respect to the face/non-face binary variable and the continuous pose parameters. This way, the trained algorithm is capable of handle a wide range of poses without retraining, outperforming traditional methods. The work of Haoxiang Li et al. [49] also takes advantage of the CNN discriminative capacity proposing a cascade of CNNs. The cascade operates at different resolutions in order to quickly discard obvious non-faces and evaluate carefully the small number of strong candidates. Besides achieving state of the art performances the algorithm is capable of fast face detection.

Another state of the art approach is the work of Kaipeng Zhang et al. [50] in which a deep cascaded multi-task framework (MTCNN) is designed to detect face and facial landmarks. The method consist of three stages: in the first stage, it produces candidate windows. Then, it refines the windows by rejecting a large number of non-faces windows through a more complex CNN. Finally, it uses a more powerful CNN to refine the result again and output five facial landmarks positions. The schema of the MTCNN is illustrated in Figure 3.2. In particular, the input image is resized to different scales, being the input to a three-stage cascaded framework:

Stage 1: First, a fully convolutional network, called Proposal Network (P-Net), is imple-mented to obtain the candidate facial windows and their bounding box regression vectors. Then, candidates are calibrated based on the estimated bounding box regression vectors. Non-maximum suppression (NMS) is performed to merge highly overlapped candidates.

Stage 2: All candidates are fed to another CNN, called Refine Network (R-Net), which further rejects a large number of false candidates, performs calibration with bounding box regression, and conducts NMS.

(37)

3.2 Face Registration 21

Stage 3: This stage is similar to the second stage, but the purpose of this stage is to identify face regions with more supervision. In particular, the network will output five facial landmarks’ positions.

Figure 3.2: Architecture of the Multi-Task CNN for face detection [50].

In this network there are three main tasks to be trained: face/non-face classification, bounding box regression, and facial landmark localization. For the face/non-face classification task, the goal of training is to minimize the traditional loss function for classification problems, the categorical cross-entropy. The other two tasks (i.e., the bounding box and landmark localization) are treated as a regression problem, in which the Euclidean loss has to be minimized [50].

Since this method holds the state of the art for face detection, the MTCNN is used as face detector in this dissertation.

3.2 Face Registration

Once the face is detected, many FER methods require a face registration step for face alignment. During the registration, fiducial points (or landmarks) are detected, allowing the alignment of the face to different poses and deformations. These facial key-points can also be used to compute localized features. Interest points combined with local descriptors provide reliable and repeatable measurements from images for a wide range of applications, capturing the essence of a scene without the need for a semantic-level [51]. Landmark localization is then an essential step to take, as these fiducial points can be used for face alignment and to compute meaningful features for FER [12]. Key-points are mainly located around facial components such as eyes, mouth, nose and chin. These key-points can be computed either using scale invariant feature transform (SIFT) [52] [53] or through a CNN where facial landmark localization is taken as a regression problem [50].

3.3 Feature Extraction

Facial features can be extracted using different approaches and techniques that will be covered in this sub-section. Feature extraction approaches can be broadly classified into two main groups:

(38)

22 State-of-the-Art

hand-crafted features and learned features, which can be applied locally or to the global im-age. Concerning the temporal information, algorithms can also be further divided into static or dynamic.

3.3.1 Traditional (geometric and appearance)

Hand-crafted features can be divided into appearance or geometric. Geometric features describe faces using distances and shapes of fiducial points (landmarks). Many geometric-based FER meth-ods recognize expressions by first detecting AUs and then decoding a specific expression from them. As an example, [54] looks over the recognition of facial actions through landmarks dis-tance, taking as prior the fact that facial actions involved in spontaneous emotional expressions are more symmetrical, involving both the left and the right side of the face. Figure 3.3 represents the recognition of one AU.

Figure 3.3: Detection of AUs based on geometric features used in [54].

Geometric features can provide useful information when tracked on temporal axis. Such ap-proach can be found in [55], in which a model for dynamic facial expression recognition based on landmark localization is proposed. Geometric features can also be used to build an active appearance model (AAM), the generalization of a statistical model of the shape and gray-level ap-pearance of the object of interest [56]. AAM is often used for deriving representations of faces for facial action recognition [57] [58]. Local geometric feature extraction approaches aim to describe deformations on motions or localized regions of the face. It is the example of the work proposed

(39)

3.3 Feature Extraction 23

by Stefano Berretti et al. [59] that describes local deformations (given by the key-points) through SIFT descriptors. Dynamic descriptors for local geometric features are based on landmark dis-placements coded with motion units [60], [61] or deformation of facial elements as eyes, mouth or eyebrows [62][63].

From the literature it is clear that geometric features are effective on the description of facial expressions. However, effective geometric feature extraction highly depends upon the accurate facial key-points detection and tracking. In addition, geometric features are not able to encode relevant information caused by skin texture changes and expression wrinkles.

Appearance features are based on image filters applied to the image to extract the appearance changes on the face. Global appearance methods began to focus on Gabor-wavelet representations [64] [65]. However, the most popular global appearance features are Gabor filters and LBPs. In [66] and [67], the input (face images) are convolved with a bank of Gabor filters to extract multi-scale and multi-orientational coefficients that are invariant to illumination, rotation, scale and translation. LBP’s are widely used for feature extraction on face expression recognition for their tolerance against illumination changes and their computational simplicity. Caifeng Shan et al. [29] implements LBP as feature descriptors, using an AdaBoost to learn the most discrim-inative LBP features and an SVM as discriminator. In general, such methods have limitations on generalization to other datasets. Other global appearance methods are based on the Bag of Words (BoW) approach. Karan Sikka et al. [68] explores BoW for an appearance-based dis-criminative feature extraction, combining highly disdis-criminative Multi-Scale Dense SIFT (MSDF) features with spatial pyramid matching (SPM).

Dynamic global appearance features are an extension to the temporal domain. In [69] local binary pattern histograms from three orthogonal planes (LBP-TOP) are proposed. Bo Sun et al. [70] uses a combination of LBP-TOP and local phase quantization from three orthogonal planes (LPQ-TOP), a descriptor similar to LBP-TOP but more robust to blur. Often a combination of different descriptors is used to form hybrid models. It is the example of the work proposed in [71], where LBQ-TOP is used along with local Gabor binary patterns from three orthogonal planes (LGBP-TOP). Figure 3.4 shows the most commonly used appearance-based feature extraction approaches.

Local appearance features require previous knowledge of regions of interest such as mouth, eyes or eyebrows. Consequently, its performance is dependent on the localization and tracking of these regions. In [72], the appearance of gray-scale frames is described by spreading an array of cells across the mouth and extracting the mean intensity from each cell. The features are then modeled using an SVM. A Gray Level Co-occurrence Matrix is used in [73] as feature descriptor of specific regions of interest.

There is not a straight answer on which feature extraction method is better, it depends on the problem and/or AUs to detect. Therefore, it is common to combine both types of appearance-based methods, as is usually associated with an increase of performance [74] [75]. The performance of FER methods based on these traditional descriptors has remarkable results but mainly for con-trolled scenarios. The performance using these features decreases dramatically in unconstrained

(40)

24 State-of-the-Art (1) (2) (3) (4) (5) (6)

Figure 3.4: Spatial representations of main approaches for feature extraction: (1) Facial Points, (2) LBP histograms, (3) LBQ histograms, (4) Gabor representation, (5) SIFT descriptors, (6) dense Bag of Words [19].

environments where face images cover complex and large intra-personal variations such as pose, illumination, expression and occlusion. The challenge is then to find an ideal facial representation which is robust for facial expression recognition in unconstrained environments. As described in the following subsection, the recent success of deep leaning approaches, specially those using CNNs, has been extended to the FER problem.

3.3.2 Deep Convolutional Neural Networks

The state of the art for FER is mostly composed by deep learning based methods. For instance, the following works [76][77][78] are some implementations of CNNs for expression recognition, holding a state of the art performance on public dataset of uncontrolled environments (SFEW). Zhiding Yu et al. [77] uses ensembles of CNNs, a commonly used strategy to reduce the model’s variance and, hence, improve the model performance. Ensembles of networks can be seen as multiple networks initialized with different weights leading to different responses. The output in this case is averaged but it can be merged by other means, as majority voting.

Another commonly used technique is transfer learning: a CNN is previously trained, usually in some related dataset and then is fine tuned for the target dataset. Mao Xu et al. [79] propose a facial expression recognition model based on transfer features from CNNs for face identification. The facial expression recognition model transfers high-level features from face identification to

(41)

3.3 Feature Extraction 25

classify them into one of seven discrete emotions with the multi-class SVM classifier. In [80], another transfer learning method is proposed. It uses as pre-trained networks the AlexNet [35], a well-known network trained on large-scale database. Since the target dataset is 2D gray-scale, the authors perform image transformations to convert 2D gray-scale images to 3D values.

Another strategy to work around limited datasets is the use of artificial data. Generative Ad-versarial Networks (GANs) [81] generate artificial examples that must be identified by a discrim-inator. GANs have already been used in face recognition. The method proposed in [82] generates meaningful artifacts on state of the art face recognition algorithms, for example glasses, leading to more data and preventing real noise to influence the network final prediction. Such approaches can possibly be extended to FER as well.

To solve the problem of model generalization on FER methods, curriculum learning [83] can be a viable option. Initially, the weights favor “easier” examples, or examples illustrating the sim-plest concepts, that can be learned most easily. The next training criterion involves a slight change in the weighting of examples that increases the probability of sampling slightly more difficult ex-amples. At the end of the sequence, the re-weighting of the examples is uniform and trained on the target training set. In [84] a meta-dataset is created with values of complexity of the task of emotion recognition. This meta-dataset is generated through a complexity function that measures the complexity of image samples and ranks the original training dataset. The dataset is split in different batches based on complexity rank. The deep network is then trained with easier batches and progressively fine tuned with the harder ones till the network is trained with all the data (see Figure 3.5).

Figure 3.5: Framework of curriculum learning method proposed in [84]. Faces are extracted and aligned and then ranked into different subsets based on curriculum. Succession of deep convolu-tion networks are applied, and then the weights of the fully connected layers are fine-tuned.

Regularization as mentioned before is also one of the factors that contribute to better deep models. More recently, new methods for regularization were introduced, such as dropout [37], drop-connect [85], max pooling dropout [86], stochastic pooling [87] and, to some degree, batch normalization [36]. H. Ding et al. [88] proposes a probabilistic distribution function to model the high level neuron response based on an already fine tuned face network, leading to regularization on feature level, achieving state of the art performance for SFEW database.

Besides regularization and transfer learning approaches, the inclusion of domain knowledge on the domain of the problem is also a common approach since it discriminates the feature space. The work of Liu et al. [78] explores the psychological theory that FEs can be decomposed into multiple

(42)

26 State-of-the-Art

action units and use the domain knowledge to build a oriented network for FER. The network can be divided in three stages where, firstly, it is built a convolutional layer and a max-pooling layer to learn the Micro-Action-Pattern (MAP) representation, extracting domain knowledge from the data, which can explicitly depict local appearance variations caused by facial expressions. Then, feature grouping is applied to simulate larger receptive fields by combining correlated MAPs adaptively, aiming to generate more abstract mid-level semantics. As last stage, a multi-layer learning process is employed in each receptive field respectively to construct group-wise sub-networks. Likewise, the method proposed by Ferreira et al. [89] intends to build a network inspired on the physiological support that FEs are the result of the motions of facial muscles. The proposed network is an end-to-end deep neural network along with a well-designed loss function that forces the model to learn expression-specific features. The model can be divided in three main components where the representation component is a regular encoding that computes a feature space. This feature space is then filtered with a relevance map of facial expressions computed by the facial-part component. The new feature space is fed to the classification component and classified in discrete classes. With this approach, the network increases the ability to compute discriminative features for FER.

3.4 Expression Recognition

Within a categorical classification approach, there are several works that tackle the categorical sys-tem in different ways: most works use feature descriptors followed by a classifier to categorically classify expressions/emotions. Principal Component Analysis (PCA) can be performed before classification in order to reduce feature dimensionality. This classification can be achieved by us-ing SVMs [90] [59], Random Forests [91] or k-nearest neighbors (kNN) [67]. More recently, deep networks are being used and, as mentioned in the previous sub-section, perform jointly feature extraction and recognition but it is possible to stop training and output the extracted features from the network and then proceed to classification and recognition [76][77][78].

3.5 Summary

Facial expression recognition has a standard pipeline, starting on face detection/alignment, feature extraction and expression recognition/classification. Feature extraction plays a crucial role on the performance of a FER system . It can be performed using traditional methods that either take in account geometric or appearance features or can be more complex, using convolutional neural networks as a feature descriptor. CNNs learn representations of data with levels of abstraction that traditional methods cannot, leading to new features. However, CNNs performance depend on the dataset size and GPU technology that determine its velocity. For FER, the datasets available are scarce, arising the need to implement methods that regularize, augment and generalize the datasets available in order to improve their performance. Novel methods on FER intend to im-prove generalization and scarcity of datasets by employing prior knowledge to the networks. Prior knowledge can consist in features transfered from other networks and domains but it can also be

(43)

3.5 Summary 27

domain knowledge that improve the discrimination in the feature space. Either way, the works that include prior knowledge indicate that this approach can achieve state of the art results and can help the generalization and the use of small datasets.

(44)

(45)

Chapter 4

Implemented Reference Methodologies

Several approaches, from traditional methods to state of the art methods, were implemented. The methodology followed for each approach will be covered in this chapter.

The implemented traditional approaches include hand-crafted based methods (geometric and apperance) as well as a conventional CNN trained from the scratch. These methods were imple-mented as baselines for the proposed FER. In addition, several state-of-the-art methods were also implemented (i.e., transfer learning and physiological inspired networks) to serve as a starting point for the proposed method.

Figure 4.1: Illustration of the pre-processing where (a) and (b) are instances from an unconstrained dataset (SFEW [92]) and (c) and (d) are instances from a controlled public dataset (CK+ [93]). The original images are fed to the face-detector (a MTCNN framework [50]) and the faces detected will be used as input to the implemented methods.

As a pre-processing step, all methods are preceded by face-detection and alignment. Then, the images are normalized, cropped and resized. To jointly perform face-detection and alignment, it

(46)

30 Implemented Reference Methodologies

is used the MTCNN as face-detector [50]. Some examples of pre-processed images are presented in Figure 4.1.

4.1 Hand-crafted based approaches

Concerning hand-crafted methods, several appearance-based methods as well as a geometric-based approach were implemented. The geometric approach is geometric-based on features computed from facial key-points (see Figure 4.2), such as:

1. Distances x and y of each key-point to the center point; 2. Euclidean distance of each key-point to the center point;

3. Relative angle of each key-point to the center point corrected by nose angle offset.

These features are then concatenated to form geometric feature descriptor. Finally, this feature descriptor is fed into a multi-class SVM for expression classification.

Figure 4.2: Illustration of the implemented geometric feature computation.

The implemented appearance-based FER methods are based on two commonly used tech-niques for texture classification, namely Gabor filter banks and LBP. Regarding the Gabor filters approach, a bank of Gabor filters with different orientations, frequencies and standard deviations was first created. Afterwards, the input images are convolved (or filtered) with the different Ga-bor filter kernels, resulting in several image representations of the original image. The mean and

On the Role of Facial Expressiveness in the Sign Language Recognition

F

E

U

P

Facial Expression Recognition:

Towards Meaningful Prior Knowledge

in Deep Neural Networks

Filipe Martins Marques

M

’

T

Resumo

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

Abbreviations

Chapter 1

Introduction

1.1

Context

1.2

Motivation

1.3

Goals

1.4

Contributions

1.5

Dissertation Outline

Chapter 2

Background

2.1

Facial Expressions

2.2

Pre-processing

2.3

Feature Descriptors for Facial Expression Recognition

2.4

Learning and classification

∑

2.5

Model Selection

Chapter 3

State-of-the-Art

3.1

Face detection (from Viola & Jones to DCNNs)

3.2

Face Registration

3.3

Feature Extraction

3.4

Expression Recognition

3.5

Summary

Chapter 4

Implemented Reference Methodologies

4.1

Hand-crafted based approaches