Spatio-temporal representation based on autoencoder for video action recognition : Representação espaço-temporal baseada em autoencoder para reconhecimento de ações em vídeos

(1)

COMPUTAÇÃO

Anderson Carlos Sousa e Santos

Spatio-Temporal Representation

Based on Autoencoder

for Video Action Recognition

Representação Espaço-Temporal

Baseada em Autoencoder

para Reconhecimento de Ações em Vídeos

CAMPINAS

2019

(2)

Spatio-Temporal Representation

Based on Autoencoder

for Video Action Recognition

Representação Espaço-Temporal

Baseada em Autoencoder

para Reconhecimento de Ações em Vídeos

Tese apresentada ao Instituto de Computação da Universidade Estadual de Campinas como parte dos requisitos para a obtenção do título de Doutor em Ciência da Computação.

Dissertation presented to the Institute of Computing of the University of Campinas in partial fulllment of the requirements for the degree of Doctor in Computer Science.

Supervisor/Orientador: Prof. Dr. Hélio Pedrini

Este exemplar corresponde à versão nal da Tese defendida por Anderson Carlos Sousa e Santos e orientada pelo Prof. Dr. Hélio Pedrini.

CAMPINAS

2019

(3)

Ana Regina Machado - CRB 8/5467

Santos, Anderson Carlos Sousa e,

Sa59s SanSpatio-temporal representation based on autoencoder for video action recognition / Anderson Carlos Sousa e Santos. – Campinas, SP : [s.n.], 2019.

SanOrientador: Hélio Pedrini.

SanTese (doutorado) – Universidade Estadual de Campinas, Instituto de Computação.

San1. Visão por computador. 2. Processamento de sinal de vídeo. 3.

Aprendizado de máquina. 4. Redes neurais convolucionais. I. Pedrini, Hélio, 1963-. II. Universidade Estadual de Campinas. Instituto de Computação. III. Título.

Informações para Biblioteca Digital

Título em outro idioma: Representação espaço-temporal baseada em autoencoder para

reconhecimento de ações em vídeos

Palavras-chave em inglês:

Computer vision

Video signal processing Machine learning

Convolutional neural networks

Área de concentração: Ciência da Computação Titulação: Doutor em Ciência da Computação Banca examinadora:

Hélio Pedrini [Orientador] David Menotti Gomes Ronaldo Cristiano Prati Thiago Vallin Spina Alexandre Mello Ferreira

Data de defesa: 30-09-2019

Programa de Pós-Graduação: Ciência da Computação

Identificação e informações acadêmicas do(a) aluno(a)

- ORCID do autor: https://orcid.org/0000-0002-7806-3410 - Currículo Lattes do autor: http://lattes.cnpq.br/5385160404560982

(4)

COMPUTAÇÃO

Anderson Carlos Sousa e Santos

Spatio-Temporal Representation

Based on Autoencoder

for Video Action Recognition

Representação Espaço-Temporal

Baseada em Autoencoder

para Reconhecimento de Ações em Vídeos

Banca Examinadora: • Prof. Dr. Hélio Pedrini

IC/Unicamp

• Prof. Dr. David Menotti Gomes DINF/UFPR

• Prof. Dr. Ronaldo Cristiano Prati UFABC

• Prof. Dr. Thiago Vallin Spina LNLS

• Prof. Dr. Alexandre Mello Ferreira IC/UNICAMP

A ata da defesa, assinada pelos membros da Comissão Examinadora, consta no SIGA/Sistema de Fluxo de Dissertação/Tese e na Secretaria do Programa da Unidade.

(5)

I would like to thank my family for all the support during my whole life, for given me the strength to move forward and survive the obstacles. In special, my parents Machado and Conceição, my grandmother Jardelina (in memoriam) and my sister Jamille for the love, aection and trust needed for accomplishing this.

I would like to thank Professor Hélio Pedrini for the opportunities and advices, for be-lieving and trusting on my potential and for all time and patience dispensed in supervising this research.

I would like to thank the many friends conquered on my academic life, for all the shared moments of joy which made the journey unforgettable. A special thanks to Jacqueline Midlej and Lucas Batista for all the care and concern and to Hanna and David Felix for the trust and support even at distance.

I would like to thank my childhood friends, whom I can always count and who are on my side even at distance. Thank you for the happy moments and great memories that aided in the dicult times during this journey.

I would like to thank the members of the LIV (Laboratory of Visual Informatics) for the companionship and technical discussions that certainly inuenced this work.

I would like to thank the faculty and sta members of the Institute of Computing at UNICAMP for the attention, operational support and precious learning, also for creating a healthy and inspiring environment.

This study was nanced in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.

An overwhelming gratitude to CNPq (process 141647/2017-5 from 2017 to 2019) and CAPES (from 2015 to 2017) for their nancial support.

(6)

Devido aos avanços no desenvolvimento de câmeras com altas taxas de amostragem, baixo custo, tamanhos reduzidos e alta resolução, um rápido crescimento na aquisição e dissemi-nação de vídeos tem impulsionado o desenvolvimento de diversas aplicações multimídia, como transmissão interativa, entretenimento, telemedicina, vigilância, entre outras. Para lidar com a enorme quantidade de dados e apoiar operadores humanos, torna-se necessário introduzir mecanismos automáticos para processar e compreender o conteúdo dos vídeos. Apesar dos recentes avanços na área de visão computacional, mais especicamente na análise de imagens por meio do uso de redes neurais profundas, o problema do reconheci-mento de ações ainda é desaador, uma vez que as relações espaço-temporais são difíceis de ser modeladas. Nesta tese, propomos e avaliamos uma nova representação de vídeo ba-seada em um autoencoder que emprega uma rede neural convolucional, a qual recebe uma sequência de vídeo na forma de quadros empilhados, codica-a para uma representação latente e decodica-a de volta para uma sequência de vídeo que se assemelhe ao máximo com a original. A arquitetura projetada do autoencoder impõe ao codicador a geração de uma imagem que resume toda a sequência do vídeo. Uma análise de diferentes funções de perda foi realizada para avaliar o impacto na imagem gerada e na reconstrução do vídeo. A transformação proposta permite alavancar modelos profundos baseados em imagens, além de sua visualização e compressão simples. Ao contrário de outras abordagens de vídeo para imagem, o método proposto fornece aprendizado de ponta-a-ponta com qual-quer modelo de rede neural que espere uma imagem como entrada e pode ser adaptado a diferentes problemas de análise de vídeos. O uso de nossa representação foi demonstrado empregando-se uma abordagem de múltiplos canais, o que também inclui uma imagem RGB e uma pilha de imagens de uxo óptico. Para combinar todos os canais, introduzi-mos a aplicação de uma abordagem com integral fuzzy, que generaliza outros operadores de fusão comuns para melhorar as pontuações individuais. Resultados experimentais uti-lizando os desaadores conjuntos de dados UCF101 e HMDB51 validam a representação espaço-temporal baseada em autoencoder, pois demonstram que nosso método é capaz de superar um modelo de referência de dois canais e alcançar taxas de acurácia competitivas em comparação com outras abordagens disponíveis na literatura.

(7)

Due to rapid advances in the development of cameras with high sampling rates, low cost, small size and high resolution, a fast growth in the acquisition and dissemination of videos has driven the development of diverse multimedia applications, for instance, interactive broadcasting, entertainment, telemedicine, surveillance, among others. To cope with the massive amount of data and support human operators, it is necessary to introduce au-tomatic mechanisms to process and understand video content. Despite recent advances in computer vision, more specically in image analysis through the use of deep neural networks, the problem of action recognition is still challenging, since spatio-temporal re-lationships are more dicult to model. In this thesis, we propose and evaluate a novel video representation based on a convolutional neural network autoencoder that inputs a video sequence as a stack of frames, encodes it to a latent representation and decodes back to a video that closely resembles the original. The specic designed architecture of the autoencoder imposes the encoder to produce an image that resumes the entire video sequence. An analysis of dierent loss functions was carried to evaluate the impact on the generated image and reconstruction of the video. The proposed transformation allows it to leverage image-based deep models in addition to its straightforward visualization and compression. Unlike other video-to-image approaches, it provides end-to-end learn-ing with any neural network model that expects an image as input and can be adapted to dierent video analysis problems. We demonstrate the use of our representation us-ing a multi-stream approach that also includes an RGB image and a stack of optical ow images. To combine all streams, we introduce the application of a fuzzy integral approach that generalizes other common fusion operators to improve on all individual scores. Experimental results on the challenging UCF101 and HMDB51 data sets validate the autoencoder-based spatio-temporal representation, demonstrating that our method is capable of surpassing a two-stream baseline and achieving competitive accuracy rates compared to other approaches available in the literature.

(8)

1.1 Basic illustration of the action recognition problem. . . 14

2.1 Example of a multilayer perceptron neural network. . . 19

2.2 Example of a convolutional neural network [6]. . . 21

2.3 Example of an autoencoder structure [2]. . . 23

2.4 Examples of Motion Energy Images (MEI) and Motion History Images (MHI). The rst row shows the key frames of a video, the second shows the MEI representation, and the third shows the MHI representation [9]. . 26

2.5 Examples of visual rhythms images from dierent slicing directions [15]. (a) video sequences; (b) horizontal-mean strategy; (c) vertical-mean strategy. 26 2.6 Frames from a video sequence and Motion Stacked Dierence Image (MSDI) generated from them [104]. . . 28

2.7 Examples of dynamic images that summarize short video sequences as still images [8]. . . 29

2.8 LSTM autoencoder model. Extracted from [87]. . . 30

3.1 Our method is composed of an autoencoder-based representation for action recognition. The representation obtained from the encoder is used as an image that serves as input to a 2D CNN classier. . . 32

3.2 Proposed video autoencoder architecture. . . 34

3.3 Proposed action recognition architecture composed of spatial, temporal and spatio-temporal streams. . . 36

3.4 Proposed architecture for action recognition using our spatio-temporal stream. . . 37

4.1 Examples of classes considered in the UCF101 action recognition data set. 42 4.2 Examples of classes considered in the HMDB51 action recognition data set. 43 4.3 Examples of images generated by the encoder. (a) shows the rst and last frames of a video clip with ten frames. (b-h) show the representation of all the ten frames given by the encoder considering the loss function applied in its training. The individual channels are shown as gray-scale images and their composition as RGB images. . . 45

4.4 Examples of images generated by the autoencoder. The rst column shows original frames of video clips, whereas the other columns show the respec-tive frame reconstruction, considering the loss function applied in the train-ing step. . . 46

(9)

its training. The individual channels are shown as gray-scale images and their composition as RGB images. . . 49 4.7 Examples of images generated by the encoder. (a) shows samples of frames

from the original video used as input. (b-d) show the representation of all the ten frames given by the encoder considering the activation used. The individual channels are shown as gray-scale images and their composition as RGB images. . . 51 4.8 Class-wise accuracy of each stream for the HMDB51 dataset (Split 1). . . . 53 4.9 Class-wise accuracy of each stream for the UCF101 dataset (Split 1). . . . 53 4.10 Examples of actions represented by the encoder. The rst column shows

the videos where the method reached the best accuracy rates, whereas the last column shows the videos where it achieved the worst accuracy rates. . 54 4.11 Class-wise accuracy of stream combinations (without fuzzy fusion) for the

HMDB51 data set (Split 1). . . 56 4.12 Class-wise accuracy of stream combinations (with fuzzy fusion) for the

HMDB51 data set (Split 1). . . 56 4.13 Class-wise accuracy of stream combinations (without fuzzy fusion) for

UCF101 dataset (Split 1). . . 57 4.14 Class-wise accuracy of stream combinations (with fuzzy fusion) for UCF101

(10)

2.1 Video representation methods. . . 29 4.1 Action classication accuracy for HMDB51 (Split 2) with dierent loss

functions for the encoder. . . 44 4.2 Results over the frame reconstructions of the test set for HMDB51 (Split 1). 47 4.3 Action classication accuracy for HMDB51 (Split 2) with dierent

autoen-coder architectures. . . 48 4.4 Action classication accuracy for HMDB51 (Split 2) with dierent

encoder-CNN connections. . . 50 4.5 Results over the frame reconstructions of the test set for HMDB51 (Split 1). 50 4.6 Accuracy of dierent input methods for CNN action classication. . . 52 4.7 Results for dierent stream fusion combinations. . . 55 4.8 Results for dierent stream fusion combinations using the proposed fuzzy

fusion. . . 55 4.9 Individual accuracy results for dierent video-to-image representations on

the HMDB51 and UCF101 data sets. . . 58 4.10 Comparison of accuracy rates (%) for UCF101 and HMDB51 datasets.

Cells in bold represents the overall highest accuracy rates, whereas under-lined cells consist of the best results using only ImageNet to pre-train the network. . . 58

(11)

2D Two-Dimensional 3D Three-Dimensional AE Autoencoder

ALL Ambient-Assisted Living BN Batch Normalization

CNN Convolutional Neural Network DI Dynamic Images

DSSIM Structural Dissimilarity Index Metric HOF Histogram of Optical Flow

HOG Histogram of Oriented Gradient I3D Inated 3D CNN

iDT Improved Dense Trajectories LBP Local Binary Patterns LSTM Long Short Term Memory MBH Motion Boundary Histogram MEI Motion Energy Image

MHI Motion History Image MLP Multilayer Perceptron

MSDI Motion Stacked Dierence Image PCA Principal Component Analysis PoTion Pose Motion

ReLU Rectied Linear Unit RGB Red-Green-Blue

SGD Stochastic Gradient Descent SSIM Structural Similarity Index Metric STIP Space-Time Interest Points

SVM Support Vector Machine Tanh Hyperbolic Tangent

TDD Trajectory-pooled Deep-convolutional Descriptors TSN Temporal Segment Network

VAE Variational Autoencoder VR Visual Rhythms

(12)

1 Introduction 13

1.1 Motivation and Problem Characterization . . . 13

1.2 Research Questions . . . 15

1.3 Objectives and Contributions . . . 16

1.4 Publications . . . 17 1.5 Text Organization . . . 17 2 Background 18 2.1 Basic Concepts . . . 18 2.1.1 Actions . . . 18 2.1.2 Neural Networks . . . 19 2.1.3 Representation Learning . . . 22 2.1.4 Autoencoders . . . 23 2.2 Related Work . . . 25 2.2.1 Action Recognition . . . 25 2.2.2 Video Autoencoders . . . 29

3 Proposed Spatio-Temporal Representation for Action Recognition 32 3.1 Video Autoencoder . . . 33 3.1.1 Encoder . . . 33 3.1.2 Decoder . . . 34 3.1.3 Loss Function . . . 34 3.2 Multi-Stream Architecture . . . 36 3.2.1 Training Details . . . 39 3.2.2 Testing Details . . . 40 4 Experimental Results 41 4.1 Data Sets . . . 41 4.2 Implementation Details . . . 42 4.3 Ablation Tests . . . 43 4.3.1 Loss Functions . . . 43 4.3.2 Depth . . . 47 4.3.3 Connection . . . 49 4.4 Results . . . 51

(13)

Chapter 1 Introduction

This chapter describes the problem under investigation in this thesis, as well as its moti-vations, challenges, objectives, contributions, research questions, and text organization.

1.1 Motivation and Problem Characterization

Videos have become an important medium in our daily lives and the volume is larger than ever. According to recent statistics, over one billion hours of video are watched daily on YouTube [3] and Facebook reaches an audience of up to 8 billion views per day [1]. In Brazil, the number of video cameras embedded in cell phones exceeds the number of inhabitants, 220 million smartphones against 207.6 million people [51]. In addition to social media, the use of videos has increased in diverse applications in the elds of robotics, human-computer interface, smart homes and surveillance. They all demand automatic and ecient video interpretation.

With the massive volume of data, retrieval and management of videos are huge chal-lenges and, although they can be dealt with the use of metadata and associated text such as title, subtitles and keywords, they may be missing, incorrect or irrelevant. Therefore, semantic extraction of video content is required [44]. Ramezanei and Yaghmaee [65] reviewed several techniques that employ human action analysis to for content-based video retrieval.

When interacting with humans, it is crucial for a robot to understand the actions taken in order to generate the appropriate reaction in a feasible time. For example, in a tennis match, a robot that predicts the player's action to hit the ball may respond faster than a robot that only tracks the ball after it has been hit. Another example is a humanoid robot that must interpret a hand-shake to provide a realistic interaction [112]. The same benets apply to human-computer interfaces that intend to provide non-invasive and natural approaches through recognition of gestures and body movements [62, 78, 80]. This has become a requirement as the entertainment industry has moved to more interactive, full-body scenarios that do not require manual controls, such as video games involving dance and sports.

The development of smart homes involves automation in the control of house ap-pliances, temperature and lighting. However, in recent years, technologies focused on

(14)

Ambient-Assisted Living (ALL) have emerged and become of great importance due to the rapidly aging society [67]. Using videos allows non-invasive methods to detect falls and other abnormal situations [40] and even social interaction changes [11].

The current development of autonomous vehicles, capable of assisting human drivers, helping to reduce fatalities and save fuel, is a major challenge that also requires under-standing certain actions to react quickly and prevent accidents [25]. For instance, a car needs to be able to anticipate the trajectory and intentions of a running pedestrian or a child chasing a ball [23].

In the domain of surveillance, visual inspection has traditionally been performed by human operators to identify events of interest in video sequences, live or posterior [43]. However, the massive amount of data involved in real-world applications makes such recog-nition process impracticable, time consuming and susceptible to failure under fatigue or stress. Therefore, automatic video interpretation systems are crucial in monitoring these tasks in complex scenarios. Systems can help detect specic events, such as violence [22] or any behavior that might be considered abnormal context [64].

In this work, we do not address each problem individually, but focus on the large area of action recognition through data sets with a diversied range of actions and, therefore, the methods developed for this purpose can be applied in many cases. Given a video sequence of variable size, the task consists of generating a label that corresponds to the main action performed on it. Figure 1.1 shows the basic scheme for action recognition, only RGB or grayscale videos are considered here and only one label is inferred. The problem of action localization and detection is beyond the scope of this work, such that it is assumed that the video has been temporally segmented previously [71, 72, 73, 74, 75] and contains only one action per sequence of frames.

Figure 1.1: Basic illustration of the action recognition problem.

Although the concept of actions may dier in the literature, it generally refers to short-range movements and interactions. In Section 2.1.1, a more detailed denition of the actions considered in this work is developed.

The subject of action recognition inherits some of the basic issues of computer vision in general, such as:

• illumination conditions, shadows, light intensity variation and reections can de-grade the images, dicult the boundaries denitions and increase the intra-class variations;

(15)

• scale variation are caused because the objects1 of interests can be close to or distant from the camera, causing divergence between dierent images;

• pose variation refers to the angle at which objects were captured and displays dif-ferent points of view, also causing mismatches;

• image resolution is responsible for the number of pixels that represent the image. A low resolution means that more information is condensed into several pixels, causing lack of detail.

On the other hand, new challenges are related to the classication of human actions in videos:

• camera motion introduces noise into the action movement and needs to be modeled and compensated;

• temporal scale variation can occur between classes of actions that have dierent ranges of complexity and movements, but also in the same class with dierent exe-cution speeds;

• background clutter adds noise and makes it dicult to segment actions from other movements. Nevertheless, some information may be important to provide context and, therefore, the background needs a careful treatment;

• intra- and inter-class variation, in addition to the challenges mentioned previously, is also caused due to the nature of human behavior, where a same action may be performed in a variety of ways. For example, a simple action such as eating can be performed with left or right hand, using fork, spoon and/or knife, and food may vary. On the other hand, dierent actions may be very similar in terms of movement, for instance, jumping rope and jumping jack or hair cut and head massage. To overcome these challenges, it is crucial to have a video representation that captures both spatial and temporal features and to address all types of variations. For the common problems of computer vision using images, there has been a signicant advance with the use of deep learning methods composed of convolutional neural networks and, therefore, it is essential to take advantage of these solutions.

1.2 Research Questions

We drive our research through a major investigative question:

• Is an autoencoder model capable of generating a meaningful spatio-temporal repre-sentation for videos using the internal code learned as an image?

1_{Objects in this context refer to any portion of the image that can be interpreted a single unit, which}

(16)

Complementary to this question and considering the specic case of the human action recognition problem, other research questions can be summarized as:

• Is the generated representation useful for action recognition?

• Does the use of end-to-end learning improve the representation for the action recog-nition problem?

• Does the representation possess complementary information with RGB and optical ow?

• Can the fusion of scores given by each dierent representation be improved with a fuzzy integral?

1.3 Objectives and Contributions

The main objective of this work is to propose, implement and analyze a learnable video representation in the format of a single 2D image. The representation should be compact and illustrative, capture both spatial and temporal information, and be relevant for action classication in a variety of classes. In order to do so, specic goals were dened:

• the denition of a video autoencoder architecture capable of generating internal code as a single image;

• the investigation of image restoration loss functions for providing appropriate in-ternal code, since most approaches available in the literature focus on the output image;

• evaluation of the representation in the context of action recognition using 2D Con-volutional Neural Network (CNN) models and standard data sets;

• combination with other types of representations: a sample RGB image from the video and a stack of optical ow images.

The main contribution of this thesis is a method that allows end-to-end learning for converting video sequences into an image representation using autoencoders. Unsuper-vised learning is used to initialize the weights of a convolutional autoencoder model and the encoder is then extracted and plugged into an image model, where it can be further trained with supervision considering the action recognition task.

Consequently, a novel video autoencoder architecture is developed, since no archi-tecture available in the literature has the purpose of exploring the internal code of au-toencoder as an image. In addition, a multi-stream approach that includes the proposed representation is developed for action recognition. Finally, a contribution in the combi-nation of multi-streams was provided with the introduction of fuzzy integral to perform the fusion of dierent model scores.

(17)

1.4 Publications

The following papers were derived from the development of this research work:

• A.C.S. Santos and H. Pedrini. Spatio-Temporal Video Autoencoder for Human Ac-tion RecogniAc-tion. 14th InternaAc-tional Joint Conference on Computer Vision, Imag-ing and Computer Graphics Theory and Applications, volume 5, pages 114-123. SciTePress, 2019.

This paper presents the rst version of the method described in this thesis and introduces the idea of an autoencoder-based representation using a shallow architecture.

• A.C.S. Santos and H. Pedrini. Human Action Recognition Based on a Spatio-Temporal Video Autoencoder. International Journal of Pattern Recognition and Articial Intelligence, 2020 (accepted).

This paper presents an extended version of the proposed methodology, similarly to the approach described in this thesis.

1.5 Text Organization

This thesis is organized as follows. Chapter 2 presents a description of relevant concepts and related work regarding action recognition, video representation and autoencoder in the video context. Chapter 3 describes our proposed video autoencoder architecture and the multi-stream action recognition approach. Implementation details and experimental results are provided in Chapter 4, as well as a comparison with state-of-the-art meth-ods. Finally, some concluding remarks and directions for future work are presented in Chapter 5.

(18)

Chapter 2 Background

This chapter presents a brief description of some fundamental concepts necessary to com-prehend the remainder of this thesis, relevant work related to action recognition with focus on dierent video representations, as well as approaches that employ video autoencoder architectures.

2.1 Basic Concepts

This section presents an introduction to relevant concepts used in this thesis. Initially, the denition of action considered in this work is presented in Section 2.1.1. Section 2.1.2 provides an explanation of neural networks focused on convolutional neural networks, as well as the fundamentals that precede them, related concepts and techniques. Section 2.1.3 discusses the idea of representation learning as a research eld. Finally, autoencoders are dened in Section 2.1.4.

2.1.1 Actions

There are dierent denitions of what constitutes an action throughout the literature. Some authors refer to actions as primitive movements such as sitting, walking and run-ning, whereas refer to activities as a more complex sequence of these actions, with in-teraction with objects and/or other people. However, these two concepts are often used interchangeably. The answer that most closely coincides the approaches available in the literature and standard data sets is given by Herath et al. [31], who dene action as the most elementary human-surrounding or human-object interaction with meaning.

This denition may be subjective and, in this work, the boundaries of what constitutes an action are data driven, that is, the categories present in the evaluated data set are always considered as actions, even though one might argue that some of them could be subdivided into more elementary interactions. The denition can also be made in contradiction to the concept of activity, which is considered here as a complex sequence of long-term actions and, therefore, any short-term human-interaction with an objective is considered an action, even if it may be separated into more elementary actions.

For example, cutting in the kitchen is considered an action category, while cooking is an activity as it involves separate high-level steps. Section 4.1 describes a list of actions

(19)

considered in our experiments.

2.1.2 Neural Networks

Neural network is a term that encompasses a large class of models and learning methods. Its most basic form is the Multilayer Perceptron (MLP), whose purpose is to approximate some function g that maps an input x to a class y, in case of a classication problem.

The description as a network comes from the combination of neurons that can represent dierent functions. A chain structure such as f(x) = g3(g2(g1(x))) is normally used and the number of functions denes the depth of the network and is where the deep learning terminology was derived from [28]. Figure 2.1 illustrates an example of the basic scheme of the multilayer perceptron.

Figure 2.1: Example of a multilayer perceptron neural network.

More specically, the function performed by a single neuron can be represented by f (wT_{x + b)}_{, where w are the parameters, also called weights, x is the input vector, b is a} bias term and f is a non-linearity, also called activation function. The goal is to adjust the weight w to obtain the desired output according to the input, which is made in the training stage. The test stage comprises applying the function to a new input with the learned weights.

The most common activation functions are the sigmoid, Hyperbolic Tangent (Tanh) and Rectied Linear Unit (ReLU). The sigmoid is also known as logistic function, which transforms the data into a value between 0 and 1 according to Equation 2.1.

f (x) = 1

1 + e−x (2.1)

The sigmoid function has an S-shape with the middle at 0.5 for all possible input values. This was the default activation used in the early development of neural networks. Equation 2.2 shows the function of hyperbolic tangent that has the same shape as the sigmoid, however, its values are in the range from -1 to 1.

f (x) = e

x_{− e}−x

ex_{+ e}−x (2.2)

The use of the tanh function usually presents better performance and easy training then sigmoid. Nevertheless, both functions saturate on high and low values and are

(20)

sensitive only at mid-points. This makes optimization more challenging as it becomes dicult to adjust weights if values are always converted close to 1 or 0 (-1, in case of tanh).

Another challenge is that, since training is performed using stochastic gradient de-scent, this requires the propagation of the error to update the weights using an estimated gradient. In addition, with deeper networks occurs the problem of vanishing gradients, which consists of the constant decrease of error propagated across the network to a point where there is no useful information.

To solve the problems mentioned, the ReLU activation function is more often used. It is a simple function, as shown in Equation 2.3, and solves most of the problems found in the other functions.

f (x) = max(0, x) (2.3)

ReLU has the advantage of computational simplicity, requiring no exponential calcu-lation and has a trivial implementation. In addition, since it outputs true zero values, in contrast to other functions that output approximately zero values, there is a representa-tional sparsity that is desirable in many cases.

Another important feature of ReLU is that it behaves similarly to a linear function, being only non-linear for values less than zero. This allows the gradients to ow without the vanishing eect [26].

Convolutional neural networks (CNNs) refer to a special type of neural network that works best with grid-like data, such as images. They are characterized by the use of a convolution operation in place of the general dot multiplication. The mathematical expression of a 2D convolution in discrete space is shown in Equation 2.4.

(K ∗ I)(i, j) =X m

X n

I(i − m, j − n)K(m, n) (2.4) where K is a lter of dimensions m × n, I is a 2D image, and i, j is a pixel location.

The use of convolution allows for weight sharing, reducing the number of parameters. If an image of 256×256×3 is used as input tor an MLP, 196,608 weights will be necessary just for a rst layer, due to its full connections. With CNN, the connections are local and the weights are image lters with height and width dimensions that dene the spatial extent to which the weights act. Figure 2.2 illustrates the outline of a basic CNN network. The output dimensions of a convolution layer depend on the number of lters, which will give its depth. The stride, which denes the step at which the lters perform, if greater than 1, reduces the output spatial dimensions. Another factor is the use of zero padding, preserving the original input dimensions, as lters normally cannot act on boundary pixels and thus reduce the image by trimming the perimeter.

A common operation applied after convolution layers is pooling. A pooling layer is responsible for reducing the size of the input by aggregating some values through typically average or maximum. It works similar to a lter with spatial dimensions and moving through the input, whose output is the aggregation of the values in which it is located.

Reduction of spatial size is desirable in the case of classication because the nal layer output should reect a category. Thus, it is also common to have nal fully connected

(21)

Figure 2.2: Example of a convolutional neural network [6].

layers.

Another useful operation to be applied after the convolution (but not restricted to it) is the Batch Normalization (BN). Batch normalization keeps the mean activation close to 0 and activation standard deviation close to 1, which reduces the internal covariance shift, that is, the amount by which the hidden unit values shift around [35].

Standardizing the activation that will serve as input to another layer means that the distribution does not change, at least not much, during the weight update. The consequence is a faster training because it allows higher learning rates due to stability in the activations and also each layer performs learning more independently.

A more recent study [77] suggests that batch normalization is eective because it smooths the optimization function, which simplies convergence and reducing internal covariance shift has little to do with it. Nevertheless, the positive impact of using such normalization is a well-accepted fact.

With the increasing complexity of networks, an additional issue needs to be managed: overtting. When the network is more complex than the amount of data, it can produce a model that ts well with the training data, but does not generalize to others of the same distribution. One way to avoid overtting is to apply a regularization that prevents the network from producing an overly specic model.

With neural networks of any type, the widely adopted form of regularization is Dropout [86]. The main idea is to randomly eliminate units and their connections during training, which produces an exponential number of dierent networks that are thinned versions of the original. This prevents the units from co-adapting too much. At test time, the original network is used with lower weights, approximating the eect of averaging the predictions of all.

The great success of CNNs through these new techniques in many practical applica-tions has led to the resurgence of neural network research. Although neural networks have been proposed for several years, they have become imperative in many problems. A detailed discussion of neural networks is beyond the scope of this thesis. For a more in-depth analysis, the reader can refer to [28, 57].

(22)

2.1.3 Representation Learning

The development of deep learning is due to the paradigm shift where before the transfor-mation of raw data into a more viable representation was proposed by the domain expert and learning occurs only in the data classication with this representation and now, in the current paradigm, the representation and classication are learned together. How-ever, with supervised training, the representation is adapted to convey with the classier and does not explicitly impose a condition on the type of features learned. Similarly, unsupervised learning obtains a representation as a byproduct of the main optimization task.

Representation learning is dened as the transformation of raw data into a better representation that is not hard-coded, but learned from the data itself. More specically. we refer here to methods that care about the type of features, adding constraints and designing specic architectures to obtain a representation that ts to other tasks. There-fore, the primary goal deviates from solving some classication or regression problem to the extraction of useful information.

Deep learning models require a large amount of data to account for all variations and avoid overtting in just a subset. However, the amount of labeled data is limited since it depends on manual labor. Reusing representations is an important skill in solving these issues.

Unsupervised learning provides a good initialization to a supervised task, where each layer can be pre-trained without supervision and, later, the entire network is trained using the labels [20]. This is called greedy layer-wise pre-training and was one of the techniques that boost the revival of neural networks as it allows to successfully train fully connected architecture. The algorithm comes down to each layer learning, in an unsupervised fashion, to output a representation from the input of the previous layers.

Transfer learning is a representation learning technique that is broadly used and can be successful [60]. It consists in adapting a model learned from one task to another, where the features captured in the rst are still relevant. Typically, the architecture is preserved and the original weights serve as initialization. The major example of a task where the model is useful to many others is the ImageNet [69] object recognition challenge that possesses an average of 1,000 images per class and 1,001 classes. This annotated data volume enables the construction and eective training of very deep models that can be used for other image-related tasks [34].

In addition, the transfer may be partial and from an unsupervised task to a supervised one. For example, a model can be trained to predict the related position of two patches extracted from the same image, which does not require supervision. The obtained features need to capture visual attributes related to the objects and their parts and, therefore, can be explored to pre-train an object localization model. This was done by Doersh et al. [18], achieving better results than training without the unsupervised features of the rst task. In this thesis, we use the video reconstruction task to obtain a representation used for video action classication using pre-trained weights from the unsupervised task in the rst part of the model and weights from ImageNet as initialization of the second part.

(23)

2.1.4 Autoencoders

An autoencoder (AE) is a specic neural network design in which the learning goal is to copy the input to the output, but with some constraint to internally obtain a compressed representation of the input. Since the objective is to retrieve the most useful information from the data, a perfect copy is not desirable and, therefore, it is necessary to ensure the networks are not just memorizing the input.

The basic structure of an autoencoder involves an encoder function h = f(x) that transforms the input data x into an internal representation h, also referred to as a code and a decoder ¯x = g(h) that uses the code to produce a reconstruction of x. Figure 2.3 illustrates the basic structure of an autoencoder.

Figure 2.3: Example of an autoencoder structure [2].

An autoencoder was originally developed as an approach to perform dimensionality reduction and, in fact, if linear activations are used, it can be similar to Principal Compo-nent Analysis (PCA), a classic method for reducing data dimensionality by transforming the data into a set of linearly uncorrelated variables that possess most of the data variance. The design of an autoencoder depends on the data on which it will be used and the type of layers used can vary in any way possible, for instance, fully connected, convolutions or recurrent. Since it uses only input data, it is unsupervised learning and the optimization consists of minimizing a reconstruction error that measures the dierence between the original data and the output.

If not carefully designed, the autoencoder cannot learn a data transformation and only copy it to the output. One strategy to avoid this is to impose a bottleneck in the architecture by constraining the number of hidden units. This is referred to as

(24)

undercom-plete autoencoders and directly limits the amount of information that ows from input to output.

Hopefully, it forces the encoder to capture the most salient features, but theoretically, if there is enough capacity, it can learn to index training data instead of useful features. Therefore, it is important to take care of the encoder power considering size and depth.

Another class is the sparse autoencoders that apply a penalty to the cost function, which imposes sparsity to the encoder output. This allows the number of hidden units to be equal or greater than the input because only a small number of units will be activated depending on the input data.

It can be seen as regularization and although it is usual to regularize the weights to avoid overtting, in this case, activations are the goal. One way to regularize is to compute and minimize the sum of absolute values on activation, which implies fewer activations. Glorot et at. [27] introduced the idea that using ReLU activations will produce a sparse code with actual zeros.

As with undercomplete autoencoders, it is expected that the sparse type will avoid the simple identity function and respond to unique statistical features in the data in which it was trained [5].

A dierent strategy for learning useful information is to change the reconstruction error. Denoising autoencoders minimize reconstruction of the output of an input corrupted by some form of noise and the original input. This is shown in Equation 2.5, where ¯x is the noise version of x.

L(x, g(f (¯x)) (2.5)

This process avoids an identity function because it changes the objective and the encoder does not have access to the original data. Bengio [7] showed that denoising forces the model to implicitly learn the structure of the data generating distribution.

There is yet another type of autoencoder based on the idea that very similar inputs should learn very similar encodings and, to force this into training, a penalty is applied to the derivatives of the hidden layer. These are called contractive autoencoders and are similar to denoising in the sense that small changes are treated as noise and should be discarded.

It is also important to mention the Variational Autoencoder (VAE), although it has a dierent purpose that moves away from learning a representation. It ts into the category of generative models and is used to create new data by sampling the latent space learned from training data. This is possible because the model is constrained to producing the parameters of a probability distribution by modeling the data.

There are many applications of autoencoders that range from data denoising [12], image inpainting [115], super-resolution [121], information retrieval [70] and anomaly de-tection [125]. In Section 2.2.2, we will expand the description of autoencoder applications in the video context.

(25)

2.2 Related Work

This section describes approaches available in the literature that are related to the re-search topic investigated in this thesis. Initially, we provide a summary of relevant action recognition methods, focusing on representations and highlighting the approaches that are more similar to ours. Next, we focus on the use of autoencoders for video analysis.

2.2.1 Action Recognition

The earliest approaches to action/motion representation and recognition date back to the 1980s and were based on hand-crafted features aimed at modeling an action by templates. Hogg [32] proposed a 3D model of a walking person where overlapping the model with the images would provide the classication. An improvement of this method was given by Rohr [68], but renement was restricted to the model construction and classication. Model-based approaches are fundamentally naïve, as they make various assumptions and constraints on data, as well as being is expensive and dicult.

The following methods have opted for feature-based representations where the action class has no specic model but is dened by a set of characteristics. Bobick and Davis [9] introduced a holistic approach using Motion History Image (MHI) and Motion Energy Image (MEI) representations that are based on cumulative operations on image sequences. MEI represents the location of the action, whereas MHI represents the history of silhouette changes over time. Figure 2.4 illustrates some examples of MEI and MHI representations. Although these representations showed promising results and derive other rened ap-proaches [29, 94, 113], the holistic approach is not able to recognize ne details, are degraded by background motion, viewpoint change and occlusions.

These methods are related to this thesis since they transform a video into an image representation. Torres and Pedrini [95] used a video-to-image transformation called Visual Rhythms (VR) for action recognition. Visual rhythms [56] were built by joining slices of all video frames, where each slice can be a set of pixels or their average in any direction [85]. However, these approaches dier from ours, because the construction of the image is rigid and there is no learning associated. Figure 2.5 illustrates some examples of visual rhythms extracted from video sequences through two dierent construction strategies.

The use of local features was introduced by Laptev [46] with Space-Time Interest Points (STIP). The representation extends the notion of interest points present in static images and follows the pipeline of point detection, obtaining a description of the region around the points and performing aggregation of all, where the nal product is a vector. Other point detectors were proposed later [42, 48, 79, 114, 118].

Among the descriptors for local regions, it is worth mentioning Local Binary Pat-terns (LBP) [59] and variations [39, 58, 97, 122],Histogram of Oriented Gradient (HOG) [47, 101], Histogram of Optical Flow (HOF) [16, 47] and Motion Boundary His-togram (MBH) [16, 101].

Space-time interest points act at a short range and are unable to capture long-term actions. A solution was to obtain trajectories by matching the points in many frames of the video [88]. Many approaches used the concept of trajectories to model and represent

(26)

Figure 2.4: Examples of Motion Energy Images (MEI) and Motion History Images (MHI). The rst row shows the key frames of a video, the second shows the MEI representation, and the third shows the MHI representation [9].

(a) videos (b) horizontal-mean (c) vertical-mean

Figure 2.5: Examples of visual rhythms images from dierent slicing directions [15]. (a) video sequences; (b) horizontal-mean strategy; (c) vertical-mean strategy.

actions with an aggregation of descriptors [52, 66, 99, 100]. Wang et al. [100] introduced Improved Dense Trajectories (iDT), whose representation was the state of the art before

(27)

deep learning solutions.

Due to the analogy between 3D images and videos, the rst convolutional neural net-works (CNNs) applied to the action recognition problem used 3D convolutions to explore spatio-temporal features [36]. Karpathy et al. [37] trained 3D networks from scratch us-ing the Sports-1M data set, which contains more than 1 million videos. However, their method did not outperform traditional approaches due to the diculty in representing motion. Tran et al. [96] proposed improvements with 3D CNNs models to learn spatio-temporal features from video sequences and showed that they can be competitive and learn good features.

Simonyan and Zisserman [82] developed a two-stream method, where a network learns actions from still frames and another learns the actions from a stack of pre-computed optical ow images, both using 2D CNNs. These networks were trained separately and, at test time, a fusion of scores provides the nal class. Optical ow consists of estimating the apparent motion of pixels between adjacent frames [92]; this strategy can be considered a video-to-image transformation, however, it works only with a two-frame input and results in an estimate for x and y directions, and thus does not mean dimensionality reduction.

Wang et al. [106] further improved the method using more recent and deeper 2D CNN architectures, developing data augmentation and training parameters, but mainly by taking advantage of pre-trained weights for the temporal stream and averaging the weights of the input to support the stack of images.

The two-stream scheme became the state of the art and many other approaches derived from it. The main issues to be addressed are the inability of the standard two-stream to explore the correlation between spatial and temporal information and capture long-term relationships. Among several methods that address this problem, two main strategies include: (i) working on the CNN output in order to explore the features from frames or snippets [17, 19, 49, 55, 98, 107] and (ii) introduction of a dierent temporal representa-tion [8, 33, 104, 108]. Our work falls into this latter category of methods.

An example of the rst method category is given by Yue-Hei et al. [55], who pro-posed the use of feature pooling or recurrent networks, such as Long Short Term Memory (LSTM), to perform feature aggregation. In addition to using LSTM, Donahue et al. [19] used it connected to the CNN output as a single network, allowing end-to-end training. Varol et al. [98] proposed long-term temporal convolutions, working with 3D convolution and large temporal resolutions. Wang et al. [107] contributed with a Temporal Segment Network (TSN) that is based on sparse sampling and a segmental consensus at training time. Ma et al. [49] proposed a TSN-inspired approach, but with LSTMs as well as a temporal network that receives as input a matrix of features at dierent time steps.

Using a similar idea of the two-stream framework, Carreira and Zisserman [10] devel-oped an action recognition approach based on 3D CNNs, such as an inated version of 2D CNNs, using pre-trained weights and also training the network with a huge action database, achieving high accuracy rates. These improvements reinforce the feasibility of using well-established 2D deep CNN architectures and their pre-trained weights from ImageNet [69].

The second category, to which this thesis belongs, focuses on improving or proposing a dierent video representation. Next, we will describe these methods and compare them

(28)

to our approach.

A siamese network was proposed by Wang et al. [108] to model the action as a trans-formation from a preconditioned state to an eect. A siamese network constitutes two copies of the same network, each receiving a dierent input and outputs being compared to enforce some properties, usually a short distance between them. In the case of action transformations, the input of one network is a set of frames of the preconditioned state of the action and the input of the other network is a set of frames belonging to the ef-fect. The objective is to enforce a small distance between a latent transformation matrix multiplied by the precondition output and the eect.

Thus, the representation proposed by Wang et al. [108] is learnable as ours, but it is a matrix specic to each action class present in the data set and is not intended to represent the video, but a transformation in the high-level feature space.

Wang et al. [104] developed a hand-crafted representation, called Motion Stacked Dif-ference Image (MSDI), inspired by the Motion Energy Image (MEI) [4] as a global tem-poral stream. The representation is constructed by accumulating the absolute dierence of consecutive gray-scale frames and applying a color map to it. Unlike ours, it cannot be learned and is based on naïve and constrained assumptions. Figure 2.6 illustrates an example of MSDI.

(a) Video sequence (b) MSDI

Figure 2.6: Frames from a video sequence and Motion Stacked Dierence Image (MSDI) generated from them [104].

Some methods, instead of introducing a dierent representation, attempt to improve the optical ow stream, originally hand-crafted. Zhu et al. [127] introduced a generative network for computing the optical ow representation that can be learned in an end-to-end fashion with the action classes. Similarly, Fan et al. [21] developed a convolutional network that acts as an optical ow solver, even without training, but can be further trained to the action-specic task. Hommos et al. [33] proposed an alternative for optical ow with an Eulerian phase-based motion representation that can also be learned in an end-to-end scheme.

One of the main dierences between these and our approach is that the target repre-sentation is known and they perform a regression to it. However, the reprerepre-sentation is latent in our scenario. In addition, it inherits the same peculiarities as the original optical ow that represents only two frames and does not reduce dimensionality.

Concha et al. [15] utilized visual rhythm as a stream. The slices used can be either vertical or horizontal averages of the frames, and a decision algorithm based on the motion

(29)

pattern selects the direction of the rhythm to serve as input. As mentioned previously, the visual rhythm is a hand-crafted feature and the generated image has variable dimensions. It depends on the number of frames and, the way it was built, the visual rhythm was not able to encode motion with only a single image representation, because horizontal and vertical movements are separated.

The work that is closest to this thesis is the Dynamic Images (DI) [8]. These are 2D image representations that are intended to summarize video motion. The image is produced by learning a ranking function to sort the frames temporally and retrieving the parameters in the form of an image. Figure 2.7 illustrates examples of Dynamic Images. This scheme demonstrated to be an ecient representation for video compression, however, it acts as a temporal pooling layer when added to a CNN and the parameters are learned for each video individually.

Figure 2.7: Examples of dynamic images that summarize short video sequences as still images [8].

Table 2.1 summarizes some relevant characteristics of main strategies for video repre-sentation.

Table 2.1: Video representation methods.

Method Format Data-Driven Learnable Main References

Templates 3D model No No [32, 68]

MHI/MEI Image Yes No [9, 29, 94, 113]

Dense Trajectories Feature Vector Yes No [52, 99, 100, 100]

Visual Rhythms Image Yes No [15, 56, 85, 95]

MSDI Image Yes No [104]

Dynamic Images Image Yes Yes [8, 102]

2.2.2 Video Autoencoders

In the context of video analysis, autoencoders are generally not used. Although there are methods that employ autoencoders for images and apply them individually to video frames [41, 116, 117], we limited the description to video-specic architectures that con-sider temporal relationships.

(30)

Srivastava et al. [87] presented a pioneering approach to video representation using unsupervised learning and recurrent neural networks. They proposed a multilayer LSTM autoencoder and predictor, where the hidden representation generated by the encoder is the learned representation of the entire input sequence. Inputs are high-level features extracted from each frame using 2D CNN or vectorized patches. In any case, the input consists of a 4096 vector and the internal feature obtained is a 2048 vector. Figure 2.8 illustrates an autoencoder architecture using LSTMs.

Figure 2.8: LSTM autoencoder model. Extracted from [87].

The main disadvantage of this model is that it does not take into account the spatial properties of frames, but only encodes the temporal changes between features. Not de-signed for end-to-end learning, the main idea is to learn the representation and use it for other applications. The resemblance to our work constitutes the use of the hidden state as representation, but the architecture diers and, consequently, also the properties of the generated representation.

Patraucen et al. [61] conceived of a temporal autoencoder nested in a spatial autoen-coder. Basically, it is an image model with a memory module provided by the convo-lutional LSTM, where, at each step, a frame is given as input and the reconstruction of the next frame is expected. The goal product is the explicit internal representation made to be a dense optical ow map. This resembles the approaches described in the previous section that improved the optical ow stream [21, 33, 127], except that it uses the internal representation as a product and considers more than two frames given the memory module. Nevertheless, it was not used for action recognition, but future frame prediction and weakly-supervised video segmentation.

The eld of anomaly detection in videos made comprehensive use of autoencoders, where they are used to model the normal behavior and the anomaly is detected by a reconstruction error threshold. Hasan et al. [30] proposed an autoencoder that inputs a stack of frames as channels and uses three 2D convolutions layers and two pooling layers

(31)

in the encoder. The decoder has three deconvolution layers and two unpooling layers, the latent representation is a 13×13 feature map with 128 channels.

Zhao et al. [124] used 3D convolutions with batch normalization [35], Leaky ReLU [50] activation and pooling. The encoder has 4 layers with these operations and the decoder 3 layers with 3D deconvolution, batch normalization, Leaky Relu, and no pooling. A fourth nal layer has only the deconvolution with sigmoid activation. The internal representation takes the form of 2×16×16×64.

Chong and Tay [13] proposed the use of Convolutional LSTMs [81] to cope with the temporal aspect. The architectures consist of two convolution layers to reduce the spatial dimensions, then three convolutional LSTMs and, nally, two deconvolution layers to bring the frames back to their original dimensions. The internal representation takes the form of 10×32×26×26.

It is noteworthy that, in the case of abnormal behavior, the compactness of the internal representation is not a concern. It follows the deep learning pattern of reducing the image spatial dimensions and increasing the number of channels. Our architecture is inspired by this strategy, but diers in shrinking the channels and maintaining the spatial dimensions.

(32)

Chapter 3 Proposed Spatio-Temporal

Representation for Action Recognition

In this chapter, we describe the proposed action representation based on a video au-toencoder that produces an image representation for a set of video frames [76]. This representation can be learned end-to-end for action classication. Furthermore, we cou-pled it as a stream in a multi-stream framework approach to action recognition. Figure 3.1 illustrates an overview of the proposed method pipeline.

Figure 3.1: Our method is composed of an autoencoder-based representation for action recognition. The representation obtained from the encoder is used as an image that serves as input to a 2D CNN classier.

(33)

Initially, the autoencoder model is trained in an unsupervised fashion, that is, no labels are used in the training stage since the expected outputs are the clips of videos themselves. Once the autoencoder is trained, the encoder is extracted and linked to a 2D CNN, where they are trained together, as a single network, to obtain the scores of the action classes using a softmax layer.

The main advantage of our video representation as an image is that it can be used in any of the many well-established 2D CNN architectures with pre-trained weights from the ImageNet data set [69]. The use of these deep convolution networks achieved state-of-the-art results in many computer vision tasks.

Section 3.1 presents and discusses the congurations of our video autoencoder that includes the architecture, hyperparameters and loss functions. Section 3.2 details the use of the learned representation for action recognition and the multi-stream approach.

3.1 Video Autoencoder

An autoencoder is an unsupervised learning approach that aims to learn an identity function, that is, the input and the expected output are the same. The goal is to reveal interesting structures in the data by placing constraints in the learning process. It has a basic structure composed of input, encoder, decoder and output.

For our video autoencoder, the input and output are clips of videos. The main con-straint is placed on the output of the encoder, which should generate an image-like rep-resentation. Figure 3.2 shows our proposed architecture for a video autoencoder, where a set of 10 grayscale frames is arranged as a 10-dimensional image for input. This image is passed through the encoder, whose output is a three-dimensional image. This image is then sent to the decoder, where the output is again 10-dimensional and represents the reconstructed video. The purpose of this autoencoder is to shrink the video to a single image representation by learning how to reconstruct a set of frames using only a 3-channel tensor.

Our design architecture is inspired by the work of Hasan et al. [30] in the sense that it uses video frames as a volume of channels, allowing the use of 2D convolutions. Our design ts into the category of an undercomplete architecture (see Section 2.1.4) because the goal is to perform a type of dimensionality reduction.

The number of input frames is xed in 10 due to hardware limitations and inspired by other works that use the same time window [8, 10, 106].

3.1.1 Encoder

The encoder is composed of one convolutional layer with 5 lters of 3×3 that reduce the 10-dimensional image to a 5-dimensional representation that passes through a batch normalization followed by a ReLU activation. This serves as input to a second convo-lutional layer with also a 3×3 neighborhood, however, with only 3 lters to produce a 3-dimensional output followed by batch normalization and a ReLU activation.

In order to maintain the image spatial size, zero padding is applied to avoid the prune of boundaries, a stride of size one is used and no pooling is considered.

(34)

Figure 3.2: Proposed video autoencoder architecture.

3.1.2 Decoder

The decoder is even simpler, consisting of a 3×3 convolutional layer with 5 lters followed by batch normalization and ReLU activation, that is, exactly as the rst layer of the encoder. The last layers use the same neighborhood (3×3) and 10 lters, returning to the dimensions of the input. There is no batch normalization and linear activation is applied, which forces the model to concentrate most of the reconstruction capacity on the intermediate representation.

3.1.3 Loss Function

The loss function is one of the most important criteria to be dened since it guides the learning process. In the video autoencoder, the goal is to evaluate if the output video is similar to the input video. A perfect score is not expected, however, a high similarity means that the internal representation is good enough for reconstructing the video.

We analyze some common types of loss functions used for image quality and delity assessment, extending them to videos by simply computing them to each pair of frames and considering their average.

(35)

expressed in Equation 3.1 as `2 = 1 N X i X j (f (i, j) − g(i, j))2 (3.1)

where f and g are the images, i and j are the vertical and horizontal coordinates, respec-tively, whereas N is the total number of pixels in the images. This loss function is the standard for image reconstruction problems, such as image super-resolution, denoising, demosaicking and deblurring.

There are many advantages in using `2 as a loss function. It is a simple and inexpensive metric in terms of computational cost, constitutes a valid distance metric and is easily dierentiable. However, it has been empirically shown that this loss function is susceptible to a local minimum in image restoration tasks [123].

The `1 norm (mean absolute error) was also investigated, expressed in Equation 3.2 as `1 = 1 N X i X j |f (i, j) − g(i, j)| (3.2)

The main dierence between both previous loss functions is that, while larger errors are more penalized with `2, the penalty is linear with `1. Nevertheless, both of them fail to correlate with human perception in image analysis in terms of delity or quality. They make a global assumption on the image, considering each pixel as independent of others and ignore their local relations [110].

Another commonly used loss function is based on the Structural Similarity Index Metric (SSIM) [111]. It is a quality measure computed on two image windows x and y of the same size, each one in a dierent image f and g. Equation 3.3 expresses the SSIM metric as

SSIM(x, y)= (2µxµy + C1)(2σxy + C2) (µ2

x+ µ2y+ C1)(σx2+ σ2y+ C2) (3.3) where µx is the mean of x, µy is the mean of y, σ2x is the variance of x, σy2 is the variance of y, and σxy is the covariance of x and y, C1 and C2 are constants that stabilize the equation (C1 = 0.01 ∗ 2552 and C2 = 0.03 ∗ 2552).

The nal index between f and g is the average of all windows for each pixel. Since we optimize to a minimum, the correct loss function corresponds to the dissimilarity expressed simply as DSSIM = (1 −SSIM)

2 .

The DSSIM loss function is more correspondent to a perceived human dierence than `1 and `2. The two latter loss functions will further penalize global dierences in contrast and brightness, whereas the former will focus on the structure of the image, which is more interesting to our problem.

Dierent from other image processing problems, such as image denoising or super-resolution, where these loss functions are employed to obtain a high-quality and delity image for a ground truth, image quality aspects, such as sharpness, low noise or color accuracy, are not as important as preserving the image general structure in our problem. Notice that the goal is to compress a number of frames into only a single 3-channel image and, with such compression, some aspects should be prioritized over others. Therefore, for

(36)

action recognition task, it is more important to preserve information about the structural changes than the pixel intensities. A comparative analysis is shown in Chapter 4.

3.2 Multi-Stream Architecture

To apply our video autoencoder-based representation to the action recognition task, we are inspired by the two-stream approach, designing our representation as a dierent stream using the same practices used in common two-stream architectures for action recogni-tion [24, 82, 106].

As shown in Figure 3.3, the encoder from the video autoencoder is connected to a 2D CNN for image classication. Many architectures have been proposed for the task of image recognition and we decided to choose the InceptionV3 [91] since it achieves state-of-the-art results in the ImageNet competition and is very compact and easy to converge.

Figure 3.3: Proposed action recognition architecture composed of spatial, temporal and spatio-temporal streams.

The rst issue to be solved is the connection between architectures. To allow end-to-end learning, the model needs to be viewed as one, meaning that it cannot be a two separated step: the encoder generates the image, saves it and then loads the image into the 2D CNN network. The link between the models themselves is very simple, just seeing all as a single model with the encoder output directly to the input of the 2D CNN, where the input data is now to the encoder, such that the InceptionV3 input layer becomes a hidden layer. Nevertheless, this raises the concerns that the InceptionV3 expects an image of xed size in 299×299 pixels and was pre-trained with the input scaled between [−1, 1]. Conveniently, the autoencoder was designed and trained to produce a 299×299 internal representation, easily solving this aspect of the encoder-CNN connection. The normaliza-tion aspect can be solved in many ways. An opnormaliza-tion would be to design the autoencoder to eortlessly produce an internal code within the desired range. In order to do that, it

(37)

would be sucient to replace the last activation of the encoder for a hyperbolic tangent function (tanh) [76]. The tanh activation can also be introduced only in the connection replacing or adding over the ReLU activation. In addition to the activation, another op-tion would be to apply a funcop-tion using minimum and maximum to scale the output of encoder to [−1, 1]. In Section 4.3, we explore these two options, initially performing the test without any modication using the output of the encoder directly as the input to the InceptionV3.

Given the connected model of video encoder and the image classication network, more modications are required. The InceptionV3 was designed for an image classication problem with 1000 categories and, to adapt it to the action recognition task, the softmax prediction layer needs to be replaced with the number of classes, which is specic to a given data set, but is expected to be less than 1000.

Since the sizes of the data sets considered for video action recognition are small in the scale of the image data set used to pre-train the network, overtting is a critical problem. The inclusion of a dropout layer before softmax is required. Dropout [86] is a regularization method for neural networks, where some units are randomly dropped (the output is multiplied by zero) during the training stage. It has the purpose of reducing overtting and is also a way to approximate an ensemble of subnetworks.

Figure 3.4 illustrates the architecture of our spatio-temporal stream for action recog-nition. It is composed of the encoder extracted from a trained video autoencoder, possible connection layer (tanh activation or min-max normalization) and an image classication architecture (InceptionV3) modied to include dropout and softmax with appropriate number of lters corresponding to the number of action classes. Additional details about the InceptionV3 architecture can be found in the work described by Szegedy et al. [91].

Figure 3.4: Proposed architecture for action recognition using our spatio-temporal stream. Moreover, the standard streams for action recognition were designed and implemented. The spatial stream is composed of a single sampled RGB image as input to the classi-cation image network. This network architecture is the same as the InceptionV3, except for the modications previously described (dropout and softmax) for the spatio-temporal stream.