Deep learning methods for detecting anomalies in videos: theoretical and methodological contributions

Texto

(1)´ FEDERAL UNIVERSITY OF TECHNOLOGY – PARANA GRADUATE PROGRAM IN ELECTRICAL AND COMPUTER ENGINEERING. ´ RIBEIRO MANASSES. DEEP LEARNING METHODS FOR DETECTING ANOMALIES IN VIDEOS: THEORETICAL AND METHODOLOGICAL CONTRIBUTIONS. DOCTORAL THESIS. CURITIBA 2018.

(2) ´ RIBEIRO MANASSES. DEEP LEARNING METHODS FOR DETECTING ANOMALIES IN VIDEOS: THEORETICAL AND METHODOLOGICAL CONTRIBUTIONS. Doctoral Thesis presented to the Graduate Program in Electrical and Computer Engineering of the Federal University of Technology – Paraná as partial fulfillment of the requirements for the degree of “Doctor of Science (D.Sc.) ” – Area of concentration: Computer Engineering. Advisor:. Prof. Dr. Heitor Silvério Lopes. Co-advisor:. Prof. Dr. Lazzaretti. CURITIBA 2018. André Eugênio.

(3) Dados Internacionais de Catalogação na Publicação R484de 2018. Ribeiro, Manassés Deep learning methods for detecting anomalies in videos : theoretical and methodological contributions / Manassés Ribeiro.-- 2018. 120 f.: il.; 30 cm Disponível também via World Wide Web. Texto em inglês com resumo em português. Tese (Doutorado) - Universidade Tecnológica Federal do Paraná. Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial. Área de Concentração: Engenharia de Computação, Curitiba, 2018. Bibliografia: f. 102-111. 1. Detecção de anomalias (Segurança de computadores). 2. Aprendizado do computador. 3. Redes neurais (Computação). 4. Códigos corretores de erros (Teoria da informação). 5. Convoluções (Matemática). 6. Processamento de imagens - Técnicas digitais. 7. Sistemas de reconhecimento de padrões. 8. Visão por computador. 9. Métodos de simulação. 10. Engenharia elétrica - Teses. I. Lopes, Heitor Silvério, orient. II. Lazzaretti, André Eugênio, coorient. III. Universidade Tecnológica Federal do Paraná. Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial. IV. Título. CDD: Ed. 23 -- 621.3. Biblioteca Central do Câmpus Curitiba – UTFPR Bibliotecária: Luiza Aquemi Matsumoto CRB-9/794.

(4) Ministério da Educação Universidade Tecnológica Federal do Paraná Diretoria de Pesquisa e Pós-Graduação. TERMO DE APROVAÇÃO DE TESE Nº 165. A Tese de Doutorado intitulada “Deep Learning Methods for Detecting Anomalies in Videos: Theoretical and Methodological Contributions”, defendida em sessão pública pelo(a) candidato(a) Manassés Ribeiro, no dia 05 de março de 2018, foi julgada para a obtenção do título de Doutor em Ciências, área de concentração Engenharia de Computação, e aprovada em sua forma final, pelo Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial.. BANCA EXAMINADORA: Prof(a). Dr(a). Heitor Silvério Lopes - Presidente – (UTFPR) Prof(a). Dr(a). Eros Comunello – (UNIVALI) Prof(a). Dr(a). Carlos Eduardo Thomaz – (FEI) Prof(a). Dr(a). Luiz Eduardo Soares de Oliveira – (UFPR) Prof(a). Dr(a). Hugo Vieira Neto - (UTFPR) A via original deste documento encontra-se arquivada na Secretaria do Programa, contendo a assinatura da Coordenação após a entrega da versão corrigida do trabalho.. Curitiba, 05 de março de 2018..

(5) In memorian of my father João Maria, my grandmother Olinda and my grandfather João Batista, who passed away during the period of my doctorate course. I also dedicate this work to my family, especially to my mother Jurema, my brother Gamaliel, my daughter Elisa, my niece Julia, and my wife Marion..

(6) ACKNOWLEDGEMENTS. During my doctorate course I was often put to the test with more doubts than certainties. Over the past four years some people have contributed for concluding this thesis, I am sincerely grateful to them. First, I would like to thank my family, especially my wife Marion and my daughter Elisa for their patience when I was absent, and for the comfort when I needed. In particular, I would like to thank my advisor Professor Heitor Silvério Lopes for his patience, confidence and serious work, which from a mentoring relationship, a good friendship raised. In the same way, I would like to thank my co-advisor, Professor André Eugênio Lazzaretti, who during the past two years has been supporting me with his productive and stimulating conversations. Thank you both for the lessons and support. Also, I would like to thank my colleagues of the LABIC Lab Hugo Perlin, Cesar Benitez, Adriano Gabardo, Leandro Hatorri, Matheus Gutoski, Lia Takiguchi, Bruna de Paula, Lucas Albini, Fernando Carvalho and Marcelo Romero for their conversations, suggestions, confraternizations and constant help. A special thanks to my friend Marcelo Romero for the tireless disposition for reviewing this document. Finally, I would like to thank to the Catarinense Federal Institute, in special my colleagues of the campus Videira, for their support by allowing me to carry the doctorate course with exclusive dedication, the IFC/CAPES/Prodoutoral for the scholarship, and the NVIDIA for the donation of the GPU boards used in this work..

(7) “However difficult life may seem, there is always something you can do and succeed at. It matters that you don’t just give up.” Stephen Hawking “Everything has its apogee and its decline... It is natural that it be so, however, when everything seems to converge to what we suppose nothingness, behold, life reappears, triumphant and beautiful! New leaves, new flowers, in the infinite bless of the resumption.” Chico Xavier.

(8) ABSTRACT. RIBEIRO, Manassés. DEEP LEARNING METHODS FOR DETECTING ANOMALIES IN VIDEOS: THEORETICAL AND METHODOLOGICAL CONTRIBUTIONS. 120 p. Doctoral Thesis – Graduate Program in Electrical and Computer Engineering, Federal University of Technology – Paraná. Curitiba, 2018. The anomaly detection in automated video surveillance is a recurrent topic in recent computer vision research. Deep Learning (DL) methods have achieved the state-of-the-art performance for pattern recognition in images and the Convolutional Autoencoder (CAE) is one of the most frequently used approach, which is capable of capturing the 2D structure of objects. In this work, anomaly detection refers to the problem of finding patterns in images and videos that do not belong to the expected normal concept. Aiming at classifying anomalies adequately, methods for learning relevant representations were verified. For this reason, both the capability of the model for learning automatically features and the effect of fusing hand-crafted features together with raw data were studied. Indeed, for real-world problems, the representation of the normal class is an important issue for detecting anomalies, in which one or more clusters can describe different aspects of normality. For classification purposes, these clusters must be as compact (dense) as possible. This thesis proposes the use of CAE as a data-driven approach in the context of anomaly detection problems. Methods for feature learning using as input both hand-crafted features and raw data were proposed, and how they affect the classification performance was investigated. This work also introduces a hybrid approach using DL and one-class support vector machine methods, named Convolutional Autoencoder with Compact Embedding (CAE-CE), for enhancing the compactness of normal clusters. Besides, a novel sensitivity-based stop criterion was proposed, and its suitability for anomaly detection problems was assessed. The proposed methods were evaluated using publicly available datasets and compared with the state-of-the-art approaches. Two novel benchmarks, designed for video anomaly detection in highways were introduced. CAE was shown to be promising as a datadriven approach for detecting anomalies in videos. Results suggest that the CAE can learn spatio-temporal features automatically, and the aggregation of hand-crafted features seems to be valuable for some datasets. Also, overall results suggest that the enhanced compactness introduced by the CAE-CE improved the classification performance for most cases, and the stop criterion based on the sensitivity is a novel approach that seems to be an interesting alternative. Videos were qualitatively analyzed at the visual level, indicating that features learned using both methods (CAE and CAE-CE) are closely correlated to the anomalous events occurring in the frames. In fact, there is much yet to be done towards a more general and formal definition of normality/abnormality, so as to support researchers to devise efficient computational methods to mimetize the semantic interpretation of visual scenes by humans. Keywords: Anomaly Detection, One-Class Classification, Deep Learning, Convolutional Autoencoder, Feature Extraction, Feature Learning, Compact Embedding, Dense Representation.

(9) RESUMO. RIBEIRO, Manassés. DEEP LEARNING METHODS FOR DETECTING ANOMALIES IN VIDEOS: THEORETICAL AND METHODOLOGICAL CONTRIBUTIONS. 120 f. Doctoral Thesis – Graduate Program in Electrical and Computer Engineering, Federal University of Technology – Paraná. Curitiba, 2018. A detecça˜ o de anomalias em v´ıdeos de vigilância e´ um tema de pesquisa recorrente em visão computacional. Os métodos de aprendizagem profunda têm alcançado o estado da arte para o reconhecimento de padrões em imagens e o Autocodificador Convolucional (ACC) e´ uma das abordagens mais utilizadas por sua capacidade em capturar as estruturas 2D dos objetos. Neste trabalho, a detecça˜ o de anomalias se refere ao problema de encontrar padrões em v´ıdeos que não pertencem a um conceito normal esperado. Com o objetivo de classificar anomalias adequadamente, foram verificadas formas de aprender representaço˜ es relevantes para essa tarefa. Por esse motivo, estudos tanto da capacidade do modelo em aprender caracter´ısticas automaticamente quanto do efeito da fusão de caracter´ısticas extra´ıdas manualmente foram realizados. Para problemas de detecça˜ o de anomalias do mundo real, a representaça˜ o da classe normal e´ uma questão importante, sendo que um ou mais agrupamentos podem descrever diferentes aspectos de normalidade. Para fins de classificaça˜ o, esses agrupamentos devem ser tão compactos (densos) quanto poss´ıvel. Esta tese propõe o uso do ACC como uma abordagem orientada a dados aplicada ao contexto de detecça˜ o de anomalias em v´ıdeos. Foram propostos métodos para o aprendizado de caracter´ısticas espaço-temporais, bem como foi introduzida uma abordagem h´ıbrida chamada Autocodificador Convolucional com Incorporaça˜ o Compacta (ACC-IC), cujo objetivo e´ melhorar a compactaça˜ o dos agrupamentos normais. Além disso, foi proposto um novo critério de parada baseado na sensibilidade e sua adequaça˜ o para problemas de detecça˜ o de anomalias foi verificada. Todos os métodos propostos foram avaliados em conjuntos de dados dispon´ıveis publicamente e comparados com abordagens estado da arte. Além do mais, foram introduzidos dois novos conjuntos de dados projetados para detecça˜ o de anomalias em v´ıdeos de vigilância em rodovias. O ACC se mostrou promissor na detecça˜ o de anomalias em v´ıdeos. Resultados sugerem que o ACC pode aprender caracter´ısticas espaço-temporais automaticamente e a agregaça˜ o de caracter´ısticas extra´ıdas manualmente parece ser valiosa para alguns conjuntos de dados. A compactaça˜ o introduzida pelo ACCIC melhorou o desempenho de classificaça˜ o para a maioria dos casos e o critério de parada baseado na sensibilidade e´ uma nova abordagem que parece ser uma alternativa interessante. Os v´ıdeos foram analisados qualitativamente de maneira visual, indicando que as caracter´ısticas aprendidas com os dois métodos (ACC e ACC-IC) estão intimamente correlacionadas com os eventos anormais que ocorrem em seus quadros. De fato, ainda há muito a ser feito para uma definiça˜ o mais geral e formal de normalidade, de modo que se possa ajudar pesquisadores a desenvolver métodos computacionais eficientes para a interpretaça˜ o dos v´ıdeos. Palavras-chave: Detecça˜ o de Anomalia, Classificaça˜ o de Uma Classe, Aprendizagem Profunda, Autocodificador Convolucional, Extraça˜ o de Caracter´ısticas, Aprendizagem de Caracter´ısticas, Incorporaça˜ o Compacta, Representaça˜ o Densa.

(10) LIST DE FIGURES. FIGURE 1 FIGURE 2 FIGURE 3 FIGURE 4 FIGURE 5 FIGURE 6 FIGURE 7 FIGURE 8 FIGURE 9. – – – – – – – – –. FIGURE 10 – FIGURE 11 – FIGURE 12 FIGURE 13 FIGURE 14 FIGURE 15 FIGURE 16 FIGURE 17 FIGURE 18 FIGURE 19 FIGURE 20 FIGURE 21 FIGURE 22 FIGURE 23 FIGURE 24 FIGURE 25 FIGURE 26. – – – – – – – – – – – – – – –. FIGURE 27 – FIGURE 28 – FIGURE 29 – FIGURE 30 – FIGURE 31 – FIGURE 32 – FIGURE 33 –. Example of anomalies in a bi-dimensional didactic dataset . . . . . . . . . . . . . . Example of local and global anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of fully connected single-hidden-layer Autoencoder . . . . . . . . . . . . Example of convolutional, pooling, deconvolutional and unpooling layers Representation of instances in feature space for One-Class Classification . Overview of the approach using hand-crafted features . . . . . . . . . . . . . . . . . . Overview of the approach for learning spatio-temporal features . . . . . . . . . . Example of sliding windows approach for selecting frames to compose the cuboids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Architecture of the Convolutional Autoencoder proposed in this work, based on (HASAN et al., 2016). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of the proposed Convolutional Autoencoder with Compact Embedding method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Normal distributions generated using mean vector m and covariance matrices S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Histogram of Q and P distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Values of probabilities for soft assignments and soft probabilistic targets . . Examples of two feature space representations . . . . . . . . . . . . . . . . . . . . . . . . . Example of stop criterion based on sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . Example of Caltech-256 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of Coil100 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of STL-10 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of LABIC3 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of LABIC4 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of Avenue dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of UCSD Ped1 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of UCSD Ped2 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of UMN dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Normalized Reconstruction Error plotted for a video clip of Avenue dataset Histogram of the Normalized Reconstruction Error for a video clip of Avenue dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spatio-temporal Normalized Reconstruction Error plotted for a Avenue dataset video clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Normalized Compression Rate results for training and test datasets . . . . . . . Correlation between Area under the ROC curve and Spatial Complexity Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kullback-Leibler loss convergence, and both TPR and TPR×TNR curves . Stop criterion based on sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Smoothed Normalized Anomaly Score computed for Convolutional Autoencoder and Convolutional Autoencoder with Compact Embedding approaches on the LABIC3 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Histograms of the Smoothed Normalized Anomaly Score computed for. 25 26 33 36 40 54 55 56 57 61 66 67 69 70 70 73 74 74 76 76 77 77 78 78 83 84 85 87 87 91 92 93.

(11) Convolutional Autoencoder and Convolutional Autoencoder with Compact Embedding approaches on the LABIC3 dataset . . . . . . . . . . . . . . . . . . . . . . . . 95 FIGURE 34 – Another example of the Smoothed Normalized Anomaly Score computed for Convolutional Autoencoder and Convolutional Autoencoder with Compact Embedding approaches on the LABIC3 dataset . . . . . . . . . . . . . . . 96 FIGURE 35 – Smoothed Normalized Anomaly Score computed for Convolutional Autoencoder and Convolutional Autoencoder with Compact Embedding approaches on a fragment of the Avenue dataset . . . . . . . . . . . . . . . . . . . . . . . 96.

(12) LIST OF TABLES. TABLE 1 TABLE 2 TABLE 3 TABLE 4 TABLE 5. – – – – –. TABLE 6 TABLE 7. – –. TABLE 8 TABLE 9. – –. TABLE 10 – TABLE 11 TABLE 12 TABLE 13 TABLE 14. – – – –. Summary of the related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Dimensions of each layer of the Convolutional Autoencoder . . . . . . . . . . . . . 57 Convolutional Autoencoder layers and output sizes . . . . . . . . . . . . . . . . . . . . . 62 Summarization of all main dataset features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Area under the ROC curve and Equal Error Rate results for the four case studies using appearance and motion features . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Confusion matrix for the Equal Error Rate result of all datasets . . . . . . . . . . 81 Area under the ROC curve and Equal Error Rate results for the four case studies of automatic learning of spatio-temporal features . . . . . . . . . . . . . . . . 84 Normalized Compression Rate for both training and test sets . . . . . . . . . . . . . 86 Area under the ROC curve and Spatial Complexity Coefficient for all datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Convolutional Autoencoder with Compact Embedding results for all datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Confusion matrix for the Equal Error Rate result of all video datasets . . . . . 89 Specifications table of LABIC3 and LABIC4 datasets . . . . . . . . . . . . . . . . . . . 118 Details table of LABIC3 and LABIC4 datasets . . . . . . . . . . . . . . . . . . . . . . . . . 119 Summary table of LABIC3 and LABIC4 datasets . . . . . . . . . . . . . . . . . . . . . . . 119.

(13) LIST OF ACRONYMS AND ABBREVIATIONS. AE AS AUC BA CAE CAE-CE CNN CV DAE DBN DEC DL EER FPR FRE GD GNG HOF HOG KL NAS NCR NRE OCC OC-SVM PCA PoC RCA RE ROC SCC SDAE SGD SSD SVDD TNR TPR t-SNE. Autoencoder Anomaly Score Area Under the ROC Curve Behavior Analysis Convolutional Autoencoder Convolutional Autoencoder with Compact Embedding Convolutional Neural Network Computer Vision Denoising Autoencoder Deep Belief Network Deep Embedded Clustering Deep Learning Equal Error Rate False Positive Rate Frame Reconstruction Error Gradient Descent Growing Neural Gas Histogram of Optical Flow Histogram of Oriented Gradients Kullback-Leibler Normalized Anomaly Score Normalized Compression Rate Normalized Reconstruction Error One-Class Classification One-Class Support Vector Machine Principal Component Analysis Proof of Concept Relevant Component Analysis Reconstruction Error Receiver Operating Characteristics Spatial Complexity Coefficient Stacked Denoising Autoencoder Stochastic Gradient Descent Sum of Squared Differences Support Vector Data Description True Negative Rate True Positive Rate t-Distributed Stochastic Neighbor Embedding.

(14) CONTENTS. 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 OBJECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 STRUCTURE OF THE THESIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 THEORETICAL BACKGROUND AND RELATED WORK . . . . . . . . . . . . . . . . . . . 2.1 COMPUTER VISION AND BEHAVIORAL ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . 2.2 ANOMALY DETECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 ONE-CLASS SUPPORT VECTOR MACHINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 PATTERN REPRESENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Representation Based on Spatio-Temporal Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Representation Based on Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 AUTOENCODER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Convolutional Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 COMPACT REPRESENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Deep Embedded Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 APPEARANCE AND MOTION FILTERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 SPATIAL VIDEO COMPLEXITY ESTIMATOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 SUMMARY OF THE RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 PROPOSED METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 FEATURE LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Appearance and Motion Filters For Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Convolutional Autoencoder for Automatically Learning Spatio-Temporal Features . 3.1.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Normalization of Reconstruction Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.6 Classification and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.7 Spatial Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 CONVOLUTIONAL AUTOENCODER WITH COMPACT EMBEDDING . . . . . . . . 3.2.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Model Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Model Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Model Stop Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Classification and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.6 Compact Representation with Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . . 3.2.7 Proof of Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 COMPUTATIONAL EXPERIMENTS AND RESULTS . . . . . . . . . . . . . . . . . . . . . . . . 4.1 DATASETS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Caltech 256 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Coil100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 STL-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 LABIC3 and LABIC4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15 19 20 21 21 23 27 28 29 30 32 34 37 39 43 46 47 48 52 52 53 54 56 58 58 59 59 60 62 63 63 64 64 65 68 71 72 72 73 73 75.

(15) 4.1.5 Avenue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.1.6 UCSD Ped1 and Ped2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1.7 UMN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 FEATURE LEARNING EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2.1 Effect of Appearance and Motion Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.2 Reconstruction Error for Detecting Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.3 Automatic Learning of Spatio-Temporal Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.4 Effect of Video Spatial Complexity in the Classification Performance . . . . . . . . . . . . . 86 4.3 COMPACT REPRESENTATION EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.1 Classification Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.2 Stop Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.1 CONTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Annex A -- PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.1 JOURNAL PUBLICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.1.1 A Study of Deep Convolutional Auto-encoders for Anomaly Detection in Videos . . 112 A.2 CONFERENCE PUBLICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A.2.1 A Clustering-Based Deep Autoencoder for One-Class Image Classification . . . . . . . 113 A.2.2 Detection of Video Anomalies Using Convolutional Autoencoders and One-Class Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A.2.3 A Gene Expression Programming Approach for Evolving Multi-Class Image Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 A.2.4 Multi-class Classification of Objects in Images Using Principal Component Analysis and Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 A.3 CHAPTER PUBLICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 A.3.1 Image Segmentation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Annex B -- VIDEO DATASETS FOR ANOMALY DETECTION IN HIGHWAYS . . . 117 B.1 AUTHOR INFORMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 B.2 ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 B.3 VALUE OF THE DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 B.4 SPECIFICATIONS TABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 B.5 DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118.

(16) 15. 1. INTRODUCTION. Vision is one of five senses which humans have to guide themselves. It is considered the main sense by which information from the outside world is acquired. Through vision, humans can obtain information of colors, shapes and textures. Moreover, the human capacity for understanding all visual content is fascinating. It starts in the first months of life and improves along time. Along the last years, the human vision has been studied in order to understand both human learning and lifelong learning processes. Recently, there is an effort towards reproducing in machines the knowledge that has been unveiled concerning the human vision. This science is called Computer Vision (CV). CV methods have become an important tool for solving problems in several areas of knowledge, such as medicine, industry, military applications, space exploration, autonomous navigation, among others. Due to recent technological advances, the cost of hardware has significantly decreased. As a consequence, surveillance cameras are omnipresent in public and private spaces. Another factor that contributes to this increase is the growing concern with public safety in the world. As result, a huge unbalance between the number of surveillance cameras and human observers arises since dozens of cameras can be available to acquire videos very efficiently. Hence, the huge volume of video data recorded and stored daily is impossible to be analyzed by humans, and this process can be worsened by human limitations, such as distraction and tiredness, despite human abilities to analyze and classify images. For this reason, the task of video analysis is tedious and exhaustive, causing human observers to fail at detecting important unexpected events (i.e. anomalies). This problem has pushed the development of automatic video surveillance systems, since the necessary human endeavor for effectively performing the observation task is too great. Hence, automatic video surveillance is a topic of great importance that has been intensely studied in the recent literature (LI et al., 2015b). In this context, it is possible to propose solutions in order to reduce the human effort in the analysis of videos. At this point, CV can provide methods for tackling this issue, for instance.

(17) 16. the Behavior Analysis (BA) is one challenging task related to monitoring both people, objects and their relationship in video surveillance. Approaches based on BAs are useful for monitoring people or objects in both dynamic and busy environments, and in the information retrieval in large-volume of video data. BA is dependent on the structural knowledge of scene and, for this reason, it can be an alternative approach to video anomaly detection problems (ZHU et al., 2014; PIMENTEL et al., 2014). The anomaly detection in videos has been a subject of great interest for both academic and industry areas (SODEMANN et al., 2012). However, the definition of anomaly in video surveillance is not only context-dependent but, also, dependent of human-defined semantics. For instance, a truck crossing an avenue may be considered an anomalous or normal event depending on the human-defined context. As a matter of fact, there is no general rule for such a definition, except by the qualitative observation that anomalies occur infrequently in comparison with normal events (JIANG et al., 2011). Detecting objects from previously unknown classes is a recurrent subject in different pattern recognition problems, especially because new classes, anomalous behaviors and concept drift can frequently occur in real-world applications (SCHEIRER et al., 2013; AHMAD et al., 2017). The most common approach to classify patterns in the presence of previously unknown classes is the use of One-Class Classification (OCC). In the literature, OCC, novelty detection and anomaly detection are considered synonyms because, to date, there is no universally accepted definition for these terms (PIMENTEL et al., 2014). Usually, authors define their approaches considering the specific problem addressed or methods applied, in which, sometimes, there are different definitions for similar problems or methods used. In general, an one-class classifier (or novelty detector) can be defined as a classifier based on previously known patterns, which are arranged as one or a set of normal concepts (clusters), allowing the identification of patterns that were not present in the original training dataset, normally defined as novelties (TAX, 2001). Therefore, considering that there is no universally accepted definition for OCC and novelty detection, for the sake of standardization, “anomaly detection” is used as a generic term in this work. Hence, the anomaly detection definition proposed by Chandola et al. (2009) is adopted, where it refers to the problem of finding patterns in data that do not conform the expected normal concept, and anomalies, in their turn, are patterns that do not conform to the well-defined concept of normal behavior. In this sense, anomaly detection can be approached as an OCC problem, such that the normal class is assumed to be human-defined and has a large number of examples, whilst.

(18) 17. the other class corresponds to the abnormal class (i.e. samples that are not, or they are rarely, present in the normal class). For instance, in the analysis of crowded pedestrian walkways, abnormal behaviors could be the circulation of non pedestrians in the walkways (e.g. bikers), anomalous pedestrian motion and behavior patterns (e.g. wrong direction or walking on the grass), as well as abnormal objects in the scene (e.g. left baggage or thrown objects). Even if the detection of anomalies was restricted to pedestrian walkways, the corresponding anomalies might have quite different characteristics, requiring the extraction of particular features from the video frames to represent and automatically classify them. For instance, from the appearance point of view, a pedestrian walking in a wrong direction behaves similarly to a pedestrian walking in the right one. However, pedestrian motion patterns differ significantly regarding the direction, which may characterize an anomaly. On the other hand, unusual pedestrians, such as people in wheelchairs, present different appearance patterns, when compared to regular pedestrians walking on walkways, even though the motion patterns are similar. This classification is significantly more difficult in crowded scenes, as they present changes in the subject size, shape, boundaries, and occlusions. For this reason, the pattern representation is an important issue to be considered for modeling anomaly detection problems. In this work the pattern representation term refers to the means of representing patterns that could be obtained by feature extraction (hand-crafted features) and feature learning methods. The use of an anomaly detector for real-world problems, independently of its approach, encompasses another very important issue: the compact representation of the normal class in the feature space. Keeping in mind that it is possible to have more than one cluster representing the normal class, the overall normal concept must be as dense as possible, so that the classifier can discriminate anomalies better (XU et al., 2014; BODESHEIM; FREYTAG, 2013). For the particular case of images and videos, the appearance and motion features extracted by standard hand-crafted methods may be inefficient when applied directly to anomaly detection problems (FENG et al., 2017; XU et al., 2015). Consequently, the mapping performed by handcrafted descriptors does not guarantee a dense representation of the normal class in the feature space. In this work, compact representation refers as the dense representation of the latent space. Recently, several approaches have been focusing on pattern representation. The most common pattern representation methods are based on hand-crafted features, such as 3D spatiotemporal gradients, Histogram of Optical Flow (HOF) and Histogram of Oriented Gradients (HOG), or their combination. On the other hand, Deep Learning (DL) methods, such as Convolutional Neural Networks (CNNs) and Convolutional Autoencoder (CAE), have been.

(19) 18. also studied for feature learning. CNN is a supervised approach considered the state-of-theart for images and videos classification problems (KRIZHEVSKY et al., 2012; SZEGEDY et al., 2015). A CAE is an unsupervised approach based on Autoencoders (AEs) capable of capturing the 2D structure of image and video sequences (MASCI et al., 2011). It is organized in encoder and decoder layers. It allows to produce a latent representation (lower-dimensional representation) of the input data in the middle of its architecture. Since training CAE does not require label information of input data, it can be useful to model anomaly detection problems. In fact, the use of CAEs for feature learning in video are still underexplored in the recent literature (see, for instance, Hasan et al. (2016) and Ribeiro et al. (2018)). In general, works that employ AEs propose their use as a fuser of hand-crafted features (extracted from frames or patches), and a classifier to discriminate anomalies. This work proposes a different approach because not only entire frames (and packages of frames and hand-crafted features) are used, but also, the reconstruction errors to discriminate anomalies in videos of different levels of complexity. The working hypothesis is that a CAE is able to learn relevant features from normal events in videos and it is hypothesized that the reconstruction error of a frame can be used for devising an anomaly score, thus allowing CAEs to be used for anomaly detection tasks. As a matter of fact, humans are very competent to intuitively combine different features, such as motion and appearance features, in order to interpret the meaning of a video sequence. In this sense, this work also addresses the question: does fusing hand-crafted features (e.g. the above-mentioned features) with input data improves the classification performance of a CAE? It is also worth mentioning that until this point, hand-crafted features of appearance and motion are combined with entire frames using a CAE as fuser. However, the CAE can also be used for learning automatically both appearance and motion features from raw input data, where frames are fed to the CAE considering a spatio-temporal approach. The hypothesis is that a CAE is not only able to learn representations of normal events in videos by combining hand-crafted features, but it can also learn automatically spatio-temporal features from raw frames. Another issue to be considered with respect to classification performance is video complexity. Although it is a difficult factor to be objectively evaluated, humans can successfully interpret videos within a large range of complexity. However, DL methods, such as a CAE, may have their performances influenced by the underlying spatial complexity of the input data. Therefore, a mean of estimating the spatial video complexity is proposed and the possible relationship between it and the performance of a CAE to detect anomalies in videos is investigated..

(20) 19. Regarding the compact representation, the idea is to jointly learn features and dense representation of the latent space using CAE, by reinforcing the compactness and increasing the separability between normal data and anomalies. An extension of the ideas presented by Xie et al. (2016) is proposed, introducing a Convolutional Autoencoder with Compact Embedding (CAE-CE) as a feature learning method specifically suited for anomaly detection problems. The working hypothesis is that the normal concept can be composed of more than one cluster, and it is desired that these clusters are the more dense as possible in order to obtain a well-separated representation. Thus, the main idea is to increase the compactness of the normal class during the training step so as to achieve a better separation between normal and abnormal objects for the test set. However, indefinitely increasing the cluster compactness may cause an excessive compression. Although this seems appropriated for well-separated representations, it may lead to model overfitting. Therefore, a stop criterion is proposed in order to find the more adequate stop point for the training phase, specifically suited for anomaly detection problems. Accordingly, the problem addressed in this work consists in modeling a normal concept for discriminating anomalies, based on a set of images or videos containing events considered normal. Notwithstanding, the main focus is to provide some theoretical and methodological contributions for a definition of a data-driven normal context in video analysis, which can be promptly applied to anomaly detection problems in the real-world. In this sense, aiming at contributing to the modeling of anomaly detection in videos, this thesis has two specific focuses: feature learning and compact representation. 1.1. OBJECTIVES The objective of this work is to propose methods to model normal concepts for. anomaly detection in images and videos using a deep convolutional autoencoder. The objective is subdivided into some specific objectives to clarify the approaches and contributions: 1. to devise methods for feature learning (aggregating hand-crafted spatial and temporal features, and learning automatically spatio-temporal features) in anomaly detection using deep convolutional autoencoder; 2. to devise a method to compute an anomaly score for discriminating anomalies in videos; 3. to investigate the possible relationship between the spatial video complexity and the performance of a convolutional autoencoder for anomaly detection;.

(21) 20. 4. to propose a new method for dense representation (enhancing the compactness of normal clusters in the feature space) based on deep convolutional autoencoder, and also a novel stop criterion based on sensitivity, specifically suited for anomaly detection problems; 5. to validate the proposed approaches in anomaly datasets publicly available; 6. to assess and compare the proposed approaches with the state-of-the-art, thus highlighting the main differences. 1.2. STRUCTURE OF THE THESIS This thesis is organized as follows. Chapter 2 presents the theoretical background. and some related work found in the recent literature.. Chapter 3 describes in detail the. proposed methods. Chapter 4 presents the computational experiments, their results and a short discussion. Finally, Chapter 5 reports the general conclusions drawn, and suggests future research directions..

(22) 21. 2. THEORETICAL BACKGROUND AND RELATED WORK. This chapter presents the main theoretical background with respect to the problem addressed in this work. First, a brief review about computer vision (CV), behavioral analysis (BA), and anomaly detection is presented, contextualizing them. A key issue in anomaly detection is related to pattern representation, which can significantly affect the classification performance. In this work, Deep Learning (DL) methods for both feature learning and dense representation (compact representation) of the feature space (latent space) are used. Subsection 2.4.2 regards feature learning with DL and Subsection 2.5.1 regards Convolutional Autoencoder (CAE), since they support this work. Since the dense representation (of the latent space) is desirable for increasing the classification performance, this work introduces a contribution with respect to compact representation, later presented in Section 2.6.. Another important issue is video spatial. complexity that could influence the classification performance. Thus, the Kolmogorov approach as a spatial complexity estimator is presented. Finally, a short summary regarding the related work to this thesis is presented in Table 1. 2.1. COMPUTER VISION AND BEHAVIORAL ANALYSIS CV is the science that provides means to machines to see and to interpret the outside. world (GRANLUND; KNUTSSON, 1995). CV can also be understood as a complement of biological vision, in which the human and animal visual perception are studied so that its concepts can be applied to artificial vision systems (COX; DEAN, 2014). In short, it provides methods for interpreting images and videos, whose interpretation starts with the transformation of raw images and videos into structures that can describe semantically a context. Raw images are organized in matrices of pixels, whilst raw videos are composed of a sequence of images.

(23) 22. per second, usually named frames. It is also worth noticing that whilst there is a wide variety of methods, to date, there is no generic model in artificial vision systems that could be used for any image understanding problem. In general, CV methods are proposed to accomplish specific vision tasks with limited performance, and rarely can be directly applied to other problems.. This issue can be. explained by our limited knowledge with respect to the mechanism of visual perception in animals (NIXON; AGUADO, 2008; COX; DEAN, 2014). Typically, CV problems are solved by a composition of filters, detectors, descriptors, and classifiers (WEINLAND et al., 2011). Actually, a significant part of Computational Intelligence is closely related to CV problems, in particular, machine learning and pattern recognition methods have been used (WANG et al., 2013a; JOHN et al., 2015; CIRESAN et al., 2012; GUHA; WARD, 2012). CV has been used in several areas of knowledge, such as medicine, industry, autonomous navigation, among others. A subject of great interest is anomaly detection in video surveillance, for which in recent years, many efforts have been carried in order to detect abnormal behaviors efficiently (BERTINI et al., 2012; CHEN et al., 2015; CHENG et al., 2015; KIM; GRAUMAN, 2009; LI et al., 2015a, 2014; MEHRAN et al., 2009; YUAN et al., 2015; XU et al., 2015). However, anomaly detection is context-dependent and, also, it is dependent on the structural knowledge of the scene. An interesting approach for tackling this issue is Behavioral Analysis (BA), which it is related to the monitoring of individuals, objects, and their iterations (ZHU et al., 2014). BA is the science of behavior change introduced by Watson’s cause-and-effect model and popularized by Skinner (1935) with antecedent-behaviorconsequence study (DIXON et al., 2012).. Skinner’s model was proposed in the 1930s,. and he was interested in measuring observable behavior to discover functional relationships between the environment and the behavior (DIXON et al., 2012; FARHOODY, 2012). BA concepts are useful to be applied in monitoring and identification of objects (or people) specially in busy environments (for instance, crowded scenes), because it requires monitoring an excessive number of individuals and their relationships, and also it needs retaining structural information regarding scenes. Recently, BA has been proposed to model problems in softbiometrics (PERLIN; LOPES, 2015), human behavior (LIAN et al., 2017), human action recognition (JAIN et al., 2017), and, also, anomaly detection (LI et al., 2014; ZHU et al., 2014; RIBEIRO et al., 2018)..

(24) 23. 2.2. ANOMALY DETECTION As mentioned before, One-Class Classification (OCC) is an important concept to. approach problems of automatic video surveillance that are characterized by busy environments with large movimentation of people and objects, also known as crowded scenes. Since there is no universally accepted definitions for novelty detection and OCC (PIMENTEL et al., 2014), “anomaly detection” is adopted as generic term along this work. Anomaly detection has been proposed for a variety of reasons, including detection of malicious activities, such as fraud (AKHILOMEN, 2013; PHUA et al., 2010), intrusion (SEN; CLARK, 2011; GARCIA-TEODORO et al., 2009; WU; BANZHAF, 2010) and terrorist activities (ROHN; EREZ, 2013). Also, there is detection of anomalies in bioinformatics (GEORGE et al., 2015; SUGIMOTO et al., 2012), weather forecast (catastrophe detection) (OTSUKA et al., 2014; OHBA et al., 2015) and health systems management (SALEM et al., 2013; ORDONEZ et al., 2015). Regarding the area of visual surveillance in videos, anomaly detection has been attracted research due to the growing concern with safety in both public and private places (ANDRADE et al., 2006; MAHADEVAN et al., 2010; LI et al., 2014; HU et al., 2015). Because anomaly detection is context-dependent, authors often model their approaches subjectively and according to the problem they wish to solve. Gonbadi et al. (2015) defined anomaly detection as a supervised approach to find a set of patterns that deviate from an expected behavior. For Hu et al. (2015), anomaly detection is equivalent to predict whether a event is normal, or not, ignoring its specific category labels. Both Denning (1987) and Fuse and Kamiya (2017) define anomalies using statistical modeling methods, which can learn the normal state from regularly acquired data and automatically detect anomalies that are different from the normal model. For Li et al. (2014), the anomalies are defined as low probability events related to a probabilistic model of the normal concept. Since the definition of the normal behavior is context-dependent, it may be influenced by the problem or by the methods that will be used. As mentioned before, in anomaly detection, known patterns are grouped depending to the normal concept, whilst the unknown patterns are classified as anomalies. Depending on the application domain, these patterns may also be known as abnormal events, outliers, exceptions, aberrations, surprises, peculiarities, contaminant samples, among others. In particular, anomalies and abnormal events are terms that are used preferably in the context of anomaly detection (TAX, 2001; CHANDOLA et al., 2009; BODESHEIM; FREYTAG, 2013; SCHEIRER et al., 2013; PIMENTEL et al., 2014;.

(25) 24. AHMAD et al., 2017). Mathematically, anomaly detection is usually formulated as an outlier detection problem, where anomalies are patterns that do not belong to the defined normal model (MAHADEVAN et al., 2010; LI et al., 2014). In the classical formulation, a statistical model P(x|X) is postulated for the measurement distribution X (under normal conditions), where x is the pattern under test. Hence, anomalies are defined as measurements with a probability of belonging to the model P that is lower than a previously defined threshold. This formulation is equivalent to the statistical test of hypotheses, where the null hypothesis H0 is rejected if P(x|X) < ε, with ε representing a threshold of the probability that the pattern belongs to the distribution X. For this work, a similar formulation is adopted, whose measurements are taken from known patterns to model the normality context, and using a threshold to separate new observations of both normal and abnormal patterns. In the proposed formulation, a normalized score function NSX (z) is postulated for the set of measurements X (normal model). Therefore, anomalies are defined as measurements whose normalized error is higher than the threshold ε. The binary classification decision is performed according to the Equation 1:. f=. n. normal, i f NSX (z) 6 ε abnormal, i f NSX (z) > ε,. (1). where X is considered the normal model and z is the pattern under classification. A didactic example that shows abnormal events in a two-dimensional dataset is presented in Figure 1. In this example, the data have two normal regions, N1 and N2 , in which most of the observations are. Observations that are distant from these regions, such as the objects o1 ,o2 , and o3 , are considered anomalies. It is worth to mention that the previous didactic example is limited for exemplifying the anomaly detection concepts. In fact, in real-world problems the scenarios are crowded scenes that involve high complexity, with many individuals and objects, and their relationships. Due to the context-dependency, the diversity of behaviors, and the complex individuals interactions, detecting anomalies in crowded scenes is a challenging task (ZHU et al., 2014). For this reason, anomaly detection in crowded scenes is a subject of great interest, and also because it is related to information retrieval in busy environments. Modeling normal behavior in crowded scenes may require several normal concepts. For instance, in an avenue where the traffic of trucks is restricted to some periods of the day, but the traffic of buses no, there would be necessary two concepts: one for the truck constraints.

(26) 25. Figure 1: Example of anomalies in a bi-dimensional didactic dataset.. and another for the buses and other vehicles. Also, behaviors considered as normal in a specific context (or scale) can be considered abnormal when they are transported to another context. For instance, bikers may be considered anomalies if they are in park walkways, but not if they are in streets. Hence, normal behaviors are required to be modeled in both multi-scale and multi-context, increasing the problem complexity. Another important issue regards the models’ robustness, since they involve objects that moves independently, causing occlusion between themselves (CONG et al., 2013b; LI et al., 2014; ZHU et al., 2014). Anomaly detection in crowded scenes can be subdivided into two main levels according to what is focused. First, the detection is focused on local events, with respect to the behavior of an object and the behavior of its neighbors. The anomalies are detected by comparing the behavior of an object with the behavior of its neighbors. A didactic example is shown in Figure 2 (a), which presents both appearance and motion abnormal pattern (red highlighted object), when compared to its neighbors (blue highlighted objects). Second, anomalies are detected at the global level considering the whole context, thus analyzing the behavior of each object with respect to everything else. Figure 2 (b) shows an example of this type of anomaly, which could be, for instance, people quickly evading the scene by running radially outwards in panic situation. Here anomalies are detected considering the observed panic situation with respect to the previously defined normality context (CONG et al., 2013b). Video anomaly detection methods can be categorized according to the surveillance target, type of sensors, feature extraction (pattern representation) process, and modeling methods (SODEMANN et al., 2012). Regarding the surveillance target, anomaly detection can be performed on traffic, individuals, crowds, and single or multiple objects. As for the types of sensors, visible-spectrum cameras are the most frequently used. The limitation of this.

(27) 26. Figure 2: Example of local and global anomalies. (a) shows the different behavior of the red highlighted object with respect to its neighbors (blue highlighted objects). (b) shows the global behavior of all objects in the scene considering all context changes.. type of sensor is the field of view and resolution of the camera (ADAM et al., 2008). Methods for feature extraction are dependent on the surveillance target. There are two main groups: those which first perform target tracking by analyzing individual moving objects in the scene (extracting complex motion and appearance features), and those that extract features directly from the image at the pixel level (PENNISI et al., 2016). Feature extraction will be detailed in Section 2.4. Regarding the modeling methods, the most used approach for anomaly detection is based on the OCC, and there are three main approaches to accomplish this task (TAX, 2001). The first is based on models that estimate the probability density function of input patterns (density methods). From the probability density function, it is possible to establish if a given input pattern is abnormal or not, based on its probability value. The second approach is related to reconstruction methods that use clustering to find out if a given input pattern is an anomaly or not, based on the distance from the unknown input pattern to clusters previously defined in the training process. The last approach comprises models that impose boundaries upon the training dataset, assuming an unknown distribution. Therefore, a boundary optimization problem is solved in order to represent the data. The most popular methods that use this approach are: One-Class Support Vector Machine (OC-SVM) classifier or Support Vector Data Description (SVDD), which can be identical under certain conditions (TAX, 2001). In the work of Xu et al. (2015) the OC-SVM was used. Also, a similar one-class approach, named space-time Markov Random Field model, was devised by Kim and Grauman (2009). There are some approaches that include both the feature extraction and the modeling method in a single step. In the work of Lu et al. (2013), a sparse combination is proposed, and it turns the original problem into a few costless small-scale least square optimization problem..

(28) 27. Following a similar idea, a sparse reconstruction and a novel dictionary selection is presented in the work of Cong et al. (2011). In the work of Xiao et al. (2015a), a probability model that takes the spatial and temporal contextual information into account is learned. The framework is unsupervised, without the need to label the training data to perform the anomaly detection task. It is also possible to include some a priori knowledge about the application. However, by using a different step for the classification process, it is necessary to select the most appropriate classifier, in addition to specifying the descriptors in the feature extraction step. These issues, by themselves, are hard to address for some applications. 2.3. ONE-CLASS SUPPORT VECTOR MACHINE In this work the formulation proposed by Tax (2001) for the OC-SVM, i.e., SVDD is. used. For a given input class, with N examples (x1 , ..., xN ) | xi ∈ Rd , it is assumed that there is a closed surface (hypersphere) that surrounds it. The hypersphere is characterized by its center a and radius R. In the original formulation, the SVDD model contains two terms. The first term (R2 ) is related to the structural risk and the second term penalizes objects located at a large distance from the edge of the hypersphere, keeping the trade-off between empirical and structural risks. The minimization problem can defined using Equation 2: ε(R, a, ξ ) = R2 +C1 ∑ ξi ,. (2). i. grouping almost all patterns within the hypersphere, according to the Equation 3: kxi − ak22 ≤ R2 + ξi , ξi ≥ 0, ∀i,. (3). in which C1 gives the trade-off between the volume of the description and the errors that are represented by the distance between outliers and the edge of the hypersphere, ξi . The k·k22 represents the Sum of Squared Differences (SSD) in l2 -norm. This optimization problem is usually solved through its Lagrangian dual problem, which consists of maximizing the Equation 4: L = ∑ αi K(xi , xi ) − ∑ αi α j K(xi , x j ), i. (4). i, j. with respect to αi , subject to the constraint postulated in Equation 5: 0 ≤ αi ≤ C1 , ∀i.. (5). The main feature of the SVDD model is the representation of the input data in a lowdimensional space without the need of a large additional computational effort (TAX, 2001). This.

(29) 28. representation allows more flexible descriptors of the input data, following the same general idea of Support Vector Machines (VAPNIK, 1998). The RBF kernel (used in this work), is given by Equation 6: K(xi , x j ) = exp. 2 ! − xi − x j 2 , υ2. (6). where υ represents the kernel parameter (width). In this representation, a new pattern z is classified as an anomaly according to the Equation 7: ! − kz − xi k22 − AS = ∑ αi exp υ2 i " #. 2 !. − x − x 1 i j 2 1 + ∑ αi α j exp − R2 < 0, 2 2 υ i, j. (7). where AS is the Anomaly Score (AS) of the pattern z. With the RBF kernel, the formulation of the SVDD model is equivalent to the OC-SVM proposed in Schölkopf and Smola (2001), as discussed in Tax (2001). 2.4. PATTERN REPRESENTATION A key issue for anomaly detection methods is pattern representation, and its choice is. crucial and may significantly affect the classification performance. The extraction of relevant features from the raw data (i.e. images or videos) is important to enable a good classification of different types of anomalies. Several types of representations have been proposed to address with this issue, but they typically present limitations. For instance, they focus on modeling the internal information of the patterns whilst the context is ignored, they have high-dimensional, and they are sensitive to noise or other variations. In an attempt to classify pattern representation approaches, Hu et al. (2015) suggested that the methods for representing patterns can be grouped into four main categories: trajectory-based, spatio-temporal-interest-point-based, foregroundblob-based and volume-based. However, it is observed in the literature the tendency of using methods that can be grouped in one or more categories, and usually they have in common the use of hand-crafted features. Therefore, this work has summarized the pattern representation methods for anomaly detection into three groups: statistical-based, spatial-temporal-based and based on DL. For statistical-based approaches (i.e. methods based on statistical model postulated over the normal context, where the anomalies are classified based on probabilities), there are: Gaussian mixture model (YU, 2012), mixture principal component analysis (TIPPING; BISHOP, 1999), hidden.

(30) 29. Markov model (DORJ; ALTANGEREL, 2013; KRATZ; NISHINO, 2009; WANG et al., 2012), Markov random field (BENEZETH et al., 2009), sticky hierarchical Dirichlet process hidden Markov model (KAMIYA; FUSE, 2015; TANIGUCHI et al., 2011) and latent Dirichlet allocation (BLEI et al., 2003). This work focuses on spatio-temporal methods and those based on DL (mainly), explained in Subsection 2.4.1 and 2.4.2, respectively. 2.4.1. REPRESENTATION BASED ON SPATIO-TEMPORAL FEATURES The most common approach for pattern representation is the use of spatial-temporal-. based methods. Such features are based on standard CV techniques and other variants, such as Histogram of Oriented Gradients (HOG) (CHENG et al., 2015), Histogram of Optical Flow (HOF) (WANG; SNOUSSI, 2014), HOF multi-scale (CONG et al., 2013b; ZHU et al., 2014), textures of optical flow (RYAN et al., 2011), tracking-based (XIE; GUAN, 2015), social force model (MEHRAN et al., 2009; RAGHAVENDRA et al., 2011; ZHANG et al., 2012), dense trajectories (WANG et al., 2013b), spatio-temporal texture (WANG; XU, 2015), sparse reconstruction cost (CONG et al., 2013a; ZHU et al., 2014) and dynamic textures (MAHADEVAN et al., 2010; LI et al., 2014). Saligrama and Chen (2012) proposed a statistical approach focused on local spatiotemporal signatures based on 3D patches. Jiang et al. (2011) proposed three different levels of spatio-temporal contexts to be extracted in order to perform the tracking process of all moving objects in a video. Brun et al. (2014) proposed a different approach using a string kernel and tracking-based approach for evaluating the similarity between trajectories and defining a novelty score for different zones in a scene. Similarly, Yang et al. (2013) used a trajectory segmentation to perform the tracking process and a multi-instance learning to detect abnormal trajectories. In general, the main limitation of tracking methods in complex and crowded scenes is the presence of occluded objects, which degrade the anomaly detection performance (XU et al., 2015). On the other hand, features based on appearance and motion are more robust to occlusion problems in videos (LI et al., 2014; XIAO et al., 2015a). The most common features are built using 3D spatio-temporal gradients, HOG and HOF. In the work of Mehran et al. (2009), a social force model was proposed in such a way that regions with anomalies are found in the abnormal frames by means of interaction forces and a bag-of-words approach. In Pennisi et al. (2016), a combination of visual feature extraction and image segmentation is presented and the method works without the need of a training phase. In Kaltsa et al. (2015), histograms of oriented swarms is applied, together with HOG, to capture the dynamics of.

(31) 30. crowded environments. Such appearance and motion model increases the detection accuracy of local anomalies and have a lower computational cost, compared to other state-of-the-art methods. Other spatio-temporal statistical measures to characterize the overall behavior of the scene are presented in the work of Xiao et al. (2015a), Kratz and Nishino (2009), Cheng et al. (2015), Cong et al. (2013b), Wang and Xu (2016). However, hand-crafted descriptors usually require that some a priori knowledge should be incorporated in the training step and this issue will be discussed better in Subsection 2.4.2. 2.4.2. REPRESENTATION BASED ON DEEP LEARNING DL is part of machine learning methods based on learning data representation, which. are composed of both linear and non-linear transformations aiming to produce more abstract and useful representations (BENGIO et al., 2013). DL is also known as hierarchical learning, or deep machine learning, whose architecture is complex and composed of multiple layers, and high-level algorithms of abstraction. DL algorithms are based on distributed representations and the idea is that the final information is generated from exhaustive iterations considering many units. It is assumed that these units are organized so that they correspond to different abstraction and composition levels. By combining different layer compositions, it is possible to obtain different levels of abstraction (BENGIO et al., 2013). DL methods have been investigated for CV problems, and they turn out to be very effective for visual recognition tasks. Several related work have appeared recently, and they are categorized according to the basic method that they are derived from, that is: Convolutional Neural Network (CNN) (SZEGEDY et al., 2015), Autoencoder (AE) (VINCENT et al., 2008), restricted Boltzmann machines (SALAKHUTDINOV; HINTON, 2009), and sparse coding (GAO et al., 2010). It is noticed that, in the context of anomaly detection, DL methods are still in early stages of development. Nowadays, there are several variants of DL methods and it has been growing rapidly on new approaches. Examples of such approaches are deep neural networks (CIRESAN et al., 2012), deep convolutional neural networks (SERMANET et al., 2013), deep Boltzmann machines (FISCHER; IGEL, 2012), Deep Belief Networks (DBNs) (HINTON et al., 2006), deep autoencoders (BENGIO et al., 2013), deep stacking networks (DENG et al., 2013) and tensor deep stacking networks (HUTCHINSON et al., 2013). In the work of Sun et al. (2017), an unsupervised online Growing Neural Gas (GNG) was trained to learn motion features and a Gaussian distribution was used to discriminate human anomalous events. Also, a similar approach was presented by Erfani et al. (2016), where an unsupervised DBN was trained to learn a set of features in a relatively low-dimensional space,.

(32) 31. and a one-class classifier was trained with the features learned by the DBN. In general, oneclass classifier can be inefficient for modeling decision surfaces in large and high-dimensional datasets. However, by combining the one-class classifier with a DBN, it is possible to reduce redundant features and improve the performance for standard OCC datasets. In the work of Xu et al. (2015), an appearance and motion Stacked Denoising Autoencoder (SDAE) was proposed to extract features of video surveillance datasets. Based on the features learned, multiple OC-SVM models were used to predict the anomaly scores and classify each frame. Despite the fact that AEs are very efficient methods for given applications, they cannot capture the 2D structure in image and video sequences, and the CAE architecture can be more appropriated (MASCI et al., 2011). A similar procedure was presented in the work of Hasan et al. (2016), in which two AE (SDAE and CAE) were used to learn regular motion patterns from video sequences. The main advantage of this approach is the possibility of capturing regularities (degrees of normality) from multiple datasets jointly. Nevertheless, the anomalies may be characterized by motion and appearance features, thus requiring that the input of the CAE includes such sort of features. However, as pointed by Perlin and Lopes (2015), those features, called hand-crafted descriptors, require that some knowledge have to be incorporated during the training step. As a result, some features may perform well in particular domains and drive classifiers to bad classification accuracy in others, even combining motion and appearance features (XU et al., 2015). In this context, DL approaches (LEE et al., 2009; MASCI et al., 2011; SZEGEDY et al., 2015; HASAN et al., 2016; ERFANI et al., 2016; RIBEIRO et al., 2018) can be one good idea for feature learning, because these methods can learn relevant features automatically from both raw image or hand-crafted features. Focusing on the feature learning, recently, DL methods, such as CNNs have achieved the state-of-the-art performance for object recognition (KRIZHEVSKY et al., 2012; SZEGEDY et al., 2015). A possible reason for such a high performance is that they can learn the feature extractor and the classifier, automatically, at the same time. The first is accomplished by many convolution filters at successive layers, and the latter by adjusting the weights of the connections between neurons. This is done by minimizing the training error, considering that the class labels are given (KRIZHEVSKY et al., 2012; SZEGEDY et al., 2015). Such characteristic can improve the inter-class separation, since both classifier and feature extractor are optimized to increase the overall accuracy. Thus, when there is a large amount of samples available for training, CNNs are able to achieve superior discriminatory power for image representation when compared to hand-crafted image descriptor (PERLIN; LOPES, 2015; HASAN et al., 2016)..

(33) 32. However, CNNs are especially designed for supervised classification problems and they are not directly applicable to anomaly detection tasks, where only the normal class is known. To overcome this issue, AEs can be an interesting option for feature learning in OCC problems, because it can be trained using only the normal class and both the Reconstruction Error (RE) and the bottleneck (latent representation) can be used to perform classification scores. The AE model was proposed by Rumelhart et al. (1986) and, later, popularized by Vincent et al. (2010) with the SDAE, as well as by Krizhevsky and Hinton (2011). AEs were initially used in the image retrieval context but, very recently, their application for video anomaly detection has emerged (HASAN et al., 2016; XU et al., 2015). However, AEs are not capable of capturing the 2D structure in image and video sequences, because the input data is a 1D vector. To cope with this issue, the CAE architecture seems to be more appropriated (MASCI et al., 2011). 2.5. AUTOENCODER The AE, also called autoassociator in the literature, has been studied for. decades (BOURLARD; KAMP, 1988; HINTON; ZEMEL, 1994; VINCENT et al., 2008; HASAN et al., 2016). It is a fully connected one-hidden-layer neural network devised to learn from unlabeled data. The idea is that the AE is trained to reconstruct the input pattern at the output of the network. Internally, an AE has a hidden layer h that compresses the input data to represent it in a latent representation space. The latent representation aims at exploiting closeness of input patterns, where a large number of inputs can be aggregated in a model to represent an underlying concept. Latent representation is useful to reduce the dimensionality and making it easier to understand the data. Formally, the AE is composed of two parts: an encoder function h = fΘ (x) and a decoder function y = fΘ0 (h). Thus, AE takes an input x ∈ Rd | d ∈ N∗ > 1 and first maps it 0. into the latent representation (hidden layer) h ∈ Rd | d 0 ∈ N∗ > 1 using the mapping function f : h = fΘ = σ (Wx + b). The mapping function is parameterized by Θ = {W, b}, where W ∈ 0. Rd×d is the set of weights between neurons, and b ∈ Rd is the bias vector. Both, weights and bias are learnt in training process. For reconstructing the input, a reverse mapping of f : y = fΘ0 (h) = σ (W’h + b0 ) with Θ0 = {W’, b0 } parameters is performed to reconstruct a vector y ∈ Rd | d ∈ N∗ > 1 (VINCENT et al., 2010; MASCI et al., 2011; GOODFELLOW et al., 2016). An example of AE architecture is shown in Figure 3..