Optimization of a face detection algorithm for real-time mobile phone applications

Texto

(1)P´ os-gradua¸c˜ ao em Ciˆ encia da Computa¸c˜ ao. Optimization of a Face Detection Algorithm for Real-time Mobile Phone Applications V´ıtor Schwambach Disserta¸ c˜ ao de Mestrado. Universidade Federal de Pernambuco Recife, 6 de fevereiro de 2009.

(2) ii.

(3) Universidade Federal de Pernambuco Centro de Inform´ atica. V´ıtor Schwambach. Optimization of a Face Detection Algorithm for Real-time Mobile Phone Applications. Este trabalho foi apresentado à Pós-Gradua¸cão em Ciência da Computa¸cão do Centro de Informática da Universidade Federal de Pernambuco como requisito parcial para obten¸cão do grau de Mestre em Ciência da Computa¸cão. Orientadora: Edna Barros. Recife 6 de fevereiro de 2009. iii.

(4) Schwambach, V´ıtor Optimization of a face detection algorithm for real-time mobile phone applications / V´ıtor Schwambach. - Recife: O Autor, 2009. xiv, 55 p. : il., fig., tab. Disserta¸c˜ ao (mestrado) - Universidade Federal de Pernambuco. CIN. Ciência da Computa¸c˜ ao, 2009. Inclui bibliografia. 1. Sistemas embarcados. 2. Processamento de imagem. 3. Ciência da Computa¸c˜ ao. I. T´ıtulo. 005.256. iv. CDD (22. ed.). MEI2010-055.

(5) Resumo. Desde equipamentos de vigilância por v´ıdeo a câmeras digitais e telefones celulares, a deteçc˜ ao de rostos é uma funcionalidade que está rapidamente ganhando peso no projeto de interfaces de usuário mais inteligentes e tornando a intera¸cão homem-m´ aquina cada vez mais natural e intuitiva. Com isto em mente, fabricantes de chips est˜ ao embarcando esta tecnologia na sua nova gera¸cão de processadores de sinal de imagem (ISP) desenvolvidos especificamente para uso em aparelhos celulares. O foco deste trabalho foi analisar um algoritmo para deteçcão de rostos para suportar a defini¸c˜ ao da arquitetura mais adequada a ser usada na solu¸cão final. Um algoritmo inicial baseado na técnica de Cascata de Caracter´ısticas Simples foi usado como base para este trabalho. O algoritmo inicial, como especificado, leva quase quarenta segundos para processar um u ńico quadro de imagem no processador alvo, tempo este que inviabilizaria o uso desta solu¸cão. Focando na implementa¸c˜ ao de um novo ISP, o algoritmo foi completamente reescrito, otimizado e propriamente mapeado na plataforma alvo, ao ponto onde um fator de acelera¸c˜ ao de 167x foi atingido e uma imagem de pior caso agora leva menos de 250 milissegundos para ser processada. Este n´ umero é ainda mais baixo se for considerada a média em um conjunto maior de imagens ou um v´ıdeo, caindo para cerca de 100 milissegundos por quadro de imagem processado. Não obstante, performance n˜ ao foi o u ńico alvo, também a quantidade de memória necess´ aria foi dramaticamente reduzida. Isto tem um impacto direto na área de sil´ıcio requerida pelo circuito e conseq¨ uentemente menores custos de produ¸cáo e consumo de potência, fatores cr´ıticos em um sistema para aplica¸cões móveis. ´ importante ressaltar que a qualidade não foi deixada de lado e em todas E as otimiza¸c˜ oes realizadas, tomou-se o cuidado de verificar que a qualidade de deteçc˜ ao n˜ ao tinha sido impactada. Este documento apresenta a pesquisa feita e os resultados obtidos. Come¸ca por uma breve introdu¸c˜ ao ao assunto de Visão Computacional e aos desafios de projetar uma solu¸c˜ ao de deteçcão de rostos. Após esta introdu¸cão, o algoritmo que serviu como base para este trabalho é apresentado juntamente com as otimiza¸c˜ oes mais relevantes ao n´ıvel algor´ıtmico para melhorar a performance. Na seq¨ uência, instru¸c˜ oes customizadas desenvolvidas para acelerar a execu¸cão do algoritmo na solu¸c˜ ao final são apresentadas e discutidas. Palavras-chave: Deteçc˜ ao de rostos, processamento de imagens, telefones celulares.. v.

(6) vi.

(7) Abstract. From video surveillance equipment to digital cameras and cellular phones, face detection is a feature that is rapidly gaining space in the design of intelligent user interfaces and making the human-machine interaction ever more natural and intuitive. With this in mind, chip makers are embedding this technology directly into their new generation of image signal processors (ISPs) designed specifically for mobile phone applications. The focus of this work was to analyze an algorithm for face detection to help define the best architecture to be used in the final solution. An initial algorithm based on technique of Cascade of Simple Features was used as base for this work. The original algorithm as specified would take almost forty seconds to process a single image frame on the target processor, time that would render the solution unpractical. Focusing on the implementation on a new ISP device, the algorithm was completely rewritten, optimized and properly mapped to the target platform to the point where an acceleration factor of 167x was achieved and a worst-case image now takes less than 250 milliseconds to be processed. That number can be much lower if the average on a large set of images is considered, coming down to about 100 milliseconds per image processed. Nevertheless, performance was not the only target, also the amount of memory needed was dramatically reduced. That has a direct impact on the area needed for silicon implementation and consequently reduced manufacturing costs and lower power consumption, factors that are critical in a system target for mobile applications. It is important to notice that quality was never set aside and in all optimizations performed care was taken to assess the detection quality was not affected. This document presents the research done and the results obtained. It starts with a brief introduction to the subject of Computer Vision and to the challenges of designing a face detection solution. After this introduction, the algorithm that served as base for this work is presented along with some of the more relevant optimizations performed at the algorithm level to improve performance. In the sequence, customized instructions developed to accelerate algorithm execution in final solution are described. Finally, the results and conclusions drawn from the optimization phases are presented and discussed. Keywords: Face-detection, image processing, mobile phones.. vii.

(8) CONTENTS. Contents. 1 Introduction 1.1. Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. 1.2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 1.3. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 1.4. Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 1.5. Structure of the Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4. 2 Background. 5. 2.1. Basic Imaging Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 2.2. Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 2.3. Face Detection Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8. 3 Related Work 3.1. 3.2. viii. 1. 10. Geometry Based Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 10. 3.1.1. Linear Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11. 3.1.2. Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11. Template Based Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 14. 3.2.1. Skin-tone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. 3.2.2. Cascade of Simple Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. 3.2.3. Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16.

(9) CONTENTS 3.3. Embedded Software Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 17. 3.3.1. Source Code Optimization Methodology . . . . . . . . . . . . . . . . . . . . . . . .. 17. 3.3.2. Instruction Set Extensions Design Flow . . . . . . . . . . . . . . . . . . . . . . . .. 18. 4 Detection Algorithm 4.1. 20. Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 20. 4.1.1. Integral Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 20. 4.1.2. Simple Haar-like Features and Cascaded Classifier . . . . . . . . . . . . . . . . . .. 21. 4.1.3. Detector Scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 24. 4.1.4. Handling Face Sizes and Distances . . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. 4.1.5. Handling Face Rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. 4.1.6. Illumination Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 28. 4.1.7. Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 29. 5 Algorithm Optimization 5.1. 31. Optimizations Performed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 32. 5.1.1. Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 32. 5.1.2. Optimization of Divisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 34. 5.1.3. Optimization of Square Root Operation . . . . . . . . . . . . . . . . . . . . . . . .. 35. 5.1.4. Integral Image on-the-fly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36. 5.1.5. Rescan Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 37. 5.1.6. Skin-tone Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 37. 5.1.7. Multi-pass Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38. 6 Cascade and Statistics Acceleration 6.1. 41. Cascade Custom Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 42. 6.1.1. XBLOCK BADDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43. 6.1.2. XBLOCK VSTEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43. ix.

(10) CONTENTS. 6.2. 6.3. 6.1.3. XFEAT CALC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43. 6.1.4. XNORM THRESH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 44. 6.1.5. XTHRESH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 44. 6.1.6. XTHRESH SIGMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 44. 6.1.7. XEXIT STATUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 44. Statistics Custom Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 6.2.1. XVAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 6.2.2. XSTDDEV FPNORM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 6.2.3. XCHECK STAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 46. 6.3.1. Functional Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 46. 6.3.2. Architecture Verification Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 47. 7 Results. 48. 8 Conclusion. 50. 8.1. Difficulties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 51. 8.2. Opportunities of Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 51. References. x. 52.

(11) LIST OF FIGURES. List of Figures. 1.1. Samsung’s Anycall SCH-W480 phone with integrated face detection[6]. . . . . . . . . . . .. 2. 2.1. Imaging Areas and Related Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 2.2. Smile detection example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 3.1. Integral Projections on vertical and horizontal axis . . . . . . . . . . . . . . . . . . . . . .. 11. 3.2. Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12. 3.3. Component Analysis results using Support Vector Machines . . . . . . . . . . . . . . . . .. 13. 3.4. Component Analysis color map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 13. 3.5. Component Analysis results using color information . . . . . . . . . . . . . . . . . . . . .. 14. 3.6. Skin-tone Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16. 3.7. Neural Network Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 17. 4.1. Integral image concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 4.2. Pixel sum calculation from integral image . . . . . . . . . . . . . . . . . . . . . . . . . . .. 22. 4.3. Haar-like feature examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 22. 4.4. Cascade execution flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 24. 4.5. First features selected by AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25. 4.6. Final feature types defined for this work . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25. 4.7. Detector scanning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. 4.8. Face sizes and impact on detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. xi.

(12) LIST OF FIGURES 4.9. xii. Scaling the image and applying the detector . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. 4.10 Examples of rotated faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 28. 5.1. Face with false positive inside . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 37. 5.2. Images skin-tone filtered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 39. 5.3. Example of multi-pass detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 40. 6.1. Application flow for verification of defined instructions . . . . . . . . . . . . . . . . . . . .. 46. 6.2. Example of an Architecture Verification Pattern . . . . . . . . . . . . . . . . . . . . . . .. 47.

(13) LIST OF TABLES. List of Tables. 7.1. Results for Detection Rate and Feature Count . . . . . . . . . . . . . . . . . . . . . . . . .. 49. 7.2. Results for Number of Cycles in Worst-case Image . . . . . . . . . . . . . . . . . . . . . .. 49. 7.3. Silicon Area Estimations in 65nm process . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49. xiii.

(14) LIST OF TABLES. xiv.

(15) Chapter 1. Introduction Everyday new mobile phones are released with a multitude of new features. In a competitive market such as the phone market, with many heavy-weight global players and billions of dollars involved, if a company wants to enter or even stay in the market it must design innovative products that will ‘stand-out in the crowd’. Designers of such systems are constantly trying to improve these systems, focusing on adding ever more and more functionalities and improving the user experience with their product. Surveys in Japan showed that already in 2005 more than 60% of the phones sold there had embedded cameras[27][18] and in 2008 that share surpassed 70% of the mobile phones world wide[15]. Furthermore, another survey performed by Hewlett-Packard Labs and Microsoft showed that cost, reliability and interface complexity were the key factors determining how a user interacts with his phone[23]. To engage the user in a better experience companies need to improve the user interface while keeping a high reliability and low cost. A way to make this increasing number of functionalities accessible to the end user and simplify its interaction with the system is to add intelligence to the user interface and one such technology that is proving to be captivating to the public and increasing the user experience is face detection. Although previously unpractical due to the high computational needs of a reasonable face detection system, recent advances in processing capacity of mobile devices associated with novel approaches in computer vision, have rendered possible the implementation of such features in a mobile phone. Samsung has leaped ahead and already released a phone with such capabilities this year[6] as can be seen in Figure 1.1. For other companies, the challenge is still there. It is not simple to design a face detection system that has a good detection rate with the restraints on area, cost and power consumption of ICs to the mobile industry. A great deal of research and empirical testing is necessary to achieve a good compromise among detection quality, resources and performance.. 1.

(16) CHAPTER 1. INTRODUCTION. Figure 1.1: detection[6].. 1.1. Samsung’s Anycall SCH-W480 phone with integrated face. Challenges. An object detection algorithm deals with detecting instances of semantic objects of a certain class in digital images or video [28]. A classifier will identify characteristics common to a given class of objects and build a model that represents this class of objects. It will then check areas of an image or a video against this model to evaluate if this area possibly holds an object of the target class. The more characteristics the objects in a given class share and likewise, the less characteristics they have in common with other object classes or the background, the easier the classifier’s job is. A few characteristics make the job of detecting faces on mobile phones a rather particular challenge. First of all, unlike other computer vision systems that are in use in photo booths or even in industrial applications where there is a certain control on the environment, there is not too much information that can be assumed by the designer about the operating environment of the phone. In a photo booth like those commonly found at train stations and airports, the subject is always at a specific distance from the camera, centrally positioned, in an ambient with controlled lighting and a clean background that does not interfere with the detection process. The subjects of a picture on a mobile phone can be near or far, centered or not, the lighting can have a cool tone or a warm tone[38] and the background is completely unpredictable. Furthermore, human faces have a high degree of variability that complexifies the task of finding common features[13]. There are different sizes from kids to grown-ups, different shapes, different skin colors, people with short hair, long hair, moustaches, beard, accessories such as hats, glasses, earrings, piercings, tatoos, and all that represents a challenge to the classifier that must try to correctly identify all these types of faces. Facial expressions and poses (frontal, inclined, profile) also add to the variability and contribute to the complexity of the problem. There are a few known algorithms for face detection (presented in section 2), each with its own characteristics, that try to solve this problem. However, there is a key aspect that is common to all of them: they’re always processing intensive tasks that require a great deal of computing power.. 2.

(17) SECTION 1.2. PROBLEM STATEMENT. 1.2. Problem Statement. The problem tackled in the course of this work is that of analyzing, optimizing and mapping an algorithm for face detection onto an IP-core1 to be integrated in a System-on-Chip2 platform to be used in a new generation of ISPs (Image Signal Processors).. 1.3. Objectives. The goals set forth in the beginning of the work presented here were: • Support architecture definition of face detection IP-core; • Detection is to be done in less than 250ms per frame for a QVGA (320x240 pixels) image; • Final area of the IP-core should be inferior to 1mm2 in a 65nm process. • Detection rate should not be impacted in a negative way by the optimizations. The main purpose of the work was to support the definition of the architecture to be used in a solution for face detection in a new generation of ISPs. Some design choices such as the best way to execute the classifier cascades, the usage of pre-processing filters, definition of memory structures, scaler and integral engine had not been done yet. So, this work was done to validate that the architecture proposed would fulfill the requirements necessary for performance and quality.. 1.4. Proposal. In order to achieve the goals set forth in this work, a defined methodology and workflow was followed during the course of this work. First of all, an external team developed the initial algorithm based on the work from Viola and Jones[39] using a Cascaded Classifier. This project was developed in C++ using 64-bits and therefore unsuitable for real-time implementation using a 32-bit processor. The proposal of this work was to first throughly study the algorithms presented here and the main characteristics that impact detection rate and performance. Based on this, various tests were done with the algorithm to find out which combination of algorithm optimizations worked the best. Once the algorithm reached a stable version, focus was redirected towards migrating the code to the embedded platform. Profiling was used to detect functions that consumed more resources and whenever 1 IP-core is a logic block that implements a given functionality and has a well-defined interface that allows for easy reuse in Systems-on-Chip. 2 A System-on-Chip consists of assembling various elements of a computer system such as processor, bus, memory and peripherals and implementing them all in a single silicon die[3].. 3.

(18) CHAPTER 1. INTRODUCTION possible these functions were optimized. Some operations that could be pipelined like image scaling and integral image calculation were displaced to a dedicated hardware so that they could be done in parallel to the processing of the cascades. A final step consisted in use profiling information to merge instructions often executed in the same order to save cycles. The 32-bit RISC processor was customized to add these new instructions and that further increased performance, reaching the targets set forth in the start of the project.. 1.5. Structure of the Document. In the next chapter, a brief introduction to computer vision systems will be presented. Its main areas are described and the distinction between image processing and computer vision is done. Further on, the concepts of face localization, detection, recognition and tracking are explained. On the subsequent chapter, the approach used in the algorithm that served as base for this work is presented in details. The first phase of the work was to analyze this algorithm, understand the basic principles behind its logic and optimize it before heading towards any hardware implementation related aspect. Some of the more interesting optimizations tried are presented. In the sequence, the hardware engines developed to accelerate algorithm execution in final solution are listed. One of these hardware engines is in the form of a processor instruction set architecture extension and since it was the most significative, this section will focus on explaining its purpose, characteristics and development process. After explaining all implementation related details and optimizations performed, a chapter is dedicated to presenting and discussing the results obtained. A comparison will be done among some of the key milestones in project development and some comments will be given. To wrap-up, the objectives established at the start are reviewed and the conclusions drawn from the project are listed. Final comments regarding difficulties found along the way are made and a few ideas on identified opportunities for future improvement are proposed.. 4.

(19) Chapter 2. Background Different readers may have varying degrees of expertise on the field and it is thus necessary to provide a brief explanation of image processing basic terms and notions. With this in mind, the objective of this chapter is to expose the concepts to be used throughout this document in an easy and straightforward way to allow a better understanding of the subsequent chapters. In a first moment, the general ideas of object localization, detection, recognition and tracking will be presented and the distinction among these concepts will be done. In the context of this document, object detection is of the main importance and will be explained in details in the next chapter. While some of these concepts are not the absolute focus of this work, they aid in understanding the context of the project and the flow adopted.. 2.1. Basic Imaging Concepts. The concept of imaging is very broad and involves many areas from computer science and physics to biology. In this section, some of the basics of imaging and how it relates to other fields of human knowledge will be discussed. It is important to notice, though, that this is not intended to be a complete guide to imaging, but an introduction to the subject. Much in a similar way to how the eyes and brain work to perform different tasks with the ultimate goal of making us see – capturing one image, processing and then extracting shape and higher level information –, traditional imaging approaches can be grouped in three main fields: Image Processing, Computer Vision and Machine Vision[14]. These fields, however, often need to be used together and there is a certain intersection among them. Figure 2.1 shows these three areas along with some of their applications. Image Processing refers to a process which has an image as input, does some operation over it and produces another modified image on the output[14]. The goal is usually to improve image details or apply effects to change how an observer or another system perceives the image.. 5.

(20) CHAPTER 2. BACKGROUND. Figure 2.1: Imaging Areas and Related Fields. Computer Vision, unlike Image Processing, doesn’t have as goal to produce another image. Its characteristic is that it extracts useful, higher level information, from a given image[14]. To achieve such tasks it is often necessary to preprocess the input image either to eliminate unwanted characteristics or to enhance some other characteristics in order to improve system accuracy and performance. Finally, Machine Vision is nothing more than the application of Computer Vision systems to industry and manufacturing. While Computer Vision is more abstract, for a system to be considered as a Machine Vision system it should have sensors or actuators such as pressure sensors, robotic arms, etc. In these systems there is usually a feedback loop[14], since decisions taken by the Machine Vision System may actually impact the environment and someway modify the sensor inputs. As one can notice, there is a large interaction among these fields and pretty often their usage is combined in real systems. The face detection work presented herein onto the category of a Computer Vision system and therefore this area will be emphasized.. 2.2. Computer Vision. As described previously, Computer Vision systems focus on extracting information from images. Extracting these high level information from an image is a work that requires an incredible amount of operations to be performed and only in recent times have computers been able to achieve the performance necessary to accomplish such tasks[14]. As a consequence, most focused work has started only after the 1970’s. Highly optimized and specialized systems had to be developed in order to achieve the systems are necessary to achieve good performance and therefore a standardized way of creating a Computer Vision system. 6.

(21) SECTION 2.2. COMPUTER VISION is not always the most effective choice. What exists are solutions well-adapted to very specific tasks and that can’t be easily generalized to perform other tasks in a straightforward way. These solutions are available in the form of toolkits that ease implementation of standard algorithms in standard architectures. But each new task requires extensive research, experimentation and optimization efforts to meet required performance versus cost tradeoffs[14], specially if the target system is an embedded system with limited resources. Some of the more common Computer Vision tasks involve object localization, detection, recognition and tracking. Each of these tasks have their own characteristics and techniques associated and most of the design choices to implement a Computer Vision system come from the knowledge the developer team has of the environment and operating conditions of the system. Object Localization refers to the task of finding the location of a particular object on an image[36]. In this case, one assumes that there is such an object in the image and that their number is known. The system will then search for the object on the image and calculate the locations with the highest probability of matching the objects to be located. This can be used on systems for identification card photo alignment, for instance. In that case it is previously known that there is one and only one person in the picture and the goal is to crop the image around the person’s face so that it is placed in the center of the final image generated. Object Detection is a little more complex than object localization in the sense that one doesn’t know a priori the number of objects present in the scene or even if there is an object present[11][39]. It is up to the system to find out the possible number of objects in the image and tell their locations. This is rather challenging in the sense that for each candidate found, the system must evaluate the chances of it being a real object or simply a false positive detection. There is also the chance that objects that are present on the image are not identified, in this case it is called a false negative detection. Moreover, Object Recognition consists in matching detections to a database of possible individuals and telling which specific individual was detected[29][43]. Imagine a software which would detect not only the faces of the people in the picture, but also who each person is. To accomplish it, one would perform a first pass with an object detection algorithm to search for all the faces on the picture and then a second pass to try to match each face detected against a database for face labelling. Another important task in Computer Vision is Object Tracking. It consists in following an object’s trajectory once it has been found[41][7][1][4]. For practical reasons this sort of algorithm makes no sense on still images, only on video feeds. It involves using localization or detection in a continuous manner, frame after frame, matching the results of current frames with those of previous ones to determine the object trajectory. While there are usual steps to perform a given task, a developer may choose to bypass or customize some of the steps according to his needs. These steps consist basically in image acquisition, pre-processing, feature extraction, detection and post-processing[2]. In image acquisition, one will actually capture the images to be used during the detection phase. Depending on the sensor, these images might be 2D or 3D, grayscale or colored, taken with a regular. 7.

(22) CHAPTER 2. BACKGROUND camera or with other types of sensors such as sonic or electromagnetic, etc. One should choose the sensor that is better adapted to the task at hand and that fulfills other criteria such as cost and robustness. After image acquisition, some pre-processing might be done to enhance some characteristics of the image that will reduce the computational burden of the detector. Once the image has been pre-processed it is available for feature extraction[2]. This step collects information about the image and makes it available to the detector. This information might consist of lines, shape, motion, anything that might be useful for the detector to decide if a particular region of the image contains an object or not. The detector will then get this data, analyze it against its own models and based on some criteria decide if there is an object or not. The post-processing part of the algorithm usually consists of two things. The first is drawing highlevel conclusions based on the detector response, like defining an object speed, size or shape. The other involves treating and presenting the results in some way to the user. It might involve some modification to be done on the input image to mark a particular object or send an alarm in case of a problem in a factory production line. Computer Vision is being used in all sorts of application around the world. Some of the fields that already use computer vision are: medical devices, where it is used to detect anomalies in exams and support the doctor; military, for missile guidance, survailance, etc.; sports, for tracking players and collecting statistics. Localization, Detection, Recognition and Tracking as shown are general approaches that can be applied onto any object. In the context of this work, these objects are people’s faces.. 2.3. Face Detection Applications. Face detection consists of finding an arbitrary number of faces and determining their location in a given image. There are many approaches to this problem[31][42][19] and none is perfect, none achieves 100% detection rate. What one should look for when designing a face detector is to choose an algorithm that provides the best balance between detection rate, false positive rate and performance for their particular needs and this is the tricky part. Application of face detection on mobile phones is gaining momentum[27][18][15]. Having a fast and accurate face detection opens up a lot of opportunities for designers. Information about the faces in a given image are very important for image quality as they can be used to determine the spot where the autofocus will work, the point which should be taken into account for exposure control and other adjustments done in the camera. Knowing the location of the faces also enables one to later apply a recognition algorithm that can be used as an anti-theft measure. Another feature that might be implemented in the future is the recognition of facial expressions. Face detection will determine the location of the faces and then another algorithm will be applied at those spots to determine the expressions people have on their faces, to know if they’re. 8.

(23) SECTION 2.3. FACE DETECTION APPLICATIONS smiling or have the eyes closed when taking a picture for instance. Figure 2.2 exemplifies the smile detection, a possible application of face detection algorithm.. Figure 2.2: Smile detection example.. 9.

(24) CHAPTER 3. RELATED WORK. Chapter 3. Related Work Face detection is a subject that is quite recent due to previous limitation in computer processing power available. Earlier studies focused on face recognition rather than face detection and date back to the 1970s like work done by Kanade[21][22], a reference in the field. He used a method known as integral projection to locate common facial features and compare to those on a database to identify the person. Images used in this study however, consist of ID-like pictures where the background is white and there is only one subject that is centered and well-lit. Furthermore, it wasn’t able to correctly locate facial features on people wearing glasses or with beard. Work in face detection itself is much more recent, with more extensive research starting in the mid 1990s. New algorithms were developed that made it possible to eliminate restrictions on background, face rotations and lighting. Even faces with glasses and beard are now successfully treated by most algorithms. There are mainly two approaches for detecting faces. The first one focuses on extracting facial characteristics and then establishing geometrical rules that define what the classifier should consider to be a face or not. The second one, most used nowadays, uses a template based approach where an image database is used as input in order to train a particular type of classifier. For each of these two approaches there are different detection techniques that can be used. In some cases, these techniques are combined to produce better results depending on the target application and its restrictions.. 3.1. Geometry Based Detection. In early systems, many researchers used algorithms which extracted facial features and then tried to find geometrical relations among these facial features that would characterize a human face.. 10.

(25) SECTION 3.1. GEOMETRY BASED DETECTION. 3.1.1. Linear Features. This technique was one of the first used in the field. Kanade was the first to demonstrate a working system using this technique in 1973 [21], even though the idea itself already existed before. The first step of the algorithm he used was to apply an edge filter to the image in order to create a binary image, where each pixel can only have two possible values: black (’0’) or white (’1’). After this first step, an integral projection of the image was done in both the vertical and horizontal axis as shown in Figure 3.1. This projection is then analysed to extract the location of facial features such as eyes, mouth and nose. He would then analyse slices of the image radially radial stripes centered on the nose, to find the contour of chin and cheeks. Altogether more than 30 points would be extracted and compared to those on a database in order to identify a person.. Figure 3.1: Integral Projections on vertical and horizontal axis used by Kanade in [21] to extract facial features. As can be seen, these projections eliminate a lot of image details, making it difficult to treat more complex images. The advantage of this algorithm is that it requires less memory and computing power than current algorithms. This is due to the fact that it only works with stripes of binary pixels at a time and this was one of the reasons he was able to implement this algorithm successfully as early as in the 1970s. However, this algorithm has very limited use, since it is uncapable of dealing with more complex scenes where the background can interfere with the detection or scenes where the subject face is rotated or poorly illuminated.. 3.1.2. Component Analysis. A component analysis face detector works by using independent classifiers for detecting eyes, nose and mouth, where such classifiers are usually built using template based techniques. An algorithm proposed by Heisele, Poggio and Pontil in 2000 [11][12][10] uses a Support Vector Machine (SVM)1 for this first step. The first step, or first level as in Figure 3.2, consists in swiping the classifier window through the image and applying the classifier at each step of the process. The region where the classifier will be 1 Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. Treating input data as two sets of vectors in an n-dimensional space, an SVM will construct a separating hyperplane in that space, one which maximizes the margin between the two data sets.. 11.

(26) CHAPTER 3. RELATED WORK applied at each iteration is delimited by the black outline on the top left corner of the image of the face in the first level in Figure 3.2. Applying the classifier to a given window produces a score, a value which rates the likeliness which the pixels in this region match the pattern used to train the classifier originally. Still in Figure 3.2, lighter areas in the result map of the output of each classifier represent higher scores, whereas darker areas represent lower scores. One can clearly notice, for instance, that the nose classifier produced much higher scores in the center of the image for there is a white spot. The process is the same for the other classifiers. For the eye classifier highest score regions are located in top corners of the image and for the mouth classifier, the highest score region is located at the bottom of the image, as expected. The second step, or second level, consists in segmenting the highest score regions. Two areas are selected for the eye classifier, one area is selected for the nose classifier and still another area is selected for the mouth classifier. These areas can be seen in Figure 3.2 as dotted outlines. Finally, after determining the location of the eyes, nose, mouth, their geometry is analyzed to figure if their position is compatible with that of an actual face. If location of eyes, nose and mouth relative to each other are compatible with those of a real face extracted during the training process, the image is tagged as a face, otherwise it is tagged as a non-face. In this example only three components were used, but more components can be used to improve accuracy. Figure 3.3, for instance, presents the results of detection for a few images where fourteen components are analyzed to detect the face.. Figure 3.2: Component Analysis of facial features used in geometry based detectors[11]. Another similar algorithm proposed by Hsu, Abdel-Mottaleb and Jain [16] uses skin-tone information on color images to create eye and mouth maps. These maps (shown in Figure 3.4) represent the locations with higher probability of detecting one of the components in a lighter tone, while dark regions of the map represent places with low probability of finding a given feature. After the successful location of eyes and mouth, this information is checked to see if it is geometrically coherent with a real face. Results of their algorithm can be seen in Figure 3.5. This strategy of using simpler detectors on a first stage and then clustering the response of the different classifiers into solutions has some positive and negative points. The positive points when compared to. 12.

(27) SECTION 3.1. GEOMETRY BASED DETECTION. Figure 3.3: Component Analysis results using Support Vector Machines[11].. Figure 3.4: Component Analysis color maps for eyes and mouth. Lighter regions represent higher probability of detecting the component[16].. 13.

(28) CHAPTER 3. RELATED WORK. Figure 3.5: Component Analysis results using color information[16].. other template based techniques are the reduced complexity of the classifiers and their invariance to face orientation. The classifiers used in the first stage to detect eye and mouth are much simpler than full face ones used in template based techniques. Also, the fact that there is a certain freedom of positioning for these facial features makes detecting different poses easier through the use of a 3D model of the face, while in other template based techniques that use full face classifiers, a view-based classifier has to be used to treat rotated faces. View-based classifiers consist on using different sets of features, one set for each possible face orientation (or view), what increases dramatically the number of operations to be performed. On the negative side, this algorithm is more sensitive to partial face occlusion2 . If some of the components checked are not found because there is something else in front of it or because of the face rotation, it will not be able to assemble the components into a positive detection.. 3.2. Template Based Detection. Unlike Geometry Based detectors that rely on previous extraction of face features and then the clustering of the detected features into solutions that respect certain geometry rules established from a face model, Template Based detectors work directly with the image pixels comparing some characteristics of the image being analyzed with those of the reference images on its training set. A few of the most important Template Based detectors used will be described in the sequence.. 2 Partial face occlusion happens when the view of the subject’s face is obstructed either by objects, such as glasses or hats, or by part of the subject’s body, such as the hands, presenting a real challenge to the classifiers[8].. 14.

(29) SECTION 3.2. TEMPLATE BASED DETECTION. 3.2.1. Skin-tone. One of the approaches to the Template Based classifiers uses image color information to filter out regions that do not have colors which are skin-tone like[34][20]. The resulting image is a black and white map where all pixels which were skin-tone like are white and all pixels which were not skin-tone like are black. This white region is then segmented and another algorithm checks for proportions between vertical and horizontal axis to determine if that is a face or not. The face proportions as well as the thresholds for definition of which colors will be accepted as skin-tone are achieved through a previous step of training where the algorithms will process images already labeled from a database and adjust their own parameters. A positive point of this method is that it is extremely fast and deals very well with rotated faces or different face sizes. However, all this simplicity has a price. If the white balance on the input image is not right due to lighting conditions, this can render the detector useless. For instance, close to sunset, lighting is a lot warmer with red tones. Subjects shot under these lighting conditions will most likely present a very red skin, which the detector might discard because of it being too red and therefore out of the thresholds defined by processing the training set. An example of a skin-tone classifier working in different color spaces can be seen in Figure 3.6. A color space is simply a vector space basis and to change from one color space to another one needs only to perform a vector basis transformation. Colors in 3.6 are shown distorted simply because the images color spaces have been changed to YCbCr and HSI, but are still being displayed as if they were coded for the RGB color space. Very often, however, simple skin-tone filters are used to pre-process images to be used in other types of classifiers. In the current work developed and presented in this document, a very relaxed skin-tone filter was chosen, one that would only eliminate regions that are assuredly not skin-tone like. This does not eliminate as much of the image, not using the full potential of the skin-tone filter, but avoids discarding skin-tone areas due to poor white-balance. Very often white balance errors produce distortion in colors that can induce the filter to imporperly eliminate large portions of the image, including the faces. If the filter eliminates the faces, it is impossible for subsequent stages to detect any faces at all, hence it impairs the whole detection process. This is the reason why a more relaxed filter was picked.. 3.2.2. Cascade of Simple Features. The technique of Cascade of Simple Features was initially developed by Viola and Jones [39] and is one of the most robust and widely adopted techniques. A complete explanation of how this method works can be found in Chapter 4, since this is the algorithm used as starting point for this work. A number of solutions use this algorithm including the Intel OpenCV library. It is a very flexible algorithm that can be easily adapted to other kind of tasks such as detecting pedestrians or vehicles. It is capable of detecting faces that have partial occlusions, faces with glasses or beard and even rotated faces through a view-based approach. There is ongoing research to try to develop rotation invariant features that would eliminate the need of repeatedly applying classifiers to the same area of the image like it is the case when the view-based approach is used.. 15.

(30) CHAPTER 3. RELATED WORK. Figure 3.6: Skin-tone classifier using different color-spaces[34]. The negative points are that this algorithm requires many cascades in order to achieve a good degree of detection of rotated faces. In the context of this project five cascades are used. This has an impact on performance, since five classifiers are executed on the same image area, and also on memory requirements, since each cascade has a set of features that should be stored in the memory of the device. This does not pose a problem to applications that run on a desktop computer, but for mobile phone applications this can be quite troublesome. That’s where this work tries to come up with inteligent approaches to avoid executing too many features on a given image.. 3.2.3. Neural Network. Another important contribution from Kanade and his colleagues Rowley and Bajula from Carnegie-Mellon University was the work done on using neural network based classifiers for detecting faces[32]. The first step in their algorithm is to train a neural network using a database with labeled faces. After this initial learning phase, the algorithm is ready to be run. The input image is scaled to different sizes to treat the problem of distance from the camera and at each iteration a patch of 20x20 pixels is scanned through the image. At each location, the patch is fed to the neural network. It will then analyze the patch in order to determine whether or not that particular patch contains a face. A general overview of the algorithm can be seen in Figure 3.7. Similarly to the technique of Cascade of Simple Features, this technique is very flexible and can be adapted to other applications just by changing the images on the training set. It is also very robust and can detect faces with rotations through the view-based approach. Comparisons show that its performance and. 16.

(31) SECTION 3.3. EMBEDDED SOFTWARE OPTIMIZATION detection rate are somewhat equivalent to those of the Cascade of Simple Features, leaving the designer free to choose the technique with which he/she is more comfortable.. Figure 3.7: Neural Network classifier overview. The image is scaled and a patch is scanned through the image. It is pre-processed to improve image quality and then fed to a neural network which will evaluate the patch to decide whether or not it includes a valid face[32].. 3.3. Embedded Software Optimization. Another important aspect in the development of an embedded imaging solution is the overall performance of the system. Choosing the right algorithm is just the first step towards attaining strict performance and consumption targets. Source code optimization and hardware acceleration of critical portions of the code are key to pushing the performance of such systems even further. Therefore, in the next subsections, a couple of methodologies for embedded software optimization will be presented and discussed.. 3.3.1. Source Code Optimization Methodology. In a collaborative work between researchers from Stanford University, DEIS University of Bologna and HP Labs, a methodology for source code optimization and profiling for performance and energy consumption has been proposed in [33]. This methodology organizes optimization activities into three categories or layers, namely: algorithmic changes, data representation changes and instruction-level optimization. Algorithmic changes consist in analysing code execution through profiling tools to determine what are the critical portions of the code, the computation kernels. Alternatives to these computational kernels are sought and the most promising are analysed to determine which alternative is the best suited choose. The chosen alternative can then be coded and tested. Algorithmic changes have the highest potential of gains. On the other hand, they can be quite risky since it is difficult to guarantee that the application will behave exactly as it did before. The next step is to perform data representation changes in order to match operations performed by. 17.

(32) CHAPTER 3. RELATED WORK the software with those available in the target architecture. Signal processing algorithms often assume double precision floating point data that is not always supported by the target architecture and must be emulated via a software library. Avoiding this type of operations can have a tremendous impact in the system performance and power consumption. Another example is the usage of 64-bit variables in 32-bit processors. Accurately matching the width of the operands will have a significant impact in the system overall performance with the benefit that it can be done without changing too much the algorithm itself, as long as the loss in precision is acceptable. The third and final step is to perform instruction flow optimization, where after extensive profiling the most critical loops are throughly analysed. The source code is then rewritten to accelerate the loops using well-known techniques such as loop unrolling, merging, software pipeling, among others. If the loops are executed very often, each cycle gained at each iteration of the loop will have a significant impact in the final performance. As will be seen in the subsequent chapters, a similar flow was used for the development of the Face Detection solution presented here. In a first moment algorithmic changes based on extensive exploration was done in parallel with data representation changes based on profiler results. In a second moment instruction flow optimization was performed, also based extensively on profiler results. Details on these two initial optimizations performed can be found in Chapter 5. Furthermore, yet another form of instruction optimization was performed that is the creation of customized instruction through the addition of specialized hardware in the form of a processor instruction set extension.. 3.3.2. Instruction Set Extensions Design Flow. Another form of optimization of a given embedded solution is to create customized instructions that perform a number of operations in a single instruction. In a work published in 2006, Leupers[24] identifies the basic steps towards implementing custom instructions: application code analysis; custom instruction identification; custom instruction implementation; software adaptation and tools generation; and hardware architecture implementation. Through the extensive use of profiling tools the code is throughly analysed to identify frequently executed code regions where the same group of instructions are executed over and over. These are naturally the most suitable areas to search for opportunities to create a customized instruction, through instruction clustering, for instance. The goal is to find sections of the code that are sufficiently repeated so that doing them in a single instruction would improve the overall system performance. Once the code has been analysed and the opportunities for customized instructions identified, the next step is to actually implement these customized instructions and validate them. The protocol and the timing on the interfaces among the custom hardware extension and the processor must be respected. Latency and area constraints need to be met through proper pipeline stage balancing. At this stage, since synthesis has not yet been performed and no actual data on the performance of the final hardware implementation is available estimations must be used. After the custom instructions have been identified and implemented, they need to be inserted in the. 18.

(33) SECTION 3.3. EMBEDDED SOFTWARE OPTIMIZATION original application and tested to verify that the application is still functionally correct. Support from the processor toolset is needed so that the new instruction set is recognized and correctly used by the compiler and profiling tools. The new executable is then used to functionally validate the application and to allow for more precise estimations through the use of the profiler. The final step towards the implementation of the custom instructions is to synthesize the custom instruction hardware. Code for the instruction needs to be converted to a hardware description level language and integrated to the processor description itself so that the whole system can be synthesized and verified. In his article, Leuper defines a series of algorithms for performing automated instruction generation and synthesis. Interesting results have been presented with accelerations in the order of 40% on average for pure MIPS applications. Various customized instructions were defined for the Face Detection solution. However, unlike the work described by Leuper, no automation was used in the process of identifying the custom instructions. The custom instructions were defined after a thorough analysis of the critical functions of the application and various trials and experiments. Finally, the instructions defined allowed an acceleration factor of 2.4x to be reached for the entire application. More details on Chapter 6.. 19.

(34) CHAPTER 4. DETECTION ALGORITHM. Chapter 4. Detection Algorithm The detection algorithm used as base for this project was created by an external algorithm team and is based on a technique originally introduced by Viola and Jones in [39]. In this chapter, focus will be given in explaining in details how the original algorithm works. Although this section focuses on the base algorithm prior to the optimization work done, as the algorithm is detailed, eventually some comments are made on implementation specific choices for the target application. On the next chapter, modifications done in this base algorithm in order to improve its performance and quality are shown.. 4.1. Algorithm Description. The algorithm proposed by Viola and Jones represented a break-through in real-time face detection and is widely used in face/object detection solutions all over the world[43]. It uses a few very interesting concepts to achieve a good detection rate while minimizing resource usage. Two of these key concepts are: 1. usage of an integral image instead of a raster image; 2. classifier structure using a reduced set of simple haar-like features.. 4.1.1. Integral Image. Prior to explaining the algorithm flow, these two concepts above are described. An integral image is an image where each point or integral coefficient represents the sum of all pixels to the left and above of the equivalent point on the input image1 , including the point itself. Assume the point at location (x,y): 1 The. 20. images from this point on are actually the luminance (Y) channel of an YUV image and therefore are in grayscale..

(35) SECTION 4.1. ALGORITHM DESCRIPTION. ii(x, y) =. X. ii(x0 , y 0 ). (4.1). x0 ≤x,y 0 ≤y. where ii(x, y) is the integral image and i(x, y) is the input raster image.. Figure 4.1: The value of the integral image at ii(x, y) is the sum of all the pixels above and to the left, including i(x, y)[39]. This Figure 4.1 represents the same integral image concept in a graphical manner. But one might ask what is the benefit of working with an integral image instead of a regular raster image, and the answer is simply that it is much easier to obtain a sum of pixels of a given region from the integral image. Imagine there is a raster image and the sum of all pixels in a rectangular area needs to be calculated. One would have to read this image pixel by pixel and accumulate the results. Now say, there is an integral image that has been precalculated and the same pixel sum needs to be calculated. Only four points need to be accessed to obtain this value, namely W , X, Y and Z. To understand how this is done, let’s analyze Figure 4.2. In Figure 4.2 area D outlines the region for which the sum of all pixels needs to be found. To calculate this sum, a series of additions and subtractions is performed. Calculations start by reading the value of point Z. This point represents the sum of pixels in all rectangles A, B, C and D and therefore we need to deduct the values of A, B and C so that the only sum left is D. To accomplish this, the value of point Y is read and subtracted from the value of point Z. With this operation, the current sum is B + D and B still needs to be eliminated, so the value of point X is deducted from Z − Y . The result of this last step is D − A because A has been deducted twice and to compensate for it point W is added. Hence, the final equation is Z − Y − X + W and its result simply the sum of pixels in area D.. 4.1.2. Simple Haar-like Features and Cascaded Classifier. Now that the concept of integral image and the calculation of pixel sum areas have been demonstrated, let’s move on to its application. This is where Viola and Jones contribution is one more time prevalent and they had the idea of using simple haar-like features for analyzing patterns in faces or objects and built their classifier based on these features.. 21.

(36) CHAPTER 4. DETECTION ALGORITHM. Figure 4.2: The sum of the pixels within rectangle D can be computed with four array references. The value of the integral image at location W is the sum of the pixels in rectangle A. The value at location X is A + B, at location Y is A + C, and at location Z is A + B + C + D. The sum within D can be computed as Z + W − (X + Y )[39]. Haar-like features can be defined as the difference of the sum of pixels of areas inside a rectangle, which can be at any position and scale within the original image[26]. They are digital image features commonly used in object detection and recognition schemes and owe their name to their intuitive similarity with Haar wavelets[35]. Examples of features proposed by Viola and Jones can be seen in Figure 4.3.. Figure 4.3: Example haar-like features proposed by Viola and Jones in [39]. Each of the figures (A), (B), (C) and (D) represent a different feature that may be applied to the integral image. The result of each feature will be a numerical value that represents the sum of all pixels enclosed in the white regions subtracted of the sum of all pixels enclosed in the hatched regions. These features concentrate on a particular region of the image and try to identify differences in contrast in these regions. When looked in a grayscale image, a typical human face has a T region that is usually darker composed of eyebrows, eyes and nose. The forehead, just above this T region, is usually lighter. The mouth on its turn can be seen as a dark rectangle enclosed by lighter areas such as cheeks and chin. The algorithm works by focusing on a given area or window and applying the appropriate features on specific regions of this window in the quest of differences in contrast that characterize a face. One of the difficulties of this algorithm is to determine which feature should be applied to what. 22.

(37) SECTION 4.1. ALGORITHM DESCRIPTION region and what is the range of values expected as result. This is done by a learning algorithm called AdaBoost[9]. This algorithm works on a database of images previously hand tagged where all faces are marked. It then starts to analyze which features are common among these faces and to analyze which features are not common with other regions of images that do not contain any faces (usually called nonfaces). It will select the features which can better identify a face. The Boost in the AdaBoost algorithm comes from the next step, where it has been improved in relation to the original Ada algorithm. A verification step is performed where the features are analyzed and for each false detection a penalty is imposed and its importance will be increased for the next run of learning. This process can be repeated numerous times until the algorithm is unable to find a better group of features or when it has run for too long. Just to have an idea, this whole learning operation can take a few days or even more. In the context of this project, the learning phase was completely performed by the external algorithm team and therefore is out of the scope of the work being described in this document. So, a feature is applied to a particular region on a specific window of the image and the result is considered positive if the value obtained is within a specified range of values delimited by two thresholds, namely threshold high and threshold low. The number of features necessary to identify a face can be in the range of thousands and a strategy needed to be envisaged to avoid executing too many features even in cases where there is no or a very remote chance of a face being found. To solve the problem of too many cascades executed, Viola and Jones chose to use classifier cascades. A cascade groups features to be executed in sequences of stages[25]. Each stage can be composed of any number of features, from a single feature to thousands of features. The execution of a cascade will start on the first stage, apply its features to the image and check the results, if the algorithm is certain there is no chance of a face being detected it exits the cascade, otherwise it proceeds to the next stage and so on (see Figure 4.4). When it finally reaches the last stage, executes all its features and if the their results are good, the cascade result is positive and a face is said to be detected. A question that remains to be answered is how is it a stage result computed based on the features results. Associated to each feature is a value that represents a weight, which will be called alpha. Each successful feature has its weight summed to an accumulator variable, called sigma, and in the end of stage execution this sum is compared against a minimum expected value, the stage threshold. The idea of grouping features in stages is particularly interesting due to the fact that there are some simple features that can tell right-away if there is no chance of a face being found in the current window and execution can stop right at the beginning. With this in mind, when a cascade is designed, only one feature is used in earlier stages and this number increases progressively until the last stage where hundreds of features are common. This does not aid the case where a face is found, for the same number of features will be executed, but provides the algorithm with a chance of exiting with just one feature executed. The area of the image without faces is usually much larger than the area of the image that contain faces and from analyzing code execution reports, it was possible to see that about 50% of all cascades exit on the first stage (first feature) and another 25% exit on the second stage (second feature) executed, the two first stages together are responsible for eliminating around 75% of all possible executions, leaving only 25% to be analyzed more carefully by stages with more features. Examples of the first two features selected by the AdaBoost algorithm for detection of upright frontal faces are shown in Figure 4.5. The feature types defined and used in this project are shown in Figure 4.6 and their choice was based on previous work by. 23.

(38) CHAPTER 4. DETECTION ALGORITHM. Figure 4.4: This flow represents the execution of a cascade without any optimizations.. Viola and Jones[39]. This is the basic set of features used for object detection algorithms, since these are somewhat simple features and easily implemented in software or hardware.. 4.1.3. Detector Scanning. The detector works in a tiny region of only 24x24 integral image coefficients in the case of the algorithm proposed by Viola and Jones[39] and 20x20 in the work presented here. In the context of this work, to be able to differentiate from yet another concept that will be presented in the optimization phase, instead of calling this particular region as a window, it will be called a patch from this point on. To detect faces, the detector cascades are applied to a patch and a result will be given that indicates if there is a face in that patch or not. To be able to detect faces in all positions of the image, the image is scanned with the detector so that all possible patches on the image are checked. This is done by placing the detector initially at the top leftmost patch of the image and moving it first to the right, one pixel at a time, and when the border of the image is reached the detector returns to left but is displaced of one pixel down and this process is done until all image has been covered. Figure 4.7 illustrates this concept of sweeping the image with the detector in order to search for faces in all possible positions. Considering the number of possible patches in a QVGA image (320x240 pixels) the number of times the detector is applied to this single image is:. N = (image width − patch width) ∗ (image height − patch height). 24.

(39) SECTION 4.1. ALGORITHM DESCRIPTION. Figure 4.5: First two features selected by the AdaBoost algorithm and corresponding to the first and second stages of the cascaded classifier design[39]. The first feature checks the difference in luminosity between the darker region with the eyes and the lighter region of the upper cheeks, while the second feature checks the difference of luminosity between the lighter region in the center of the forehead and a part of the nose against darker region of the eyebrows. Figure 4.6: Final feature types defined for this work.. 25.

(40) CHAPTER 4. DETECTION ALGORITHM N = (320 − 20) ∗ (240 − 20) = 66000. Figure 4.7: The image pixels are represented in green, the hypothetical face on the center of the image in yellow and the image patch where detector is applied is orange. The detector is scanned from left to right, then moved one pixel down and again scanned from left to right until the whole image has been processed. The image on the left represents the initial patch position and the sense of scanning, the result of applying the detector on this patch is negative. The image on the right represents a patch where the detector result is positive.. 4.1.4. Handling Face Sizes and Distances. The distance of the subject to the camera has a direct impact on the face size in the image, the closer the subject is, the larger the portion of the image its face will occupy. If the detector is scanned simply like described previously, only faces that fit the detector will result in positive detections. See Figure 4.8.. Figure 4.8: Image on the left represents the case where a image is too big to be detected simply by scanning the detector on a single image size, which in this case the detector will not be able to detect it as a face. Image on the right shows the case where the face would have the appropriate size and detector would work. To overcome this problem there are two solutions: either to scale the detector or to scale the image. In the approach chosen by Viola and Jones[39] the detector is scaled, but in the approach presented here the method to scale the image instead of the detector was chosen.. 26.

(41) SECTION 4.1. ALGORITHM DESCRIPTION Scaling the detector is usually a good option. One just has to adjust the points where the sum of pixels will be done to obtain a feature value. This value is then normalized with respect to the ratio between the area of the patch to which the detector was applied and the standard patch area of 20x20 pixels. However, this approach has an important drawback in terms of memory usage. At all times it is necessary to keep in memory the full integral image and do random accesses to it. Each integral image coefficient uses 3 bytes instead of 1 byte for each pixel of the original image, by itself this would triplicate the usage of memory to keep the working image. The approach chosen, the one of scaling the image and applying the detector always with the same size, represents a small overhead in terms of computation because the whole image has to be scaled everytime and the integral coefficients recalculated, but is better in architectural terms. Basically only a few patches of integral image are kept in memory at all times and the integral image is calculated on the fly, what would not be possible with the other approach. To clarify a little better how this works, Figure 4.9 shows the effect of applying the detector to the same image scaled to different sizes.. Figure 4.9: First row represents an image with a big face that is scaled down a few times. Second row represents the size of the face relative to the standard detector patch. It is possible to see that by changing the size of the input image, eventually the face size will match the size of the detector and this way the system is able to detect faces with different sizes.. 4.1.5. Handling Face Rotations. Now that the classifier can handle multiple face sizes and different positions on the image, another aspect needs to be treated. One of the drawbacks of this type of classifier is that it does not treat the case of faces that are rotated or tilted. Unless a person is taking a picture for an ID photo, very often people are not exactly at an upright position. Faces appear tilted to the left or right, what is called an in-plane rotation, or rotated sideways towards a profile view, what is called an out-of-plane rotation.. 27.

(42) CHAPTER 4. DETECTION ALGORITHM There are different approaches to the object/face detection problem that use features that are invariant to rotation[32], but they are more complex and not so rapidly computed. To enable the classifier as described in this section to detect faces with different types of rotation, different cascades are needed. In the context of this project, five cascades are used, one for each type of rotation listed below: 1. frontal upright faces; 2. negative in-plane rotation close to −30◦ ; 3. positive in-plane rotation close to +30◦ ; 4. negative out-of-plane rotation close to −45◦ ; 5. positive out-of-plane rotation close to +45◦ . It is important to notice that these numbers represent the main orientation of each detector cascade, but each cascade has a range of detection that is about ±15◦ in both in-plane and out-of-plane rotations around this main orientation. Figure 4.10 represents these different types of rotations.. Figure 4.10: Examples of faces with in-plane and out-of-plane rotations. From left to right, top to bottom: rotation in-plane −30◦ ; frontal upright; rotation in-plane +30◦ ; rotation out-of-plane −45◦ ; rotation out-of-plane +45◦ .. 4.1.6. Illumination Normalization. Illumination may play an important role in the detection. If a face has too little contrast or is either too dark or too bright, the detector might not be able to detect it. To minimize this problem a simple solution is adopted. After the feature is applied to the patch and its value calculated, prior to comparing this value to the thresholds and determining if the feature result is positive or negative, the value of the feature is normalized. This technique is similar to the Local Normal Distribution normalization method,. 28.