Data set and features - Données multimodales pour l’analyse d’image Matthieu Guillaumin

2.4 Data set and features

In this section, we present the data set used for our experiments and our feature extraction procedure for describing faces in a vector space, as required by metric learning techniques. We also discuss several alternative face descriptors from the literature.

2.4.1 Labeled Faces in the Wild

TheLabeled Faces in the Wilddata set (Huang et al.[2007b]) originates from a multi- modal data set namedYahoo! News, which was collected in 2002–2003 and introduced by Berg et al.[2004a]. TheYahoo! News data set consists of images and accompany- ing captions. On this large collection, a face detector was applied, as well as a named entity detector. Documents with no detected faces or names were removed, leading to a database with roughly 15000 image and caption pairs, with 15280 detected named entities and 22750 detected faces.

Using these documents as sets of faces and names, a constrained clustering technique was proposed to automatically assign names to the depicted faces. However, it yield a high percentage of errors, and some pictures appeared to be duplicates. In Chapter 3, we consider alternative approaches for this clustering task. To obtain a labelled data set of face images, Huang et al.[2007b] have taken the approach of manually anno- tating a subset of the face images using the captions as an aid for naming the person, using a unique identifier for each individual. In the end, theLabeled Faces in the Wild data set contains 12233 face images with identity labels. In total 5749 people appear in the images, 1680 of them appear in two or more images.

The faces show a big variety in pose, expression, lighting, etc., see Figure 2.2 for some examples. An aligned version of all faces is available, referred to as “funnelled”, which we use throughout our experiments. This data set can be viewed as a partial ground-truth for theYahoo! Newsdata set. Labeled Faces in the Wildhas become thede factostandard data set for face verification, with an active monitoring of the literature for improvements. As of today, more than 20 participants have submitted results to this challenge, even though the data has been available for less than 2 years.

The data set comes with a division in 10 parts calledfoldsthat can be used for cross validation experiments. The folds contain between 527 and 609 different people each, and between 1016 and 1783 faces. From all possible pairs, a small selection of 300 positive and 300 negative image pairs are provided for each fold. Using only these pairs for training is referred to as the “image-restricted” paradigm; in this case the identity of the people in the pairs cannot be used. The “unrestricted” paradigm is

used to refer to training methods that can use all available data, including the identity of the people in the images.

Performance is measured using a ROC curve on the 6000 selected pairs with their classification scores obtained from classifiers when their fold is excluded from training. A ROC curve plots the true positive rate versus the false positive rate. Since the positive and negative classes are balanced in the test set, we will also report the classification performance when there are as many false positive as false negative. This corresponds to the operating point of the ROC curve with equal misclassification, or equal error rate. We therefore refer to this measure as the ROC-EMC accuracy:

ν =TP+TN

P+N , (2.44)

where TP is the number of true positives, TN the true negatives, P the true number of positives and N the true number of negatives. The recommended measures for comparing methods on theLabeled Faces in the Wilddata set are the following: mean accuracyµand standard deviationσover the folds, as given by:

µ= P10

i=1νi

10 and σ=

È P10

i=1(νi−µ)²

9 , (2.45)

whereνi is the accuracy for foldi alone.

2.4.2 Face descriptors

In this section, we first describe existing descriptors for faces then present the one we used in our experiments.

Historically, a vectorial representation of faces is obtained from a holistic description of the face image. This includes recent successful approaches like Local Binary Pat- terns (LBP, Ahonen et al.[2004]) and later extensions –e.g. TPLBP and FPLBP (Wolf et al.[2008]).

As the name suggests, the idea behind LBP is to extract a binary code from the pattern locally surrounding a pixel. From the comparison between a pixel value and its eight neighbour values, a 8-bit code is obtained, as illustrated in Figure 2.9. The robustness of LBP with illumination changes comes from the invariance of the comparison to monotonic transformations of the greyscale pixel values. Notably, Gamma correction (x 7→ x^γ), brightness and contrast transformations are monotonic.

We also consider the extension of LBP from Wolf et al. [2008], namely Three-Patch and Four-Patch LBP (TPLBP and FPLBP, respectively). As illustrated in Figure 2.10,

2.4. DATA SET AND FEATURES 37

57 65 80 73 68 66 95 53 33

→

0 0 1

1 ! 0

1 0 0

→ c=10001100

Figure 2.9: A local binary pattern (LBP) is a binary code formed by comparing a pixel value with its neighbours (left). The 8 comparisons are encoded with 0 and 1 (centre), and the pattern is serialized into a 8-bit code (right).

Figure 2.10: TPLBP (left), respectively FPLBP (right), are extensions of LBP. The drawings, from Wolf et al. [2008], show how codes are constructed from comparing pixel values on three, respectively four, patches, with the associated parameters w (patch size), r (radius) andα(angle).

they consist in sampling several patches in the neighbourhood of pixels and perform comparison between several pixel values in order to obtain binary codes.⁸ Notably, these descriptors have many parameters to set. We will use the recommended param- eter values provided by the authors of the original work.

Other global descriptors exist. For instance GIST (Oliva and Torralba [2001]) and Histogram of Oriented Gradient (HOG, Dalal and Triggs[2005]). Although they are not specifically designed for human face recognition, it is interesting to use them in combination of machine learning techniques like metric learning for the verification task. The work of Funes Mora[2010] shows that HOG descriptors can perform comparably for our SIFT descriptor described below, but the face images have to be correctly aligned first.

Another type of face description relies on facial feature detection. Facial feature detection, also known as fiducial point localization, has the goal of localizing in face images specific points such as corners of eyes, mouth, nose, eyebrows. This is an im-

8Code available at: http://www.openu.ac.il/home/hassner/projects/Patchlbp/

Figure 2.11: On the left, the facial features are shown. The sketch is the middle illustrates the tree-like constellation model. On the right, the position uncertainty for parts 2, 3 and 4 are shown once part 1 is fixed. Courtesy of Felzenszwalb and Huttenlocher [2005].

portant tool for face recognition as it can intervene in several processing steps. The first is data alignment: it is straightforward to infer an affine transformation between two faces if at least three fiducial points are matched (see also Urschler et al.[2009]).

Second, due to the flexibility of individual localizations, the facial features can help build face descriptors that are invariant to non-linear transformations. Finally, if con- fidence scores are obtained for the localization of the fiducial points, it is possible to estimate the probability that the points are incorrectly localized, using for instance outlier detection techniques. An incorrectly localized feature could also be caused by an occlusion occurring on the corresponding part of the face image. Systems can therefore build upon facial feature detection for handling occlusions in a robust man- ner, seee.g. Funes Mora[2010].

State-of-the-art facial feature detection include Felzenszwalb and Huttenlocher[2005].

It is based on a constellation model which combines discriminative local appearance models and a generative model for spatial regularization. The system is made compu- tationally efficient by using a tree-structured Gaussian mixture for the joint position of the features, as illustrated in Figure 2.11, and this structure is learned together with the appearance models at train time. Three mixture components are used and they correspond approximately to frontal faces and faces oriented towards both sides.

For the appearance models, score functions are learned from the training set using an AdaBoost framework, with weak classifiers operating on Haar features.

Everingham et al.[2006]improved the efficiency of the appearance model evaluation, and trained the system on detecting 9 facial features: (1) Left corner of the left eye, (2) Right corner of the left eye, (3) Left corner of the right eye, (4) Right corner of the right eye, (5) Left corner of the nose, (6) Tip of the nose, (7) Right corner of the nose, (8) Left corner of the mouth, and (9) Right corner of the mouth. Having detected the facial features, their description is usually performed by extracting features from the pixel neighbourhood using local appearance descriptors, as illustrated in Figure 2.12.

Everingham et al.[2006]compared two descriptors. First they used the pixel values of surrounding circular patches, including an alignment procedure, normalisation and noise reduction. The resulting descriptor is illustrated in Figure 2.13. Greyscale values are serialized and concatenated into a 1937D descriptor. Second, they tried using

2.4. DATA SET AND FEATURES 39

Facial feature detection

Local description

Figure 2.12: Illustration of the facial feature-based processing pipeline of Evering- ham et al. [2006]. After detecting faces in the images, face images are aligned and given to the facial feature detector. Then, local appearance descriptors of these fiducial points are extracted and concatenated to obtain a vectorial representation of the faces.

→

Image gradients Keypoint descriptor

Figure 2.13: On the left, the circular patches describe the 9 facial features in the 1937D descriptor of Everingham et al. [2006]. In the centre and on the right, illustration of SIFT features (128D each, Lowe [1999]) for describing the patches.

SIFT descriptors (Lowe[1999]). SIFT features have proved very successful in many computer vision tasks such as image stitching, object recognition, robotic navigation, and action recognition in videos. The SIFT descriptor is composed of a grid of 4×4= 16 histograms of 8 bins, and is therefore 128D. Each bin represents the magnitude for a particular orientation of the gradient in the cell begin considered. This magnitude is weighted by a Gaussian function centred on the keypoint. Using the Euclidean distance between descriptors, the authors did not find the SIFT features to bring any improvement.

Relying on the success of using facial features for recognition, we will use the detector of Everingham et al. [2006] which is available on the web.⁹ In a setting where a keypoint provides a scale, the SIFT description is performed at that particular scale.

9At: http://www.robots.ox.ac.uk/~vgg/research/nface/^.

Figure 2.14: Illustration of our SIFT-based face descriptor. SIFT features are extracted at 9 locations and 3 scales shown on the top. Each row represents a scale at which the patches are extracted: the top row is scale 1, the middle row is scale 2 and the bottom row is scale 3. The first column shows the locations of the facial features, and the remaining nine columns show the corresponding patches. Our descriptor is the 3456D concatenation of the3×9SIFT features describing the patches.

Here, although the face detector gives an approximate scale for the face, it is not precise, and the facial feature detector does not provide any scale either.

To overcome this potential issue, we propose (c.f. Guillaumin et al. [2009b]) to use multiscale SIFT descriptors to describe the patches. Setting scaleσ=1 to represent a 16×16 patch in the 250×250 face images, we extract SIFT features at multiple scale forσ∈ {1, 2, 3}. Our 3456D SIFT-based face descriptor is the concatenation of the 9×3=27 SIFT features, as illustrated in Figure 2.14.

No documento Données multimodales pour l’analyse d’image Matthieu Guillaumin (páginas 46-51)