• Nenhum resultado encontrado

We have introduced two new methods for visual verification: Logistic Discriminant Metric Learning (LDML), and Marginalised kNN classification (MkNN). We note that LDML can be trained from labelled pairs as provided in the restricted paradigm ofLa-

2.6. CONCLUSION 53

(a)

(b)

(c)

(d)

Figure 2.23: Illustration of face recognition of 7 (out of 17) people using one train- ing exemplar, with one person in each column. For each person we show: (a) the exemplar image, (b) a correctly recognised face of that person, (c) a non-recognised face of that person, and (d) another failure: an erroneously accepted face of another person.

beled Faces in the Wild, whereas MkNN requires labelled training data and implicitly uses all pairs. With its log loss, LDML is a robust technique to learn a Mahalanobis metric in a supervised fashion. Additionally, LDML can perform dimensionality reduc- tion and is kernelizable. The MkNN classifier is conceptually simple, but in practice it is computationally expensive as we need to find nearest neighbours in a large set of labelled data. This computational cost can be alleviated by using efficient and/or approximate nearest neighbour search techniques.

LDML in combination with our descriptors yields a classification accuracy of 79.3%

on the restricted setting ofLabeled Faces in the Wilddata set, where the best reported results so far were 78.5% and 79.5% using the funnelled data (Wolf et al.[2008]and Pinto et al.[2009]) and 86.8% using improved alignment and background informa- tion (c.f. Wolf et al. [2009]). LDML and MkNN yield comparable accuracies on the unrestricted setting, above 83%. Remarkably, the gain when using the unrestricted setting is not observed with other state-of-the-art methods such as Wolf et al.[2008]. We were the first to present results on theLabeled Faces in the Wild data that follow and make good use of the unrestricted paradigm. Combining our methods, the ac- curacy is further improved to 87.5%. Later work by Taigman et al.[2009] obtained 89.5%, which is the current best for the unrestricted setting.

We also showed that metric learning leads to great improvements as compared to a simple L2 metric for applications of face similarities like clustering and recognition from a single exemplar.

For face recognition, looking at the examples of failure cases of our method in Fig- ure 2.20, pose changes remain one of the major challenges to be tackled in future work. Explicit modelling of invariance due to pose changes using techniques like those in Cao et al.[2010]is an option worth exploring.

Concerning metric learning, we plan to explore learning directly the optimal metric for MkNN classification. Metric learning is more generally a promising technique to use for other computer vision tasks, for instance for retrieval. We intend to extend our study to other data sets and tasks, especially using weak settings.

3

Caption-based supervision for face naming and recognition

Contents

3.1 Introduction . . . . 55 3.2 Related work on face naming and MIL settings . . . . 58 3.3 Automatic face naming and recognition . . . . 61 3.4 Data set . . . . 75 3.5 Experiments . . . . 81 3.6 Conclusion . . . . 93

3.1 Introduction

In this chapter, we consider a first type of multimodal data: news images with cap- tions. This data is typically published by news media agencies such as Agence France Presse, Associated Press, Reuters, or the Belga News Agency. The published docu- ments consist of a short text describing a piece of news, and an image illustrating the event, as shown in Figure 3.1. Although they are not specifically created with this purpose, the textual parts of those documents describe the visual content of the images to some extent.

Using such data sets, we can consider many computer vision applications, as long as the image caption is sufficiently informative about the visual task at hand. Obviously, the origin of the data puts a strong bias towards political, sport and social events.

Many news stories concern people and their actions: President Obama addresses the press, Roger Federer wins a tennis match, etc...

An Iranian reads the last issue of the Farsi-language Nowruz in Tehran, Iran Wednesday, July 24, 2002.

An appeals court on Wednesday confirmed the sentence banning Iran’s leading reformist daily Nowruz from publishing for six months and its pub- lisher, Mohsen Mirdamadi, who is President Mo- hammad Khatami’s ally, from reporting for four years. Mirdamadi is head of the National Security and Foreign Policy Committee of the Iranian parlia- ment. (AP Photo/Hasan Sarbakhshian)

Chanda Rubin of the United States returns a shot during her match against Elena Dementieva of Russia at the Hong Kong Ladies Challenge Jan- uary 1, 2003. Rubin beat Dementieva 6-4 6-1.

(REUTERS/Bobby Yip)

Figure 3.1: Illustration of two multimodal documents from news agencies: they consist of a text illustrated by an image. The caption therefore partially describes the visual content of the image, especially in terms of human identities and actions.

It is quite natural to try to leverage the quantity of information that these data sets represent for tasks that relate to human properties such as: their identity, their pose and appearance, or their actions. The applications are numerous, and given the con- vergence of the quality of amateur photography as found on Flickr and Facebook towards professional photography, it is expected that the systems for face and action recognition that we today train on news data will soon be available directly as tools for users of media sharing websites.

In Chapter 2, we showed how we can use a similar type of images with manual labels to train face recognition systems in uncontrolled settings. Here, we explore how we can exploit the weak supervision that the captions provide instead of using manual annotations. Specifically, we study the following two tasks.

First, we address the task of face naming. Face naming is the task of finding the correct associations between names that appear in the captions to faces that appear in the corresponding images, as illustrated in Figure 3.2. By obtaining a good naming of the faces, the burden of building a manually labelled data set of faces can be greatly alleviated. Equivalently, for the same annotation effort, much larger data sets can be obtained. We show that given a simple model for documents, face naming is a constrained clustering problem, and propose adapted algorithms to perform it

3.1. INTRODUCTION 57

Angela Merkel Hu Jintao

German Chancellor Angela Merkel

shakes hands with Chinese PresidentHu Jintao(. . . )

Kate Hudson

Naomi Watts

Kate HudsonandNaomi Watts, Le Di-

vorce, Venice Film Festival - 8/31/2003.

Figure 3.2: Examples of typical image-caption pairs in theYahoo! Newsdata set, and the result of automatic face naming.

efficiently. We introduce a graph-based method to solve this task and compare to previously proposed generative approaches.

Second, we try to learn metrics for verification as in Chapter 2 but using captions as weak supervision. As a baseline, we can use the automatic naming of faces as supervision to train traditional metric learning algorithms. The errors that will be made by the naming process will penalise the predictive performance of the metric.

Therefore, we also explore an alternative approach to learn useful metrics from noisily annotated faces, or, more exactly, noisily annotated sets of faces. We formulate the problem in the Multiple Instance Learning (MIL) framework and introduce MildML for learning metrics in such a setting. MildML stands for Multiple Instance Logistic Discriminant Metric Learning. As we will see, with this approach, the learnt metrics for face recognition outperform the ones obtained from using automatically named faces, without any user intervention.

In Section 3.2 we review work related to face naming and MIL metric learning. Then, in Section 3.3, we present generative and graph-based approaches for face naming, which are published in Guillaumin et al.[2008]and Guillaumin et al.[2010a]. In the same section, we also study the different means to learn a metric for face recognition automatically from the weak supervision of news documents, including MildML which was published in Guillaumin et al.[2010c]. In Section 3.4, we describe techniques to extract names from captions and features for faces in images, and we also present the Labeled Yahoo! Newsdata set which is a subset ofYahoo! Newsthat we have manually annotated. In Section 3.5, we evaluate and discuss the performance of the different methods for face naming and weakly supervised metric learning. We conclude in Section 3.6.