Data sets and features - Données multimodales pour l’analyse d’image Matthieu Guillaumin

To evaluate our models, we consider three publicly available data sets that have been used in previous work, and allow for direct comparison for the tasks of image auto- annotation and keyword-based image retrieval. In this section, we first describe each of them, providing basic statistics and some examples images. Then, in Section 4.4.4, we detail the feature sets that we extracted from the images, and the base distances we used for each feature.

4.4.1 Corel 5000

This data set was introduced and first used in Duygulu et al.[2002] and is available online¹. Since then, it has become an important benchmark for keyword based image

1At: http://kobus.ca/research/data/eccv_2002/

4.4. DATA SETS AND FEATURES 117

plane jets

jet plane

sky jet

mountain

swimmers fox

pool water

water river

people ice

sky kauai

water people

coast

Figure 4.5: Examples of images from theCorel 5000data set, with their associated keywords.

retrieval and image annotation. It is a subset of around 5000 images of the larger Corel CD set, which is a large collection of manually annotated images. The images, all taken by professional photographers, are have a standard size of 256×384 pixels.

Originally, the collection comes with a high-level structure which indicated roughly the scene type, such as “africa”, “museum”, “insects”, “dogs”, “orchids”, “people”, etc., that can be used for image categorisation. Rather, theCorel 5000data set focuses on the manual annotations that have been given to the images in the form of keywords.

These keywords describe to some extent the content of the images, especially when objects are present in the images. The annotations consist of one to five keywords, and were assigned for indexing purposes.

The data set is split in training and testing subsets. The former contains 4500 different images, and the later 499. In the training set, there are 371 different keywords, but only 260 also appear in the test set. It is therefore natural to restrict to these 260 words in our vocabulary.

Two example images have already been shown in Figure 4.1, but in Figure 4.5 we give several additional examples. The examples show that the variety in the data is limited, and that the annotations do not cover the visual content comprehensively.

4.4.2 ESP Game

The ESP Gamedata set is a recently built data set by Von Ahn and Dabbish [2004].

The images were collected on the Internet, are therefore show a very large diversity.

Among the images, we can find logos, drawings, personal photos, web page decora- tions, etc., sometimes with a very low quality. The data set is therefore challenging.

Moreover, the annotations were obtained from an online game. In this game, two players, that can not communicate outside the game, gain points by agreeing on words when being simultaneously showed the same image. As a consequence, almost all words that are agreed upon describe the image to some extent. Notably, players will easily agree on words that are written in the images, a problem which is beyond the scope of this paper. Using the same image multiple times for several pairs of players help identify the important and meaningful tags for the images. Contrary to theCorel 5000data set, these annotations are therefore not specifically intended for indexing or retrieval.

There are 60000 images that are publicly available², but, to ensure a fair comparison, we will use the subset of around 20000 images that was used in Makadia et al.[2008], with a vocabulary of 268 words. Similarly, examples have already been provided in Figure 4.1, but we add a few in Figure 4.6.

4.4.3 IAPR TC-12

This set of 20.000 images accompanied with descriptions in several languages was ini- tially published for cross-lingual retrieval (Grubinger[2007]). It can be transformed into a format comparable to the other sets by automatically extracting common nouns using natural language processing techniques similar to the ones we already used in Chapter 3 for extracting names of individuals. Given the length of the text, this procedure typically yields a large number of keywords, but they are less descriptive and more noisy. The noun extraction procedure also provides the number of occurrences of each noun in the description, although we do not exploit this additional informa- tion. Similar to theESP Gamedata set, these annotations that we use as ground-truth were not specifically associated with images for the tasks we consider. We use the same resulting annotation as in Makadia et al.[2008], which are publicly available³, using a vocabulary of 291 words.

The images are touristic photos from South America, and are of a good quality. In this respect, they do not show such a wide variety as what is observed in the ESP Game data set, as shown in Figure 4.1 and Figure 4.7.

2http://hunch.net/~learning/

3http://www.cis.upenn.edu/~makadia/annotation/

4.4. DATA SETS AND FEATURES 119

computer man

green people

board red

group table window

man blue

face women

film snow

magazine cover river

circle coin

blue old

box money

war round

green

Figure 4.6: Examples of images from the ESP Gamedata set, with their associated keywords.

In Table 4.1, we summarise different statistics for the three data sets, showing the average and maximum numbers of images assigned with the same keyword, and numbers of keywords associated to images. Below we describe our feature extraction procedure.

4.4.4 Feature extraction

We extract different types of features commonly used for image search and categorisation. They cover all four combinations of global or local features with colour or texture descriptors.

The two types of global image descriptors that we used are Gist features (Oliva and Torralba[2001]) and colour histograms with 16 bins in each colour channel for RGB, LAB, HSV representations. Our local features include SIFT (Lowe[2004]) as well as a robust hue descriptor (van de Weijer and Schmid[2006]), both extracted densely on a multi-scale grid or for Harris-Laplacian interest points. Each local feature descriptor is quantised using k-means with 1000 bins for SIFT and 100 bins for Hue on two million

cloud lake

mountain desert

slope landscape

sky middle

tourist

lake woman

mountain wall

shore house

man roof

woman mountain

city range

building sky

street door

sky man

car car

people desert

roof

Figure 4.7: Examples of images from theIAPR TC-12data set, with their associated keywords.

Corel 5000 ESP Game IAPR TC-12 Image size 256×384 variable 360×480

Vocabulary size 260 268 291

Number of training images 4500 18689 17665

Number of test images 499 2081 1962

Average number of words per image 3.4 4.7 5.7

Maximum number of words per image 5 15 23

Average number of images per word 58.6 362.7 347.7 Maximum number of images per word 1004 4553 4999

Table 4.1: Statistics of the training sets of the three data sets. Average and maximum image and word counts are also provided. These statistics for the test sets are very similar to those of the training sets, except for the number of images and images per word.

No documento Données multimodales pour l’analyse d’image Matthieu Guillaumin (páginas 127-132)