Object recognition for semantic robot vision

(1)

(2)

(3)

(4)

Palavras-chave Representa¸cão de objectos, reconhecimento de objectos, classifica¸cão de objectos, deteçcão de objectos, extraçcão de objectos, segmenta¸cão de imagem, combina¸cão de classificadores, The Semantic Robot Vision Challenge, procura de objectos numa cena.

Resumo Reconhecer todos os objectos presentes numa qualquer imagem do dia-a-dia será um importante contributo para a compreensão autónoma de ima-gens. Um agente inteligente para perceber todas as dinâmicas do conteúdo semântico precisa primeiramente de reconhecer cada objecto na cena. Con-tudo, a aprendizagem e o reconhecimento de objectos sem supervisão, con-tinuam a ser um dos grandes desafios na área da visão robótica. O nosso trabalho é uma abordagem transversal a este problema. Nós constru´ımos um agente capaz de localizar, numa cena complexa, instâncias de catego-rias previamente requisitadas. Com o nome da categoria o agente procura autonomamente imagens representativas da categoria na Internet. Com es-tas imagens aprende sem supervisão a aparência da categoria. Após a fase de aprendizagem, o agente procura instâncias da categoria numa fotografia estática do cenário.

Esta disserta¸cão é orientada à deteçcão e ao reconhecimento de objectos numa cena complexa. São usados dois modelos para descrever os objectos: Scale Invariant Feature Transform (SIFT) e o descritor de forma proposto por Deb Kumar Roy. Para localizar diferentes objectos de interesse na cena efectuamos segmenta¸cão de cena baseada nas saliências de cor. Após local-izado, extra´ımos o objecto da imagem através da análise dos seus contornos, para finalmente reconhece-lo através da combina¸cão de vários métodos de classifica¸cão.

(5)

Keywords Object representation, object recognition, object classification, object detection, object extraction, scene segmentation, ensemble of classifiers, the semantic robot vision challenge, environment search phase.

Abstract Recognizing objects in an everyday scene is a major step in unsupervised image understanding. An intelligent agent needs to first identify each object in an environment scene, so it could eventually understand all the dynamics of the semantic content. However, unsupervised learning and unsupervised object recognition remains a great challenge in the vision research area. Our work is a transverse approach in unsupervised object learning and ob-ject recognition. We built an agent capable of locating, in a complex scene, an instance of a requested category. The name of a category is uploaded to the agent’s system and it autonomously learns the category appearance, by searching the Internet and looking for category examples. Then it explores a static picture of the surrounding environment, looking for an instance of the previously learned category.

This dissertation focus on the object detection and object recognition in a complex picture scene. We use Scale Invariant Feature Transform (SIFT) and Roy’s Shape Representation (RSR) to represent an object, and an en-semble of several classification techniques to recognize an object. To obtain the object’s location on the complex scene we used scene segmentation, based on image colour saliencies, and object extraction based on contour analysis.

(6)

List of Figures

1.1 Ideal object detection. . . 2

2.1 The UA@SRVC agent modules distributed by the two SRVC’s phases. . . 9

2.2 Categorized groups of images. . . 10

2.3 Ideal object extraction. . . 12

2.4 The UA@SRVC agent pointing to an object. . . 13

2.5 The Internet search phase. . . 14

2.6 The environment search phase. . . 16

3.1 Canny’s edge detector. . . 18

3.2 One iteration of the RSR algorithm main cycle. . . 18

3.3 Whole process to acquire a RSR histogram from a hammer. . . 19

3.4 Linear regression window. . . 20

3.5 Shape context basic idea. . . 21

3.6 Global shape context. . . 22

3.7 GSC histogram rotation. . . 22

3.8 Illustration of the SIFT feature extraction process. . . 24

3.9 Maxima and minima of the difference-of-Gaussian. . . 24

3.10 Overview of category representation. . . 26

4.1 The stages of the colour saliency clustering. . . 28

4.2 The concentric window. . . 30

4.3 Sub-image types. . . 33

4.4 Extraction scenarios. . . 33

4.5 Sub-image fragment selection. . . 35

4.6 Saliency grid . . . 36

5.1 SIFT keypoint match. . . 40

5.2 The CVS classification process . . . 42

6.1 RSR scenarios . . . 45

6.2 RSR and GSC results. . . 45

6.3 Results of the general categories. . . 46

6.4 The time spent in the classification process. . . 46

6.5 SIFT results . . . 47

6.6 CVS scenarios. . . 48

6.7 Specific categories results . . . 49

(9)

(10)

Chapter 1

Introduction

Autonomous image understanding is the Holy Grail of computer vision research. Unsuper-vised learning and the ability of a computer to comprehend entirely the semantic content in a scene may trigger a whole new level in the Artificial Intelligence area. However, true image understanding remains a great challenge to vision researchers. To recognize a scene’s semantic content it is required to first recognize each present object. A number of recent studies have presented new approaches for categorizing a scene [9, 39], although they classify the scene as a whole and not the individual objects within. Our approach follows a different policy, that may be divided in two steps: first it is required to recognize each object that composes the scene individually and secondly it is necessary to understand the semantic dynamics be-tween them for possible autonomous image understanding. Our team developed an intelligent agent designed to fulfil the first step. From the name of a category, our agent, autonomously learns its visual aspect, by searching example images in the Internet [31].With the knowledge gathered in the Internet, it scans static pictures of the environment, looking for the precise location of an instance from the requested category to find. Our agent is a transverse solution to recognize objects in a scene. It touches many vision fields like unsupervised learning [31], scene segmentation, object extraction and object classification. This dissertation reports and evaluates the work performed on the intelligent agent and also proposes some upgrades in the agent’s first version to improve its performance.

1.1 Object representation

Object recognition is a fundamental task of computer vision, where from a given image we try to automatically recognize the identity of the object or category visually represented in the image [9]. But for an agent to visually recognize an object, it has to learn the object’s appearance in the first place. Then the agent needs to associate the object descriptor with a category, to recognize other similar objects as the same category.

An unsupervised learning technique [31] selects which images retrieved from the Inter-net will represent the training categories. Each category is represented by a set of object representations. This way we provide the agent with the required knowledge to recognize a category’s instance in a scene.

An object image is basically an array filled with colour codes in each cell. For sure this is an object representation, but how can an agent extract information from such complex data? The agent can extract multiple kinds of information from a single image, for instance colour,

(11)

shape or unique characteristics, and save them as features that represent the object. However, for this to happen, there are many technical issues to solve first, like creating models robust to scale, rotation and shear variations or robust to noise. Many authors like Lowe [26], Roy[34], Belongie[3] have dealt with these problems and suggested strategies to solve them. In chapter ”Object representation” we present their approaches and propose some modifications for our specific problem.

1.2 Object detection and extraction

Object detection and extraction problems remain a great challenge in computer vision research. Correct image segmentation is a critical step in computers’ autonomous image un-derstanding. An image scene is like a ”jigsaw puzzle” of objects, each object is a puzzle piece and when fit together form a semantic context. To achieve the main goal of computer vision, autonomous image understating, an agent needs to first comprehend each object by itself and then conjugate them in a semantic context. However, to comprehend each object, the agent has to detect its location in the complex scene and then extract it from the rest of the scene. This scene segmentation sets the stage for object recognition by providing a higher-level representation of an image in terms of regions of uniform colour/intensity and structural relevance [32]. In our case, object detection consists of analyzing the scene looking for plau-sible object regions (figure 1.1.). However, decomposing a scene into regions of interest is not enough for recognizing an object. Typically, some regions of interest have more than one object within or are too big (have a bigger area than the required to fit the object). There-fore, after detecting an object region candidate it is desirable to isolate the object(s) from the rest of the scene’s environment noise. This task is known as object extraction. However, an unsupervised evaluation about what is background and what is part of the object is difficult. [14] and [1] worked around this issue, but their extraction methods were not completely un-supervised. We developed two new unsupervised extraction heuristics. The first one merges neighbouring detected contours into a unique object region. The second approach, extracts and rebounds the hypothetical object region, based on the size of the resulting segmented region (sub-image).

Figure 1.1: Ideal object detection. All relevant objects in the semantic context were segmented with the minimal bounding box.

(12)

1.3 Object classification

Object classification is a challenging problem that requires strong generalization ability from the classification methods in order to be capable of handling variations in illumination, occlusions, view-point and noise from the background clutter [15]. Nowadays, object recogni-tion has been heavily improved by local features invariant to common image transformarecogni-tions, like SIFT [26, 27]. They proved to be robust in recognizing instances of specific categories under different view-points or lighting conditions. Since the intelligent agent has the right information that describes a specific category, local features approaches shown to be a pow-erful tool in its recognition [26]. However, to represent the concept of a generic category these approaches have lower performances. Shape analysis [34, 3] of an object has the advan-tage of having a better generalization capability compared with local feature analyses (like SIFT). However, shape analysis has many limitations especially when the object to classify has not a clean background or is partially hidden in a scene. In this dissertation we try to take advantage of these factors. We built an agent that, according to each type of category to recognize, switches between shape analysis and local feature analysis. Furthermore, to improve the results and combine the classification methods individual decisions into a final verdict we implemented an ensemble of classifiers [10] approach.

1.4 Semantic Robot Vision Challenge - SRVC

A team from the University of Aveiro was formed with the objective of participating in The Semantic Robot Vision Challenge 2008 software league, under the scientific guidance of Professor Luis Seabra Lopes. SRVC’2008 took place in Anchorage/Alaska, in association with the IEEE conference: Computer and Vision Pattern Recognition 2008. The event was sponsored by the US National Science Foundation and Google. The team from the University of Aveiro built an agent that was able of qualifying for the international competition. This dissertation reports an integral part of the UA@SRVC agent. Therefore, to understand our research choices in this dissertation, first it is required to know the SRVC’s specifications.

SRVC is an international research competition designed to push the state of the art in image understanding and automatic knowledge acquisition from large unstructured databases of images, such as those generally found in the Internet. The robots in this competition are completely autonomous. They receive a text list of object categories that must be located in the environment. To learn the objects appearance, the robots automatically use the web to search image examples of the same object categories. With the knowledge gathered in the Internet, they must be able to locate the objects in a complex scene.

The Semantic Robot Vision Challenge is divided into two leagues: robot league and soft-ware league. In the robot league, besides the difficulty of recognizing objects in a complex scene, robot league teams must provide a robot capable of navigating safely in the indoor scenario, without damaging the room or the objects within. In the software league, teams do not need to care about the robot navigation system. The competition organizers provide to teams a set of pictures taken from the environment. The downside of this league is that the teams do not have any control over the pose or position of the robot’s cameras.

(13)

Each team must participate in two distinct phases of the competition: Internet Search phase and Environment Search phase. The work on this dissertation is more oriented to the Environment Search Phase, while the Internet Search phase task was assigned to a fellow team member Rui Pereira [31].

1.4.1 SRVC - Internet Search Phase

In the Internet Search phase the robots or intelligent agents may autonomously gather and select appropriate information in the Internet.

After receiving the file containing the list of the twenty categories names, the agents must autonomously search any public-domain databases of images, in order to get image examples of the target categories. For this purpose, our agent uses the Google Image Search engine, that shown to be, compared with other engines, the most robust and comprehensive for any type of object. However, the image search engines have a big problem. When a keyword (e.g. airplane, hammer, etc) is searched, they retrieve thousands of images, but only a small fraction would be considered good for the object visual representation. So, besides gathering the images, our agent should pick autonomously the appropriate images that truly represent the object. The image selection task is explained in detail at Rui Pereiras’s dissertation [31]. Furthermore, it is in this phase that our agent builds the object class database from the good images previously collected and selected. This database holds the object’s category name, associated with models built from the good images that represent the object. All these tasks should be accomplished in a maximum period of time of 4 hours. In our case, for a set of twenty objects, our agent requires an average time of 30 minutes to complete the Internet Search phase.

1.4.2 SRVC - Environment Search Phase

The Environment Search phase is where the robots or intelligent agents will search the environment looking for the requested objects. In the robot league, robots navigate freely through the environment searching for objects that better match the objects required to find. This factor may be an advantage, because robots can learn to take good pictures that need little segmentation. In the software league besides the textual file containing the object names, it is provided a set of pictures, taken on the environment. The pictures taken are not optimal, i.e. there are pictures without any objects and pictures that need plenty of segmentation. It is on these pictures that software league teams must recognize the required objects.

Detecting the relevant objects of the scene’s pictures is one of the main tasks in this phase. Using detection techniques based in colour saliencies, our agent retrieves, from a single scene picture, smaller images with regions of interest that have a high possibility of containing an object. Finally, these segmented images are compared with the images fetched in the Internet Search phase and eventually objects within are recognized. When an object is identified, a bounding box points its location in the scene picture. All these tasks should be performed in a maximum period of 30 minutes. In our case, the Environment Search phase lasts an average period of 5 minutes, to find 20 objects in the given set of 50 pictures.

1.4.3 Objects to Find

In the textual list of objects to find there are two distinct types of object’s categories: general categories, such as ”hammer” or ”bottle”, and specific categories such as a music

(14)

CD that has a specific title, e.g. ”Hopes and Fears by Keane”. This factor raises one new technical issue: which known models and classifiers suit better for each type of object. In chapter ”Performance Evaluation” we tested our classification methods to see which one fits better to each category type.

The objects to find are randomly placed in the scenario. However, some objects required to be found are not present in the scenario and to increase the difficulty some other objects, that do not appear in the textual list, are also placed in the scenario.

1.4.4 Scoring the Competition

At the end of the competition each team retrieves a picture from the environment per each object specified in the initial list of objects to be found. Every picture must have one, and only one, bounding box that clearly identifies the position of the object. However, if more than one object is found in a picture, that same picture may be reused for each of the different objects there. An object may be indentified only once, if there are multiple pictures for the same object then none of those pictures will be scored. Finally, each picture is scored according to the following rules:

1. The match between the real bounding box and the returned bounding box is computed as the ratio of the two bounding boxes (intersection/union)

2. Based on the computed ratio values, points are assigned as follows: • ratio ≥ 75% =⇒ 3 points

• ratio ≥ 50% =⇒ 2 points • ratio ≥ 25% =⇒ 1 points

• ratio < 25% =⇒ 0 points, this also means that false hits have a zero scoring. The final score of each team is given by the sum of all picture scores.

1.5 Objectives

The objective of this work is to describe and evaluate developed capabilities to au-tonomously detect and classify objects in complex pictures. This must be achieved with small number of given low quality training objects. This main goal may be divided in four sub-objectives:

• Design an object descriptor robust to scale and in-plane rotation. • Find relevant objects for the semantic context in a complex picture. • Detect and extract an object from a cluttered background.

• Combine different classification methods and decisions in a final classification verdict. • Evaluate the performance of the adopted classification techniques.

(15)

1.6 Related work

In this section, we make a quick overview of the previous research work related to our investigation themes: object representation, object detection and extraction and object clas-sification.

1.6.1 Object representation

Many different techniques have been developed for describing an object. The object is represented by computer models created upon the object’s image. The descriptors must be distinctive and, at the same time, robust to scale, rotation and view point changes.

In 2003, Lazebnik et al. [22] developed a descriptor suitable for texture representation, the intensity-domain spin images descriptor, inspired by the spin images used by Johnson and Hebert [18].An intensity-domain spin image is a two dimensioned histogram of brightness val-ues in an affine-normalized patch. The two dimensions of the histogram are d, the distance from the centre of the normalized patch, and i, the intensity value.

Lowe [26, 27] proposed a scale invariant feature transform (SIFT), which combines a scale invariant region detector and a descriptor based on the image gradients of the detected re-gions. Mikolajczyk and Schmid [30] developed the gradient location and orientation histogram (GLOH), that is an extension of the SIFT descriptor by changing location of the SIFT’s sam-pling grid and using principal component analysis (PCA) to reduce the size. Other SIFT derivation is the PCA-SIFT algorithm [20]. It accepts the input as the standard SIFT de-scriptor: the points of interest location, scale and the dominant orientations of the SIFT feature. By applying PCA techniques, the resulting descriptor is smaller than the standard SIFT descriptor and may be used with the same matching algorithms [20]. Furthermore, Mikolajczyk and Schmid [30] evaluated a variety of local image descriptors and indentified the SIFT-based algorithms as being the most resistant to common image deformations. We use the SIFT algorithm in our object representation approach. SIFT will be presented in the ”Object representation” chapter.

1.6.2 Object detection and extraction

There is many work related to the region of interest detection. Region of interest detectors use different image measurements and are either scale or affine invariant [30].

Lindeberg [23] has proposed a scale-invariant ”blob” detector, where a ”blob” is defined by a maximum of the normalized Laplacian measure in scale-space. Lowe [27] approximates the Laplacian with difference-of-Gaussian (DoG) filters and also detects local extrema in scale-space. Lindeberg and Garding [24] derivate the blob detector into affine-invariant employing an affine adaptation process based on the second moment matrix. Mikolajczyk and Schmid [30] use a multi-scale version of the Harris points of interest detector [17] to locate points of interest in space and then apply Lindeberg’s strategy for scale selection and affine adaptation. A similar idea was explored by Schaffalitzky and Zisserman [35] as well as Baumberg [2]. Tuytelaars and Van Gool [40] built two types of affine-invariant regions. One based on a combination of interest points and edges. The second was based on image intensities. Matas et al. [29] introduced a new set of distinguishing regions, the so called maximally stable extremal regions (MSER). They are extracted with a watershed like segmentation algorithm.

(16)

The concept may be presented as follows. Imagine all possible thresholds of a grey-scale image I: th ∈ 0, 1, ..., 255. The pixels in the image above th are considered ”white” and the pixels below ”black”. If we see the movie of the threshold images Ith, where the frame

th corresponds to the threshold th, on the first moments of the movie we should see a fully white image. Then black spots, corresponding to local intensity minima, will start to appear and grow. Finally, in the last moments we should see a completely black image. The set of all connected components of all frames of the movie is the set of all maxima regions. The maximal stable extremal regions are chosen among these regions. Kadir et al. [19] have proposed a scale invariant detector. Their idea was to use entropy for measuring local image attributes and estimate the local saliencies of the image. The method searches for regions with high entropy levels using a circular, scalar and moving window. The regions with the highest entropies will be defined as salient regions. Our approach also locates salient regions, but instead of measuring the entropy levels it measures the colour saliency levels. These detection approaches detect regions of interest in an image to describe. However, our problem is more complicated. A detected region of interest may be relevant or inappropriate to our object detection and extraction:

• One object may be defined by several regions of interest.

• The detected regions of interest have one or more objects within. • The detected regions of interest have no relevant object within.

Hence, besides colour saliency detection, our approach clusters points of interest into plausible object regions and upon those regions extract the object within by analysing the regions’ contours (more details in the ”Object detection and extraction” chapter).

1.6.3 Object classification

Achievements in object classification have demonstrated that using local features or de-scriptors invariant to scale or affine changes tends to be an effective approach [12, 22, 38, 30]. At the same time, support vector machine (SVM) classifiers [37, 21, 11] have shown their promise for visual classification tasks and the development of specialized kernels suitable for use with local features has emerged as a fruitful line of research [7, 15, 11, 28]. On the other hand, some vision researchers [4, 25] raise in defence of the non-parametric classifiers (e.g. k-nearest neighbour or nearest neighbour classifiers), due to their simplicity, the capability of handling huge amounts of classes and they do not demand a training stage. Zhang et al. [43] proposed a hybrid approach combining a K-nearest neighbour approach with a support vector machine approach. The basic idea is to find the closer neighbours to query sample and train a local support vector machine that preserves the distance function on the collection of neighbours. However, all these approaches relied on precise training models. In our case, the agent has to deal with inaccurate training models. Hence, the training models are created unsupervised and with common Internet pictures. We believe that a non-parametric approach is more appropriate for dealing with our coarse training models.

1.7 Organization of the dissertation

In chapter 2 we will present the architecture of our learning and recognition approach as a whole. This way the reader is familiarized from the beginning with our strategy and hopefully

(17)

will better understand our further decisions.

In chapter 3 we will explain how an object may be represented. We describe how the adopted models are built, analyze its specific characteristics and induce some modifications to solve our specific problem.

In chapter 4 we will explain how an object may be detected and extracted from a picture with cluttered background.

In chapter 5 we will present the used classification methods and explain how the several classification techniques co-exist in our recognition approach.

In chapter 6 we will evaluate the performance of our agent. First, we will perform iso-lated tests to the environment search phase modules of the agent and finally we will test the agent’s performance as a whole.

In chapter 7 we will summarize the results conclusions of this dissertation. Finally, we will point the future work required to improve our object recognition method.

(18)

Chapter 2

The UA@SRVC Agent

In this chapter we present the architecture of the UA@SRVC agent. This dissertation focuses on SRVC’s environment search phase. To learn more about our agent’s approach to SRVC’s Internet search phase consult Rui Pereira’s dissertation [31].

2.1 Main agent modules

The UA@SRVC agent may be described as a set of independent modules. This way the agent is more understandable as a whole. In its most basic form the UA@SRVC agent is divided into seven modules which are the object fetcher, object selection, category model building, object detection, object extraction, object classification and object pointer. Figure 2.1 illustrates all modules and the SRVC phase to where each one belongs.

In te rn e t S e a rc h P h a s e Model Building Object Selection r e h c t e F t c e j b O E n v ir o n m e n t S e a rc h P h a s e Object Extraction Object Detection Object Pointer Object Classification

Figure 2.1: The UA@SRVC agent modules distributed by the two SRVC’s phases: the Internet search phase and the environment search phase.

2.1.1 Object fetcher

The object fetcher module is fed with a text file, containing the names of the specified categories to find. This module is responsible for searching those categories’ names using

(19)

the Google image search engine and fetching the retrieved images to categorized groups of images. Each group contains the image search result for a given category. This module just associates the group of fetched images with the corresponding category name. It performs a blind fetch without any selection capability. In this module, all images returned by the search engine are possible instances of the requested category. The object fetcher module is also responsible for stamping each group with the category’s type: general category or specific category. Once more, this task is executed not because of any selective process, but because the input textual file contains the required information: the specific categories have quotes or capital letters in their names and the generic categories have only lowercase letters in their names. The category’s type differentiation is important because the categories will be represented according to their category’s type. Finally, the object fetcher module returns a set of image groups as illustrated in figure 2.2.

Figure 2.2: Categorized groups of images. Each group has a category and a type of category associated. This is an ideal scenario, because each group of images on this figure is entirely correct: every image truly belongs in the specified group.

2.1.2 Object selection

The object selection module is responsible for analysing the categorized groups, returned by the object fetcher module, and selecting the best images to represent each category. In figure 2.2 the two image groups shown are optimal groups. In other words, every image is a true member of each group. If this scenario was always granted, the object selection module would not be necessary. However, such ideal image groups are rare. It is widely known that a common image search in an image search engine, such as Google, usually returns an enormous set of images, but only a small fraction would be considered as relevant. Therefore, the main task of this module is to analyse and refine the raw groups of images. This task may be carried out in an unsupervised way because good images of a category are normally similar and bad images are typically very dissimilar between them. So when images are clustered, the biggest cluster on each image group tends to be the correct one representing the category.

(20)

Our agent takes advantage of this factor to select good examples for each category [31].

2.1.3 Category model building

This module is the last module of SVRC’s Internet search phase and is responsible for building the knowledge of the agent. This module grabs the refined group of images re-trieved from the previous module and builds the respective descriptor models, storing these models as training models. Each type of category1 _{may have the same or different types of}

categories models. For instance, in the most recent version of the UA@SRVC agent, the spe-cific categories are represented by SIFT-based models, and general categories are represented with shape-based models. The specifications of this approach are presented in the ”Object Representation” chapter.

2.1.4 Object detection

The object detection module is the first module of SRVC’s environment search phase. It receives a set of images taken from the competition’s environment. On each image the object detection module is entrusted of detecting the location of possible relevant objects within the scene and retrieving those objects as new images. When feasible, the borders of the sub-images should be defined by the minimal bounding box possible surrounding a single object, as illustrated in figure 1.1. Though, such optimal and clean image segmentation as illustrated in figure 1.1 is still not available nowadays. Current technology is able to segment the scene’s image into candidate object regions, according to the detected points of interest. However, it produces many unsuitable images: sliced objects, many objects per image or sub-images without any relevant object within. Our agent’s object detection generates many sub-images from a unique image scene, some of them are unsuitable sub-images but also good ones are generated. Therefore, in the development of the next modules we took these limitations into account. To learn all specifications of our approach please read the ”Object detection and extraction” chapter.

2.1.5 Object extraction

The goal of the object extraction module is to extract the detected objects from the cluttered background. Usually, the sub-images generated by the previous module have some background noise, even when the previous module returns good sub-images, as shown in figure 2.3. This module extracts the uncategorized objects from the generated sub-images and builds the respective object representations. This way, the uncategorized objects are ready to be classified by the next module.

Our object extraction approach was developed after the SRVC’2008 took place. Therefore, the agent participating in SRVC’2008 did not have any object extraction method. Then, the object representations are directly built from the generated sub-images, produced by the object detection module. We added object extraction to the UA@SRVC agent to see if the resulting effects were positive. Chapter ”Object detection and extraction” presents the approach.

(21)

Object

Extraction

module

Figure 2.3: Ideal object extraction. This figure shows the object extraction goal. However, the performance illustrated is difficult to achieve autonomously.

2.1.6 Object Classification

The object classification module is responsible for recognizing the uncategorized objects of the environment. It is in this module that all classification methods of the agent are applied. The object classification module is fed with the category models and the representations of the detected/extracted objects. Every uncategorized object representation is classified based on the knowledge acquired by the agent in the Internet search phase. In other words, the agent compares all uncategorized object representations with the category models and retrieves a classification verdict, according to its classification policy. Hypothetically, all sub-images retrieved are considered to be images of the objects in the environment. All sub-images are supposed to contain a relevant object of the environment. However, this consideration is incorrect, because many sub-images are irrelevant. Therefore, this module has to exclude them or be robust to them. Our agent treats the unsuitable sub-images as attractors and excludes them from the classification process. More details in the ”Object classification” chapter.

2.1.7 Object pointer

The object pointer module is responsible for pointing the recognized objects in the envi-ronment images. As it was explained in the introduction, for each requested category, the agent must return an environment picture with a bounding box surrounding a category’s in-stance. Therefore, for each category this module gathers the location on the picture, provided by the object detection module, and the category assigned to the recognized object, provided by the object classification module. According to this information, each category is associated with a scene’s picture, marked with a single bounding box pointing the position where the agent found the object, as shown in figure 2.4.

2.2 Architecture of the UA@SRVC agent

The UA@SRVC agent was the first system developed at the University of Aveiro to com-pete in the Semantic Robot Vision Challenge. It was developed by our team from scratch and ranked in second place in the SRVC’2008 software league. Figure 2.5 shows a diagram that illustrates the event flow of the Internet searching phase. First, a textual list containing the categories requested is uploaded to the agent’s system. The agent searches all categories’ names in the Google image search engine and for each category it downloads the retrieved

(22)

Figure 2.4: The UA@SRVC agent pointing to an object. On the left image, it is shown a correct object location and classification and on the right side it is shown an incorrect one. As it was referred previously, the same scene’s picture may be used several times, to recognize different category’s instances.

images (consult the download process in [31]). All the images fetched are associated with the respective category and marked with the respective category’s type. This way the system creates, for each category, a raw group of training images. We say raw groups because at this stage there was no image selection. The next step is the refinement of the groups of training images (note that the refinement is entirely autonomous). Before refining the groups of im-ages, the agent creates descriptive models for each image: SIFT models for specific categories and global shape context (GSC) models for generic categories. Then for each image group our agent clusters the images by similarity. The most crowded cluster is assumed as being the cluster that better characterizes the image group. In other words, each training group is only constituted by the images of the predominant cluster [31]. After the image selection, the Roy’s shape representation (RSR) models are built, they are built only at this stage because they do not enter the image selection algorithm. Finally, the refined groups of models are saved as training models and the rest of the models/images are discarded. Therefore, each category is represented by a group of training models.

Figure 2.6 shows a diagram that represents the event flow of our agent’s environment search phase. First, the competition pictures are loaded to the system. It is in these pic-tures that the agent must find instances of the requested categories. Therefore, it performs an autonomous object detection/extraction for each picture. Our object detection parses all pictures looking for colour saliencies in the picture. For our agent, a region with a high colour saliency is considered to be a candidate object region. Based on the detected regions, our agent segments the scene into multiple sub-images. At this stage, the agent considers that each sub-image contains a single relevant object of the environment. The agent also saves

(23)

Figure 2.5: The sequence of events of the UA@SRVC agent, during the Internet search phase. (1) - The list of categories to find is loaded. (2) - Each category’s name is searched on Google’s image search engine. (3) - Image results for each category are retrieved. (4) - Categorized group of images. Each category is associated with a group of images. (5) - Specific categories are represented with a SIFT local features. (6) - General categories are represented with the Global Shape Context. (7) - SIFT models of each specific category are loaded. (8) - Global Shape Context models of each generic category are loaded. (9) - Bad models and images are deleted. (10) - SIFT models and GSC models are saved as training models. (11) - After the image selection, the generic categories are also described by Roy’s Shape Representation. (12) - RSR models are saved as training models

(24)

the location of each sub-image on the scene’s picture. After these steps, the agent creates three object representations for each sub-image: SIFT representation, global shape context representation and Roy’s shape representation. These are the uncategorized objects descrip-tors that will be matched with the training models. To match the object representations, the agent uses SIFT-based match for the specific categories and for the generic categories a vot-ing system2 _{combining the individual decisions of the GSC-based match and the RSR-based}

match. But how our agent knows if an uncategorized object is an instance of a generic or specific category? To solve this problem we took advantage of a challenge specification: for each category, all participant teams must retrieve a scene’s picture containing the request object surrounded by a bounding box. So instead of determining which training category is more similar to the uncategorized object representation, we perform the opposite. To each training category, we determine the uncategorized object most likely to belong to the cate-gory’s training group. This way, it is possible to know which type of similarity measures to use: SIFT-based match for specific categories or the voting system for generic categories.

Each training model is matched with all uncategorized object representations. This pro-cess retrieves for each category a similarity ranking, containing the similarity scores of each uncategorized object. The agent analyses these scores, looking for attractors. An attractor is a sub-image that is predominantly high scored in several different categories. These attractors are bad sub-images that were incorrectly segmented, typically from the background clutter. The attractors are discarded and to each category it is found the most similar sub-image (nearest neighbour). Due to the tests performed before the competition we noticed that, typ-ically, the smaller sub-images give better results (according to SRVC’s scoring policy). So, in the nearest neighbour selection we added a heuristic to give preference to the smaller nearest neighbour. Finally, the sub-images with the highest scores are gathered and to each category it is retrieved a scene picture with the object surrounded by a bounding box (figure 2.4).

(25)

Figure 2.6: The sequence of events of the UA@SRVC agent, during the environment search phase: (1) - Scene pictures are load to the agent’s system. (2) - For each sub-image save its location in the scene picture. (3) - Perform SIFT modelling for every sub-image. (4) - Perform global shape context modelling for every sub-image. (5) - SIFT object representations are saved as uncategorized objects. (6) - Global shape context object representations are saved as uncategorized objects. (7) - Uncategorized objects are loaded. (8) - Training models are loaded. (9) - For each category, sort all uncategorized objects according to category’s similar-ity. Specific categories use SIFT models for the similarity evaluation and generic categories use global shape context models for the similarity evaluation. (10) - For each category the sorted lists of uncategorized objects is loaded. (11) - Attractors are removed. (12) - The best uncategorized object to represent the category is loaded. (13) - Scene’s pictures are loaded to the agent’s bounding box system. (14) - The locations of the sub-images in the scene picture are loaded. (15) - For each requested category the agent highlights on a scene’s picture the object that better represents the category.

(26)

Chapter 3

Object representation

The purpose of this chapter is to explain how the studied object representations are built, analyze their specific characteristics and propose some modifications to solve our specific problem. We studied and implemented three models for object representation. Two of them have a shape descriptor approach (Shape Context [3, 31] and Roy’s Shape Representation [34]) and the other one is a very popular local interest points descriptor, known as Scale Invariant Feature Transform (SIFT) [26, 27]. This chapter only explains how the objects are represented. Object detection and extraction are addressed in chapter 4 and the object classification in chapter 5.

3.1 Roy’s shape representation - RSR

Roy’s shape representation (RSR) was proposed in 1999 by a MIT researcher Deb Kumar Roy in his thesis ”Learning words form sights and sounds” [34]. RSR is a shape descriptor approach that uses the edges of an object to build a two dimensional histogram. Each his-togram may be interpreted as the shape signature of an object. Therefore, if two objects have similar histograms it means that they have a similar shape as well.

The RSR algorithm only extracts information about object’s edge points. The rest of the information present in the object image is discarded. RSR measures distances d and angles δ between pairs of edge points. Where d is the Euclidean distance and δ is the relative angle formed by the edge points’ tangent lines. Figure 3.2 illustrates one iteration of the described process.

If an object has n edge points it will produce n2₂−n distances and relative angles. RSR accumulates every n2₂−n term in a single two dimensional histogram. Therefore, a RSR his-togram may be used as a global shape representation of an object.

To compute the RSR approach the shape contours of the object are needed. Therefore, we pre-process the object image with Canny’s edge detector [6], that returns a black and white image with the internal and external shape contours (figure 3.1).

3.1.1 The algorithm

To build the RSR histogram we perform the following steps on the edge image:

(27)

Canny’

s Edge

D etector

Figure 3.1: Object appearance before and after applying Canny’s edge detector. 2. Estimate αi, the angle formed between the tangent line and a referential line, at each

edge pixel i.

3. For each pair of edge pixels i, j:

• With the pixel coordinates previously saved, calculate the Euclidean distance, dij,

between the pixels.

• Calculate the relative angle between edges δij = |αi− αj|. In figure 3.2 it is shown

one iteration example.

4. Normalize the inter-pixel distances with the maximal distance dij.

5. Build a two-dimensional histogram conjugating δ_ij with d_ij. Roy’s advice is to use a 8x8 histogram, but in our case it was better to use a 32x32 histogram, as we show in the ”Performance Evaluation” chapter.

d

p

q

Tp

_Tq

Figure 3.2: One iteration of the RSR algorithm main cycle. At each iteration it is calculated the inter-point distance d and the angle formed by the two tangent lines δ.

The RSR model is invariant to scale changes, since the inter-pixel distances are normalized. The model is also invariant to in-plane orientation changes, because RSR histograms are built with relative angles. In figure 3.3 it presented the whole process for building the RSR histogram of a hammer.

(28)

i

j

d_ij Aij

(a) (b) (c) (d)

Figure 3.3: Whole process to acquire a RSR histogram from a hammer. (a) Image of a hammer. (b) Image filtered with Canny’s edge detector. (c) Gathering of the angle A and distance d of all pair wise ij combinations. (d) Quantization of the gathered data in a RSR histogram.

3.1.2 Estimating the tangent angle

To find the relative angle δij, we first need to find the tangent angles αi and αj of the two

edge pixels i, j. The angle α_i is the angle formed between the tangent line on pixel i and the horizontal axis line. Therefore, we may use (3.1) to calculate αi,

αi= arctan(mi), (3.1)

where mi is the slope of the tangent line of the edge pixel i.

To find m_i, the tangent line equation on the edge pixel i is needed. In addition, calcu-lating the tangent lines on discrete data may be accomplished by approximation techniques. Therefore, we developed an effective and efficient approximation to calculate the tangent lines in our application. We considered that the tangent line on the edge pixel i, is equal to the linear regression of its closest edge pixels neighbours, as shown in figure 3.4. The closest neighbour radius is a parameter that we adjust to improve the application performance. The smaller the neighbour radius, the better the tangent line approximation on edge pixel i is retrieved. However, if we set the neighbour radius too small, the resulting tangent line will be too sensitive to local shape changes. Therefore, we need to set a balance between tangent line approximation and final application performance. We tried to achieve this equilibrium with a large amount of tests (see the ”Performance evaluation” chapter).

Using linear regression (3.2) we are able to obtain the slope mi,

mi = P_k<n−1 k=0 (xk− ¯x)(yk− ¯y) P_k<n−1 k=0 (xk− ¯x)2 , (3.2)

where n is the number of edge neighbors considered and (xk, yk) are the coordinates of edge

pixels.

Finally, with the slope mi found, we are capable of calculating the angle formed between

(29)

x1000

Amplified Window

window W

p

Figure 3.4: Window W follows the shape edges. At each point it is calculated an approxima-tion of the tangent line, by applying linear regression on the points inside window W . The amplified window shows a linear regression example.

3.2 Shape context

The Shape Context is a local shape descriptor, first introduced by Belongie, Malik and Puzicha [3]. The basic idea of the shape context approach is to represent the shape of an object through a discrete set of points sampled from the internal and external edges of the shape (figure 3.1). Like the RSR approach, the edge points do not correspond to keypoints (as inflection points) they are simply every point of a shape contour.

Unlike the RSR approach, where a single histogram describes the entire shape, the shape context approach has a coarse histogram for each edge point on the shape. Let P ∈ {p1, ..., pn}, pi ∈ R2 be the set of points in a shape’s edges. Consider the n − 1 vectors

obtained from a point to all other sample points. The n−1 vectors represent the configuration of the entire shape contour relative to a referential point pi. To retain this information for

each pi, i = 1, ..., n it is created a coarse histogram hi, quantifying the length of the n − 1

vectors (figure 3.3) and the orientation (angle measured relative to the positive x-axis). So it may be considered that hi is defined to be the shape context of the point pi.

hi(k) = #(q 6= pi: (q − pi) ∈ bin(k)) (3.3)

The used histogram bins are uniform in log-polar space. This way the descriptor is more sensitive to positions closer to the sampled points than to those points further away. The basic idea of the shape context is illustrated in figure 3.5.

Shape context is translational invariant due to its nature. However, to obtain uniform scale invariance of the shape as a whole, first it is needed to normalize all radial distances

(30)

Figure 3.5: Shape context basic idea. (a,b) Sample edge points of two shapes. (c) Log-polar histogram used to compute the shape context. (d,e,f) Shape context of point marked with a circle, diamond and triangle, respectively.

by the mean distance d, where d is the mean length of the n2_−n

2 vectors. The shape context

is not rotation invariant, because for some recognition tasks rotation is a characteristic to consider, e.g. distinguishing between a ”9” and a ”6”. However, rotation invariance may be accomplished by measuring angles at each point relative to the direction of the tangent at that point.

3.2.1 Shape context as a global descriptor

In the original shape context approach a shape contour with n points requires n coarse histograms, h1, ..., hn, to be represented. To build these histograms requires the calculation

of the length and orientation of n2_−n

2 vectors. Therefore, we considered this technique

com-putationally heavy for our application. In the global shape context (GSC)1 _{we maintain the}

main procedures of the original shape context, but instead of calculating a coarse histogram for each edge point P ∈ {p₁, ..., p_n}, p_i∈ R2_{, we just calculate, for each object shape, a coarse}

histogram, hgc, relative to the point Pgc, where Pgcis the geometric centre of the object. This

idea is illustrated in figure 3.6. The coarse histogram hgc is built such as a normal coarse

histogram h_i in the local shape context approach. After calculating the n vectors required to build the hgc histogram, we normalize all distances by the maximal distance.

1_{The GSC descriptor and the respective classification method were implemented by my fellow teammate}

Rui Pereira [31]. However, GSC is an integrating part of the jointly developed system, therefore is important to know its specifications.

(31)

Figure 3.6: Our shape context adaptation. Each object has only a coarse histogram to be represented. A1 B1 A2 B2 A3 B3 A4 B4 A5 B5 C1 D1 C2 D2 C3 D3 C4 D4 C5 D5 E1 F1 E2 F2 E3 F3 E4 F4 E5 F5 Angles D ist a n ce s A1 B1 A2 B2 A3 B3 A4 B4 A5 B5 C1 D1 C2 D2 C3 D3 C4 D4 C5 D5 E1 E2 E3 E4 E5 Angles D ist a n ce s F1 F2 F3 F4 F5

Figure 3.7: Shift of the angle columns. The histogram angle columns follow the rotation of the object. On the figure’s left side it is shown an object and its histogram. On the figure’s right side, it is symbolized the effect caused by the rotation of the object, on the histogram

(32)

Object descriptors need to be invariant to scale and rotation changes. To achieve scale invariance, we simply normalize the h_gc histogram by its maximal value. However, to obtain rotation invariance the process is more complicated. The histogram itself is not invariant to rotation, although there is a pattern in histograms of the same object, but with different rotations. When an object rotates, there is a shift in angle columns, as illustrated in figure 3.7. Therefore, it is possible to have invariance to rotation while comparing two shapes, by shifting one of the histograms A times, where A is the number of angle columns of the histogram. The shift arrangement that results in the lowest cost is considered to be the right position to match the histograms.

3.3 Scale invariant feature transform - SIFT

Nowadays SIFT is a very popular algorithm to describe local image features. It was pub-lished by David Lowe in 1999 [26, 27] and since then SIFT is widely used by vision researchers in their work. This happens because SIFT features are very robust to scale, rotation, illumi-nation, noise and minor viewpoint changes. Furthermore, SIFT features are highly distinctive and relatively easy to compute. In this section, we try to explain how the SIFT features are built to represent an object. Object recognition will be addressed in the ”Object classifica-tion” chapter.

SIFT represents an object by aggregating the object’s unique characteristics. A unique characteristic is an area in the object that the SIFT algorithm considers interesting (figure 3.8 (a)). The SIFT author called these areas keypoint descriptors. Therefore, for SIFT an object is a concatenation of its most relevant keypoint descriptors. To create a keypoint descriptor the SIFT algorithm gathers the orientations and magnitudes of the image gradient around each local interest point (keypoint). For these to happen a 16 × 16 grid is centred around the keypoint location and to each grid cell the image gradient is computed. The gradient of an image measures how it is changing. It provides two pieces of information. The magnitude of the gradient tells us how quickly the image is changing, while the direction of the gradient tells us the direction in which the image is changing most rapidly. In other words, the gradient vector points in the direction of largest possible intensity increase and the length of the gradient vector corresponds to the rate of change in that direction.

The resulting gradients are sampled in 4×4 groups from the 16×16 grid (figure 3.8 (b)) and for each group the magnitude and the orientation are quantified into a histogram. Therefore, every keypoint is described by a set of 16 magnitude and orientation histograms (figure 3.8 (c)), where histograms arrows’ length correspond to the sum of the gradient magnitude near that direction within the sub-region. Each histogram has 8 bins so each keypoint descriptor has 128 dimensions.

We have explained how the keypoint descriptor is formed, but we still need to explain how the SIFT algorithm detects a keypoint. The SIFT algorithm uses scale space extrema of the difference-of-Gaussian (DoG) function, proposed by Lowe[26] based on the work of Witkin [42], for detecting stable keypoint locations that are invariant to scale change of the image. Therefore, SIFT convolves DoG with the image to get Di(x, y, σ):

Di(x, y, σ) = DoG z }| { (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y) | {z } image (3.4)

(33)

(a)

(b)

(c)

(d)

Figure 3.8: Illustration of the SIFT feature extraction process: (a) original image with a local interest point to describe; (b) local interest area with gradient samples at each grid point; (c) local individual orientation histograms, which result of accumulating each sample into the corresponding bin; (d) final 128 dimensional keypoint.

x x x x x x x x x x x x x x x x x x x x x x x x x x S c a le

Figure 3.9: Maxima and minima of the difference-of-Gaussian images are detected by com-paring a pixel (marked with ) to its 26 surrounding neighbors, in 3x3 regions at the current and adjacent scales (marked with ×).

(34)

The Di(x, y, σ) convolution is an element of a finite set of convolutions D(x, y, σ), where k is

the constant multiplicative factor.

Scale space extrema of the DoG function will produce a set of candidate keypoint loca-tions. In order to detect the final keypoint location, each sample point (keypoint candidate) is compared with its eight neighbors in the current image convolution D_i(x, y, σ), with its nine neighbors above Di+1(x, y, σ), and finally with its nine neighbors bellow Di−1(x, y, σ),

i.e., the candidate keypoint location is compared with its twenty-six surrounding keypoint candidates, as illustrated in figure 3.9. The keypoint candidate will be a final keypoint if it is larger or smaller than the other twenty-six surrounding keypoint candidates.

After this process, all keypoint locations are defined. Finally, the object is represented by the concatenation of every final keypoint descriptor found in the image.

3.4 Category representation in the UA@SRVC agent

The category representations are built in the SRVC’s Internet search phase. The UA@SRVC agent starts by searching the Internet for image examples of the required categories. Then, a unsupervised clustering technique selects the best image examples to represent each cate-gory [31]. The training categories are also divided into two group types: general categories and specific categories. This division is possible, since the input textual file of the compe-tition follows a special convention: the specific categories have capital letters or quotes in their names (indicating proper nouns) and the general categories have only lowercase letters in their names (indicating common nouns). The separation is important because our agent represents each type with different descriptors. If a category is specific, the agent follows a SIFT-based description and if the category is generic the models it follows a shape-based description. The generic categories are described by RSR and GSC, so each training instance of the category will have two shape descriptors associated. In the other hand, the specific categories are described by the concatenation of all SIFT features of training instances of the category.

Figure 3.10 illustrates the design of our categories. Once more, the figure shows an exam-ple where the instances of each category are comexam-pletely right, i.e. each category is represented by true category instances. However, our unsupervised object selection does not retrieve such ideal image results, mainly because of the low quality of the images fetched in the Internet. As a consequence, the categories with incorrect training instances will be wrongly interpreted by the upper agent’s modules, producing object recognition errors (more about our unsupervised object selection see [31]).

(35)

...

camera

...

water cup

...

Coca-cola can

...

H&S shampoo

...

Figure 3.10: Overview of category representation. Specific categories are represented by sets SIFT local features. Generic categories are represented by the global shape context and Roy’s shape representation.

(36)

Chapter 4

Object detection and extraction

For an intelligent agent to entirely understand a scene’s picture, it first needs to recognize each individual item composing the scene. Only then, the agent may conjugate all the indi-vidual information and eventually understand the semantic context of the scene. In our case, the agent must recognize all relevant objects in the scene. For instance, figure 1.1 (chapter 1) illustrates an optimal object detection/extraction: all relevant objects for the semantic context were correctly segmented. However, before recognizing any object in the scene, our agent, must find plausible object regions in the picture. Our object detection approach follows the policy that all relevant objects, to the semantic content, must outstand in the scene’s pic-ture. Therefore, our agent locates all colour salient regions of the scene’s picpic-ture. We consider the colour salient areas of the picture as plausible object regions. A plausible object region becomes a candidate object region, after an autonomous redefinition of the plausible region boundaries by a clustering technique. Finally, the candidate object region needs to be parsed to extract any possible object within.

The main objective of the object extraction is to isolate the existing objects in the candi-date object regions. The reason of this extraction effort is to clean the surrounding noise, in order to improve the object representation and, this way, facilitate the object classification task. Figure 2.3 illustrates an ideal object extraction result: the relevant object within the candidate object region is retrieved without the surrounding noise. The extraction process should autonomously analyze the image and distinguish which regions are part of the object and which regions are part of the background. In [8] the authors join object extraction with image classification to distinguish features from the object and then extract it from the pic-ture. However, they rely on an accurate set of training categories. Unfortunately in our case the training categories are autonomously created from Internet images, making it difficult to grant exact training categories. Therefore, our extraction algorithm must extract the relevant objects without knowing how the object looks. We developed two object extraction heuristics. The first aggregates the contours of the candidate region by analyzing its relative distances. The second heuristic extracts the object based on the contours relative size.

The object detection and extraction modules were designed and implemented by the UA@SRVC team leader. The object detection module was used in SRVC’2008. The object extraction module, was finalized during the competition, but could not be integrated in time to be used for competition purposes. The integration of object extraction in the agent and some innovations (redefine and discard small contours), as well as the evaluation of these modules, were carried out by the author of this dissertation.

(37)

4.1 Colour saliency clustering

The main idea behind colour saliency clustering is to group points of interest into colour salient regions. We consider these salient regions as candidate objects. In our approach, a point of interest is a picture’s point surrounded by a colour saliency area. In other words, the relevance of a picture’s point is measured by the colour saliency saturation of the neigh-bourhood points. According to its relative distances in the scene’s picture, the most relevant points of interest are clustered into groups and these groups will define the object regions boundaries. The clustering process is completely autonomous and is incapable of recognizing objects, it only clusters colour saliencies regions on the input picture.

Figure 4.1: The stages of the colour saliency clustering: (a) Input scene’s picture. (b) Salien-cies’ image, all colour saliencies are revealed. (c) Salient regions are detected through cluster-ing of points of interest. (d) Salient regions after a clustercluster-ing refinement. (e) Final detected regions: three candidate object regions were detected.

4.1.1 Measuring colour saliency

The purpose of the colour saliency detector is to identify the colour saliencies of a picture. The algorithm needs to first determine the dominant colours of the input picture before detecting the salient ones. Therefore, we create a global colour histogram of the entire picture. The histogram has 3 dimensions and each one represents a RGB1 _{component (8 bins per}

dimension, for a total of 512 bins). This way, we may quantify the number of occurrences of

(38)

each colour in the picture.

The colour saliency detector generates a new greyscale picture representing the colour saliencies of the input picture. To build the saliency picture we perform the following steps:

1. Build the global colour histogram of the input picture.

2. Calculate the average, the maximum and the minimum values of the histogram bin, havg, hmax, hmin.

3. For each pixel i of the input picture: • Get the rgb components of the pixel i.

• Get the histogram’s bin value, hi, according to pixel’s i rgb components.

• If h_i < h_avg then the respective pixel of the greyscale image will have the grey intensity given by (4.1). Otherwise, the grey intensity is given by (4.2):

G(i) = 255 ×hi(hmax+ havg− 2hmin) + hmin(hmax− havg) 2(hmax− hmin)hmax

(4.1)

G(i) = 255 × 2hi(hmax+ havg) + (hmax− havg)2

2(hmax− havg)hmax . (4.2)

Figure 4.1 (b) illustrates the resulting saliencies’ image: a greyscale image where the darker regions correspond to the colour salient regions of the input picture.

4.1.2 Detection of points of interest

In our approach, a point of interest is a picture’s point surrounded by a salient region. After measuring the colour saliency on the input picture we may refer to the colour salient regions as the darker regions on the saliencies’ image (figure 4.1 (b)). Any pixel on the saliency image is a candidate point of interest. The pixels surrounded by the darkest regions will be the main candidates. Therefore, the algorithm scores each pixel of the saliencies’ image according its neighbours grey colour. The algorithm centres, on each image pixel, a window to measure the black saturation around the pixel. The concentric window is divided into L layers (figure 4.2). Each layer is ranked with a relevance factor. This way, the closer layers weight more in the final score of the point. The radius of each layer, ri, and the number of

points at each layer, Ni, have a geometrical growth:

r_i = ½ 1 , if i = 0 Q_i k1 , if 1 ≤ i < L. (4.3) Ni = ½ 2 , if i = 0 Q_i k2 , if 1 ≤ i < L. (4.4)

and the relevance factor of each layer i is given by:

Ri= Rmax− k3i, i = 0, 1, ..., (L − 1). (4.5)

Object recognition for semantic robot vision

Contents

List of Figures

Chapter 1

Introduction

1.1

Object representation

1.2

Object detection and extraction

1.3

Object classification

1.4

Semantic Robot Vision Challenge - SRVC

1.5

Objectives

1.6

Related work

1.7

Organization of the dissertation

Chapter 2

The UA@SRVC Agent

2.1

Main agent modules

Object

Extraction

module

2.2

Architecture of the UA@SRVC agent

Chapter 3

Object representation

3.1

Roy’s shape representation - RSR

Canny’

s Edge

D etector

d

p

q

Tp

Tq

3.2

Shape context

3.3

Scale invariant feature transform - SIFT

(a)

(b)

(c)

(d)

3.4

Category representation in the UA@SRVC agent

...

...

...

...

...

...

Chapter 4

Object detection and extraction

4.1

Colour saliency clustering

_Tq