Object detection and recognition for robotic applications

(1)

Universidade de Aveiro Departamento deElectrónica, Telecomunica¸cões e Informática 2014

Patr´ıcia Nunes

Aleixo

Dete¸c˜

ao e reconhecimento de objetos para

aplica¸c˜

oes rob´

oticas

Object detection and recognition for robotic

applications

(2)

(3)

“A picture is worth a thou-sand words.”

— Unknown

Patr´ıcia Nunes

Aleixo

Dete¸c˜

ao e reconhecimento de objetos para

aplica¸c˜

oes rob´

oticas

Object detection and recognition for robotic

applications

(4)

(5)

Patr´ıcia Nunes

Aleixo

Dete¸c˜

ao e reconhecimento de objetos para

aplica¸c˜

oes rob´

oticas

Object detection and recognition for robotic

applications

Disserta¸cão apresentada à Universidade de Aveiro para cumprimento dos requesitos necessários à obten¸cão do grau de Mestre em Engenharia Electrónica e Telecomunica¸cões, realizada sob a orienta¸cão cient´ıfica do Professor Doutor António José Ribeiro Neves e da Professora Doutora Ana Maria Perfeito Tomé Professores do Departamento de Electrónica, Teleco-munica¸cões e Informática da Universidade de Aveiro.

(6)

(7)

o j´uri / the jury

presidente / president Prof. Doutor Jos´e Luis Guimar˜aes Oliveira Professor Associado da Universidade de Aveiro

vogais / examiners committee Prof. Doutor Jaime dos Santos Cardoso

Professor Auxiliar da Universidade do Porto (arguente principal)

Prof. Doutor Ant´onio Jos´e Ribeiro Neves Professor Auxiliar da Universidade de Aveiro (orientador)

(8)

(9)

agradecimentos Ao meu orientador, Doutor António Neves, por acreditar no meu trabalho, por toda a motiva¸cão e orienta¸cão ao longo deste último ano. À minha co-orientadora, Doutora Ana Tomé, pela disponibilidade e revisão deste documento. Agrade¸co também á equipa CAMBADA, pela oportunidade que me foi dada para trabalhar neste projeto.

`

A minha fam´ılia, pai e mãe, por toda a confian¸ca depositada em mim e no meu trabalho em todas as fases da minha vida. À minha irmã, por todas as palavras de motiva¸cão e incentivo, nos momentos menos bons. A todas as amizades feitas ao longo do meu percurso universitário, obrigado por todo o companheirismo, ajuda e boa disposi¸cão. Deixo ainda um agradecimento às amizades de longa data, que mesmo longe, também estiveram presentes nesta fase da minha vida, festejaram comigo as minhas pequenas conquistas, e se orgulham do meu percurso. Um muito obrigado a todos.

(10)

(11)

Palavras-chave Visão Robótica, dete¸cão de objetos, dete¸cão de contornos, SIFT, SURF, transformada de Hough, Template Matching, OpenCv.

Resumo A visão por computador assume uma importante relevância no desenvolvi-mento de aplica¸cões robóticas, na medida em que há robôs que precisam de usar a visão para detetar objetos, uma tarefa desafiadora e por vezes dif´ıcil. Esta disserta¸cão foca-se no estudo e desenvolvimento de algoritmos para a dete¸cão e identifica¸cão de objetos em imagem digital para aplicar em robôs que serão usados em casos práticos.

São abordados três problemas: Dete¸cão e identifica¸cão de pedras decorati-vas para a indústria têxtil; Dete¸cão da bola em futebol robótico; Dete¸cão de objetos num robô de servi¸co, que opera em ambiente doméstico. Para cada caso, diferentes métodos são estudados e aplicados, tais como, Tem-plate Matching, transformada de Hough e descritores visuais (como SIFT e SURF).

Optou-se pela biblioteca OpenCv com vista a utilizar as suas estruturas de dados para manipula¸cão de imagem, bem como as demais estruturas para toda a informa¸cão gerada pelos sistemas de visão desenvolvidos. Sempre que possivel utilizaram-se as implementa¸cões dos métodos descritos tendo sido desenvolvidas novas abordagens, quer em termos de algoritmos de pre-processamento quer em termos de altera¸cão do código fonte das fun¸cões utilizadas. Como algoritmos de pre-processamento foram utilizados o dete-tor de arestas Canny, dete¸cão de condete-tornos, extra¸cão de informa¸cão de cor, entre outros.

Para os três problemas, são apresentados e discutidos resultados experi-mentais, de forma a avaliar o melhor método a aplicar em cada caso. O melhor método em cada aplica¸cão encontra-se já integrado ou em fase de integra¸cão dos robôs descritos.

(12)

(13)

Keywords Robotic vision, object detection, contours detection, SIFT, SURF, Hough transform, Template Matching, OpenCV.

Abstract The computer vision assumes an important relevance in the development of robotic applications. In several applications, robots need to use vision to detect objects, a challenging and sometimes difficult task. This thesis is focused on the study and development of algorithms to be used in detection and identification of objects on digital images to be applied on robots that will be used in practice cases.

Three problems are addressed: Detection and identification of decorative stones for textile industry; Detection of the ball in robotic soccer; Detec-tion of objects in a service robot, that operates in a domestic environment. In each case, different methods are studied and applied, such as, Template Matching, Hough transform and visual descriptors (like SIFT and SURF). It was chosen the OpenCv library in order to use the data structures to im-age manipulation, as well as other structures for all information generated by the developed vision systems. Whenever possible, it was used the implemen-tation of the described methods and have been developed new approaches, both in terms of pre-processing algorithms and in terms of modification of the source code in some used functions. Regarding the pre-processing algo-rithms, were used the Canny edge detector, contours detection, extraction of color information, among others.

For the three problems, there are presented and discussed experimental re-sults in order to evaluate the best method to apply in each case. The best method for each application is already integrated or in the process of integration in the described robots.

(14)

(15)

Introduction

In the last decade, technology evolve exponentially. Robotics is an example of this fast development, in a few decades the scientific fantasy turns into reality.

A robot can be defined by an agent in a specific environment where it can take information and react with a specific goal. The information about the environment can be obtained by a variety of sensors. One of the most rich type of sensor, and nowadays widely used, is a digital camera. Computer vision systems for robot applications allows an agent to understand the environment around it, detect shapes as objects and classify them. Performing this task in a time-constrain manner requires an efficient vision system and the processing time is a relevant issue to be considered when a vision system is implemented. Moreover, several parameters affect the performance of a vision system. The type of camera or lenses plays an important role in the vision system, however, some external parameters, such as the conditions of illumination in the environment where the robot works, the correct calibration of the vision system, among others, imposes some constraints when a vision system is developed.

Since the beginning, researchers are trying to develop computer vision systems to imitate the human vision, which is a complex system that acquire images through the eyes and then process them in the brain, in order to react to external stimuli. In a computer vision system, images are acquired by a digital camera and processed in a computational system with the same objective that the human brain. The main goal in computer vision has been to duplicate the abilities of human vision by electronically perceiving and understanding an image. However, it is assumed in the scientific community that this goal is still far to be reached.

(20)

Computer vision is a topic for study and research in several areas. Currently, vision systems are used in industrial environments, medicine, astronomy, forensic studies, biometrics, among others. This dissertation focus the use of vision systems for robot applications, namely in three different applications: robotic soccer, industrial and domestic environment. All the applications focus in autonomous robots that performs behaviors or tasks with a high degree of autonomy, without human intervention.

1.1 Objectives

The objective of this work is to describe, understand and evaluate several vision algo-rithms to be used in robotic applications. In particular, we intend to study and develop algorithms for the analysis of shapes, features and color information on digital images in or-der to detect and classify objects. These algorithms are developed taking into consior-deration time constrains so that they can be used in autonomous robots operating in an industrial or domestic environment.

This thesis focus on solutions for three practical applications:

• Development of a decorative stones detector for textile industry to Tajiservi company; • Integration of a new approach to ball detection in the vision system of the CAMBADA

soccer team of University of Aveiro;

• Development of a object detector to integrate in the CAMBADA@Home autonomous service robot from University of Aveiro.

When the development is concluded, each algorithm should be integrated and tested in each application.

1.2 Thesis structure

The remaining of the thesis is structured as follows:

• In Chapter 2 there are presented several known algorithms for object detect found in the literature. Three different approaches are study along this thesis, in order to solve

(21)

the several problems proposed. In this chapter techniques used with the aim of object detection, like Template matching and Hough transform are presented.

• In Chapter 3 a deep analysis on visual descriptors is present. SIFT, SURF, FAST and BRIEF algorithms are explained.

• In Chapter 4 there are proposed several solutions to solve a practical problem. A decorative stones detector for textile industry was developed and results presented and discussed. In this chapter several pre-processing steps are proposed and explained, in order to achieve the best algorithm to detection. Three algorithms are developed and results compared and discussed. Techniques like Canny edge detector, dilation and erosion and contours representation are used.

• In Chapter 5 is discussed the problem of the detection of the ball in the robotic soccer. Circular Hough transform is applied in order to solve several challenges related to the CAMBADA team.

• In Chapter 6 is proposed a solution to an autonomous service robot, in domestic envi-ronment, to detect objects. SIFT and SURF algorithms are tested and evaluated. • In Chapter 7 is described the conclusion of this thesis and as well as ideas for future

(22)

(23)

Chapter 2

Object Detection on digital images

An object recognition and detection system tries to find and to classify objects of the real world from an image of the world, using object models which are known a priori. Humans recognize multiples objects in images with little effort. The image of the objects may vary in different view points, namely different sizes, scale or even when they are rotate, but objects can even be recognized when they are partially obstructed from view. In robotics, finding an object and classifying it is a non-trivial task in cluttered environments. The problem of object detection and recognition is a notoriously difficult one in the computer vision and robotics communities. Develop an autonomous robot with the ability to perceive objects in a real-world environment is still a challenge for computer vision systems. Many approaches to perform this task have been implemented over the last decades.

From robotics to information retrieval, many desired applications demand the ability to identify and localize categories, places and objects. Object detection algorithms typically use extracted features and learning algorithms to recognize instances of an object category. It is commonly used in applications such as image retrieval and security and surveillance systems.

A variety of models can be used, including: Image segmentation and gradient-based, through blob analysis and background subtraction; Feature-based object detection, through detecting a reference object in a cluttered scene using feature extraction and matching; Tem-plate matching approaches when the scale factor is not important, among others.

(24)

2.1 Basic concepts

This section will present the basic elements of image processing to accomplish the most fundamental tasks, developed along this thesis.

2.1.1 Digital images

An image may be defined as a two-dimensional function, f (x, y), where x and y are spatial (plane) coordinates, and the amplitude of f at any pair of coordinates (x, y) is called the intensity, of the image at that point. When x, y, and the intensity values of f are all finite, discrete quantities, it is denoted a digital image. A digital image is composed of a finite number of elements, each of which has a particular location and value. These elements are called pixels. In Figure 2.1 is shown a scheme of a digital image.

Figure 2.1: Digital image representation: rectangular matrix of scalars or vectors, where f (i, j) are named pixels, Nc refers the number of columns and Nc the number of rows.

2.1.1.1 Types of images

Digital images can be classified in four types: black and white, grayscale, color and indexed images. In this section it is presented a brief explanation relatively to each type of image.

• Black and white images, also called binary images, whose pixels have only two possible intensity values. These images have been quantized into two values, usually 0 and 1 or 0 and 255, depending on the numeric type used to represent the information.

(25)

Black and white images needs a low storage or transmission and simple processing. On the other hand, these images have limited applications, restricted to tasks where internal details are not required as a distinguishing characteristic. Black and white images are sometimes used in intermediates steps on a computer vision system.

• Grayscale images, also called monochromatic, denotes the presence of only one value associated to each pixel. They are the result of measuring the intensity of light reflected by an object or some other information, like temperature or distance, at each pixel in a single band of the electromagnetic spectrum. Grayscale images combine black and white in a continuum producing a range of shades of gray. This range is represented from 0 to 2b, where b denoted the number of bits used to represent each pixel.

• Color images include the presence of several channels, normally three, to characterize each pixel. For example, the human eye has three types of receptors that detect light in red, green and blue bands. For the brain, each color corresponds to electromagnetic waves with different wavelengths that are received by the eye. In the digital world, color can be represented with a set of numbers, which are interpreted as coordinates in a specific color space. The RGB, YCbCr and HSV are the most commonly used color spaces in computer vision. They differ in the mathematical description of each color. • Indexed images are a form of vector quantization compression of a color image. When

a color image is encoded, color information is stored in a database denominated palette and the pixel data only contain the number (the index) that corresponds to a color in the palette. This color table stores a limited number of distinct colors (typically 4, 16, . . . , 256). Indexed images reduce the storage space and transmission time, but limits the set of colors per images.

In Figure 2.2 it is possible to see an example of an image represented in the different types described above.

2.1.1.2 Digital cameras

A digital camera can be seen as a device that captures the light from the environment and produces a digital image that can be stored internally or transmitted to another device.

(26)

(a) (b) (c)

(d) (e) (f)

Figure 2.2: Types of images: (a) Black and white; (b) Grayscale; (c) Color; (d), (e) and (f) Indexed images with different number of colors in the palette.

The electronic component responsible for this transformation is a chip composed of a light-sensitive area made of crystal silicon in a photodiode which absorbs photons and releases electrons through the photoelectric effect. The electrons are accumulated over the length of the exposure. The charge that is generated is proportional to the number of photons that hit the sensor. This electric charge is then transferred and converted to a voltage that is amplified and then sent to an analog to digital converter where it is digitized.

There exist two main different technology of chips in digital cameras: CCD (Charge Coupled Device) or CMOS (Complementary Metal Oxide Semiconductor). They perform similarly but differ in how the charge is transferred and where it is converted to a voltage. Depending on the application, one has advantages and disadvantages regarding the other.

When a camera takes a picture, the shutter opens briefly and each pixel on the image sensor records the light that falls on it by accumulating an electrical charge. To end the exposure, the shutter closes and the charge from each pixel is measured and converted into a digital number. Pixels on an image sensor only capture brightness, not color. In most of the color cameras red, green, and blue filters are placed over individual pixels on the image sensor being possible latter to reconstruct a color image. This filter is called Bayer matrix [1] and it is illustrated in Figure 2.3.

(27)

Figure 2.3: The Bayer arrangement of color filters on the pixel array of an image sensor (left). Cross section to red, green and blue color (right) [2].

In the market exists a variety of cameras to fulfill the requirements of a specific application. It is possible to find cameras with high resolution and long exposure time for scientific usage, infrared cameras to military and rescue applications, high dynamic range for hard illumination scenarios, stereo and 3D cameras to extract spatial information of an environment, among others. Robots can use different types of cameras, according to the final application.

2.1.2 Image filtering

When an image is acquired it is often not used directly for object detection. Sometimes it is necessary to eliminate or transform the information present on an image. This process can be performed by an operation denoted filtering. When an image is filtered, its appearance, completely or just a region, changes by altering the shades and colors of the pixels in some manner. Filters are common used to blurring, sharpening, edge extraction or noise removal. A filtering process is a convolution operation between an image with another matrix called kernel. Each new pixel value is calculated as a function of corresponding old pixel value and its neighboring. The numbers in the kernel represents the weights by each pixel of the neighborhood in the original image will be multiplied. The action of a convolution is a sum of this product. Figure 2.4 illustrate a schematic to help understand how convolution occurs.

2.1.3 OpenCv Library

OpenCv (Open Source Computer Vision) is a computer vision library, launched by Intel.

(28)

Figure 2.4: Center element of the kernel is placed over the source pixel. The source pixel is then replaced with a weighted sum of itself and nearby pixels [3].

Pattern Recognition in 2000. Currently, OpenCv is owned by a nonprofit foundation called OpenCV.org.

It focuses mainly on real-time extraction and processing of meaningful data from images.

OpenCv has several hundreds of image processing and computer vision algorithms which make

developing advanced computer vision applications easy and efficient. Its primary interface is in C++, but it still retains a less comprehensive though extensive older C interface. There are now full interfaces in Python and Java. OpenCv library runs under Linux, Windows and

Mac OS X.

OpenCv functionalities were subdivided and modularized, providing a set of modules that

can execute roughly the functionalities. In this project several modules are used, such as:

core, imgproc, highgui and calib3d.

In this chapter we present different approaches for object detection and recognition and introduce some techniques that have been proposed in the literature. We will discuss the different recognition tasks that a vision system may need to perform, analyzing the complexity of these tasks and present useful approaches in different phases of the recognition task.

Three approaches are considered, namely Template matching method, Hough transform and visual descriptors based on extraction of features.

(29)

2.2 Template Matching

Template matching is a technique in digital image processing used for finding small parts

of an image which match a model image. This process involves calculating the correlation between the image under evaluation and the reference image, for all possible positions between the two images. They are compared using a technique called sliding window. This window moves by each pixel processed (from left to right and top to down) and it is created a result matrix, where each element contains the result of a proximity measure that represents the “good” or “bad” similarity between them. Mathematically, the source image I is the image where is searched the database image T . The result matrix R corresponds to the calculation carried out and the position of each element of that matrix are designated (x, y). Image under evaluation I is of size W × H, and reference image T of size w × h. The result matrix R will have W − w + 1 × H − h + 1, with a condition: (size I) > (size T ).

It is possible to calculate the similarity of two images in several ways. The OpenCv library has available six different methods of calculation [4]:

• Squared difference: This method calculate the difference squared. The best match was found in the global minimums, so a perfect match will be 0:

R(x, y) =X x′_,y′

(T (x′_{, y}′_{) − I(x + x}′_{, y + y}′₎₎2

(2.1) • Cross correlation: This method uses multiplication, so a perfect match will be large

and bed matches will be smaller or 0:

R(x, y) =X x′_,y′

(T (x′_{, y}′_{) · I(x + x}′_{, y + y}′₎₎

(2.2)

• Correlation coefficient: In this method the mean is subtracted from both the ref-erence and the image under evaluation before computing either the inner product. A perfect match will be 1:

R(x, y) =X x′_,y′

(T′_(x′_{, y}′_{) · I}′_{(x + x}′_{, y + y}′₎₎

(30)

• Normalized methods: Normalization is a process that changes the range of pixel intensity values. It is useful, for example, in images with poor contrast due to glare:

R(x, y) = P x′_,y′(T (x′, y′) − I(x + x′, y + y′))2 q P x′_,y′T (x′, y′)2· P x′_,y′I(x + x′, y + y′)2 (2.4) R(x, y) = P x′_,y′(T (x ′_{, y}′_{) · I(x + x}′_{, y + y}′₎₎ q P x′_,y′T (x′, y′)2· P x′_,y′I(x + x′, y + y′)2 (2.5) R(x, y) = P x′_,y′(T ′_(x′_{, y}′_{) · I}′_{(x + x}′_{, y + y}′₎₎ q P x′_,y′T′(x′, y′)2· P x′_,y′I′(x + x′, y + y′)2 (2.6) where in (2.3) and (2.6): T′_(x′_{, y}′_{) = T (x}′_{, y}′_{) −} 1 w · h · X x′′_,y′′ T (x′′_{, y}′′₎ I′(x + x′, y + y′) = I(x + x′, y + y′) − 1 w · h · X x′′_,y′′ I(x + x′′, y + y′′).

After finished the computation of the correlation between the database image and all the possible region on the image, the best matches can be found. When Equation 2.1 or 2.4 is used, the match is found in the global minimum, for the remaining equations, the best matches corresponds to a global maximum (Figure 2.5 illustrate the calculation of result matrix). This maximum or minimum value corresponds the upper left corner in the reference image.

In Figure 2.6 it is possible see an example of applying the Template matching algorithm.

2.3 Hough transform

The Hough transform is a feature extraction technique used in image processing. With this method it is possible to find lines, circles, or other simple forms in an image. This

(31)

Figure 2.5: Result matrix calculation for method 1 - Equation 2.1. I denotes de image under avaluation, T the reference image and R the result of correlation between both matrixes. The best match position corresponds to a global minimum.

Figure 2.6: Example of applying the Template matching algorithm. The image under evalu-ation is at the top right and below the illustrevalu-ation of the result matrix with the correlevalu-ation values for method 6 - Equation 2.6 (maximum corresponds to the location where it was found the best matching - marked with a red circle).

transform, developed by Hough in 1959, was used in physics experiments [5] being its use in computer vision introduced by Duda and Hart in 1972 [6].

The Hough transform algorithm, initially developed to detect lines in images, can also be extended to detect other simple images structures. The line transform is a relatively fast way of searching in a binary image for straight lines. Contrary, circular shapes detection requires

(32)

a large computation time and memory for storage, increasing the complexity of extracting information from an image. The Hough transform requires a pre-processing filtering to detect edges on an image.

2.3.1 Linear Hough transforms

In the Cartesian coordinate system, lines can be represented using the following equation: y = mx + b,

where (x, y) is a point through which the line passes, m is the slope of the line and b is the intercept value with the y axis. In the Hough transform, a different parametrization is used. Instead of the slope-intercept form of lines, the algorithm use the normal form. In the polar coordinate system, a line is formed using two parameters: an angle θ and a distance ρ. ρ is the length of the normal from the origin (0, 0) onto the line and θ is the angle that this normal makes with the x axis (see Figure 2.7).

Figure 2.7: Hough space representation - ρ refers the length of the normal from the origin (0, 0) onto the line and θ is the angle this normal makes with the x axis.

In this representation, the equation of the line is: ρ = x cos θ + y sin θ.

For each pixel at (x, y) and its neighborhood, the Hough transform algorithm determines if there is enough evidence of a straight line at that pixel. For each point in the original space it consider all the lines which go through that point at a particular discrete set of angles, chosen a priori. For each angle θ, it is calculated the distance to the line through the point

(33)

at that angle. Make a corresponding discretization of the Hough spaces will result in a set of boxes in Hough space. These boxes are called the Hough accumulators. For each line, increment a count in the Hough accumulator at point (θ, ρ). After considering all the lines through all the points, a Hough accumulator with a high value will probably correspond to a line of points. In Figure 2.8 is possible see an example how Hough transform works.

Figure 2.8: Hough transform process for three points and with six angle groupings. In the top, solid lines of different angles are plotted, all going through the first, second and third point (left to right). For each solid line, the perpendicular which also bisects the origin is found, these are shown as dashed lines. The length and angle of the dashed line are then found. The values are shown in the table below the diagram. This is repeated for each of the three points being transformed. The results are then plotted as a graph (bottom). The point where the lines cross over one another indicates the angle and distance of the line formed by the three points that were the input to the transform. The lines intersect at the pink point; this corresponds to the solid pink line in the diagrams above, which passes through all three points [7].

(34)

2.3.2 Circular Hough transforms

Circular shapes can be detected based on the equations for circles. The equation of a circle is:

r2= (x − a)2+ (y − b)2,

where (a, b) represent the coordinates for the center and r in the radius of the circle. The parametric representation of the circle is given by:

x = a + r cos θ y = b + r sin θ.

For each edge point, a circle is drawn with that point as origin and radius r (if the radius is not known, then the locus of points in parameter space will fall on the surface of a cone, varying the radius). In circular case, the accumulator is a tree-dimensional array, with two dimensions representing the coordinates of the circle and the last one specifying the radius. The values in the accumulator are increased every time a circle is drawn with the desired radius over every edge point. The accumulator, which kept counts of how many circles pass through coordinates of each edge point, proceeds to a vote to find the highest count. The coordinates of the center of the circles in the images are the coordinates with the highest count. In Figure 2.9 it is possible to see an example of application of Hough transforms in circular detections.

Figure 2.9: Hough transform - example of circle detection. In an original image of a dark circle (radius r), for each pixel a potential circle-center locus is defined by a circle with radius r and center at that pixel. The highest-frequency pixel represents the center of the circle (marked with red color) [8].

(35)

2.4 Visual descriptors

The concept of feature detection refers to methods that aim at computing abstractions of image information and making local decisions at every image point.

This approach of performing matching using features, usually involves 3 steps:

• Detection: Points of interest are identified in the image at distinctive locations. • Description: The neighborhood of every point of interest is represented by a distinctive

feature vector, which is later used for comparison.

• Matching: Features vectors are extracted in two different images. Descriptor vectors between the images are matched on the basis of a nearest neighbor criterion.

In Chapter 3 it will be present in detail several algorithms regarding the detection of points of interest and extraction of features vectors.

2.5 Descriptors matching

Image matching is a fundamental aspect of many problems in computer vision, including object or scene recognition. With the several methods to extract keypoits, features are found, constructing a set of vectors with some features in an image. Matching process consists to find matches (pairs of keypoints) between different views of an object or scene. In Figure 2.10 it is possible to see the main principle of matching approach.

There are different methods to calculate distance between two vectors of the same length:

Euclidean, Manhattan and Hamming methods are explained below. The distance between

two n-dimensional vectors a and b, can be express by:

• Euclidean: d = v u u t n X n=1 (aj− bj)2; • Manhattan: d = n X n=1 |aj− bj|;

(36)

Figure 2.10: Features matching process: on the left the object to be recognized, on the right the scene where the object will be found. In the middle is represented the features of two pictures and respective matching signed with black arrows.

• Hamming: distance d between two vectors A and B ∈ F(n) _{is the number of} coeffi-cients in which they differ. F is a finite field with q elements. Usually used in binary descriptors.

The feature matching problem is usually divided into two separate components. The first is to select a matching strategy, which determines which correspondences are passed on to the next stage for further processing. The second is to devise efficient data structures and algorithms to perform this matching as quickly as possible.

2.5.1 Matching strategy

Three different methods are considering and illustrated in Figure 2.11.

• Fixed Threshold: Is the simplest matching strategy. Two regions are matched if the distance between their descriptors is below a threshold;

• Nearest Neighbor (NN): Match the nearest neighbor in feature space;

• Nearest Neighbor Distance Ratio (NNDR): Compare the nearest neighbor dis-tance to that of the second nearest neighbor.

(37)

Figure 2.11: Fixed threshold, nearest neighbor, and nearest neighbor distance ratio matching. At a fixed distance threshold (dashed circles), descriptor F1 fails to match F2 , and F3 incorrectly matches F4. If we pick the nearest neighbor, F1 correctly matches F2 , but F3 incorrectly matches F4 . Using nearest neighbor distance ratio (NNDR) matching, the small NNDR d1/d2 correctly matches F1 with F2, and the large NNDR d′1/d′2 correctly rejects matches for F3.

It is difficult to set a threshold value: threshold too high results in too many false positives, i.e., incorrect matches being returned, while setting the threshold too low results in too many false negatives, i.e., too many correct matches being missed.

A descriptor can have several matches and several of them may be correct. In the case of nearest neighbor-based matching, two regions A and B are matched if the descriptor DB is the nearest neighbor to DAand if the distance between them is below a threshold. With this approach, a descriptor has only one match.

The third matching strategy is similar to nearest neighbor matching, except that the threshold is applied to the distance ratio between the first and the second nearest neighbor. The regions are matched if:

N N DR = d1 d2 =

||DA− DB|| ||DA− DC||

< th, where DB is the first and DC is the second nearest neighbor to DA.

2.5.2 Efficient matching

Once we have decided on a matching strategy, we still need to efficiently search for po-tential candidates. It is necessary to compare all features against all other features in each pair of potentially matching images. As linear search is too costly for many applications, this

(38)

has generated an interest in algorithms that perform approximate nearest neighbor search, in which non-optimal neighbors are sometimes returned. Such approximate algorithms can be orders of magnitude faster than exact search, while still providing near-optimal accuracy.

Several approaches are study to improve an efficient matching. Devise an indexing struc-ture is a better approach to search feastruc-tures rapidly near a given feastruc-ture. Multidimensional search tree or hash table are examples of an indexing structure implementation. In this thesis, only a brief explanation about randomized kd-tree and hierarchical k-means tree algorithms are aborted. This algorithms were integrated in FLANN library, explained below.

Fast Library for Approximate Nearest Neighbor - FLANN

FLANN is a library for performing fast approximate nearest neighbor searches in high dimensional spaces. It contains a collection of algorithms found to work best for nearest neighbor search and a system for automatically choosing the best algorithm and optimum parameters depending on the dataset.

Marius Muja and David G. Lowe in their experiments [9], obtained the best performance in two algorithms. These algorithms used either the hierarchical k-means tree or multiple randomized kd-trees. Following it is presented a simple description of these algorithms.

The classical kd-tree algorithm [10] can be defined by a binary tree in which each node represents a subtile of the records in the file and a partitioning of that subtile. The root of the tree represents the entire file. Each non-terminal node has two successor nodes. These successor nodes represent the two sub-files defined by the partitioning. The terminal nodes represent mutually exclusive small subsets of the data records, which collectively form a partition of the record space. To find the nearest neighbor of a query point, a top-down searching procedure is performed from the root to the leaf nodes.

The nearest neighbor search in the high-dimensional case may require visiting a very large number of nodes, and even the process costs linear time. Kd-trees splits the data in half at each level and the randomized trees are built by choosing the split dimension randomly from the first dimensions on which data has the greatest variance. When searching the trees, a single priority queue is maintained across all the randomized trees so that search can be ordered by increasing distance to each bin boundary. The degree of approximation is determined by

(39)

examining a fixed number of leaf nodes, at which point the search is terminated and the best candidates returned.

The hierarchical k-means tree algorithm divides the dataset recursively into clusters (see Figure 2.12). The k-means algorithm is used by setting k to two in order to divide the dataset into two subsets. Then, the two subsets are divided again into two subsets by setting k to two. The recursion terminates when the dataset is divided into single data points or a stop criterion is reached.

Figure 2.12: Hierarchical k-means tree [11].

2.6 Location of object: the RANSAC algorithm

The problem of object recognition assume a classical approach to this problem: extract local features from images (such as SIFT or SURF) and match them using some decision criterion. Then, a last step of this procedure usually consists in detecting groups of local correspondences under coherent geometric transformations. To estimate the homography and the correspondences which are consistent with this estimate, a RANdom SAmple Consensus (RANSAC) algorithm can be used.

The algorithm was first published by Fischler and Bolles in 1981 [12]. RANSAC is an iterative method to estimate parameters of a mathematical model from a set of observed data. It is a non-deterministic algorithm: it produces a reasonable result only with a certain probability (this probability increasing as more iterations are allowed).

The study of RANSAC algorithm is not the objective of this thesis. More details about this algorithm are described in the original paper [12].

(40)

2.7 Final remarks

Template Matching method tries to find small parts of an image which match a model

image. The correlation process can be computed by different methods: Squared difference, cross correlation, correlation coefficient and normalized methods. The normalization process divide each inner product by the square root of the energy of the reference image and the image under evaluation. These methods are used because they can help reduce the effects of lighting differences between the images to comparison. About the computational efficiency, normalized methods have more accurate matches. Regarding to the processing time, simple methods (squared difference) are faster, while normalized methods are slower [13] [14].

The performance of the Hough transform is highly dependent on the results from the edge detector. This factor requires that an input image must be carefully chosen for maximize the edge detection. Object size in picture and the distance between the objects cause variations in the results.

(41)

Chapter 3

Features-based descriptors

In this chapter, the concept of feature detection will be studied in more detail, taking into consideration four well known algorithms: SIFT, SURF, FAST and BRIEF.

3.1 Scale Invariant Feature Transform descriptor - SIFT

Scale-invariant feature transform (or SIFT) is a popular image matching algorithm in com-puter vision to detect and describe local features in an image. The algorithm was published by David Lowe[15].

The SIFT descriptor is invariant to translations, rotations and scaling transformations in the image domain. It is robust to moderate perspective transformations and illumination variations.

The SIFT method operate in a stack of gray-scale images with increasing blur, obtained by the convolution of the initial image with a variable-scale Gaussian. A differential oper-ator is applied in the scale-space, and candidate keypoints are obtained by extraction the extrema of this differential. Position and scale of detected points are refined, and possible unstable detections are discarded. The SIFT descriptor is built based on local image gradient magnitude and orientation at each sample point in a region around the keypoint location. The descriptor encodes the spatial gradient distribution over a keypoint neighborhood using a 128-dimensional vector.

An illustrative diagram of the SIFT algorithm is represented in Figure 3.1. The details of algorithm are presented as follows.

(42)

Figure 3.1: Description scheme of the SIFT algorithm. 3.1.1 Scale space construction

A scale-space is constructed from the input image by repeated smoothing and subsampling. This process involves the convolution process between a Gaussian kernel, G(x, y, σ), and an input image, I(x, y):

L(x, y, σ) = G(s, y, σ) ∗ I(x, y), where ∗ denotes the convolution operator and

G(x, y, σ) = 1 2πσ2e

−(x2_+y2_)/(2σ2₎ ,

σ denotes the standard deviation and σ2 _{the variance of the Gaussian kernel.}

The scale-space of an image is constructed from the initial image, repeatedly convolved with Gaussians, with increasing levels of blur, to produce images separated by a constant factor k. A stack of images is constructed, each image with different levels of blur and different sampling rates. This set is split into subsets where images share a common sampling rate. These subsets are called octaves. In the scale space, first octave is constructed by increasing sampling rate by a factor two. In the next octaves, the sampling rate is iteratively decreased by a factor of two and σ doubling.

The standard value of scales per octave is nspo = 3, but in practice, scale-space will also include three additional images per octave, necessary to applied differential operator in the next step.

(43)

Figure 3.2: Each image is characterized by its level of blur and its sampling rate. Images are organized per octaves. In the same octave, images share a common sampling rate, with increasing level of blur separated by a constant factor k. For each image, in the octave, the level of blur double comparing with the same position of image in the previous octave σo,1= 2 × σo−1,1 with o = 1, · · · , 4. Linrefers to the initial image, with level of blur σin= 0.5, and sampling rate S = 1 (by default). Lo,s denotes localization of each image in scale-space, number of octave, o, and the number of scale into the octave, s.

In Figure 3.3 it is possible see the scale space construction applied in an image.

Difference of Gaussians

A Difference-of-Gaussians (DoG) is computed from the differences between the adjacent levels (separated by a constant multiplicative factor k) in the Gaussian scale-space.

DoG(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y) = L(x, y, kσ) − L(x, y, σ).

The three additional images per octave are used in this step. The auxiliary images are necessary because the extreme DoG scale-space need another image to comparison (See Fig-ures 3.4 and 3.5).

(44)

Figure 3.3: In the top right, color image represent the input image. First octave is represented in the top and successive octaves are present then. The scale is reduced and level of blur increase. This images are generated by a demo algorithm, available online, after adaptation and described in a paper [16].

Figure 3.4: Example to DoG computation in the second octave. Do,s denotes localization of each DoG image in scale-space, computation by D1,s= L1,s+1− L1,swith s = 0, · · · , 4 . Pairs of continuous images are subtracted and the difference between the images at scale kσ and σ is atributed a level of blur σ (procedure is not centered). Auxiliary images are represented in red color.

(45)

Figure 3.5: Example to DoG computation in the second octave. Auxiliar images are used in this step. This images are generated by the same demo algorithm, used in Figure 3.3.

3.1.2 Keypoints detection

Detecting keypoints is not a trivial computation. After constructing the DoG scale-space it is necessary analyze all pixels, to determine the extrema in each DoG image. Maximum and minimum values cannot be directly computed, it is necessary refine their location (via a quadratic interpolation). Finally, unstable candidates points are discarded. This process eliminates low contrasted extrema and candidate keypoints on edges. Following is presented details about this steps.

Extraction of candidate keypoints (Locate DoG extrema)

Interest points are obtained from the points at which the DoG values assume extrema with respect to both the spatial coordinates in the image domain and neighbors pixels in adjacent DoG scale-space. This step is illustrated in Figure 3.6.

The extrema detection is a rudimentary way to detect candidate keypoints of interest. This technique produce unstable and sensitive to noise detections.

Candidate keypoints refinement

SIFT method use a local interpolation to refine the position and scale of each sample point. Given a point in the DoG scale-space Ds,m,n, with the quadratic function Ds,m,n(x) at the sample point (m, n) locate in o octave and s scale into octave, given by:

(46)

Figure 3.6: Maxima and minima are computed in each DoG image by comparing a pixel (in red color) with 26 neighbors, 8 pixels in the same DoG image and 9 × 2 in the adjacent scales (marked at gray color).

Ds,m,n(x) = Ds,m,n+ xTGs,m,n+1 2x

T_H s,m,nx,

where x = (x1, x2, x3) is the offset from this point, Gs,m,nand Hs,m,nthe gradient and Hessian matrix, respectively: Ds,m,n=      (Ds+1,m,n− Ds−1,m,n)/2 (Ds,m+1,n− Ds,m−1,n)/2 (Ds,m,n+1− Ds,m,n−1)/2      , Hs,m,n=      h11 h12 h13 h21 h22 h23 h31 h32 h33      with h11= Ds+1,m,n+ Ds−1,m,n− 2Ds,m,n h22= Ds,m+1,n+ Ds,m−1,n− 2Ds,m,n h33= Ds,m,n+1+ Ds,m,n−1− 2Ds,m,n h12= (Ds+1,m+1,n− Ds+1,m−1,n− Ds−1,m+1,n+ Ds−1,m−1,n)/4 h13= (Ds+1,m,n+1− Ds+1,m,n−1− Ds−1,m,n+1+ Ds−1,m,n−1)/4 h33= (Ds,m+1,n+1− Ds,m+1,n−1− Ds,m−1,n+1+ Ds,m−1,n−1)/4

This quadratic function is an approximation of the second order Taylor development. The location of the extremum, x∗_{, is determined setting it to zero, giving:}

(47)

If the offset x∗ _{is less than 0.5, the extremum is accepted and corresponding keypoint} coordinates are recalculated:

(σ, x, y) = So Smin σmin2(x ∗ 1+s)/nspo, S_o(x∗ 1+ m), So(x∗3+ n) ,

where So denotes the sampling rate into the octave, Smin = 0.5, and nspo= 3 by default. If this condition falls outside the domain of validity and sample point is changed, the interpolation is performed instead that point. To get the interpolated estimate for the location of the extremum, the final offset x∗ _{is added to the location of its sample point (s, m, n) + x}∗_. This process is repeated until the interpolation is validated. If after five iterations the result is still not validated, the candidate keypoint is discarded.

This post-processing stage is, in particular, important to increase the accuracy of the scale estimates for the purpose of scale normalization.

Filter low contrast responses and edges

In order to discard low contrast extrema, SIFT method uses a threshold value, applied in Ds,m,n(x∗). In the standard case, with three scales per octave, the value of threshold is T hDoG= 0.03:

If |Ds,m,n(x∗)| < T hDoG the candidate keypoint is discarded.

Some of candidate keypoints may even subsist after the last two process. The DoG function will have a strong response along edges. Interpolation refinement and discarding by value threshold on the DoG value can’t be sufficient.

Undesirable keypoints can be computed by a Hessian matrix (2 × 2) in the location and scale of the keypoint:

Hs,m,n =   h11 h12 h21 h22  , where: h11= Ds,m+1,n+ Ds,m−1,n− 2Ds,m,n, h22= Ds,m,n+1+ Ds,m,n−1− 2Ds,m,n,

(48)

h12= h21= (Ds,m+1,n+1− Ds,m+1,n−1− Ds,m−1,n+1+ Ds,m−1,n−1)/4.

Edges present a large curvature orthogonal to the edge and a small one along the edge. By analyzing the eigenvalues of the Hessian matrix it is possible to detect if keypoint compose an edge or not. If it is computed a big ratio between the largest eigenvalue λmax, and the smallest one λmin, indicates the presence of an edge: r = λ_λmax_min.

A second threshold value is considered. Keypoints are discarded if the ratio between the eigenvalues r is less than a threshold T hEdge (standard value is T hEdge = 10). Eigenvalues can be related with determinant and trace of Hassian matrix by:

tr(Hs,m,n) = h11+ h22= λmax+ λmin, det(Hs,m,n) = h11h22− h212= λmaxλmin. Then, tr(Hs,m,n)2 det(Hs,m,n) = (λmax+ λmin) 2 λmaxλmin = (r + 1) 2 r .

This is known as the Harris-Stephen edge response [17].

Finally, to discard keypoints candidates on edges, the following test is applied: If edgeness (Hs,m,n) > (T hEdge+1)

2

T hEdge then discard candidate keypoint. 3.1.3 Keypoint descriptors

Now legitimate keypoints are found. They’ve been tested to be stable, and the scale at which the keypoint was detected is known (it’s the same as the scale of the blurred image). So the algorithm have scale invariance. The next step is to assign an orientation to each keypoint in order to obtained rotation invariance.

Assign keypoints orientations

A Gaussian smoothed image, Lo,s is selected, according to the scale of the keypoint. The gradient magnitude, m(x, y), and orientation, θ(x, y) are computed to the keypoint consider-ing:

(49)

m(x, y) = q

(Lo,s(x + 1, y) − Lo,s(x − 1, y))2+ (Lo,s(x, y + 1) − Lo,s(x, y − 1))2,

θ(x, y) = tan−1 Lo,s(x, y + 1) − Lo,s(x, y − 1) Lo,s(x + 1, y) − Lo,s(x − 1, y)

.

An orientation histogram is created. Each sample added to the histogram is weighted by its gradient magnitude and by a Gaussian-weighted circular window (standard deviation equal to 1.5 × σ), reducing the contribution of distant pixels. In the histogram, the 360 degrees of orientation are subdivided into 36 bins (each 10 degrees). The “amount” that is added to the bin is proportional to the magnitude of gradient at that point. Once done this for all pixels around the keypoint, the histogram will have a peak at some point. In addition to the biggest mode, other three modes whose amplitude is within the 80% can be considered. Because this reason, it is possible a keypoint generate various descriptors. This new keypoint has the same location and scale as the original, but its orientation is equal to the other peak. In Figure 3.7 it is possible see an example histogram.

Figure 3.7: Histogram of gradient magnitudes and orientations. In this example, a keypoint will create two different orientations.

Keypoint descriptor

The goal of SIFT algorithm is to generate a very unique fingerprint for the keypoint. To do this, a 16 × 16 window around the keypoint is considered. This window is broken into 4 × 4

(50)

windows. For each 4 × 4 region, the gradient magnitudes and orientations are calculated and an histogram constructed (see Figure 3.8).

Figure 3.8: Computation of the keypoint descriptor. Keypoint is marked at red color. The gradient magnitude and orientation are computed, in a 16 × 16 window around the keypoint. These samples are accumulated into orientation histograms summarizing the contents over 4×4 subregions. The length of each arrow corresponding to the sum of all gradient magnitudes near that direction in this region. The amount added also depends on the distance from the keypoint (computed by the Gaussian-weighted). So gradients that are far away from the keypoint will add smaller values to the histogram [18].

The descriptor is a vector that contains the value of all orientation histogram entries. Each histogram have 8 orientations bins, therefore the descriptor vector contains 8 × 16 = 128 elements.

In order to reduce the effects of illumination change, the descriptor vector is normalized to unit length. To reduce the effects of non-linear illumination a threshold of 0.2 is applied and the vector is again normalized.

In Figure 3.9 it is possible see all keypoints detected by SIFT algorithm.

3.2 Speeded Up Robust Feature descriptor - SURF

Speeded Up Robust Feature (or SURF) is a fast and robust algorithm for local, similarity invariant representation and comparison, first presented by Herbert Bay (2006) [19]. Similarly to the SIFT approach, SURF is a detector and descriptor of local scale and rotation-invariant image features.

(51)

Figure 3.9: Keypoints detected - SIFT algorithm by OpenCv library.

The SURF method uses integral images in the convolution process, useful to speed up this method. The box-space is constructed by using box filters approximation, by convolving of the initial images with box filters at several different discrete size. For this reason, I will use the term box-space in this document. To select interest point candidates, the local maxima of a Hessian matrix is computed and a quadratic interpolation is used to refine the location of candidate keypoints. Contrast sign of the interest point are stored to construct the keypoint descriptor. Finally, the dominant orientation of each keypoint is estimate and vector descriptor is computed.

Algorithm scheme of the SURF is illustrated in Figure 3.10 and the details of algorithm are presented as follows.

(52)

3.2.1 Pre-processing

To understand how SURF algorithm works it is necessary to clarify some concepts. A brief explanation about integral images and box-Hessian aproximation are presented below.

Integral images

An integral image is a data structure that allows rapid summing of subregions. The integral image, denoted IP(x, y), at location (x, y) contains the sum of the pixel values above and to the left of (x, y) in the input image I:

IP(x, y) = i≤x X i=0 j≤y X j=0 I(x, y).

Using integral images, it takes only three additions and four memory accesses to calculate the sum of intensities inside a rectangular region of any size (see Figure 3.11).

(a) (b)

Figure 3.11: Integral image computation. (a) integral image representation; (b) area A computation using integral images: A = L4+ L1− (L2+ L3) [20].

Box-Hessian aproximation

The SURF descriptor is based on the determinant of the Hessian matrix. The Hessian matrix, H, is the matrix of partial derivatives of the image I(x, y):

H(I(x, y)) =   d2_I dx2 d 2_I dxdy d2_I dxdy d2_I dy2  .

(53)

The second order scale normalized Gaussian is the chosen filter by convolution with the image I(x, y). With Gaussian function it is possible vary the amount of smoothing during the convolution process in scale space construction and construct kernels for the Gaussian derivatives in x, y and xy directions. The Hessian matrix can be calculated now, as function of space (x, y) and scale, σ:

H(x, y, σ) =   Lxx(x, y, σ) Lxy(x, y, σ) Lxy(x, y, σ) Lyy(x, y, σ)  ,

where Lxx is the convolution of the Gaussian second order derivative with the image I in point (x, y) and similarly for Lyy and Lxy.

The respective kernels can be represented by using box filters. The masks used are a very crude approximation and are shown in Figure 3.12.

Figure 3.12: Laplacian of Gaussian approximation. Top: the Gaussian second order par-tial derivative in x (Lxx), y (Lyy) and xy (Lxy) directions, respectively; Bottom: box-filter approximation in x (Dxx), y (Dyy) and xy (Dxy) directions [21].

The black and white sections in each filter represents the weights applied. In Dxx and Dyy filters the black regions are weighted with a value of −2 and white regions with 1. For the Dxy filter, the black region is weighted with −1 and white region with 1.

SURF algorithm uses the following approximation for the Hessian determinant: det(Happrox) = DxxDyy− (0.9Dxy)2.

(54)

Box filters are kernels with constant elements in rectangular regions and integral images help the sum of pixel values in a rectangular region very fast. The use of box filters enables the use of integral images, which it is possible to speed up the convolution operation.

3.2.2 Box space construction

In the SIFT method, the scale-space is implemented by a iterative convolution between the input image and a Gaussian kernel, repeatedly sub-sampled. The size is reduced and blur level increase. This process is not computationally efficient, since images need to be resized in each layer and each relies on the previous. To improve this step, SURF algorithm applies kernels with increasing size to the original image. All images in the box-space can be created simultaneously, with the same time processing (the calculation time is independent of the filter size). The original image keeps unchanged and only varies the filter size.

The box-space is divided into octaves, similarly to SIFT. A new octave corresponds to the doubling of the kernel size and a sub-sampling of factor two. Each octave is also divided in several levels with increase blur level.

3.2.3 Keypoints detection

A candidate keypoint is obtained by computation of maxima responses of the determinant of the box-Hessian matrix in box-space by comparing a point with its 26 neighbors in the box-space just like SIFT.

When a maxima is found the area covered by the filter is rather large and this introduces a significant error for the position of the interest point. In order to extract the exact location of the interest point a simple second order interpolation is used.

3.2.4 Keypoints descriptor

The purpose of a descriptor is to provide a unique and robust description of a feature. The SURF descriptor is based on Haar wavelet responses in x and y direction and can be calculated efficiently with integral images. SURF descriptor use only a 64-dimensional vector to construct a feature.

In order to be invariant to rotation it is necessary to fix a reproducible orientation based on information from a circular region around the interest keypoint.

(55)

Assign keypoints orientations

The area to be considered, in order to assign keypoint orientation, is a circular region with radius 6s where s denotes the scale at which the interest point was detected. At every point in this neighborhood, responses to horizontal and vertical box filters are computed. The

Haar wavelet filter compute the responses in x and y directions. The filters are illustrated in

Figure 3.13. The black segment have the weight −1 and white segment +1.

Figure 3.13: Wavelet filter to compute the response in x (left) and y (right) direction. The

Haar wavelet response set to a side of 4s.

The computation of the gradient at this scale in this neighborhood is obtained using convolution with box-filters illustrated above. The horizontal and vertical result of convolution is called ds and dy respectively.

The interest area are weighted with a Gaussian (σ = 2s) centered at the interest point to give some robustness for deformations and translations.

A histogram is built to estimate the dominant orientation. The response of a Gaussian are represented as vectors in a space with horizontal response strength along the x axis and vertical response strength along the y axis. A sliding orientation window of size π/3 is considered and the horizontal and vertical responses inside the window are summed, constructing a new vector. This diagram is illustrated in Figure 3.14.

After calculation all vectors orientation, the longest vector over all windows defines the orientation of the interest point. In the following step, the SURF descriptor will be calculated.

(56)

Figure 3.14: Orientation assignment. A circular neighborhood around the interest keypoint is considered. A sliding orientation window compute the dominant orientation of the Gaussian weighted Haar wavelet response at every sample point in the region [22].

Keypoint descriptor

To extract the descriptor, a square region centered in the keypoint is constructed. This square have a size of 20s and contains the orientation of the keypoint. The region is split up into 4 × 4 sub-regions. A set of four features are calculated at 5 × 5 regularly spaced grid points, in each of these subregions. These four features include Haar wavelet response in the horizontal and vertical directions and their absolute values. (See Figure 3.15).

Figure 3.15: Keypoint is assigned at red color. Descriptor region split up into 4 × 4 sub-regions (left). In each 2 × 2 sub-division is calculated the sum dx, dy, |dx| and |dy|, computed relatively to the orientation of the grid (right) [19].

Each 2 × 2 sub-division have a set of four features, this result in a descriptor vector of length 64 (16 sub-regions ×4 features).

(57)

In Figure 3.16 it is possible to see keypoints detected by SURF algorithm.

Figure 3.16: Keypoints detected - SURF algorithm by OpenCv library.

3.3 Features from Accelerated Segment Test - FAST

Features from Accelerated Segment Test (or FAST) in an algorithm proposed originally by Rosten and Drummond [23] in 2006 for identifying corners in an image. FAST algorithm is an attempt to solve a common problem, related with processing time, in robots. In real-time, when a robot needs to extract information about the environment to react according to it, the algorithms presented above (like SIFT and SURF) are efficient, but too computa-tionally intensive for use in real-time applications of any complexity. Thus, such detection schemes are not feasible in real-time machine vision tasks, despite the high-speed processing capabilities of today’s hardware. Unlike SIFT and SURF, the FAST algorithm only detect corners/keypoints, not producing descriptors. This detector can be used in others descriptor algorithms to detect keypoints.

The FAST detector uses a circle of 16 pixels and radius 3 to classify whether a candidate point ρ (with intensity Iρ) is actually a corner. Each pixel in the circle is labeled from integer number 1 to 16 clockwise (illustrated in Figure 3.17). If a set of n contiguous pixels in the

(58)

circle are all brighter than the intensity of candidate pixel ρ plus a threshold value t or all darker than the intensity of candidate pixel ρ minus threshold value t, then ρ is classified as corner. n is usually chosen as 12 and all pixels in the circle are examining. Each pixel, x, can take three states: darker than ρ, lighter than ρ or similar to ρ:

Darker: Ix≤ Ip− t,

Similar: Ip− t < Ix< Ip+ t,

Brighter: Ip+ t ≤ Ix.

where Ix is the intensity of pixel that be analyzed.

Figure 3.17: FAST algorithm neighborhood: the pixel ρ is the center of a candidate corner. A circle of 16 pixels and radius 3 are considered. The intensity of each pixel is evaluated. The arc (dashed line) passes through 12 contiguous pixels which are brighter than ρ by more than the threshold.

To make the algorithm fast, it first compares the intensity of four pixels (as in Figure 3.17: pixels with numbers 1, 5, 9 and 13) with the intensity of a central pixel Ip. If at least three of these four pixels do not have values above Ip+ t or below Ip− t, then ρ cannot be a corner. If this test is passed, then all the 16 pixels are tested to check if there are 12 contiguous pixels that satisfy this criterion.

In this approach the detector has several weaknesses: the algorithm test does not generalize well for n < 12, generating a very large number of keypoints; The position and choice of the

(59)

fast test pixels makes certain implicit assumptions regarding the features; Many features are detected close to each other.

A machine learning approach has been added to the algorithm to deal with these issues, explained in the original paper [23].

3.4 Binary Robust Independent Elementary Features - BRIEF

The BRIEF algorithm was presented in 2010 [24], and was the first binary descriptor published, on the basis of simple intensity difference tests. BRIEF takes only the information at single pixels location to build the descriptor, so to improve its sensitiveness to noise the image is first smoothed by a Gaussian filter. This is done by picking pairs of pixels around the keypoint, according to a random or nonrandom sampling pattern, and then comparing the two intensities. If the intensity of the first pixel is higher that of second the test returns 1, or 0 otherwise. This is obtained using the following scheme τ for a patch p of size S × S :

τ (p; x, y) =    1 if p(x) < p(y) 0 otherwise,

where p(x) is the smoothed intensity of the pixel at a sample point x = (u, v)T_{. A set of n} d pairs is defined, so as to generate an nddimensional bit string. The BRIEF descriptor is then taken to be the nd-dimensional bit string:

X 0≤i≤nd

2(i−1)τ (p; xi, yi).

Frequently nd= 128, 256 and 512 (good compromises between speed, space and accuracy). The authors of BRIEF consider five methods to determine the vectors x and y:

(a) x and y are randomly uniformly sampled;

(b) x and y are randomly sampled using a Gaussian distribution, meaning that locations that are closer to the center of the patch are preferred;

(c) x and y are randomly sampled using a Gaussian distribution where first x is sampled with a standard deviation of 0.04S2and then the yis are sampled using a Gaussian distribution - Each yi is sampled with mean xi and standard deviation of 0.01S2;

(60)

(d) x and y are randomly sampled from discrete location of a coarse polar grid; (e) For each i, xi is (0, 0) and yi takes all possible values on a coarse polar grid.

Figures 3.18, which illustrates examples of the five sampling strategies will help clear up the definitions.

(a) (b) (c) (d) (e)

Figure 3.18: Choose the test locations [24].

BRIEF descriptor is a compact, easy-computed, highly discriminative algorithm. In the matching process it uses Hamming distance, this process can be very efficiently implemented by doing a bitwise XOR operation between the two descriptors.

3.5 Final remarks

A SIFT keypoint is a selected image region with an associated descriptor. Their descriptors are stored in a vector that contains the information necessary to classify a keypoint. It is possible obtain features, highly distinctive, useful in the matching process. The scale-space of representation provides SIFT with scale invariance. In order to achieve rotation invariance, each keypoint are assigned an magnitude and orientation. The keypoints extracted by SIFT have a scale and orientation associated with them, characterizing this algorithm highly distinctive and robust.

The SURF algorithm works with the same objective that the SIFT descriptor. Keypoints are assigned with scale and rotation invariance in order to achieve distinctive features in an image. The SURF descriptor is an improvement of SIFT with respect to the speed of processing time. Integral images associated with the Laplacian of Gaussian approximation is an ingenious construction to speed up the convolution operation.

(61)

More visual descriptors exists in literature, such as BRISK, FREAK, OBR, among others, but only SIFT and SURF are tested in this project.

Object detection and recognition for robotic applications

Patr´ıcia Nunes

Aleixo

Dete¸c˜

ao e reconhecimento de objetos para

aplica¸c˜

oes rob´

oticas

Object detection and recognition for robotic

applications

Patr´ıcia Nunes

Aleixo

Dete¸c˜

ao e reconhecimento de objetos para

aplica¸c˜

oes rob´

oticas

Object detection and recognition for robotic

applications

Patr´ıcia Nunes

Aleixo

Dete¸c˜

ao e reconhecimento de objetos para

aplica¸c˜

oes rob´

oticas

Object detection and recognition for robotic

applications

Contents

Chapter 1

Introduction

1.1

Objectives

1.2

Thesis structure

Chapter 2

Object Detection on digital images

2.1

Basic concepts

2.2

Template Matching

2.3

Hough transform

2.4

Visual descriptors

2.5

Descriptors matching

2.6

Location of object: the RANSAC algorithm

2.7

Final remarks

Chapter 3

Features-based descriptors

3.1

Scale Invariant Feature Transform descriptor - SIFT

3.2

Speeded Up Robust Feature descriptor - SURF

3.3

Features from Accelerated Segment Test - FAST

3.4

Binary Robust Independent Elementary Features - BRIEF

3.5

Final remarks