Odometria visual monocular em robôs para a agricultura com camara(s) com lentes "olho de peixe"

(1)

F

ACULDADE DE

E

NGENHARIA DA

U

NIVERSIDADE DO

P

ORTO

Odometria visual monocular em robôs

para a agricultura com camara(s) com

lentes "olho de peixe"

Monocular Visual Odometry in Robots for Agriculture using Fisheye

Cameras

André Silva Pinto de Aguiar

Mestrado Integrado em Engenharia Eletrotécnica e de Computadores Supervisor: Prof. Armando Jorge Sousa

Supervisor: Dr. Filipe Neves dos Santos

(2)

(3)

Resumo

As vinhas do Douro são caracterizadas por condições desfavoráveis para plataformas robóticas autónomas. Estas estão situadas em encostas íngremes o que pode trazer diversos problemas como a indisponibilidade ou imprecisão de sinais do Sistema de Navegação por Satélite e apresentam irregularidades no terreno que podem causar imprecisão da odometria das rodas. Este trabalho pretende apresentar uma solução que não seja afetada por estes problemas de modo a localizar uma plataforma robótica denominada de RoMoVi: Robô Modelar e Cooperativo para Vinhas de Encosta. Neste contexto surge a Odometria Visual como solução.

Odometria Visual constitui uma área de investigação apelativa na área da navegação em robótica. O seu principal objetivo é estimar o movimento do robô integrando a transformação relativa entre imagens consecutivas. Em relação a abordagens de localização comuns a Odometria Visual tem a vantagem de não ser afetada pelo deslizamento das rodas como a odometria convencional, não depender da precisão dos sinais do Sistema de Navegação por Satélite como os sistemas baseados em GPS e não ter a necessidade de mapear o ambiente como as abordagens de SLAM. Contudo, o uso de câmeras para seguir o movimento de um robô é uma tarefa desafiante especialmente em ambientes exteriores devido a três razões essenciais: características da iluminação, elevada pre-sença de objectos e/ou pessoas em movimento e elevada profundidade do ambiente. Estes desafios são ainda maiores no contexto da Odometria Visual monocular. O uso de uma única câmera para calcular a transformação homogénea relativa entre duas imagens consecutivas tem dois proble-mas inerentes. Primeiramente, não é possível calcular a escala do movimento usando uma única câmera sem nenhum conhecimento prévio do movimento e/ou configuração do robô ou do am-biente ou sem nenhuma fonte de informação externa tal como um sensor. Em segundo lugar, a maioria dos métodos de Odometria Visual monocular revelam imprecisão ao lidar com rotações puras. A mudança rápida de observação do ambiente causa erros consideráveis na maioria das abordagens presentes na literatura. Para reduzir os efeitos destes problemas, câmeras com ângulos de visão mais amplos podem ser usadas, nomeadamente câmeras com lentes em olho de peixe. Estas possuem a vantagem de capturarem mais informação instantânea sobre o ambiente envol-vente em comparação com as câmeras de perspetiva convencionais. Além disso, permitem seguir pontos de referência sobre um número mais elevado de imagens consecutivas o que pode trazer robustez ao método de Odometria Visual. Pelo contrário, apresentam distorção radial que pode corromper a estimação do movimente se não for devidamente modelada.

Nesta dissertação é proposta uma versão do Libviso2, um método de Odometria Visual da literatura, adaptado para o uso de uma camera com uma lente em olho de peixe. Para ultrapassar os problemas inerentes à Odometria Visual monocular duas abordagens foram desenvolvidas. A primeira é um Filtro de Kalman que funde um giroscópio com a rotação do método de Odometria Visual e estima o seu bias em tempo real. A segunda é o cálculo da escala do movimento baseado num LiDAR onde o fator de escala é calculado associando as medidas de distância com pontos de referência 3-D triangulados. Finalmente, de modo a executar o sistema unificado num micropro-cessador uma abordagem híbrida que apresenta otimizações em GPU e CPU foi desenvolvida.

(4)

(5)

Abstract

Douro vineyards are characterized by harsh conditions to autonomous robotic platforms. They are placed in steep slope hills which brings issues such as the unavailability or inaccuracy of Global Navigation Satellite System signals and terrain irregularities that can cause inaccuracy of wheel odometry. This work intends to present a solution that does no get affected from these issues in order to localize a robotic platform named RoMoVi: Modular and Cooperative Robot for Vineyards. In this context Visual Odometry comes up as a solution.

Visual Odometry constitutes an appealing research area in robot navigation field. Its main goal is to estimate the robot motion integrating the relative transformation between consecutive image frames. In relation with the common localization approaches Visual Odometry has the advantage of not being affected from wheel slippage as the conventional odometry, not depending on the accuracy of Global Navigation Satellite System signals as in GPS-based methods and not having the necessity of mapping the environment as SLAM approaches. However, the use of cameras to track a robot motion is a challenging task especially in outdoor environments due to three main reasons: illumination characteristics, high presence of moving objects and/or persons and high depth of the scene. These challenges are even bigger in the context of monocular Visual Odometry. Using a single camera to compute the relative homogeneous transformation between two image frames has two main inherent issues. Firstly, it is not possible to calculate the motion scale using a single camera without any prior on the robot motion and/or configuration or on the environment or without any external source of information such as a sensor. Secondly, most of monocular Visual Odometry methods reveal inaccuracy dealing with pure rotations. The fast change of the scene view causes considerable errors on most approaches present in the literature. To reduce the effects of the mentioned issues wider field of view cameras can be used, in specific fisheye cameras. They have the advantage of capturing more instantaneous information about the surrounding scenario in comparison with conventional perspective cameras. Also, they allow to track individual image features or landmarks over a higher number of consecutive images which can bring robustness to the Visual Odometry method. On the other hand, they present radial distortion that can degenerate motion estimation if not well modeled.

In this dissertation we propose a version of Libviso2, a state-of-the-art Visual Odometry method, suitable for the use of a single fisheye camera. To overcome the inherent issues of monoc-ular Visual Odometry two approaches were developed. The first is a Kalman Filter that fuses a gyroscope with the output rotation of the Visual Odometry method estimating its bias online. The second is a LiDAR-based scale calculation where we compute the scale factor by association of the range measures with 3-D triangulated feature points. Finally, in order to execute the unified system on top of microprocessor a hybrid approach that presents both GPU and CPU optimizations was developed.

(6)

(7)

Acknowledgments

I would like to thank to every person that contributed to the accomplishment of this work. To Dr. Filipe Santos and Prof. Armando Sousa for supporting, believing and guiding me as a person and engineer.

To the team from Centre for Robotics in Industry and Intelligent Systems laboratory for all the support, specially to Luis Santos for all the patience and friendship recording data to use on this dissertation.

To Prof. Miguel Riem Oliveira for his contribution.

To my friends, in special Hugo Antunes, Miguel Aguiar, Miguel Macedo and Pedro Leite for all the friendship and constructive talks that helped me clearing my mind.

To my family for all the support, love and guidance during this work and all my academic walk.

To my lovely girlfriend and bestfriend for all the love and for being with me in every moment. Thank you.

André Aguiar

(8)

(9)

“With one kind gesture you can change a life. One person at a time you can change the world. One day at a time we can change everything.”

Steve Maraboli

(10)

(11)

List of Figures

1.1 AgRob V16 robot for agriculture and forestry. . . 2

2.1 Global system architecture. . . 7

3.1 Difference between projecting a scene point X into x0 _{using a perspective model} and projecting it into x using a radially symmetric model [30]. . . 11

3.2 Epipolar geometry setup [27]. . . 14

3.3 Connection of a CPU and a GPU [1]. . . 20

4.1 ORB-SLAM system overview [36]. . . 26

4.2 LSD-SLAM system overview [12]. . . 27

4.3 SVO system overview [15]. . . 28

4.4 Probabilistic depth map based on Bayesian estimation to deal with pure rotations proposed by [57]. . . 31

4.5 Camera configuration present on the agriculture field robot developed in [14]. . . 32

4.6 Memory layout of an OpenCL application. . . 33

5.1 The route traveled by the vehicle when capturing the used dataset. The start and finish points are signaled by A and B [3]. . . 36

5.2 Trajectory performance of Libviso2, SVO and DSO on Kitti dataset and Libviso2 on Rustbot’s. . . 38

5.3 Number of matches computed by Libviso2 and SVO in the two sequences and number of active points computed by DSO in the two sequences. . . 39

6.1 Calibration toolbox interface. . . 41

6.2 Omnidirectional camera calibration toolbox funcionalities. . . 42

6.3 (a) Raw fisheye camera image and (b) rectified fisheye camera image [2]. . . 43

6.4 Matches computed (a) before and (b) after applying the mask into them [2]. . . . 44

6.5 Set of image processing operations applied to the input rectified image to generate a mask [2]. . . 45

6.6 Epipolar geometry configuration for central fisheye cameras considering the plane that intercepts the unit sphere center and the epipolar curve represented in red. . . 46

7.1 Camera (right) and inertial sensors (left) coordinates system relation. . . 49

7.2 Observations (a) and angular velocity process (b) covariance noise in function of rotation. . . 52

8.1 (a) Side-view and (b) front-view of the projection of LiDAR measures in the fish-eye image, definition of the neighborhoods and features association. . . 55

(14)

8.2 Projection of LiDAR measurements in the fisheye image and 2-D features associ-ation with them. . . 55

9.1 UML diagram of the OpenCL abstraction layer implemented. . . 58

9.2 Representation of the 16-way SIMD kernel architecture to extract the inliers from the total set of matches corresponding to 16 simultaneous RANSAC iterations. . 61

9.3 (a) Serial and (b) parallel RANSAC configurations. . . 63

10.1 RoMoVi’s setup - vision system containing a fisheye and perspective camera, and the LiDAR used. . . 66

10.2 Trajectory performance of Libviso2 under Rustbot and Kitti’s sequences and bias estimation for each one. . . 67

10.3 Trajectory performance of SVO and DSO under Kitti’s sequence and bias estima-tion for each one. . . 68

10.4 Comparison of the trajectory performance of Libviso2 on Rustbot dataset using a (1) non linear and (2) linear variation of KF noise coefficients. . . 69

10.5 Performance of the four described configurations standalone and with the gyro-scope fusion on sequenceA. . . 70

10.6 Performance of the two configurations executed on the PC standalone and with the gyroscope fusion on sequencesBandC. . . 71

10.7 Performance of the two configurations executed on the RPi standalone and with the gyroscope fusion on sequencesBandC. . . 72

10.8 Feature matches and inliers computed by the perspective and omnidirectional ver-sions of Libviso2 on sequencesBandCrunning on the PC considering a maximum

(15)

List of Tables

4.1 Key characteristics of the main VO methods present in the literature. . . 25

5.1 SVO parameters tuning. . . 37

5.2 Average processing runtime (sec) of the three methods. . . 38

10.1 Runtime performance (sec) of the omnidirectional version of Libviso2 on RPi using both the serial and parallel configurations. . . 71

(16)

(17)

Abbreviations and Symbols

ALU Arithmetic Logic Unit

API Application Programming Interface

CPU Central Processing Unit

DSP Digital Signal Processor

EKF Extended Kalman Filter

FPGA Field-Programmable Gate Array

FoV Field of View

GFLOPS Giga Floating Point Operations Per Second GNSS Global Navigation Satellite System

GPU Graphics Processing Unit

IMU Inertial Measurement Unit

KF Kalman Filter

OpenCL Open Computing Language

QPU Quad Processing Unit

RANSAC Random Sample Consensus

ROS Robot Operation System

RPi Raspberry Pi

SFU Special Functions Unit

SIMD Single Instruction, Multiple Data

SLAM Simultaneous Localization and Mapping

SVD Single Value Decomposition

TMU Texture and Memory Unit

VO Visual Odometry

VPM Vertex Pipe Memory

vSLAM Visual Simultaneous Localization and Mapping

(18)

(19)

Chapter 1

Introduction

To give a robot the autonomy of navigating in an unknown environment by itself consisted in a big milestone of the scientific world. The relevance of this task is visible in the numerous quantity of scientific works that aim to localize robots and make them autonomous while in movement. Performing this in outdoor environments is even a more challenging task due to the non uniform characteristics of the ground, the higher density of surrounding movements of objects and/or per-sons, the characteristics of luminosity, the high depth of the scene and others [49]. VO is one of the interest fields of study on robotics in order to localize robots. Its main goal is to track camera motion by analyzing consecutive image frames [47]. The use of a single or multiple cameras to perform this task is appealing due to the challenges associated with it but also to the low need of resources.

1.1 Context

In general, using robots to perform repetitive tasks has some advantages in comparison with the human performance on them. In specific, robots can work autonomously by a larger number of hours and if they are robust can offer a better performance in terms of final results. Also, for dangerous tasks human life is not put at risk if robots are allocated to them.

In this context, the quantity of robotic solutions in agriculture have increased in the last years. However, nowadays most of the agricultural processes are still performed by the human hand in Portugal. For example, Douro vineyards consist in places where the use of robots is appealing but really challenging. These vineyards are placed in steep slope hills which brings inherent issues such as the inaccuracy or even unavailability of the GNSS, the inaccuracy of wheel odometry due to harsh terrain conditions as well as the non reliability of IMUs under these conditions [3].

Aiming to overcome these issues the INESCTEC team from Centre for Robotics in Indus-try and Intelligent Systems laboratory has a research project in the robotics for agriculture field since 2014 named AgRob [43][44]. Their main focus is to develop solutions for crop monitoring, harvesting, precision pulverization and others. To do so, at the moment three main agricultural platforms were created in this context: AgRob V14, AgRob V15 and AgRob V16. The last one

(20)

[41] is represented on Fig. 1.1. In the same context, this team works also in the project RoMoVi:

Figure 1.1: AgRob V16 robot for agriculture and forestry.

Modular and Cooperative Robot for Vineyards. This project main goal is to develop robotic com-ponents and a robotic platform for commercial use in the context of hillside vineyards.

This dissertation is inserted in RoMoVi’s context, namely in the localization of the agricultural platform. It is aimed to localize the robot using a single camera recurring to VO techniques and common sensors as a support.

1.2 Motivation and Scope

The environment where the agricultural platform is inserted is Douro vineyards. In these, the present of high hills limits the reliability of the GNSS signals in terms of quantity and accuracy [43]. Also, GNSS receivers are susceptible to spoofing attacks which is a safety issue that could put the well behavior of the system in danger. Thus, the use of satellite-based localization systems such as GPS is not a reliable standalone option in this kind of environments. In this context, a reliable localization system independent of satellite signals is required. Wheel odometry is also quite used in robotics. However, as mentioned, the environment where the robot is inserted are not favorable to this kind of approach due to terrain conditions that are propitious to wheel slippage. To overcome these issues, we propose the use of a monocular VO system that keeps track of the robot motion using a fisheye camera. The choice of using this type of camera is based on the prior that the wider horizontal FoV in comparison with perspective cameras results in a deeper knowledge about the surrounding environment. Also, wide FoV cameras allow the tracking of image features over a larger set of consecutive images which can have impact on motion estimation [56]. Taking advantage of this and overcoming the inherent issues of the use of this type of cameras, it is possible to compute a more robust motion estimation. To support and solve monocular VO main issues we propose the use of common sensors.

(21)

1.3 Goals 3

1.3 Goals

The main goal of this dissertation is to detect the robot pose on real time in an agricultural envi-ronment. To do so, we aim to achieve the following goals:

• Select a robust monocular VO method present in the literature that will constitute the basis of this dissertation.

• Adapt the method to the use of a fisheye camera, i.e., develop an omnidirectional version of the state-of-the-art method.

• Optimize the developed approach using sensors to solve the main issues associated with monocular VO.

• Achieve real time performance on a device that can be coupled to the agricultural platform, such as a microprocessor.

1.4 Contributions

During this dissertation, the following contributions to the state of art were performed.

• Accepted paper "Monocular Visual Odometry Benchmarking and Turn Performance Opti-mization" in 19th IEEE International Conference on Autonomous Robot Systems and Com-petitions.

• Accepted paper "Monocular Visual Odometry Using Fisheye Lens Cameras" in 19th EPIA Conference on Artificial Intelligence.

• Submitted paper "A Version of Libviso2 for Central Dioptric Omnidirectional Cameras with a Laser-based Scale Calculation" in ROBOT 2019 : Fourth Iberian Robotics Conference [in reviewing process].

• Accepted omnidirectional version of Libviso2 by Andreas Geiger which is public available at the official Libviso2 website1_.

• Accepted pull request on the official ROS wrapper for Libviso22of the developed omnidi-rectional approach.

• Open source implemented OpenCL abstraction layer3. • Open source datasets recorded in RoMoVi’s context.

This dissertation main contribution is constituting a system that fuses several data sources with VO in order to perform a more robust estimation of the robot motion.

1_{http://www.cvlibs.net/software/libviso/}

2_{https://github.com/srv/viso2}

(22)

1.5 Dissertation Rationale and Structure

After this introduction, chapter2intends to specify the problem description and the proposed so-lution of this dissertation. After that chapter3presents a brief introduction to the main concepts used in this work and chapter4presents the literature review. The description of the implemen-tation comes in chapters from 5 to9. Each implementation chapter details contributions to the global working system. The work started by performing a benchmark between three monocular VO methods in chapter5in order to start with the most adequate open source solution on the mar-ket. Then, the proposed VO approach in chapter6is described and the sensor-based approaches in chapters7and8. Finally we present a processing time optimization approach in chapter9. These chapters are based on the articles presented in the appendixA. Finally, the results are presented in chapter10and the conclusions in chapter11.

(23)

Chapter 2

Problem Statement and Direction

The main statement that this dissertation pretends to defend is:

It is possible to track the motion of a robot with moderated accuracy using a single fisheye camera and common sensors on top of a low cost microprocessor in an outdoor environment.

2.1 Problem Description

Associated with the given statement several issues and questions arise such as:

• Question 1: How to model the imposed distortion of the fisheye lens in order to use it in VO context?

• Question 2: How to deal with challenging motions such as pure rotations?

• Question 3: How to compute the motion scale using a single camera? (or, in other words, how to calculate the robot velocity using a single camera?)

• Question 4: Is it possible to simultaneously obtain robustness and real time performance on a low cost microprocessor in order to execute the algorithm on the agricultural platform? These four questions summarize the main problems that this dissertation intends to solve. Fisheye cameras offer a much wider FoV than the perspective ones. However, the lens distortion if not well modeled can directly affect motion estimation. As it will be described, elementary camera models such as the pinhole camera model are not suitable to deal with this kind of distortion. This constitutes a real issue that is still in development in the literature. Moving forward, monocular VO present high difficulties computing motion accurately in pure rotations. The fast change of the scene view leads motion estimation to degenerate in many state-of-the-art algorithms. In fact, many of them have relocalization methods implemented in order to recover of this kind of motions. Also, the use of a single camera in VO has an inherent problem that is also still an open issue in the literature: motion scale. With a single camera it is not possible to have the notion of scene depth which makes impossible the determination of the scale factor without any prior assumption on the motion, robot configuration and/or surrounding environment or without the use of external

(24)

sources of information such as sensors. Finally, many times in order to achieve high robustness to deal with the mentioned issues of monocular VO, complex algorithms that are computationally expensive have to be implemented. This way, having a VO method running on real time in low cost devices such as microprocessors is a challenging task.

2.2 Proposed System Architecture

In order to solve the referenced issues, the primary step performed on this dissertation was a benchmark between three well known monocular VO methods present in the literature. In this way, it was possible to choose the method that composes the basis upon which all the developed approaches were implemented. After that, we were able to present a solution for each one of the raised issues.

Question 1: The unified and accepted approach adopted in order to obtain a camera model is to perform its calibration. The use of omnidirectional cameras imposes the same need. In this way, in order to obtain a model of the fisheye camera that can be used in this dissertation a state-of-the-art omnidirectional camera calibration toolbox suitable for fisheye cameras will be used. After having the model of the camera two approaches are proposed to use the fisheye camera on the VO method: the first uses rectified fisheye images in the original monocular VO method and the second is an extension of the state-of-the-art method suitable for fisheye cameras.

Question 2: An outdoor environment is characterized by conditions that most monocular VO methods present difficulties to deal with. When these conditions are conjugated with challenging motions such as pure rotations, estimating camera motion with accuracy turns out to be a difficult task. In order to obtain a solution that minimizes the effects of illumination, moving objects and other conditions that characterizes outdoor environments and that solves the difficulties of monocular VO dealing with pure rotations we propose a KF-based approach that fuses a gyroscope with the VO resultant rotation. To cancel gyroscope associated errors we also estimate its bias online.

Question 3: The notion of depth is crucial in VO because it can be used to calculate depth maps, reconstruct 3-D scenes and calculate the motion scale. Stereo camera VO systems have the advantage of seeing the scene from two different perspectives which can be used to compute the its depth. Using a single camera to do so requires additional knowledge about the scene. In this context we propose a LiDAR-based scale calculation. In other words we use the distance measures taken by a LiDAR sensor to compute the depth of interest points in order the calculate the scale factor.

Question 4: In order to have the unified system running on the agricultural platform on real time, a small device with at least a processing unit must be used. In this context we choose to use the

(25)

2.2 Proposed System Architecture 7

RPi 3B due to its low cost and the fact that it possesses a CPU with 4 cores and a GPU. However, its processing capacities are limited when dealing with computationally expensive methods and algorithms. To overcome this limitation we propose a hybrid approach that uses both RPi’s CPU and GPU to optimize the VO method routines.

In short, this dissertation consists in the combination of these four approaches into a unified system. An omnidirectional version of a state-of-the-art monocular VO system fused with a gyro-scope and a LiDAR running on top of a RPi 3B with both CPU and GPU optimizations. Figure

2.1summarizes the described architecture.

(26)

(27)

Chapter 3

Fundamentals

This chapter intends to introduce the main concepts that will be used to describe the processes in-volved in the implementation of this dissertation. Firstly the camera models and the approaches to their calibrations are described. Then VO is introduced with more focus on feature-based methods. Lastly a brief introduction to the main concepts of parallel programming is performed.

3.1 Camera Models and Calibration

The pinhole camera model suits well approximating conventional perspective cameras. However, it is not suitable for fisheye lens cameras [30]. This type of lenses achieve FoVs usually in the order of 180 degrees. So, in order to model them a different approach has to be used. In this section a brief description of the fisheye camera, the common models of both perspective and fisheye cameras and the fisheye camera calibration is performed.

3.1.1 Fisheye Camera

Fisheye cameras are considered omnidirectional due to the wide FoVs that they present. Omni-directional cameras can be classified in two main groups: central and non central. The former type satisfies the single effective view point effect. In other words, all the captured rays from the camera intercept at single point. Many cameras using fisheye lenses are central and the ones that are not can satisfy the single effective view property approximately [48]. This type of cameras are also called dioptric. They are designed combining a conventional camera with a fisheye lens and present several advantages relatively to other cameras: they do not generate images with dead areas and do not increase the size of the original image [8]. However, the presence of a lens brings distortion to the image. There are two main types of distortion: tangential and radial. The last one is highly present in fisheye camera images and so there is the need to model this type of distortion.

(28)

3.1.2 Perspective Camera Model

Pinhole cameras can be modeled using the following perspective projection

r = f tan(θ) (3.1)

where r is the distance between the image point x and the principal point, θ is the angle between the principal axis and the incoming ray and f is the camera focal length [30]. This model assumes that the image is formed by the interception of the incoming rays that pass through the camera center with the focal plane [47]. So, given a 3-D point in the camera reference frame X = [x y z]T_,

the normalized image point x = [u v]T _{is computed as follows.}

λ    u v 1    = KX =    fx 0 cx 0 fy cy 0 0 1       x y z    (3.2)

where λ is the scene depth, fxand fyare the focal length components and [cxcy]T is the principal

point. These last four are called the intrinsic camera parameters and form the called intrinsic camera matrix K that relates 3-D points with normalized image coordinates. This model begins to degenerate when the FoV exceeds the 45 degrees where a non linear model is needed to handle lenses distortion.

3.1.3 Fisheye Camera Model and Calibration

As the perspective camera model is not suitable for fisheye cameras, a different approach is used to model them. Usually, a radially symmetric model is considered to do so [30]. This model considers that this type of cameras present symmetric radial distortions due to the symmetry of fisheye lenses. There are many projection models that follow this assumption. The most common are the stereograhic, equidistance, equisolid, and orthogonal projections that are respectively represented as follows.

r = 2 f tan(θ/2) (3.3)

r = f θ (3.4)

r = 2 f sin(θ/2) (3.5)

r = f sin(θ) (3.6)

In Fig. 3.1is represented the fisheye camera model. It is also visible the difference of projecting a 3-D point present in a scene using the perspective camera model referenced in equation 3.1

and using a radially symmetric model to do so. In practice, the designed fisheye lenses present distortions that do not fit exactly in the models3.3-3.6due to imperfections in the manufacturing process. In order to handle this imperfections a calibration has to be performed. The calibration

(29)

3.2 Visual Odometry 11

Figure 3.1: Difference between projecting a scene point X into x0_{using a perspective model and}

projecting it into x using a radially symmetric model [30].

techniques for fisheye cameras can be divided in two main groups: marker-based calibration and autocalibration [40].

Marker-based calibration methods use a pattern to obtain the relationship between 3-D world points and 2-D projected image points. In most cases, this patter is a checkerboard. Capturing the pattern from a few different perspectives these methods can obtain the lens distortion function and other intrinsic parameters such as the image center and the focal length.

Autocalibration methods do not require any known pattern to calibrate the camera. Instead they search for point correspondences across images to calculate the camera intrinsic parameters. A possible and general approach to autocalibration can be:

1. Find sufficient point correspondences in multiple images;

2. Given these correspondences obtain a distorted 3-D reconstruction that can be related with the true one by an homography transformation;

3. Compute the homography matrix that restores the real 3-D structure from a distorted one. This matrix encodes the camera positions and its intrinsic parameters.

3.2 Visual Odometry

VO comes up as a good alternative in GPS-denied environments or to conventional wheel odome-try because it does not depend on satellite signals and it is not affected by wheel slippage. In this work, as referenced before, monocular VO will be used. This means that only one camera will be used to estimate its motion.

(30)

3.2.1 Problem Statement

VO main goal is to track camera motion between consecutive frames. This can be considered as and homogeneous transformation defined as T_k,k−1= [R_k,k−1|tk,k−1] ∈ R4×4, where Rk,k−1∈

SO(3) is the rotation matrix between two consecutive frames and t_k,k−1∈ R3×1 is the respective translation. To calculate the absolute pose in reference to an initial pose P0= [X0Y0Z0θx0θy0θz0]T

the relative pose calculated is integrated for every frame as follows.

Tk=Tk−1Tk,k−1 (3.7)

Two main approaches exist in the literature to estimate T_k,k−1. The first computes features from the scenario that it tries to match between frames to compute the relative pose. The second uses directly pixel intensities to do so. These two methods are detailed bellow. Since the method used in this dissertation is feature-based, a more detailed description of this class will be performed.

3.2.2 Feature-Based Methods

Feature-based methods use features from the environment that repeat over frames to track the motion. The main steps of this approach are feature detection, feature matching, motion estimation and, in some cases, bundle adjustment [47].

3.2.2.1 Feature Detection

As [17] states, in the feature-detection step keypoints that may be present in other images are searched in the current one. These points represent local-features and must differ from their neighbors in terms of intensity, color and texture. The features that are most commonly searched in computer vision are corners and blobs. A corner is defined by the interception of at least two edges while a blob is a point that differs from its neighborhood in terms of intensity, color and texture. A good feature detector is characterized by:

• localizing features accurately both in position and scale; • redetecting a large number of features in consecutive frames; • computational efficiency;

• distinctiveness in order to facilitate the matching procedure; • robustness to harsh conditions like noise and blur;

• invariance to illumination, rotation, scale and perspective distortions.

There are many feature-detectors used in this field such as FAST, Harris, Shi-Tomas, SURF, SIFT, CENSURE and others. The first three are corner detectors while the others are blob detectors, each one with its pros and cons. For example, a corner detector is faster but less distinctive compared to a blob detector.

(31)

Feature detection is composed by two main stages. In the first is applied a feature-response function to the entire image. After that, in the second stage, the nonmaxima suppression algorithm is applied to the output of the previous step. A feature is found if a local minima or maxima is returned by this approach. After computing all the features of the image frame, a description of each one is performed. To do so, the region surrounding each feature is converted into a compact descriptor that will be used in the matching process.

3.2.2.2 Feature Matching

In this step, feature correspondences are searched in consecutive frames. The easiest way to do so is comparing all the feature descriptors present in both images. A match is found for the best correspondence of the descriptors. Even though this procedure is efficient finding matches, it is computationally heavy for scenarios with a large number of features. So, other approaches are commonly used like search trees or hash tables in order to compute the matching search faster. 3.2.2.3 Motion Estimation

After computed the feature matches present in consecutive images VO enters in the motion esti-mation step. This is a key and critical stage of VO. The problem consists in: given two sets of features f_k−1 and fk compute the transformation between the images Ik−1 and Ik. In monocular

VO there are two main approaches to perform this task considering the dimensional representation of the features: 2-D-to-2-D and 3-D-to-2-D.

2-D-to-2-D

In the 2-D-to-2-D case, fk−1and fkare specified in 2-D image coordinates. In order to estimate

the relative transformation between two consecutive frames epipolar geometry is used. The setup of this approach is represented in Fig. 3.2. This configuration is composed by two camera views with centers C and C0 _{observing a scene point X. The projection of X into the two camera planes}

λ and λ0are x and x0, respectively. The interception of the line that joins the two camera centers with the image planes are the epipoles that can be represented by e and e0_{. The epipolar plane π}

is defined by the lines that join X, C and C0_{. Finally, the epipolar lines l and l}0 _{join x with e and}

x0_{with e}0_{. These lines can also be interpreted as the projection of XC into λ}0_{and XC}0 _{into λ . So,}

there is a direct mapping

x0_{7→ l}

from a point x0_{located in λ}0_{that belongs to XC}0 _{with the epipolar line l [}₂₇_{]. In the same way a}

point x located in λ and XC can be mapped to l0_{. In other words, exists a direct mapping function}

between a point located in one camera plane and the epipolar line located in the other. This relation is encoded in the so called Fundamental Matrix F as follows.

(32)

Figure 3.2: Epipolar geometry setup [27].

Since x0_{lies on l}0_{, x}0T_l0₌_{0 and so}

x0T_{Fx = 0.} _(3.9)

Equation3.9 is called epipolar geometry constraint. F is a matrix of rank 2 since it represents a direct mapping between the 2-D plane of the first image into the 1-D epipolar line into the second one. This matrix can be estimated using the Eight-Point Algorithm proposed by Longuet-Higgins in 1981 and extended by Hartley in 1995. This method assumes that 8 pairs of matches between the two images are available. Given this set represented as xi= [ui vi1]T,x0i= [u0iv0i 1]T and

solving equation3.9for the 8 points results

    u1u01 v1u01 u10 u1v01 v1 u1 vi 1 ... ... ... ... ... ... ... ... u8u08 v8u08 u80 u8v08 v8 u8 v8 1                       F11 F12 F13 F21 F22 F23 F31 F32 F33                   =0 (3.10)

that is an equation of the form AF0₌_{0. So it can be solved using SVD solving A = USV}T_{. The}

matrix F corresponds to a reshape of the column of V with the smallest singular value. In order to reject the matches that present high error (outliers) the iterative method RANSAC is used. This, in each iteration selects a random sample of dimension 8 from the total set of matches, computes the matrix F and calculates x0T_{Fx for all the matches. For a single match, if the result present a low}

(33)

error, it is considered as inlier. The criterion to evaluate the error is given by a certain metric. One of the most used is the Sampson Distance. After computing all iterations, the bigger set of inliers is taken and the final Fundamental Matrix is computed using this set, instead of just 8 matches.

If the camera matrix K is known, equation3.9can be rewritten as

x0T_K−T_EK−1_{x = bx}0T_{Ebx= 0 ⇔ E = K}T_FK _(3.11)

where bx0 _{and bx are normalized image coordinates. So, calculating F and knowing K allows the}

extraction of E which is called Essential Matrix and that encodes R_k,k−1and t_k,k−1. These matrices and consequently camera motion can be extracted using once again SVD. However, four different solutions exist which are represented as follows.

R_k,k−1=_{UWVT,UWTVT_} (3.12) tk,k−1=U h 0 0 ± 1iT (3.13) where WT =    0 −1 0 1 0 0 0 0 1   . (3.14)

The correct projection matrix T_k,k−1is chosen by triangulation of the matches to obtain their 3-D coordinates up to a scale factor. The solution that presents all the z components positive is the one that correctly encodes the motion. It is worthy to note that this z component is obtained up to a scale factor since F has only 5 degrees of freedom (rank 2). This phenomenon represents the so called scale ambiguity. The triangulation can be performed using a linear method. For example, considering the matrix P that projects a pixel point x = [u v]T _{into a scene point X and a matrix P}0

that does the same for x0_{= [u}0_v0_]T _{in the next frame we have}

x = PX (3.15)

x0₌_P0_X _(3.16)

which implies that x × (PX) = 0 and x0_{× (P}0_{X) = 0. Joining these two equations and solving the}

cross product an equation of the form ΣX = 0 is obtained, with Σ = [x ×P x0_{× P}0_]T_{. So, one more}

time, it can be solved using SVD. Performing the cross product we get

Σ=       uP3− P1 vP3− P2 u0_P0 3− P10 v0_P0 3− P20       (3.17)

where Pj represents the jth row of P and the same for P0j. The 3-D point is computed solving

(34)

3-D-to-3-D

In 3-D-to-2-D motion estimation methods the features in the previous frames are specified in 3 dimensions while the ones in the current frame are in 2 dimensions. So, in this case Tk,k−1

is computed matching 3-D and 2-D features represented as X_k−1 and xk, respectively [47]. In

monocular VO, X_k−1 is estimated by triangulation of the image points x_k−1 and x_k−2. So, this method requires three consecutive image frames to estimate motion. This approach computes T_k,k−1minimizing the reprojection error defined as

arg min

T_k,k−1

∑

_i ||x i

k− ˆxik−1||2 (3.18)

where ˆxi

k−1is the reprojection of Xk−1i into the image Ikusing the transformation Tk,k−1[47]. This

problem is called perspective from n points (PnP). In the monocular case normally 3 point are used so the method is called P3P.

3.2.2.4 Bundle Adjustment

Many VO methods perform a refinement of pose estimation minimizing some cost function. In most cases this step is called Bundle Adjustment. As the name suggests this optimization process consists on the ’bundles’ coming from each 3-D feature that converges on the camera center of each image frame, which are ’adjusted’ optimally taking into account both feature and camera positions [53]. Normally this process considers a set of n consecutive frames and performs a parameter optimization of camera poses and 3-D features for this set [17]. To do so, it minimizes the following cost function.

arg min Xi_,T_k

∑

i,k||x i k− g(Xi,Tk)||2 (3.19) where xi

k is the ith image position of the 3-D feature Ximeasured in the kth image and g(Xi,Tk)is

xi

kreprojection according to Tk. This non linear cost function is usually solved using

Levenberg-Marquardt algorithm. Bundle Adjustment provides robustness to VO methods. In fact, those who use it present lower levels of drift due to the use of a window of images instead of only two of them.

3.2.3 Appearance-Based Methods

Direct methods use information directly from image pixels to estimate camera pose [51]. In other words, these methods have as input raw camera images and perform a photometric optimization by minimizing the error resultant from image alignment relatively to pixel intensities. As example of a way to calculate this error, equation3.20represents the minimization of the disparity photometry error described in Engel et al. Semi-Dense VO approach [13].

λ∗=min

λ

(35)

Here, λ represents the disparity, ire f the reference intensity and IP(λ )the image intensity on the

epipolar line for a given disparity.

As [11] describes, direct methods can be classified considering their map density in: • Sparse: Use only specific and independent pixels of the image, usually called keypoints; • Dense: Use and reconstruct all pixels present in the 2-D image;

• Semi-dense: Do not use all image points but a large subset of it.

Direct methods are, in general, less accurate than feature-based methods and are computationally more expensive [47]. However, many functional and robust appearance-based methods are present in the literature as will be described in chapter4.

3.2.4 Visual SLAM

VO is often confused with the concepts of vSLAM. As referenced before, VO tracks the cam-era motion by computing the relative homogeneous transformation between consecutive frames. Additionally, some VO methods also map small portions of the captured scene. In this context, there is a fine line between VO and vSLAM. SLAM is a technique that estimates a sensor motion and reconstructs the structure of the environment where it is inserted. vSLAM uses a camera as a sensor to obtain visual information from the environment and perform SLAM tasks. The main difference between the two techniques is that vSLAM performs a larger scale mapping with a global map optimization [51]. This optimization consists in suppressing the accumulative error present on the mapping process by considering the global consistency of the map. Also, the map is refined if a closed loop is observed, i.e., if a region of the map is revisited. If this happens, the error accumulated since the beginning of the mapping procedure can be calculated and a loop constraint is used to cancel it. This technique is called loop closing. To find a closed loop the current image is matched with the previous ones. If a match is observed the accumulative error can be calculated.

3.2.5 Challenges in Visual Odometry

Tracking motion with a single camera is a real challenge. In monocular VO two main problems appear: the scale ambiguity and pure rotations. While stereo systems can estimate scale by using information of both cameras knowing the baseline transformation between both of them, monoc-ular VO systems are not able to do so when camera motion presents no constraints [4]. So, if no external sources of information are used motion is estimated only up to a scale factor. Also, most monocular VO methods present high difficulties dealing with pure rotations. Many of them implement relocalization algorithms to prevent definitive loss of motion in these harsh rotations.

(36)

3.2.6 Sensors Integration

Typically, VO system accuracy increases when it is combined with sensors such as IMUs or dis-tance sensors. So, a brief description of their integration in VO is here performed.

3.2.6.1 Common Sensors

An inertial sensor is a device that exploits resistance to a change in momentum to sense some kind of variation in a body motion [7]. An IMU is an electronic device composed by accelerometers, gyroscopes and in the most cases, a magnetometer. So, this device is able to sense and output two relative measures: linear acceleration using the accelerometers and angular velocity using gyro-scope. In the presence of magnetometers, this device is also able to output its absolute orientation, by measuring the surrounding magnetic field. IMUs are famous for being present in aircraft sys-tems, measuring variations in terms of roll, pitch and yaw. Also they are used as a support in GPS-based systems, where they continue estimating the object pose when it reaches environments where satellite signals are unavailable. This kind of sensors are widely used in VO field to help it solving scale ambiguity and pure rotations. Besides IMUs, planar lasers or 3D laser scans such as Lidars are also often fused with VO. These distance sensors are usually used to determinate the unknown scale factor associated with the camera motion. As sensors always have associated noise, it is convenient to have a model of this noise while fusing sensor data with VO. To do so, usually the KF is used.

3.2.6.2 The Kalman Filter

The KF is an estimator highly used to fuse several data sources. This algorithm estimates the value of unknown variables that are called state variables based on observed measurements that contain noise. It can be divided in two main phases: prediction and correction. In the prediction step, a state model is considered as follows:

ˆx_k+1|k=A ˆx_k|k+Buk+1+ ωk+1 (3.21)

where ˆx_k+1|k is the state vector of the process, uk+1 the control vector and ωk+1 the state noise.

Typically, in VO systems the state model is non linear because it either is modeled using odom-etry equations, homogeneous transformation matrices or quaternions. In these cases an EKF is required, that is similar to KF but considers a non linear state model. This can be represented as follows:

ˆx_k+1|k= f ( ˆx_k|k,uk+1) + ωk+1 (3.22)

with f non linear. The correction step uses observations (usually sensor measurements) to correct the state model predictions. To do so, the observations have to be modeled and could be either linear

(37)

3.3 Parallel Computing 19

or non linear.

ˆzk+1=h( ˆx_k+1|k) +vk+1 (3.24)

Once again, EKF is used in non linear models. Of course, the noise coefficients ω and v are un-known. So, in order to include both process and observations noise into the filter, two matrices are specified: Q and R. These matrices represent the process and observations covariance noise matrices, respectively. They are normally dynamic because it is expected the noise to be a func-tion of some variable that depends on the system instantaneous condifunc-tions. That being said, the predictions are corrected via equation3.25.

ˆx_k+1|k+1= ˆx_k+1|k+Kk+1(zk+1− ˆzk+1) =ˆxk+1|k+Kk+1˜zk+1 (3.25)

This equation intuitively means that the prediction ˆx_k+1|kis as corrected as bigger is the error in the observations model derived from errors in the states. Here, the Kalman gain K imposes a weighted mean between these two factors. The way that this matrix is calculated implies that the smaller the observations noise, the smaller K magnitude. So, K can be viewed as how much the estimates are changed by the measurements and this can be controlled by the noise covariance matrices.

In short, this filter allows the fusion of multiple sources in order to obtain more reliable results. It only considers two time instants and it is fast enough to run on real time. Thus, KF is a reliable solution in localization and navigation systems.

3.3 Parallel Computing

The default approach to software development is serial computing. In this, instructions are com-puted sequentially one at a time in a single processor. However, many processes can be optimized if a parallel approach is adopted. This approach called parallel computing consists in using mul-tiple resources simultaneously to solve a computational problem [5]. It breaks the problem in several parts that are executed concurrently in different processors that are usually called cores. Current day computers often allow simultaneous access to both CPU and GPU which enables par-allel computing. Parpar-allel computing can be performed using general CPUs but also using GPUs that are graphical processing units capable of executing multiple processes in parallel.

3.3.1 Main Architectures

There are many ways to classify parallel computing systems. However, the most used classifica-tion is Flynn’s Taxonomy that is used since 1996. It classifies multi-processor computer architec-tures in two dimensions: instruction stream and data stream. This results in a total of 4 possible architectures namely:

• SISD - Single Instruction Single Data;

(38)

• MISD - Multiple Instruction Single Data; • MIMD - Multiple Instruction Multiple Data.

The first refers to traditional computers that execute serial computing. In this, one instruc-tion stream is executed by the CPU and one data stream is used as input per clock cycle. Single Instruction Multiple Data Stream is a type of parallel computing where one instruction is being executed in all the processing units but each one can operate over a different data element simul-taneously. Multiple Instruction Stream Single Data Stream is the most uncommon architecture. Here, each processing unit executes the separate instructions on the same data stream. Finally, Multiple Instruction Stream Multiple Data Stream refers to a type of parallel computing where multiple processing units operate on multiple data streams independently. This configuration is the most common parallel computing technique. Is is widely used in modern supercomputers.

Memory organization is also a metric to classify computer architectures [1]. This classification consists in the types multi-node with distributed memory and multiprocessor with shared memory. In multi-node systems distributed processors are connected over a network. Each one has its own local memory that is shared over the network with the other nodes. Multiprocessor architecture is characterized by having multiple processors either physically connected to the same memory or sharing the memory using a low-latency link [1]. Usually, this multiprocessors include a single-chip with multiple cores. This configuration is known as multicore. Derived from this concept arises many-core that is a configuration with a very high number of cores. This architecture can be found in many GPUs. These devices support all the architectures previously described. Thus, they are highly attractive in parallel computing field.

3.3.2 Heterogeneous Computing

Nowadays, computers include complementary processing elements to CPUs. Typically GPUs are used for this effect. These devices are increasingly powerful enabling their applicability in parallel computing processes. This allowed the rise of heterogeneous computing that is characterized by a system that uses more than one type of processor. A standard example of a heterogeneous compute node are two multicore CPUs connected to two or more many-core GPUs. This configuration has CPU as host and GPU as a co-processor. The last is never a standalone processing unit and is connected to CPUs through a PCI-Express bus as represented in Fig. 3.3.

(39)

3.3 Parallel Computing 21

A heterogeneous application is constituted by two implementation blocks: host code and de-vice code. The former runs on CPU and the last on GPU. The application is initialized on the CPU that is the main responsible for it and that provides the data that the co-processor uses to compute the processes in parallel.

(40)

(41)

Chapter 4

State of Art

This chapter presents the state of art of the main concepts present in the dissertation. Firstly, it describes the main omnidirectional camera calibration tools existing. Then, VO is described in terms of the main methods present in the literature, the omnidirectional approaches to it, the main methods to solve VO main problems and finally VO in agriculture. Finally, a brief description of the programming tools used is done.

4.1 Omnidirectional Camera Calibration

Omnidirectional camera calibration methods are not abundant in the literature. Kannala et. al. proposed two different methods to do so. The first is marker-based and is suitable for fisheye cameras [29]. They propose a generic model for this type of cameras and a calibration procedure that can be performed using one or more views of a planar calibration object. They also pro-pose an autocalibration method for central cameras that requires two views of a scene to compute point correspondences [31]. These correspondences are used to estimate the intrinsic and extrin-sic camera calibration parameters by minimizing the angular error. It is applicable to fisheye and catadioptric cameras. OpenCV also has a toolbox to calibrate fisheye cameras1 _{that uses images}

of a checkboard in different views. In this dissertation the focus will be on the omnidirectional camera calibration method proposed by Davide Scaramuzza et. al. that is described bellow.

4.1.1 Omnidirectional Camera Calibration Toolbox by Davide Scaramuzza

Davide Scaramuzza et. al. present the main omnidirectional camera calibration method present in the literature [45] that is coupled with a Matlab toolbox [48]. The method is marked-based and only requires the camera observing a planar pattern in different perspectives. It does not require prior knowledge of the camera motion nor a model of the omnidirectional sensor. It as-sumes that the image projection function can be described as a Taylor series expansion and uses a two-step least squares linear minimization technique to determine the coefficients of the series. The method fits well calibrating both catadioptric and dioptric cameras by considering a unified

1_{https://docs.opencv.org/3.4/db/d58/group__calib3d__fisheye.html}

(42)

omnidirectional camera model that represents both camera types. The model can be described as follows. Let us consider a scene point X, its projection on the sensor plane x00_{= [}_u00_v00_]T _(an

hypothetical plane orthogonal to the mirror/lens axis) and in the camera plane x0_{= [u}0_v0_]T _(that

coincides with the camera CCD). The last two can be related by an affine transformation such as x00₌_Ax0₊_{t. Considering also the image projection function g which relates a point x}00_{in the}

sensor plane and the vector υ that emanates from the viewpoint to the scene point X they represent the camera model as follows

λ υ= λg(x00) = λg(Ax0+t) = PX, λ > 0 (4.1)

where X ∈ R4 _{is expressed in homogeneous coordinates, P ∈ R}3×4 _{is the perspective projection}

matrix and λ is the depth factor. So, in order to be able to reconstruct from each pixel the direction of the respective scene point in the world, a calibration procedure is performed. This consists in estimating the matrices A and t and the non linear function g so that all the vectors g(Ax0₊_t)

satisfy the projection equation4.1. The function g is assumed to be represented as follows.

g(u00_,_v00_{) =}    u00 v00 f (u00_,v00₎    (4.2)

Function f is designed so it can describe different kind of sensors. It is represented by the follow-ing polynomial.

f (u00_,v00_{) =}_a

0+a1r + a2r2+ ... +aNrN (4.3)

The coefficients ai∈ {0,1,...,N} of the polynomial of order N are the model parameters that the

calibration intends to determine and r is the radial distance of x00_{from the sensor axis. This being}

said, equation4.1can be rewritten as follows. λ υ= λg(Ax0+t) = λ " Ax0₊_t f (u00_,v00₎ # =PX, λ > 0 (4.4)

In the calibration procedure, to reduce the number of parameters to be estimated, the matrices A and t are estimated up to a scale factor α. After determining these matrices an image point x0 _{in pixels is related with x}00_{as x}00_{= α}_x0_{. Substituting this relation in equation}_4.4_{it comes the}

following projection equation

λ υ= λg(αx0) = λ α    u0 v0 a0+a1r0+ ... +aNr0N    = PX, λ , α > 0 (4.5)

where f (u0_,v0_{) =}_a₀₊_a₁_r0_{+ ... +}_a_N_r0N_{, [u}0 _v0_]T _{are the pixel coordinates of an image point in}

relation to its center and r0₌√_u02₊_v02 _{is the euclidean distance of the pixel point to the center.}

(43)

estimated. To determine these parameters, a planar pattern is captured by the camera in different views that are related by a rotation matrix R and a translation τ. These two matrices correspond to the extrinsic camera parameters. Considering Xi j = [Xi jYi j Zi j]T to be the 3-D coordinates of

the calibration pattern in its coordinate system and xi j = [ui j vi j]T the correspondent pixels in the

image plane, as the pattern is considered to be planar Zi j=0. Thus, equation4.1can be rewritten

as follows. λi j    ui j vi j a0+ ... +aNri j0N    = PiX = h

r1,i r2,i r3,i τi

i       Xi j Yi j 0 1      = h r1,i r2,i τi i_  Xi j Yi j 1    (4.6)

This last equation shows that the extrinsic parameters have to be determined for each calibration pattern pose in order to solve for the calibration parameters.

4.2 Visual Odometry

In this section, the main VO and vSLAM methods present in the literature are briefly described. Table4.1presents a summary of their main characteristics.

4.2.1 Main Visual Odometry Methods

Method Type Camera system Mapping Optimization Loop closure Relocalization DSO [11] Appearance-based Both Sparse Gauss-Newton No No LSD-SLAM [12] Appearance-based Monocular Semi-dense Pose-graph Yes Yes DTAM [37] Appearance-based Monocular Dense No No Yes PTAM [33] Feature-based Monocular Sparse BA No Yes SVO [15] [16] Semi Monocular Sparse No No Yes LibViso2 [23] Feature-based Both Dense Gauss-Newton + RANSAC No No ORB-SLAM [36] Feature-based Both Sparse BA and Pose-graph Yes Yes MonoSLAM [10] [9] Feature-based Monocular Sparse No No No

Table 4.1: Key characteristics of the main VO methods present in the literature.

MonoSLAM developed by Davison el al. [10] was the first successful vSLAM method and it was created in 2003. This approach is based on a probabilistic feature-based map that represents the estimates of the state of the camera, the features of interest and also the uncertainty of these estimates at a given instant. This map is continuously updated along with camera motion using EKF. The prediction model is based on uniform camera motion so the state vector is composed by camera’s 3-D position, orientation, linear and angular velocity and by the extracted feature points. This means that when new features are detected the state vector grows that leads to the increase of the computational cost proportionally to the size of the environment. This method looks for features that build salient and repetitive image regions to use as landmarks. To initialize the map a known object placed in front of the camera is observed and a global coordinate system is defined.

(44)

To reduce the computational cost present in MonoSLAM, PTAM [33] splits tracking and map-ping in two different threads that run in parallel. This way, tracking is not affect by the higher computational cost of mapping. This method introduces the concept of keyframes in the SLAM field and uses it in the mapping thread. Here, an input frame is considered as a keyframe if a large disparity between it and any keyframe is observed. Also, PTAM is the first SLAM algorithm that uses Bundle Adjustment as optimization method. Bundle Adjustment is here used to optimize features locally with some keyframes and globally with all the keyframes associated with the map. Tracking is performed by matching the projected mapped feature points and input image feature points. A relocalization system is also present in this method. When the motion is lost the recovery is done by a randomized tree-based search.

ORB-SLAM [36] is the most complete vSLAM system present in the literature as can be observed in table 4.1. This method is based on PTAM main ideas. So, tracking and mapping are also executed in parallel CPU threads. This method uses the same type of feature, ORB features, for all the modules: tracking, mapping, loop closing and relocalization. This approach is then reliable, simple and efficient. Tracking and mapping are independent of the global map size because a covisibility graph is used, which allows these two tasks to focus only on a portion (covisible) of the map. Loop closing is performed using pose-graph optimization where the graph is built using a spanning tree. The relocalization algorithm is based on an interesting approach called bags of words place recognition. Here, a visual vocabulary is created offline as a descriptor of the environment. These words are stored in a database and they are associated with stored keyframes. The database returns the keyframes successful matches. Figure4.1represents ORB-SLAM system overview.

Figure 4.1: ORB-SLAM system overview [36].

(45)

method supports both monocular and stereo camera systems. The stereo version main goal is to reconstruct the environment and track the motion on real time using a single CPU. It is divided in four stages: sparse feature matching, egomotion estimation, dense stereo matching and 3-D reconstruction. Feature matching is done between four images: current and previous right and left images. The procedure consists in filtering the four images with corner and blob masks and perform non-maximum- and non-minimum-suppression on the filtered frames. The matching is done in circle, i.e, it starts and ends in the current left image passing by all the others. A circle match is accepted if the last feature of the circle is the same as the first one. Camera motion is computed using the circle features matches and by minimizing the reprojection error and refining the resultant velocity using a KF. The 3-D reconstruction is performed using a greedy approach which associates pixels by reprojecting them in the next frame. This method also has a monocular version. In this, the same feature matching approach is used but with only two frames: the pre-vious and the current ones. It uses the epipolar geometry method described in section3.2.2.3to compute the relative motion between frames. To solve the monocular scale ambiguity it assumes that camera height and pitch are fixed and uses these values to calculate scale along with ground plane estimation.

Moving to appearance-based methods rises DTAM [37] proposed by Newcombe et al. that is a fully direct method. In this method the tracking is done comparing the input frames with the synthetic generated map images present in the reconstructed map. As described in table4.1 the mapping is dense. Here, depth is estimated by minimizing a global energy which is the sum of the photometric error and robust spatial regularisation terms. It should also be emphasized that this algorithm presents optimized versions that runs on mobile phones.

Another well known direct method is LSD-SLAM [12]. In contrast to DTAM which recon-structs full areas, this algorithm is based on semi-dense VO so the reconstruction is limited to certain areas. These areas are characterized by intensity gradient. As represented in figure4.2

Figure 4.2: LSD-SLAM system overview [12].

this method contains three main modules: tracking, depth map estimation and map optimization. In the tracking component new camera images are continuously tracked. Thus, camera pose is estimated with respect to the current keyframe taking the previous frame as initialization. The depth map estimation takes on tracked frames and decides whether they are keyframes or not. If not, they are used to refine the current keyframe. When a keyframe is replaced it is stored into the

(46)

global map using the map optimization module. In this module, loop closures and scale drift are detected.

DSO [11] is another highly efficient fully direct VO method proposed by Engel et al. This method is sparse so it has a limited number of active points. This is due to the consideration of redundancy in images, i.e, they considered that image data is highly redundant so there is no need to use and excess of image points. Here, the input image is divided into a grid and the high intensity active points are candidates to map reconstruction. The motion is tracked via minimization of the photometric error. In this approach, both geometric and photometric calibration are considered. This last gives information about lens attenuation, gamma correction and exposure times which highly increases odometry efficiency. The optimization is performed in a sliding window. So, camera poses that are not in the window are marginalized.

A semi-direct approach is proposed by Forster et al. which is called SVO [15]. This method

Figure 4.3: SVO system overview [15].

is semi-direct because tracking is done by feature matching but mapping is performed using an appearance-based approach. So, when the initial estimate of camera pose and feature extraction are established, the algorithm goes on using only feature points. So, in this method camera motion is calculated by minimizing the photometric errors around feature points. By analysis of Fig. 4.3

we can see that feature extraction is only done when a new keyframe is selected to initialize new 3-D points. This optimizes the runtime because feature extraction is not performed in every frame.