A Multi-Frame Adaptive 3D Non-Rigid Registration for Augmented Reality

(1)

Universidade Federal da Bahia

Universidade Salvador

Universidade Estadual de Feira de Santana

TESE DE DOUTORADO

An Adaptive Approach to Real-Time 3D Non-Rigid Registration

Antonio Carlos dos Santos Souza

Programa Multiinstitucional de

P´

os-Gradua¸

c˜

ao em Ciˆ

encia da Computa¸

c˜

ao – PMCC

Salvador

19 de Dezembro de 2014

(2)

(3)

ANTONIO CARLOS DOS SANTOS SOUZA

AN ADAPTIVE APPROACH TO REAL-TIME 3D NON-RIGID

REGISTRATION

Tese apresentada ao Programa

Mul-tiinstitucional de P´

os-Gradua¸c˜

ao em

Ciˆ

encia da Computa¸c˜

ao da

Univer-sidade Federal da Bahia,

Universi-dade Estadual de Feira de Santana

e Universidade Salvador, como

requi-sito parcial para obten¸c˜

ao do grau de

Doutor em Ciˆ

encia da Computa¸c˜

ao.

Orientador: Antˆ

onio Lopes Apolin´

ario J´

unior

Salvador

(4)

ii

Ficha catalogr´afica. Souza, Antonio Carlos dos Santos

An Adaptive Approach to Real-Time 3D Non-Rigid Registration/ Anto-nio Carlos dos Santos Souza– Salvador, 19 de Dezembro de 2014.

65p.: il.

Orientador: Antônio Lopes Apolinário Júnior.

Tese (doutorado)– Universidade Federal da Bahia, Instituto de Matem´atica, 19 de Dezembro de 2014.

1. Alinhamento n˜ao-r´ıgido 2. Algoritmos Adaptativos 3. Realidade Au-mentada.

I. Apolinario, Antˆonio Lopes. II. Universidade Federal da Bahia. Instituto de Matem´atica. III T´ıtulo.

(5)

iii

TERMO DE APROVAC

¸ ˜

AO

ANTONIO CARLOS DOS SANTOS SOUZA

AN ADAPTIVE APPROACH TO REAL-TIME

3D NON-RIGID REGISTRATION

Esta tese foi julgada adequada à obten¸cão do t´ıtulo de Doutor em Ciência da Com-puta¸cão e aprovada em sua forma final pelo Programa Multiinstitucional de P´ os-Gradua¸cão em Ciência da Computa¸cão da UFBA-UEFS-UNIFACS.

Salvador, 19 de Dezembro de 2014

Prof. Dr. Antônio Lopes Apolinário Júnior Universidade Federal da Bahia

Prof. Dr. Vinicius Moreira Mello Universidade Federal da Bahia

Prof. Dr. Thales Miranda de Almeida Vieira Universidade Federal de Alagoas

Prof. Dr. Ricardo Farias

Universidade Federal do Rio de Janeiro

Prof. Dr. Luiz Marcos Garcia Gon¸calves Universidade Federal do Rio Grande do Norte

(6)

(7)

ACKNOWLEDGEMENTS

First, I would like to thank God for all the blessings given during my journey.

This work was made possible by the enthusiastic support, suggestions, encouragement, and guidance of many individuals. I am greatly indebted to my academic advisor, chair and director of this work, Prof. Dr. Antonio Lopes Apolin´ario Jr for instilling in me the joy of conducting outstanding research in computer graphics. These six years have been intense in my career. Your support and vigilance have allowed me to achieve results that I couldn’t have thought of.

Thank you so much Committee for the direction, feedbacks, and all the enlightening advices. Thank you Prof. Dr. Gilson Giraldi, Prof. Dr. Vin´ıcius Mello and Prof. Dr. Perfilino Ferreira.

Thank you Prof. Dra. Lynn Alves for the unforgettable times at master’s degree. Furthermore, I would like to acknowledge my friend M´arcio Cerqueira de Farias Macedo for the great partnership and his awesome markerless augmented reality environ-ment for on-patient medical data visualization. Working with you has been a wonderful experience and a great source of inspiration. I really wonder how my thesis would be without this environment.

Many individuals also provided support in myriad ways. Special thanks go to Aline Machado, Sabrina, Osmar, Rosalba, Thalles Carib´e, Prof. Dr. Eduardo Telmo, Prof. Dr. Lurimar, Prof. Andr´e, Prof. Jowaner, Prof. Cesar, Lilia, Prof. Dr. Marcelo Veras, Prof. Dr. Jairo Dantas, Bruno, Leo, Everton, Toninho, Dona Vilma, Marilene (Fortona), Dona Mary, Sr. Deja, Cita, Fabiana, Carol, Janio, Eliakin, Igor, Rodrigo, Jony, Katia, Anderson, Leandro, Edilson and Rita.

I am also much indebted to the insightful discussions and fun times with all the Labra-soft friends: Luiz Cl´audio Machado, Valentim, Romilson, Ronaldo, Antonio Maur´ıcio, Si-mone, Josildo, Marcelo, Felipe, Amilton, Vanessa, Diego, Let´ıcia, Luiz Henrique, Pedro, Jorge, Fabiano and Aderbal.

Finally, I would like to acknowledge my family for their constant support and encour-agement during this graduate journey: Antonio Porf´ırio, Reinaldo, Ricardo, Maisa, Tissi, Lucia, Danilo, my uncle Zé Ribeiro, Manoel, Walter and Carlito, my aunt Belita, Esmera, Judith and Decinha, my cousins Fatim, Alva and Bel, Alan, Zeo, Caio, Vivian, Manoel, Jorge, Nandi, Renã, Luis and my family away from home Chicão, Rosinha, Juca, Leo, Mito and Kelly. I wish to thank Aline Requião for everything. Aline, I appreciate your motherly love.

I dedicate this work to the loving memory of my mother Antonia, to my son Arthur and to my nieces Bruna and Eduarda.

(8)

(9)

RESUMO

Alinhamento não-r´ıgido 3D é fundamental para o rastreamento e/ou reconstru¸cão de modelos tridimensionais deformáveis. Contudo, a maioria dos algoritmos de alinhamento não-r´ıgido não são tão rápidos quanto aqueles desenvolvidos no campo do alinhamento r´ıgido. Métodos rápidos para alinhamento não-r´ıgido 3D são particularmente interes-santes para aplica¸cões de realidade aumentada sem marcadores, em que um objeto sendo utilizado como marcador natural pode sofrer deforma¸cões ao longo do tempo de execu¸cão da aplica¸cão. Nesta tese é apresentado um algoritmo adaptativo multi-frame implemen-tado em GPU para o alinhamento não-r´ıgido de modelos tridimensionais deformáveis capturados por uma câmera RGB-D. Abordagens adaptativas tendem a otimizar algo-ritmos, concentrando esfor¸cos nos locais mais relevantes, causando um efeito global de melhoria da solu¸cão. O método proposto utiliza adaptatividade em três passos do algo-ritmo. Primeiro, para guiar a distribui¸cão de regiões de influência baseado na intensidade de deforma¸cão calculada sobre o objeto. Segundo, durante a sele¸cão de restri¸cões, em que a amostragem feita sobre o objeto para a fase de otimiza¸cão é baseado na deforma¸cão atual medida. Terceiro, para aplicar o algoritmo em um esquema multi-frame apenas quando o erro do rastreamento r´ıgido ultrapassar um certo limiar, indicando que uma transforma¸cão r´ıgida já não produz um alinhamento satisfatório. A partir do uso da adaptatividade e do paralelismo da implementa¸cão em GPU, foram obtidos resultados que demonstram que o método proposto é capaz de executar em tempo real com uma abordagem tão precisa quanto aquelas existentes na literatura.

Palavras-chave: Alinhamento n˜ao-r´ıgido, Algoritmos Adaptativos, Realidade Aumen-tada.

(10)

(11)

ABSTRACT

3D non-rigid registration is fundamental for tracking or reconstruction of 3D deformable shapes. However, the majority of non-rigid registration methods are not as fast as the ones developed in the field of rigid registration. Fast methods for 3D non-rigid registration are particularly interesting for markerless augmented reality applications, in which the object being used as a natural marker may undergo non-rigid user interaction. Here, we present a multi-frame adaptive algorithm for 3D non-rigid registration implemented on GPU where the 3D data is captured from an RGB-D camera. In general, adaptive algorithms optimize the solution, focusing on the more relevant aspects of the problem, causing a global improvement on the final solution. Our approach uses adaptivity in three stages of the process. First, to guide the distribution of regions of influence based on the deformation intensity on some region of the shape. Second, during the selection of constraints, where the sampling done over the object for the optimization is based on the current deformation. Third, to apply the algorithm in a multi-frame manner only when rigid tracking error is above a pre-defined threshold, showing that a rigid transformation cannot result in a satisfactory result. Taking advantage from this adaptivity and the parallelism of the GPU, the results obtained show that the proposed algorithm is capable to achieve real-time performance with an approach as accurate as the ones proposed in the literature.

Keywords: Non-Rigid Registration, Adaptive Algorithms, Augmented Reality.

(12)

(13)

LIST OF FIGURES

2.1 Reality-Virtuality Continuum (Milgram; Kishino, 1994). . . 5 2.2 Marker-based (left image) and markerless (right images) augmented reality.

Left image is courtesy of ARToolKit library (Kato; Billinghurst, 1999) and right images are courtesy of KinectFusion (Izadi et al., 2011). . . 7 3.1 Overview of the proposed approach from 3D reference model reconstruction

to tracking solution. Adapted from (Souza; Macedo; Apolinario, 2014). . . . 14 3.2 Overview of KinectFusion’s pipeline (Izadi et al., 2011). . . 15 3.3 Left image: The user translated his face fast. A small number of points

were at the same image coordinate and the ICP failed. Right image: By using the pose estimation algorithm, the problem can be solved (Macedo; Apolinario; Souza, 2013). . . 17 4.1 Overview of the proposed approach from the depth map acquisition to the

final non-rigid aligned surface. . . 20 4.2 Building of the deformation graph (right) over the source object (left)

based on the residual error measured (center). . . 23 4.3 Refinement of the deformation graph (right) over the cheeks region of the

source object (left) based on the residual error measured (center). . . 24 4.4 Collapsing of the deformation graph (right) over the cheeks region of the

source object (left) after updating on the residual error (center). . . 26 4.5 Constraint selection based on the initial non-rigid error between source

and target surfaces. . . 27 5.1 Overview of the libraries used for each step of our approach. . . 32 5.2 Datasets used for evaluation of the nonrigid registration algorithm. I

-Synthetic dataset consisting on a deformed plane. II - Real dataset of a deforming hand. III-1 - Real dataset of a user smiling. III-2 - Real dataset of a user inflating his cheeks. . . 33 5.3 The resulting color-coded error from the registration between source and

target surfaces. In all situations the proposed algorithm AdNodes + Ad-Cons obtained an averaged accuracy below 3mm and standard deviation below 3.5mm. I Synthetic dataset consisting on a deformed plane. II -Real dataset of a deforming hand. III-1 - -Real dataset of a user smiling. III-2 - Real dataset of a user inflating his cheeks. . . 34 5.4 Accuracy (in mm) obtained by AdNodes and AdNodes + AdCons in

com-parison with the Embedded Deformation (ED) algorithm and the initial error for each one of the datasets used. . . 34

(16)

xiv LIST OF FIGURES 5.5 Accuracy comparison between ED algorithm and our adaptive approach

with respect to the node selection for the dataset II. . . 35 5.6 Accuracy comparison between ED algorithm and our adaptive approach

with respect to the node selection for the dataset III-1. . . 35 5.7 Accuracy comparison between different sampling schemes used to select

constraints for optimization for the dataset II. . . 36 5.8 Accuracy comparison between different sampling schemes used to select

constraints for optimization for the dataset III-1. . . 37 5.9 Accuracy (in mm) related to the parameter k for each one of the datasets

used. . . 38 5.10 Accuracy (in mm) obtained for each level of the quadtree and for each one

of the datasets used. The maximum number of nodes for a level l is 4l_{. .} ₃₈ 5.11 Performance (in FPS) obtained by AdNodes and AdNodes + AdCons in

comparison with ED algorithm for each one of the datasets used. . . 39 5.12 Performance (in ms) obtained by our approach for each one of the most

computationally expensive methods. MM - Matrix Multiplication (A = Jt_{J ); Jacobian - computation of J ; Cholesky - LLT decomposition; Solver} - linear solver Strsm from CUBLAS library; ACS - Adaptive Constraint Selection; ANS - Adaptive Node Selection; Weights - computation of the influence of G on Ps; MV - Matrix-vector multiplication (b = −Jtr). . . . 40 6.1 Neutral and deformed reference models based on user’s facial expression. 44 6.2 Neutral and deformed reference models for a different user. . . 45 6.3 Neutral and deformed reference models based on challenging deformation

scenarios. . . 46 6.4 Cheeks tracking error measured for both rigid and rigid + non-rigid

so-lutions. Plot in red - rigid tracking. Plot in blue - non-rigid adaptive tracking. Dashed line - threshold. . . 46 6.5 Color-coded cheeks tracking error measured for both rigid and non-rigid

solutions. . . 49 6.6 Cheeks-2 tracking error measured for both rigid and rigid + non-rigid

solutions. Plot in red - rigid tracking. Plot in blue - non-rigid adaptive tracking. Dashed line - threshold. . . 49 6.7 Color-coded cheeks-2 tracking error measured for both rigid and non-rigid

solutions. . . 50 6.8 Smile tracking error measured for both rigid and rigid + non-rigid

so-lutions. Plot in red - rigid tracking. Plot in blue - non-rigid adaptive tracking. Dashed line - threshold. . . 50 6.9 Color-coded smile tracking error measured for both rigid and non-rigid

solutions. . . 51 6.10 Smile-2 tracking error measured for both rigid and rigid + non-rigid

so-lutions. Plot in red - rigid tracking. Plot in blue - non-rigid adaptive tracking. Dashed line - threshold. . . 51

(17)

LIST OF FIGURES xv 6.11 Color-coded smile-2 tracking error measured for both rigid and non-rigid

solutions. . . 52 6.12 Kiss tracking error measured for both rigid and rigid + non-rigid solutions.

Plot in red - rigid tracking. Plot in blue - non-rigid adaptive tracking. Dashed line - threshold. . . 52 6.13 Color-coded kiss tracking error measured for both rigid and non-rigid

so-lutions. . . 53 6.14 Kiss-2 tracking error measured for both rigid and rigid + non-rigid

so-lutions. Plot in red - rigid tracking. Plot in blue - non-rigid adaptive tracking. Dashed line - threshold. . . 53 6.15 Color-coded kiss-2 tracking error measured for both rigid and non-rigid

solutions. . . 53 6.16 Open Mouth tracking error measured for both rigid and rigid + non-rigid

solutions. Plot in red - rigid tracking. Plot in blue - non-rigid adaptive tracking. Dashed line - threshold. . . 54 6.17 Color-coded open mouth tracking error measured for both rigid and

non-rigid solutions. . . 54 6.18 Angry tracking error measured for both rigid and rigid + non-rigid

so-lutions. Plot in red - rigid tracking. Plot in blue - non-rigid adaptive tracking. Dashed line - threshold . . . 54 6.19 Color-coded angry tracking error measured for both rigid and non-rigid

solutions. . . 55 6.20 Bag tracking error measured for both rigid and rigid + non-rigid solutions.

Plot in red - rigid tracking. Plot in blue - non-rigid adaptive tracking. Dashed line - threshold. . . 55 6.21 Color-coded bag tracking error measured for both rigid and non-rigid

so-lutions. . . 55 6.22 Limitation of the proposed method. User’s body (A) is reconstructed

(B) and the algorithm cannot track user’s arms (C) integrating all the movement into the 3D reference model (D). . . 56 6.23 Body tracking error measured for both rigid and rigid + non-rigid

so-lutions. Plot in red - rigid tracking. Plot in blue - non-rigid adaptive tracking. Dashed line - threshold. . . 56 6.24 Color-coded body tracking error measured for both rigid and non-rigid

(18)

(19)

LIST OF TABLES

5.1 Number of constraints (C), accuracy (A, given in mm), standard deviation (SD, given in mm) and performance (P, given in FPS) results according to the step size (from 1 to 32) or sampling scheme (Adap for adaptive) used to select constraints for optimization. . . 37 6.1 Average accuracy (A, given in mm) and Standard Deviation (SD, given in

mm) results according to the weight used to update the 3D reference model. 47 6.2 Average accuracy (Avg., given in mm), Standard Deviation (Std. Dev.,

given in mm) and Performance (Perf., given in FPS) results for each one of the tracking algorithms tested in presence of specific user deformation. NRn: Non-Rigid Registration applied for every n frames (independent of rigid tracking fail); NRAdaptive: Non-Rigid Registration applied whenever the rigid algorithm fails. . . 48 6.3 Average accuracy (Avg., given in mm), Standard Deviation (Std. Dev.,

given in mm) and Performance (Perf., given in FPS) results for each one of the thresholds used to detect rigid tracking fail. . . 49

(20)

(21)

Chapter

1

In this first chapter, a brief contextualization of the problem we want to solve, objectives and contributions of the proposed work and thesis organization are described.

INTRODUCTION

Augmented Reality (AR) is a technology in which the view of a real scene is augmented with additional virtual information. As stated by Azuma (1997), an AR application must follow three basic characteristics:

1. Combination of virtual object(s) into a real scene; 2. Real-time performance;

3. 3D Registration for accurate tracking of the augmented scene;

Since the beginning, tracking is one of the main problems which limits the development of a successful AR application. Virtual and real worlds must be properly aligned so that they seem to coexist at the same location for the user. For some applications, such as the ones proposed for medical AR in surgery environments, it is specially important accurate registration of the virtual medical data into the patient or a successful surgery operation may be compromised.

Tracking plays an important role not only in AR, but also for 3D reconstruction. Several viewpoints of the same object/scene of interest are captured by an appropriate sensor and these must be registered and aligned to the same coordinate system. After this registration step, the different viewpoints must be integrated into a single 3D model. Therefore, if the viewpoints are incorrectly aligned, visible artifacts will appear in the final reconstructed model.

Computer vision techniques have been proposed to solve the problem of registration, however they are not robust enough for some illumination conditions (Teichrieb et al., 2007). With the availability of depth sensors, 3D registration techniques have been proposed using 3D information to improve tracking robustness. But, for low-cost depth sensors, noise may affect the accuracy of the registration.

(22)

2 INTRODUCTION

In scenarios such as on-patient craniofacial medical data visualization (Lee et al., 2012;

Macedo et al., 2014), it is specially important for a markerless AR environment (MAR)

to provide support for non-rigid tracking, which adds one level of interactivity for the user and improves the robustness of the tracking algorithm for rigid and non-rigid pa-tient interactions. The main issue related to this support is that AR requires real-time interactivity and most of the current state-of-the-art works in the field of 3D non-rigid registration do not provide such performance. Here, we assume that an application runs in real-time if its performance is equal or above 15 frames per second (Akenine-moller; Moller; Haines, 2002). This concept of real-time is more related to user interactivity, because the

user must interact with the application and receive fast feedback from it without too much delay.

Several approaches exist for accurate 3D non-rigid registration, however a few of them allow interactive registration. Despite the real-time techniques which rely on strong priors about a specific scenario (Weise et al., 2011;Chen; Izadi; Fitzgibbon, 2012;Bouaziz; Wang; Pauly, 2013;Li et al., 2013), a few methods have been proposed for fast general-purpose non-rigid registration (Sumner; Schmid; Pauly, 2007; Nutti et al., 2014). Their common characteristic is the way they represent the deformation for a given surface: using a deformation graph. Each node of this graph has a 3D affine transformation which allows source surface to be deformed to a target surface. Deformation is modelled in terms of an energy function and, by using a non-linear optimization algorithm, energy is minimized and the best affine transformations for each node of the graph can be found.

In this doctoral work, we want to address the problem of fast 3D non-rigid registration by applying adaptive techniques to reduce the computational cost of the registration while keeping it accurate.

1.1 HYPOTHESIS

Our main question of research is: Is it possible to track interactively and with sufficient accuracy deformable objects which undergo deformation in sequential frames in a mark-erless augmented reality application?

To answer this question, we build upon an adaptive approach for fast non-rigid regis-tration in scenarios where real noisy surfaces are captured from a low-cost depth sensor. This thesis aims to solve the problem of fast, interactive 3D non-rigid registration for MAR environments. In this sense, the proposed approach must be as accurate as state-of-the-art solutions, while supporting real-time performance and being robust under noisy and missing data.

1.2 CONTRIBUTIONS

The main contributions of this thesis are:

A markerless augmented reality environment based on a low-cost RGB-D sensor; A dynamic subdivision approach for node selection on the source object;

(23)

1.3 ORGANIZATION 3

An adaptive algorithm to select, for each iteration, samples from the source object to be used as constraints for optimization;

A multi-frame adaptive approach in which non-rigid registration is applied only when rigid tracking error is above a certain threshold and a 3D rigid representation of the object is updated to take into account the current deformation;

A full framework for non-rigid registration implemented entirely on the Graphics Processing Unit (GPU);

1.3 ORGANIZATION

This thesis is organized as follows:

Chapter 2, Fundamentals and Related Work. This chapter formalizes the concepts of augmented reality, 3D registration and their challenges. Also, it provides an extensive review on related work in the fields of rigid and non-rigid registration, focusing on the interactive methods developed so far.

Chapter 3, Markerless Augmented Reality Environment. The focus of this thesis is to add support for non-rigid tracking in a markerless augmented reality environment. Therefore, in this chapter we present the environment in which the proposed non-rigid registration was applied and validated.

Chapter 4, GPU-Based Adaptive Non-Rigid Registration. In this chapter we present the proposed adaptive non-rigid registration algorithm and its adaptation to take advantage from the parallelism of the GPU, as well as the multi-frame scheme adopted to improve algorithm’s performance.

Chapter 5, Non-Rigid Registration Evaluation. In this chapter, non-rigid regis-tration is evaluated in terms of accuracy and performance for several datasets.

Chapter 6, Non-Rigid Support Evaluation for a Markerless Augmented Re-ality Environment. In this chapter, non-rigid tracking is evaluated in the context of the markerless augmented reality environment in terms of accuracy, performance and tracking robustness for several datasets.

Chapter 7, Conclusion and Future Work. Thesis is concluded with a summary and discussion of future directions.

(24)

(25)

Chapter

2

This chapter formalizes the concepts of augmented reality, 3D registration and their challenges. Also, it provides a review on related work in the fields of rigid and non-rigid registration, focusing on the interactive methods developed so far.

FUNDAMENTALS AND RELATED WORK

2.1 AUGMENTED REALITY

The concept of virtual environments has been proposed since 90s. They can be defined as environments in which only virtual objects are present. Milgram and Kishino proposed a taxonomy to identify in which point the applications were localized inside the so-called Reality-Virtuality Continuum (Milgram; Kishino, 1994) (Figure 2.1). The extremes of this taxonomy are the real world and the Virtual Reality (VR). At the center are the Augmented Reality and Augmented Virtuality. On the former, there is the predominance of the real world over the virtual one, while on the latter there is the prevalence of the virtual world over the real.

Figure 2.1 Reality-Virtuality Continuum (Milgram; Kishino, 1994).

AR and VR use virtual objects both, but they have some differences. AR changes the real world by adding virtual elements. Thus, it is fundamental for an application to maintain the contact with the view of the real world, which is the basis for an AR application. Although authors such as Vallino and Azuma state that the main goal of AR is the seamlessly integration of the virtual objects into the real scene (Vallino, 1998;

Azuma et al., 2001), it is not mandatory for such systems to be realistic. Another central distinction between AR and VR is the registration or tracking problem. This process is crucial in AR: the combination of real and virtual objects into the augmented scene requires an accurate positioning of the virtual objects over the real world.

(26)

6 FUNDAMENTALS AND RELATED WORK

The motivation for the development of applications and researches in the field of AR comes from the potential of benefits that such techniques may bring in several other fields. In the specific field of registration, AR methods have been attracting a lot of attention in Medicine, because they extend the possibilities of study and practice for many techniques and medical procedures related to the medical images generated from patient’s current condition, such as angiographic visualization (Wang et al., 2012), liver surgery (Haouchine et al., 2013, 2014) and uterine laparosurgery (Collins et al., 2014). However, registration is a crucial problem in AR applications. Objects misplaced in the scene appear to be floating over the real scene. Accurate registration becomes even more crucial in applications which demand high precision, such as surgeries.

Tracking in AR is performed based on color or depth intensity of the object being tracked by the application. For color-based tracking, features are computed from the color image of the scene captured by the sensor and tracked during application’s live stream (Horn; Schunck, 1981; Lucas; Kanade, 1981). The first solution proposed to solve this issue was based on fiducial markers, used as point of reference positioned in the real scene for tracking (Figure 2.2-left image). Due to its intrusiveness (i.e. the marker is an artificial content introduced in the scene), methods for color-based tracking without markers were proposed. However, the main drawback in this kind of registration is still the same: the susceptibility to illumination conditions. To overcome this problem, depth-based tracking was proposed by registering two surfaces captured from the real scene from a real-time 3D depth sensor (Besl; Mckay, 1992;Chen; Medioni, 1992). This kind of tracking has grown

popularity due to its accuracy, robustness over illumination conditions and the recent availability of low-cost depth sensors.

In general, AR applications can be divided in two groups: based and marker-less. Marker-based AR uses a fiducial marker as a point of reference in the field of view to help the system to estimate the camera pose (Figure 2.2-left image) (Kato; Billinghurst, 1999). Markerless AR (MAR) uses a part of real scene as a natural marker (Figure 2.2-right images) (Izadi et al., 2011). By using it as a point of reference for tracking, one can expect non-rigid motion of the marker if it consists of a deformable object (e.g. face, body, hand).

2.2 3D REGISTRATION

3D registration is a fundamental problem in fields as 3D reconstruction and augmented reality. Most of the depth sensors provide partial surface data (i.e. acquired from one viewpoint) that must be aligned, to allow camera pose estimation, and merged, to obtain a complete digital representation of the object or scene of interest.

Some functional models have been proposed in the literature to solve the problem of registration with good performance:

1. Rigid Registration: In this kind of registration, a single Euclidean transformation is used to align two objects (Rusinkiewicz; Levoy, 2001). This transformation has the following properties: (1) It is global (i.e. remains the same for every point); (2) It can be uniquely defined by three non-collinear pairs of correspondences; (3) It is low-dimensional (i.e. only six degrees of freedom). Real-time performance is

(27)

2.2 3D REGISTRATION 7

Figure 2.2 Marker-based (left image) and markerless (right images) augmented reality. Left image is courtesy of ARToolKit library (Kato; Billinghurst, 1999) and right images are courtesy of KinectFusion (Izadi et al., 2011).

easily achieved due to the low number of parameters required to solve the rigid registration;

2. Articulated Deformation: For surfaces which are mainly characterized by artic-ulations, a skeleton is typically used as basis for deformation. In this representation, a skeleton is defined by a combination of bones and joints. Each joint is associated to some DoF (i.e. joint angles) and is related to other joints by rigid transformations (Allen; Curless; Popovi´c, 2002). In an alternative representation, joint deformation is obtained by blending the transformations of two adjacent bones in the overlap re-gions (Chang; Zwicker, 2008, 2011). The advantage of this representation is that it requires a low number of parameters to be estimated, which depends on the number of available bones or joints;

3. Local Affine Deformation: For several real-world datasets, it is desirable for the non-rigid registration algorithm to support general deformations, without prior knowledge about the objects or the kind of deformation they undergone. To achieve real-time performance, models, such as articulated registration, rely on prior knowl-edge about the scenario (e.g. skeleton tracking), losing its generality. To solve this issue, keeping non-rigid registration fast, accurate and general, solutions which use local affine transformations are frequently employed as they allow the preservation of fine surface details, while decoupling the complexity of the geometry from the complexity of the deformation by using a deformation graph as basis representation (Sumner; Schmid; Pauly, 2007);

Other functional models such as rigid registration with non-rigid correctives (Brown; Rusinkiewicz, 2007) and isometric deformation (Lipman; Funkhouser, 2009) have also been proposed in the literature, however they require too much computational cost, being inadequate to be used in our approach.

(28)

2.2.1 Rigid Registration

Rigid registration estimates a single transformation, composed of rotation and translation, to align two different viewpoints of the same object. Rigid registration is a challenging problem because, for real-world scenarios, it must deal with noise, outliers and non-overlapping regions in-between two surfaces captured from commodity depth sensors. Noise refers to the presence of unwanted points near the surface captured. Outliers are noisy points far from the surface, but that must be rejected otherwise they may affect the optimization phase. As the object is captured from a single view of the camera, the presence of non-overlapping regions between two surfaces is already expected, however holes and other artifacts may decrease significantly the region of overlap (Tam et al., 2013). To limit the search space for optimization and correspondence estimation, constraints must be defined. In the field of rigid registration, transformation-induced constraints such as closest point criterion are commonly employed. It constraints potential corre-spondences by computing and matching closest points for every iteration of the registra-tion algorithm. It is used in the standard Iterative Closest Point (ICP) algorithm (Besl; Mckay, 1992; Chen; Medioni, 1992). To reduce search space for correspondence, specific

ap-proaches have been proposed: project and project-and-walk methods (Rusinkiewicz; Levoy, 2001) restrict the search for a new closest point to the same 2D projection (i.e. pixel) and local neighbourhood respectively, avoiding global exhaustive search. Other constraints such as features (Johnson; Hebert, 1999) and saliency (Gelfand et al., 2005) have also been proposed in the literature and provide more reliable correspondences, and consequently more accurate convergence to the final result, however they require high processing time, being inadequate to be used in our proposal.

In fact, rigid registration has been researched for several years and now it consists on a well-defined problem with a small number of parameters to be estimated. Then, real-time high-quality methods have already been proposed in the literature. The most popular algorithm for 3D rigid registration is the ICP. It consists of six steps:

Selection of Points: Points from source and target objects are selected as samples for the algorithm;

Matching of Points: Corresponding points from source and target objects are associated;

Weighting of Correspondences: Correspondences are weighted such that the most reliables will have more weight according to its level of reliability;

Rejecting of Correspondences: Outliers are rejected from the pairs of corre-sponding points;

Error Association: Point-to-point or point-to-plane error metric is defined for the optimization step;

Error Minimization: Energy function built from previous step is (commonly) minimized by solving a linear system.

(29)

2.2 3D REGISTRATION 9

As the ICP algorithm provides high accuracy and real-time performance for rigid registration (Rusinkiewicz; Levoy, 2001), it is used in our approach.

2.2.2 Non-Rigid Registration

Non-rigid registration requires more attention because it faces the issues from rigid reg-istration and also the problem of deformation, which itself increases the number of pa-rameters to be estimated and the space of solutions that can be found. Unlike the rigid scenario, where every point from a given source object must be moved by a single transfor-mation measured by the algorithm, in the non-rigid scenario, every point may undergo a different, interconnected deformation. Therefore, more reliable correspondences must be computed for every region of the source object so that the registration may be sufficiently accurate and realistic (Tam et al., 2013).

Traditionally, commercial systems have used markers to provide sparse reliable corre-spondences for non-rigid registration, however, they are intrusive in the scene (Bermano et al., 2014). Templates have been used for applications based on part-to-whole alignment,

where they provide strong priors for the shape, helping on handling of noise and missing data (Li et al., 2009). For scenarios such as facial non-rigid registration, blendshapes can be applied to capture a basis set of user expressions (Weise et al., 2011; Bouaziz; Wang; Pauly, 2013; Li et al., 2013). Other constraints induced by deformation, features, signa-ture and saliency require too much processing time. Closest point criterion can be used for rigid and non-rigid registration in a similar way. However, regularization constraints are commonly employed to improve optimization phase by avoiding local minima tak-ing advantage from a priori information. Orthonormality (Sumner; Schmid; Pauly, 2007) and handling of holes (Li; Sumner; Pauly, 2008) are some of the most used regularization schemes for non-rigid shapes.

In general, non-linear optimization solver is typically employed for rigid and non-rigid registration. Many techniques have been used focusing on finding the best transforma-tions and correspondences. Local deterministic optimization methods compute a solution that maximizes/minimizes an energy function locally. These techniques do not pro-duce the most accurate solutions, but are mainly used due to their low processing time. Gradient-descent, Newton, Gauss-Newton, quasi-Newton and Levenberg-Marquadt are often employed for non-rigid registration (Madsen; Bruun; Tingleff, 2004). Singular Value Decomposition, quaternions, orthonormal matrices and dual quaternions are the most frequently used for rigid registration (Lorusso; Eggert; Fisher, 1995). As local optimization techniques may find only the local minima, global optimization can solve this problem trying to find a global solution. As alternative, stochastic optimization can solve this problem by using statistics and probabilistic models. While stochastic and global deter-ministic optimization seem to be more accurate, in this thesis we use a technique based on local optimization because of its low running time. Moreover, as we assume that there are spatial and temporal coherences between the sequential frames used for registration, local optimization converges after a few iterations (Sumner; Schmid; Pauly, 2007).

Surfaces such as face, hand and body may undergo deformation during a process of 3D reconstruction, for instance, and the rigid registration is not able to solve it. A solution

(30)

for this issue is to apply non-rigid registration to align those deformable objects.

One of the first works in the field of fast non-rigid registration applied to computer graphics is the Embedded Deformation (ED), a real-time deformation algorithm for object manipulation and creation of 3D animation (Sumner; Schmid; Pauly, 2007). The goal of this technique is to allow an user intuitive surface editing while preserving surface’s features. Deformation is represented by a graph. Each node of this graph is associated with an affine transformation that influences the deformation to the nearby space. The great advantage of this approach is that it can be applied to a wide range of objects, articulated or not.

Although its main goal is the user object manipulation, the algorithm proposed by Sumner et al. also can be seen as a non-rigid registration algorithm in which source and target surfaces are the objects before and after user manipulation. In this sense, many other works have used or improved this approach to the specific problem of non-rigid surface registration.

Li et al. adapted Sumner’s algorithm to the registration of partial range scans acquired from a 3D scanner (Li; Sumner; Pauly, 2008). They augmented the ED algorithm with a rigid registration and designed an energy function to penalize unreliable correspondences. Later on, Li et al. presented an extension of the previous approach (Li; Sumner; Pauly,

2008) where an algorithm for high-quality template-based non-rigid surface registration and reconstruction using dynamic graph refinement and multi-frame stabilization was presented (Li et al., 2009).

Li et al. presented a method for temporally coherent completion of surfaces captured from real-time dynamic performances (Li et al., 2012). They extended the non-rigid regis-tration proposed in their previous work (Li et al., 2009) by adding texture constraints for the optimization. Dou and colleagues proposed an algorithm to track dynamic objects acquired from real-time commodity depth cameras, such as the Microsoft Kinect Sensor (Dou; Fuchs; Frahm, 2013). Basically, they have extended the KinectFusion algorithm (Izadi et al., 2011) to deal with non-rigid registration. Their non-rigid registration is based on ED algorithm, however color consistency and dense point cloud alignment were added to the original energy function. All these approaches improve the accuracy of the ED algorithm, however requiring execution time in the order of minutes to register two point clouds. Thus, they are not suitable for an AR application.

Few methods were capable to achieve real-time performance in 3D non-rigid registra-tion. Chen et al. proposed a method for non-rigid registration of skeletons captured from user’s body (Chen; Izadi; Fitzgibbon, 2012). Their approach runs in 30 frames per second (FPS) but uses a small number of constraints for registration and depends on a skeleton definition.

Nutti et al. proposed a method to track tumors based on patient’s body position that presumes the prior knowledge about the scenario (Nutti et al., 2014). Their algorithm runs in 10 FPS by using a multi-thread implementation of (Li et al., 2009) in CPU.

Zollh¨ofer et al. proposed a method for real-time non-rigid registration of arbitrary meshes captured from the real scene (Zollh¨ofer et al., 2014). Based on a hardware special-ized for high-quality surface acquisition, their approach generates a 3D template model of the object of interest and uses a hierarchical non-rigid registration algorithm fully

(31)

2.3 SUMMARY 11

implemented on the GPU. The implementation runs in 30 FPS with high accuracy. In this work, we present an approach also based on the ED algorithm which shares some characteristics of (Zollh¨ofer et al., 2014), such as no special configuration or prior

knowledge of the object and GPU parallelism to achieve real-time performance. However, no special hardware is supposed to be used, on the contrary, our approach is based on a simple off-the-shelf RGB-D sensor, with noise and low accuracy. As proposed in related work (Li et al., 2009), we use an adaptive graph refinement to improve non-rigid registration accuracy. Differently from other approaches, the algorithm proposed here runs entirely on the GPU and is based on a quadtree which operates over the 2D projection of the object to be registered. Also, the main goal of our algorithm is to be incorporated in a MAR environment, as a tool to improve tracking of the deformable object.

2.3 SUMMARY

Augmented reality is a technology which has been used in several fields such as medicine, entertainment, among others. For some applications, markerless technology is useful to remove the intrusiveness of traditional marker-based approaches. When the object used as basis for markerless tracking is deformable, it is desirable for the application to support non-rigid motion to improve tracking robustness.

Many methods have been proposed for accurate 3D non-rigid registration inspired by the ED algorithm, however a few of them support real-time performance, still requiring prior knowledge about the scenario. To overcome this situation, in this thesis we propose an alternative method for fast 3D non-rigid registration which extends the ED algorithm by using a three-level adaptive approach implemented entirely on the GPU.

(32)

(33)

Chapter

3

The focus of this thesis is to add support for non-rigid tracking in a markerless augmented reality environ-ment. Therefore, in this chapter we present the environment in which the proposed non-rigid registration was applied and validated.

MARKERLESS AUGMENTED REALITY

ENVIRONMENT

In this chapter we present the MAR environment in which this work is based on. An overview of proposed MAR environment can be seen in Figure 3.1. An RGB-D sensor is used to capture color and depth information of the scene. The object of interest is localized, segmented from the scene and reconstructed in real-time. Then, real-time tracking is performed by using the 3D reference model previously reconstructed and the current 3D object captured by the sensor. The final registered 3D object is integrated into the 3D reference model to account for new viewpoints or changes in object’s shape due to deformations. A detailed explanation of the environment can be seen in the next subsections of this chapter.

3.1 SURFACE ACQUISITION

In this environment, an RGB-D sensor is used to capture color and depth information from the real scene for every input frame (Figure 3.1). Color information is encoded as a color map, an image which stores for each pixel the red, green and blue intensities of the captured scene. Depth information is encoded as a depth map (D), an image which stores for each pixel the measurement of distance (i.e. depth) from the corresponding 3D point on the scene to the depth sensor.

Our approach is based on a low-cost RGB-D sensor which provides noisy depth data. As described in Section 2.2, unwanted points on the surface captured may reduce reg-istration accuracy. To minimize this problem, bilateral filter is applied over D (Tomasi;

Manduchi, 1998), as shown in Equation 3.1. To reduce noise preserving features (i.e.

discontinuities) of the raw depth data, this technique uses a non-linear combination of nearby image intensities based on geometric proximity and photometric similarity.

(34)

14 MARKERLESS AUGMENTED REALITY ENVIRONMENT

RGB-D Sensor

Live Stream Object Segmentation

TSDF

3D Reference Model

Source Depth Map

Target Depth Map

Source Surface

Target Surface

Registration Tracking

3D Reference Model Reconstruction

Figure 3.1 Overview of the proposed approach from 3D reference model reconstruction to tracking solution. Adapted from (Souza; Macedo; Apolinario, 2014).

Df(p) = 1 W (p) X q∈S Gσd(||p − q||)Gσc(||D(p) − D(q)||)D(q) W (p) =X q∈S Gσd(||p − q||)Gσc(||D(p) − D(q)||) (3.1)

where D(p) and D(q) correspond to the pixel values at positions p and q in image D. σd and σc are the standard deviations of Gaussian functions G for space (i.e. distance) and range (i.e. color) domains, respectively. W (p) is a normalization factor, S is the neighbourhood of pixel p and Df is the filtered depth map. From empirical tests, we have set σd= 4.5 and σc= 30.

Unwanted points are also localized on the background scene, which can be removed from Df by using a depth threshold. On the experiments conducted, we have used the value of 1.3 meters for such task by considering that the object of interest is somewhere near the depth sensor.

To detect and segment the object of interest in the scene (Figure 3.1), two methods can be used. The first method relies on the use of a classifier to detect the object on the appropriate map. If it is applied on the color map, intrinsic and extrinsic calibrations must be performed to allow the mapping of the segmented region from color to depth map. In practice, we have tested the approach in some scenarios where the object consists on user’s head. In these cases, the Viola-Jones face detector (Viola; Jones, 2004) implemented

(35)

3.2 3D REFERENCE MODEL RECONSTRUCTION 15

in GPU is used to locate and segment the face in the color map (Figure 3.1). This detector takes advantage from a representation called integral image to compute Haar-like features quickly. In an integral image, each pixel contains the sum of the pixels above and to the left of the original position. After the computation of the Haar-like features, a combination of simple classifiers built using the Adaboost learning algorithm is employed to detect faces in color images (Freund; Schapire, 1995). If the classifier is not available, an

alternative method can be used. A 2D bounding box that contains the foreground object is computed from D. Then, it is discarded from the memory every position outside the bounding box.

By applying the process of intrinsic calibration, a point cloud P is computed from D. The normal vector (n) for each point is the eigenvector of smallest eigenvalue for a covariance matrix built for every point p ∈ P (Holzer et al., 2012).

Once the 3D object is obtained for every frame, markerless rigid registration is per-formed based on the interactive alignment of two consecutive source (Ps) and target (Pt) point clouds captured from the real scene. In fact, Ps is represented by a 3D reference model generated from the object of interest in a previous pose and Ptis the current point cloud acquired by the depth sensor.

To achieve real-time performance, all the steps of this MAR environment must run on the GPU. Then, all the algorithms were carefully designed and implemented in a parallel way to exploit the full parallelism provided by the hardware.

3.2 3D REFERENCE MODEL RECONSTRUCTION

To reconstruct the 3D reference model from the object of interest in real-time (Figure 3.1), the KinectFusion algorithm is employed (Izadi et al., 2011; Newcombe et al., 2011). An overview of this algorithm can be seen in Figure 3.2.

Figure 3.2 Overview of KinectFusion’s pipeline (Izadi et al., 2011).

Once the object is detected on the scene, the region that contains it is fixed. Then, the object is constrained to be moved only inside this region. From the different viewpoints captured from the same object, a single 3D reference model can be generated. To do so, the KinectFusion integrates raw depth data captured from an RGB-D sensor into a 3D grid to produce a high-quality 3D reconstruction of the object of interest. The grid

(36)

16 MARKERLESS AUGMENTED REALITY ENVIRONMENT

stores for each voxel the signed distance to the closest surface around a narrow region (i.e. TSDF - Truncated Signed Distance Function) and a weight that indicates uncertainty of the surface measurement. These volumetric representation and integration are based on the VRIP algorithm (Curless; Levoy, 1996). To extract the implicit surface of the 3D reconstructed model, zero-crossings (i.e. positions where the TSDF sign changes) are detected on the grid through the raycasting algorithm.

By extracting the reference model in a previous pose, and aligning it to the cur-rent 3D model captured by the depth sensor, the incremental motion (Trigid) between frames can be estimated. This solution allows accurate markerless tracking without error accumulation, as the high-quality 3D reference model is used as basis for tracking.

3.3 TRACKING

Rigid motion is estimated by the ICP algorithm described in Section 2.2. Each one of the ICP steps were designed to achieve real-time performance while providing good accuracy for the rigid registration. This real-time variant of the algorithm is described as follows: Selection of Points: All the points from Ps and Pt are selected for optimization; Matching of Points: Corresponding points between Ps and Pt are associated by using the projective data association (i.e. reverse calibration) (Rusinkiewicz; Levoy, 2001), which matches the points that are located at the same 2D projection position (i.e. the same pixel in Ds and Dt);

Weighting of Pairs: It is assigned constant weight for each association;

Rejection of Pairs: Pairs are rejected if the Euclidean distance between corre-sponding points is greater than 10mm or angle between correcorre-sponding normals is greater than 20 degrees;

Error Metric: Point-to-plane metric (Equation 3.2) is used to guide optimization;

argmin X

p selected

||(Trigidps− pt) · nt||2 (3.2) Error Minimization: Error metric is minimized by using the Cholesky

decompo-sition on Equation 3.2 (Chen; Medioni, 1992).

The real-time variant of the ICP algorithm uses projective data association to find correspondences. The ICP fails, or does not converge to a correct registration, when there is high pose variation between frames in sequence. To improve tracking robustness, a real-time pose estimator is used to give a new initial guess to the tracking algorithm when it fails (Figure 3.3). For the situations where the object consists on user’s head, the head pose estimation algorithm proposed by Fanelli et al. was used (Fanelli et al., 2011). However, even using this algorithm, the tracking may fail if the user interacts non-rigidly with the application. Non-rigid tracking support can be added by applying a real-time non-rigid surface registration algorithm to align the 3D reference model and the current model captured, as will be discussed in the next chapter.

(37)

3.4 SUMMARY 17

Figure 3.3 Left image: The user translated his face fast. A small number of points were at the same image coordinate and the ICP failed. Right image: By using the pose estimation algorithm, the problem can be solved (Macedo; Apolinario; Souza, 2013).

3.4 SUMMARY

One solution to provide accurate markerless tracking for an augmented reality environ-ment is by generating a 3D reference model of the object of interest and tracking it in real-time. The KinectFusion algorithm is used to reconstruct such model in real-time and the ICP algorithm is used to track it in the scene by registering the 3D reference model in a previous pose and the current 3D model captured from a depth sensor. To add support for non-rigid tracking, it is necessary a real-time non-rigid registration algorithm to maintain user interaction with the application.

(38)

(39)

Chapter

4

In this chapter we present the proposed adaptive non-rigid registration algorithm and its adaptation to take advantage from the parallelism of the GPU. Our approach is evaluated in terms of accuracy and performance for several datasets. Some of the content described in this chapter is present in our authored publication in (Souza; Macedo; Apolinario, 2014).

GPU-BASED ADAPTIVE NON-RIGID REGISTRATION

In this chapter we present the adaptive non-rigid registration algorithm. An overview of the full process to register two point clouds can be seen in Figure 4.1. Non-rigid algorithm builds a deformation graph (G) on Psto allow its deformation to Ptiteratively. Each node g ∈ G consists of a point ∈ Ps associated with a 3D affine rigid transformation (i.e. a 3D rotation matrix R and a 3D translation vector t) which influences the deformation to the nearby space. Current deformation between Ps and Pt is modelled in terms of an energy function and a non-linear optimization algorithm is applied to minimize this energy based on the affine transformations of G. To reduce computational cost of the non-linear solver, a sub-sample of Ps is selected as constraint to be used during optimization. Next, the algorithm iteratively refines G according to the energy function measured previously. This refinement is based on a quadtree. The registration is stopped when the residual error between deformed Ps and Pt is sufficient low. To achieve a good performance, the full pipeline runs entirely on GPU and non-rigid registration algorithm is applied in a multi-frame manner only when rigid tracking fails.

Our deformation model is inspired in the ED algorithm (Sumner; Schmid; Pauly, 2007). However, we have added a three-level adaptive approach to improve accuracy and perfor-mance of the original solution. Moreover, we have implemented it on the GPU to boost performance even more. The proposed algorithm consists of several stages (Figure 4.1), which are described in the next sections of this chapter.

4.1 DEFORMATION MODEL

By using the deformation graph, a point p can be deformed by G according to the following equation:

(40)

20 GPU-BASED ADAPTIVE NON-RIGID REGISTRATION

Source Depth Map

Target Depth Map

Cropped Source Depth Map Cropped Target Depth Map Source Surface Target Surface Matching of Points Building of Quadtree Weighting the Influence of Nodes Selection of Con-straints Error Min-imization Updating the source object

Deformed Source Surface Error > threshold

Error ≤ threshold Adapting Quadtree Selection of Nodes

Figure 4.1 Overview of the proposed approach from the depth map acquisition to the final non-rigid aligned surface.

(41)

4.1 DEFORMATION MODEL 21 p0 = k X j=1 wj(p)[Rj(p − gj) + gj + tj] (4.1) where k represents the k-nearest nodes of p and wj is a weight that measures the influence of each node to the point.

To solve the problem of non-rigid registration using this representation, we use three energy functions - Erot, Ereg, Econ (Sumner; Schmid; Pauly, 2007):

Energy function for rotation (Erot): In order for a 3 × 3 rotation matrix to represent a rotation in SO(3), it must satisfy six conditions: each of its three columns must be unit length, and all columns must be orthogonal to one another (Grassia, 1998). The squared deviation of these conditions is given by the function Rot(R):

Rot(R) = (c1· c2)2 + (c1· c3)2+ (c2· c3)2+ (c1· c1 − 1)2+ (c2· c2− 1)2 +

(c3· c3 − 1)2 (4.2)

where c1, c2 and c3 are the column vectors of a given rotation matrix.

The term Erot is defined by the sum of the rotation error over all affine transfor-mations of G: Erot = m X j=1 Rot(Rj) (4.3)

Energy function for regularization (Ereg): In order to apply a deformation sufficiently smooth, we must ensure that the affine transformations of adjacent nodes in G must be consistent. Ereg is the sum of the squared distances between each node’s transformation applied to its neighbours and the actual transformed neighbour positions: Ereg = m X j=1 X k∈N (j) ||Rj(gk− gj) + gj + tj − (gk− tk)||22 (4.4)

where Nj consists of all nodes connected with the node gj.

Energy function for constraints (Econ): This energy function deals directly with Psand Pt. It measures how distant they are from each other. Econ is the sum of the Euclidean distances between the deformed source points and their correspondents on the target object:

(42)

22 GPU-BASED ADAPTIVE NON-RIGID REGISTRATION Econ = n X i=1 ||p0_i− qi||22 (4.5)

q is the target point correspondent to pi, p

0

i is pi after deformation (Equation 4.1). n is the total of points in Ps.

The total energy function Etot is defined by the following equation:

Etot = wrotErot+ wregEreg+ wconEcon (4.6) We used wrot = 1, wreg = 10 and wcon = 100 in all our experiments, as suggested in related work (Sumner; Schmid; Pauly, 2007). We tested other weights and alternative strategies for relaxing them during each iteration, however we did not obtain better results.

4.2 MATCHING OF POINTS

After object detection and segmentation, points from Ps and Pt are associated. By using the MAR environment described in the previous chapter, it is assumed that there is temporal/spatial coherence between frames, as the rigid registration was already applied and, as result, Ps and Pt are relatively near from each other. Hence, projective data association (Section 3.3) is used to match the points.

As adaptation for GPU processing, each GPU thread transforms a single point psinto image coordinate and associates it with the point pt at the same image coordinate.

4.3 SELECTION OF NODES

After the matching of points, the nodes of G are selected. A quadtree is built on GPU to perform the selection of nodes based on the 2D projection of G. As the nodes of G are also points in Ps, we can convert them from world to image coordinates by using the same process used to reproject Ps into Ds. Ps may be an object with holes distributed along the surface. In this case, the selection of nodes only based on the 2D space may cause the nodes to be selected in regions where there is no depth data. To solve this problem, we take advantage from what we call virtual nodes to represent the space where there is no depth data. Virtual nodes favor the expansion of the quadtree in regions where naturally we have depth data, however not in the specific position of the node. It is worthy to mention that virtual nodes do not have affine transformation, they are just leaves of the quadtree that can be refined to generate real leaf nodes if necessary. Therefore, we restrict the use of virtual nodes in the first two levels of the quadtree.

To build the quadtree, some information must be stored on the GPU memory space, such as: the level for each node in G, whether in a given position exists a node in G, G has children (i.e. is a parent node) and exists a virtual node in G.

The algorithm can be divided in three steps: the building of the quadtree (Algorithm 1), the adaptive refinement (Algorithm 2) and collapse (Algorithm 3) of nodes in G.

(43)

4.3 SELECTION OF NODES 23

Algorithm 1 Building a quadtree

1: for each thread of index idx in parallel do

2: u ← getP ixel(idx, currentLevel)

3: if depth(v(u)) > 0 then

4: insertN odeInGraph(u)

5: setLevel(u, currentLevel)

6: else if currentLevel <= 2 then

7: insertV irtualN odeInGraph(u)

8: setLevel(u, currentLevel)

9: end if

10: if currentLevel > 1 and hasN ode(u) then

11: parentIdx = idx/4

12: u ← getP ixel(parentIdx, currentLevel − 1)

13: removeN odeF romGraph(u)

14: removeV irtualN odeF romGraph(u)

15: insertN odeInP arentList(u)

16: end if

17: end for

Figure 4.2 Building of the deformation graph (right) over the source object (left) based on the residual error measured (center).

We build the quadtree in the first iteration of our algorithm. This building is shown as pseudocode in Algorithm 1 and one result is illustrated in Figure 4.2. First, we iteratively call the GPU kernel that will select the nodes. We iterate from the first level to the level required by the user to build the quadtree. Each GPU thread in parallel computes the position u to select the node (line 2). To compute u, we need the thread id and the current level of the quadtree being iterated. The method getP ixel shifts the position of the thread id to the center of the 2D space that will be represented by the node. If the point is visible, it will be a new node in G (lines 3-5). In the opposite case, it can be a new virtual node (lines 6-9). Therefore, we allow the quadtree to be refined even in regions where there are just a few points. If the node was selected but it is not in the first level (line 10), the thread removes the parent node from G, being it a real or virtual node, and

(44)

inserts it into a parent list, indicating that it has already been expanded (lines 13-15). In this case, getP ixel computes the position of the parent node based on the previous level in the quadtree hierarchy and the parent id thread (as the parent is expanded to four children, we simply divide the current thread id by 4 to obtain the parent id). Algorithm 2 Refinement of nodes

3: if hasN ode(u) or hasV irtualN ode(u) and getLevel(u) = currentLevel then

4: evaluateEcon(u)

5: if region around u must be refined then

6: for each child node at pixel uc do

7: if depth(v(uc)) > 0 then

8: insertN odeInGraph(uc)

9: setLevel(uc, currentLevel + 1)

10: end if

11: end for

12: removeN odeF romGraph(u)

13: removeV irtualN odeF romGraph(u)

14: insertN odeInP arentList(u)

15: end if

16: end if

17: end for

Figure 4.3 Refinement of the deformation graph (right) over the cheeks region of the source object (left) based on the residual error measured (center).

After the building of the quadtree, the nodes of G can be refined or collapsed accord-ing to the residual error measured in the previous iteration. The algorithm to do the refinement of nodes is shown as pseudocode in Algorithm 2 and one result is illustrated in Figure 4.3. Again, we iteratively call the GPU kernel that will refine the nodes. We iterate from the first level of the quadtree to the maximum level in order to refine the

(45)

4.4 WEIGHTING THE INFLUENCE OF NODES 25

nodes in a top-down fashion. For each GPU thread in parallel, we compute the posi-tion of the thread in the 2D space, check if there is a node at this posiposi-tion and if it is at the current level being iterated (lines 2-3). If the thread passes from this condition, we compute the average of the error around a region C as explained before (line 4). If the average is above a certain threshold, the node must be refined. For each child node computed from the node position (line 6), we check whether there is a point at the child position (line 7). If exists, it will be a new child node in G (line 8). In this case, the thread removes the node from G (lines 12, 13) and inserts it into a parent list, indicating that it has already been expanded (line 14).

The algorithm to do the collapsing of nodes is shown as pseudocode in Algorithm 3 and one result is illustrated in Figure 4.4. Again, we iteratively call the GPU kernel that will collapse the nodes. We iterate from the maximum level of the quadtree to the root node in order to collapse the nodes in a bottom-up fashion. For each GPU thread in parallel, we compute the position of the thread in the 2D space, check if the node has children and if it is at the current level that is being iterated (lines 2-3). If the thread passes from these conditions, given a region C around u, we compute the average of the error Econ (Equation 4.5) for each ps ∈ C (line 4). If the average is below a certain threshold, the children nodes in C must be collapsed. To collapse the nodes, we check if exist child nodes and they are leaf nodes (line 6). In this case, they are collapsed (lines 7-9) and C is represented by the old parent node (lines 10-11).

Algorithm 3 Collapsing of nodes

3: if hasChildren(u) and getLevel(u) = currentLevel then

4: evaluateEcon(u)

5: if region around u must be collapsed then

6: if exist child nodes and they are leaves then

7: for each child node at pixel uc do

8: removeN odeF romGraph(uc)

9: end for

10: insertN odeInGraph(u)

11: removeN odeF romP arentList(u)

12: end if

13: end if

14: end if

15: end for

4.4 WEIGHTING THE INFLUENCE OF NODES

In this step, the influence of the k-nearest nodes for each ps is computed. The weight wj can be computed by:

(46)

Figure 4.4 Collapsing of the deformation graph (right) over the cheeks region of the source object (left) after updating on the residual error (center).

and then normalized to sum to one. distmax is the distance to the k + 1-nearest node with respect to p. From the Equation 4.7, it is guaranteed that the nearest nodes will have more influence in the deformation of p. Also, as the nodes are points of Ps, they are deformed by other nodes of G.

To compute the weights efficiently in GPU, we create an array that contains only the nodes selected. The direct access to this array prevents us from checking explicitly on the surface whether a point is also a node. Then, each GPU thread computes the influence for a specific node in G.

4.5 SELECTION OF CONSTRAINTS

To compute the best affine transformations that align Ps and Pt we must:

1. Select the constraints (i.e. points from Psthat will be used during the optimization phase);

2. Convert the affine rotations from Euler to quaternion representation;

3. Compute the energy function Etot (Equation 4.6) that models the constraints to guide the proper registration of the objects;

4. Use a non-linear solver to minimize Etot;

Instead of using the full dense point cloud as constraint for the optimization or asking the user to perform this task of constraint selection, we use an adaptive algorithm that performs the selection of constraints based on the residual error previously measured (Equation4.6). Given a region on the source surface, the higher the error, the higher the number of points selected as constraints for the optimization, as can be seen in Figure 4.5.

In the first iteration of the optimization algorithm, where the residual error still was not measured, an uniform sampling is used to select the constraints. To do that, a n × n mask, with step n, is scanned through the 2D projection of Ps at the xy coordinates.

(47)

4.6 ERROR MINIMIZATION 27

Source Surface Target Surface

Initial Error Constraints

max

0

Figure 4.5 Constraint selection based on the initial non-rigid error between source and target surfaces.

The point at the center of this mask is selected to be a constraint if it exists in Ps (i.e. it is not in a hole). From empirical tests, n = 4 produced the best results. A discussion about the most appropriate value for n is shown in Chapter 5, Section 5.2.

In the remaining iterations of the optimization, we use the same n×n mask to perform a scan on the 2D projection of Ps and its residual error Etot (Equation 4.6). First, the algorithm evaluates the average residual error at the n × n region being scanned. Based on this average error measured from Etot, which we call here Eavg, and a pre-defined threshold thc, the number of points selected at that region will be defined. In this case, we have three situations:

1. Eavg > thc, all the n2 points are selected;

2. Eavg ≥ thc/2 and Eavg ≤ thc, n points uniformly distributed over the mask are selected;

3. Eavg < thc/2, only the point at the center of the mask is selected.

Therefore, we select more constraints in the regions where the deformation is high and must be minimized, but we still consider the regions where the deformation is small or none, by selecting a small number of constraints to represent them. From empirical tests, thc equals to the half of the averaged root mean squared error measured for the dataset produced the best results.

4.6 ERROR MINIMIZATION

In this stage, the affine transformation A = [R|t], where R is a 3 × 3 rotation matrix and t is a 3D translation vector, is estimated for each node by a non-linear Gauss-Newton

A Multi-Frame Adaptive 3D Non-Rigid Registration for Augmented Reality

Universidade Federal da Bahia

Universidade Salvador

Universidade Estadual de Feira de Santana

TESE DE DOUTORADO

An Adaptive Approach to Real-Time 3D Non-Rigid Registration

Antonio Carlos dos Santos Souza

Programa Multiinstitucional de

P´

os-Gradua¸

c˜

ao em Ciˆ

encia da Computa¸

c˜

ao – PMCC

Salvador

19 de Dezembro de 2014

ANTONIO CARLOS DOS SANTOS SOUZA

AN ADAPTIVE APPROACH TO REAL-TIME 3D NON-RIGID

REGISTRATION

Tese apresentada ao Programa

Mul-tiinstitucional de P´

os-Gradua¸c˜

ao em

Ciˆ

encia da Computa¸c˜

ao da

Univer-sidade Federal da Bahia,

Universi-dade Estadual de Feira de Santana

e Universidade Salvador, como

requi-sito parcial para obten¸c˜

ao do grau de

Doutor em Ciˆ

encia da Computa¸c˜

ao.

Orientador: Antˆ

onio Lopes Apolin´

ario J´

unior

Salvador

TERMO DE APROVAC

¸ ˜

AO

ANTONIO CARLOS DOS SANTOS SOUZA

AN ADAPTIVE APPROACH TO REAL-TIME

3D NON-RIGID REGISTRATION

ACKNOWLEDGEMENTS

RESUMO

ABSTRACT

CONTENTS

LIST OF FIGURES

LIST OF TABLES

Chapter

1

INTRODUCTION

Chapter

2

FUNDAMENTALS AND RELATED WORK

Chapter

3

MARKERLESS AUGMENTED REALITY

ENVIRONMENT

Chapter

4

GPU-BASED ADAPTIVE NON-RIGID REGISTRATION