Parallel Repairing 3D Fuzzy Images into Well-Composed Images

(1)

Federal University of Rio Grande do Norte Exact and Earth Sciences Center

Department of Informatics and Applied Mathematics Bachelor in Computer Science

Parallel Repairing 3D Fuzzy Images into

Well-Composed Images

Rafael Lucena Germano

Natal-RN June 2018

(2)

Rafael Lucena Germano

Parallel Repairing 3D Fuzzy Images into

Well-Composed Images

Bachelor’s dissertation presented to the De-partment of Informatics and Applied Mathe-matics at the Exact and Earth Sciences Cen-ter at the Federal University of Rio Grande do Norte in partial fulfillment of the require-ments for the degree of bachelor in Computer Science.

Advisor

Dr. Bruno Motta de Carvalho

Federal University of Rio Grande do Norte – UFRN Department of Informatics and Applied Mathematics – DIMAp

Natal-RN June 2018

(3)

Germano, Rafael Lucena.

Parallel repairing 3D fuzzy images into well-composed images / Rafael Lucena Germano. - 2018.

51f.: il.

Monografia (graduação) - Universidade Federal do Rio Grande do Norte, Centro de Ciências Exatas e da Terra, Departamento de Informática e Matemática Aplicada, Bacharelado em Ciência da Computação. Natal, 2018.

Orientador: Bruno Motta de Carvalho.

1. Computação. 2. Imagens bem-compostas. 3. Topologia digital. 4. CUDA. I. Carvalho, Bruno Motta de. II. Título. RN/UF/CCET CDU 004

Universidade Federal do Rio Grande do Norte - UFRN Sistema de Bibliotecas - SISBI

Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Ronaldo Xavier de Arruda - CCET

(4)

Bachelor dissertation under the title of Parallel Repairing 3D Fuzzy Images into Well-Composed Images presented by Rafael Lucena Germano and accepted by the Department of Informatics and Applied Mathematics of the Exact and Earth Sciences Center of the Federal University of Rio Grande do Norte, being approved by all the members of the thesis committee specified below:

Dr. Bruno Motta de Carvalho

Advisor

Department of Informatics and Applied Mathematics Federal University of Rio Grande do Norte

Dr. Rafael Beserra Gomes

Dr. Selan Rodrigues dos Santos

(5)

(6)

Acknowledgements

I would like to first thank my advisor, Bruno Motta de Carvalho, for his guidance and patience during the development of this work.

To the members of the examination committee for their comments on this work and for accepting participating on such a short time.

Thank my parents, Fernando and Heliana, and my sisters, Amanda and Camila, for their support.

(7)

It is not knowledge, but the act of learning, not possession but the act of getting there, which grants the greatest enjoyment.

(8)

Parallel Repairing 3D Fuzzy Images into

Well-Composed Images

Author: Rafael Lucena Germano Advisor: Dr. Bruno Motta de Carvalho

Abstract

This work presents a parallelized version in Compute Unified Device Architecture of the algorithm presented in (SIQUEIRA et al., 2008) for repairing images into well-composed ones, as well as a comparison between heuristics to obtain the well-composed image which minimizes the difference between the generated well-composed image and the original image. The algorithm is based on successively changing the points from one object to another until the image becomes well-composed. Well-composed images are images on which the intersection of the voxels of an object with its complement forms a topological surface. Such images enjoy very useful properties which reduce the processing time of algorithms, such as thinning and surface extraction algorithms. Lastly, the performance of the parallel and sequential versions are compared and an analysis of the produced well-composed images is done.

(9)

Parallel Repairing 3D Fuzzy Images into

Well-Composed Images

Autor: Rafael Lucena Germano Orientador(a): Dr. Bruno Motta de Carvalho

Resumo

Este trabalho apresenta uma versão paralelizada em Compute Unified Device Architecture do algoritmo apresentado em (SIQUEIRA et al., 2008) de repara¸cão de imagens 3D em imagens bem-compostas, assim como uma compara¸cão entre heur´ısticas para obter a imagem bem-composta que minimiza a diferen¸ca entre a imagem bem-composta gerada e a imagem original. O algoritmo se baseia em mudar sucessivamente pontos de um objeto para outro até que a imagem seja bem-composta. Imagens bem-compostas são imagens nas quais a interse¸cão dos voxels dos objetos com seus complementos formam uma superf´ıcie topológica. Tais imagens possuem propriedades muito úteis que reduzem o tempo de processamento de alguns algoritmos, como por exemplo de algoritmos de esqueletoniza¸cão e extra¸cão de superf´ıcies. Por fim, o desempenho da versão paralela e da sequencial serão comparados e uma análise das imagens bem-compostas obtidas será realizada.

(10)

List of figures

1 A voxel centered in p. . . p. 18 2 Types of critical configurations . . . p. 20 3 Hardware model of a GPU . . . p. 21 4 Repairing a critical configuration . . . p. 28 5 Possible data access conflicts of threads visiting close points . . . p. 35 6 Visitation order of points and synchronization calls of point based

algo-rithm. . . p. 39 7 Visitation order of points and synchronization calls of cube based algorithm. p. 40

(11)

List of tables

1 Time spent finding critical configurations for the sequential (S) and

par-allel (P) algorithms. . . p. 45 2 Comparison between the developed algorithms for I1 . . . p. 47 3 Comparison between the developed algorithms for I2 . . . p. 47

(12)

List of Abbreviations

CUDA – Compute Unified Device Architecture GPU – Graphics Processing Unit

SM – Streaming Multiprocessor

API – Application Programming Interface CPU – Central Processing Unit

(13)

3 Repairing Algorithm p. 27 3.1 Sequential Version . . . p. 27 3.1.1 Binary Images . . . p. 27 3.1.2 Non-binary images . . . p. 32 3.2 Parallel Version . . . p. 33 3.2.1 Binary Images . . . p. 33 3.2.1.1 Finding critical configurations . . . p. 34 3.2.1.2 Thread per point . . . p. 37 3.2.1.3 Thread per block . . . p. 39 3.2.1.4 Further insight . . . p. 42 3.2.2 Non-binary images . . . p. 42

(14)

3.2.3 Extending to Fuzzy Images . . . p. 42 3.3 Heuristics to choose points . . . p. 43

4 Results p. 45

5 Final Remarks p. 48

5.1 Conclusion . . . p. 48 5.2 Future Works . . . p. 48

(15)

14

1 Introduction

In several applications it is desirable to work with well-composed 3D images, a dis-cretization of the real space into a cubic space with the well-composedness property that solves the problem of close points in cubic grids being connect by an infinitesimal area through a vertex or an edge. We say informally that a well-composed image is an image such that if two cubes in the image share only an edge then there is a cube such that share faces with both cubes, and if two points share only a vertex then there are two cubes that share a face such that each cube shares a face with one of the cubes sharing a vertex.

These kind of images have several interesting properties, such as a version of the

Jordan-Brouwer separation theorem for 2D (LATECKI; ECKHARDT; ROSENFELD, 1995)

and 3D (LATECKI, 1997) well-composed images as well as the fact that the Euler char-acteristic is locally computable (LATECKI, 1997). These properties allows the application of simpler and faster algorithms such as thinning (MARCHADIER; ARQU `ES; MICHELIN, 2002), (XIE; THOMPSON; PERUCCHIO, 2003) and surface extraction (SIQUEIRA; LATECKI; GALLIER, 2005) which would not work or require extra processing with images that are not well-composed.

However, although well-composedness offers useful properties, many problems that work with 3D images usually generate an image by some digitizing process followed by a segmentation step to identify the objects of interest. For example, in medicine it is part of the first steps to create cross-sectional profiles of organs which may be used to assess diseases, followed by a thinning step, which could be faster if the image is guaranteed to be well-composed, and some other steps.

There are several segmentation algorithms which produce a fuzzy map for the image, a map that associates to each point in the image values of pertinence to the objects in the image, thus, coding the uncertainty inherent to a particular segmentation task. However, since the segmentation algorithm may not generate a well-composed image, a repairing algorithm is necessary to obtain a well-composed image.

(16)

15

Therefore we extend the definition of well-composedness as well as the algorithm presented by (SIQUEIRA et al., 2008) to fuzzy images and develop a parallel algorithm to repair fuzzy images into well-composed images.

The algorithm presented by (SIQUEIRA et al., 2008) which consists in systematically choosing a point from the background and inserting on the foreground, or conversely from the foreground to the background, until the image becomes well-composed. It should be noted that this algorithm is random, because there is more than one point which could be changed that would make the image well-composed, and the rule used is to choose randomly any of these points.

Hence, we propose to implement a parallel algorithm that inserts multiple points at the same, hoping to increase the speed of the algorithm in several times, and use the pertinences of the points (a fuzzy concept) to remove the randomness of the algorithm, applying a rule that would minimize the number of changes while trying to obtain the most similar well-composed image to the segmented image. We also compare heuristics that change points from both foreground to background and background to foreground, to observe how similar the resulting image is to the original while having the possibility that the algorithm does not terminate.

Doing so may cause the algorithm to never stop, as it may oscillate between inserting and removing a point into the foreground as pointed by (BOUTRY; G ´ERAUD; NAJMAN, 2015). However, if the algorithm does not eventually finish after a set amount of iterations, we may start to only insert(or remove) into the foreground to guarantee that the algorithm would eventually terminate.

1.1 Outline

The Chapter 2 is divided in four sections, in Section 2.1 we introduce concepts and definitions of digital image processing necessary to understand the work presented. In Sec-tion 2.2 we introduce the Compute Unified Device Architecture(CUDA) that was chosen to develop the parallel repairing algorithm. In Section 2.3 the related works in the area of image repairing into well composed images are presented, and in Section 2.4 we explain the segmentation algorithm used to generate fuzzy images.

Chapter 3 is also divided in three sections, in Section 3.1 we explain a sequential algorithm to repair images into well-composed images, in Section 3.2 we present a parallel version based on the sequential repairing algorithm and in Section 3.3 heuristics to find

(17)

16

well-composed images with fewer iterations and less changes are discussed.

In Chapter 4 we present and discuss the obtained results. And on Chapter 5, we present our conclusion and future works that may span from this work.

(18)

17

2 Basic Concepts and Related

Works

In this chapter we will define the most important concepts to understand this work in Section 2.1, present related works in Section 2.3 and explain the segmentation algorithm in Section 2.4.

2.1 Definitions

The objects of study are going to be digital images which are a discretization of the real space. Thus, a definition of the discrete space we are working with must be given. This space will be called a 3D grid and it is defined as:

Definition 1. A 3D grid is said to be the set of points of the form δZ3 = { (δz1, δz2, δz3) | (z1, z2, z3) ∈ Z3}, where δ ∈ R+ _{is a constant.}

This definition describes a regular cubic grid. There are other ways to define 3D grids. However, for this text we focus on the study of cubic grids, thus we define a voxel as a cube and say that:

Definition 2. Given a point p = (p1, p2, p3) ∈ δZ3, the voxel centered in p, denoted by V(p), is the cube in R3 _{with sides of length 2δ and center p such as shown in Figure 1, or,} more formally,

V(p) = { (q1, q2, q3) | |pi− qi| ≤ δ, i ∈ {1, 2, 3} }. Given this definition of voxel, we have that:

Definition 3. The neighborhood of a point p ∈ δZ3_{is the set of points q ∈ δZ}3_{− {p} such} that V(p) ∩ V(q) contains a face, an edge or a vertex of a cube, more specifically, these

(19)

18

p 2δ

Figure 1: A voxel centered in p.

neighborhoods are called a 6-neighborhood, a 18-neighborhood or a 26-neighborhood, respectively. Moreover, we denote by N6(p), N18(p) and N26(p) the 6, 18 and 26 neigh-borhood of p and say that a pair of points p and q are adjacents or neighbors if each is in the neighborhood of the other.

For the entirety of this text, we consider that the neighborhood of a point always refers to the 26-neighborhood of a point except if stated otherwise. From the definition of neighborhood the concept of path comes straightforward as:

Definition 4. A path is a finite ordered tuple of points hp0, . . . , pmi such that for all pn with 0 ≤ n < m, we have that pn+1∈ N (pn) for some neighborhood N .

Finally we define a 3D digital image as:

Definition 5. A 3D digital image, or simply image, is a function I : G → L, where G is a 3D grid and a finite set of L ⊂ Z, which contains labels for a point. These labels may be interpreted in many ways, such as the object that contains that point in space, in which case we say that the n-th object of that image is the set of points p in which I(p) = n, or as a binary representation of color such as a 24-bit representation of the RGB space.

This definition comes with the following notation:

Definition 6. A point p is in I : G → L if p ∈ G and we say change a point p in I to l ∈ L when changing the image of I at p to l, that is I(p) = l.

A binary image is the particular case of 2 labels that is defined as:

Definition 7. A binary image is any image of the form I : G → {0, 1}, where the back-ground points are the set of points p ∈ G such that I(p) = 0 and the foreback-ground points are the complement of the background points.

Furthermore, we define a fuzzy image as an image where each object has a pertinence value associated, or more formally:

Definition 8. A fuzzy image is a function I : G → [0, 1]n _{where G is a 3D grid and n is} a positive integer.

(20)

19

Finally well-composedness will be defined following the original definition of ( LATE-CKI, 1997), but extended from binary images to images with multiple objects.

Definition 9. An object in an image is said to be well-composed if and only if the intersection of the voxels in that object with its complement forms a 2D-manifold in R3 and an image is said to be well-composed if and only if all the objects in the image are well-composed.

This definition, however, is not very useful to work with. That is the reason why the following theorem is used to develop the repairing algorithm. However, before stating the theorem it is important to define what is a critical configuration.

Definition 10. An instance of a critical configuration is a set of 4(8) adjacent points such that two points are in the foreground or background, they are not in the 6-neighborhood(and 18-neighborhood) of each other and the remaining 2(6) points are in the complement. Fig-ure 2 illustrates the 2 types of critical configurations that may occur modulus reflections and rotations.

In this text these two critical configurations will be named as shared edge and shared vertex between the voxels. The theorem comes straightforward from the definition of a critical configuration as:

Theorem 1. (LATECKI, 1997) An image is well-composed if and only if no instance of critical configurations(modulo reflections and rotations) occur for any point in the image. Corollary 1. If an image is well-composed then its complement and rotations also are well-composed.

Proof. Let I be a well-composed image and I0 a rotation or complement of I. Then by Theorem 1 no critical configuration shown in Figure 2, complement or rotation of such configuration occur in the original image. Therefore a critical configuration cannot occur after applying a operation of rotation or complement, because then I would also contain a critical configuration, but by hypothesis I is well-composed. We conclude that I0 does not contain a critical configuration and therefore is well-composed, by Theorem 1.

2.2 An introduction to CUDA

This section gives a brief introduction to CUDA, introducing its architecture, some concepts that come with it and CUDA C a extension of the C programming language

(21)

20

(a) Shared edge

(b) Shared vertex

Figure 2: Types of critical configurations, where only points in the foreground are shown.

that will be used to develop the parallel version of the repairing algorithm.

2.2.1 Architecture

Compute Unified Device Architecture(CUDA) is a hardware and software architec-ture developed by NVIDIA to issue and manage computations on Graphics Processing Units(GPUs). It will be used with the C++ programming language to develop our parallel repairing algorithm. And in this section we will present the most important concepts of CUDA necessary to understand this text.

The first concept to be introduced are kernels, functions that are executed in parallel across a grid of blocks of threads by Streaming Multiprocessors(SMs), a set of processors sharing memory on-chip and a instruction unit as illustrated in Figure 3, such that each block is independent of each other. When a kernel is launched, each block of the grid is enumerated and distributed to a SM with available resources. The threads of a block execute in parallel in the same SM and multiple blocks may execute in the same SM. When enough threads terminate a new block is launched in the SM.

Threads of a same block are grouped together in warps, each warp executes the same instruction for all threads in it. A warp is said to be active when at least one thread is being executed in it. If threads take different paths in a conditional branch, then the warp executes each path taken with the threads that did not take that path idle. When this happens we say that the threads diverged and that a warp divergence occurred.

In a GPU, threads have access to several kinds of memory as shown in Figure 3. There are the device memory, also calledglobal memory, the slowest memory in a GPU, but capable of storing great amounts of data. The shared memory, a memory block shared by all threads in the same thread block, much faster than global memory, however only being able to store small amounts of data. And a very small amount of thread registers only accessible by one thread, however it is the fastest memory in a GPU.

(22)

21

Figure 3: Hardware model of a GPU. Image taken from NVIDIA CUDA toolkit documen-tation.

There are also two other read-only cached memories, the constant memory, capable to be as fast as a clock cycle if on cache, and texture memory, faster than global memory and used by graphics Application Programming Interfaces(APIs), such as the Open Graphics Library and DirectX, to store textures, however it may be used for general computing.

Two important concepts to increase performance that comes from GPU memory and warps are coalesced memory and occupancy. The first comes from the fact that several bytes are transfered in a single memory transaction from global memory. Thus to increase performance we should reduce the amount of memory transactions by storing data in a way that bytes that are going to be used by the same warp must be close in global memory. The second comes from increasing performance by maximizing the ratio of active warps

(23)

22

and the total number of active warps capable to be executed in a SM. In general, to maximize the number of active warps, a SM must have enough resources so that each warp can be active, these resources include registers per thread, per block and the total amount of registers in a SM, shared memory per block and the total amount of shared memory, the total number of active threads, between others.

Another important concepts are work and step complexities, these are concepts of parallel programing, that are important for the understanding of this text. The work complexity refers to the total amount of operations across all the processors in all SMs necessary to execute the program, while step complexity refers to the total amount of steps in parallel necessary to execute a program, assuming that there are enough processors to perform all operations.

2.2.2 CUDA C

To end this brief introduction to CUDA, we will present CUDA C an extension of C that allows the user to communicate from the host, the Central Processing Unit(CPU) and its memory, with the device, the GPU and its memory. CUDA C consists of a minimal set of extensions and a runtime library. For this text the extensions necessary to fully understand the work are:

• Function execution space specifiers

– device denotes a function executed in the device, callable only from the device;

– host denotes a function executed in the host, callable only from the host; – and global denotes that the function is a kernel and should be executed in

the device, callable from either the host or the device and that does not return a value;

• Variable memory space specifiers

– device denotes that the variable is in global memory and may be addressed by any thread as well as the host;

– shared denotes that the variable is in the shared memory of a thread block and may only be accessed by threads in that block;

(24)

23

– A signed primary type followed by a number 1 ≤ n ≤ 4 denotes a vector type composed of n variables of the type specified;

∗ For example int3 denotes an array of 3 integers;

∗ The components are accessible, from the first to the last, by the fields x, y, z and w;

∗ To create vector type variables functions of the name make followed by the vector types are used to create the variable, for example make int3; – The same goes for unsigned primary types, however the types are preceded

by an ‘u’ character, for example uint3 denotes an aligned array of 3 unsigned integers;

– dim3 a type based on uint3 to specify dimensions; • Built-in variables

– threadIdx is an uint3 variable that contains the thread index in a block; – blockIdx is an uint3 variable that contains the block index in a grid; – blockDim is a dim3 variable that contains the dimensions of the block. It is also important to know about the synchronization functions syncthreads and cudaDeviceSynchronize. Both are barriers that make threads wait for the other threads in the block and in the kernel, respectively, arrive at the same point to continue the execution. And the memory manipulation functions cudaMalloc, cudaMemcpy and cudaFree to allocate, copy and free memory in the GPU from the CPU, respectively.

Lastly, the code in Algorithm 1 illustrates how to call a kernel using the operator <<< ... >>> to call a kernel with with 1 block and N threads per block. The operator receives two arguments, both are dim3 variable containing the size of the grid in the kernel and the size of the block.

Uninitialized coordinates of a dim3 variable are initialized with 1, thus the size of the grid in the example is 1 × 1 × 1, which means that only one block is being executed in the SM, and N × 1 × 1 threads are being executed in a block.

2.3 Related Works

The first definition of well-composed images was given by (LATECKI; ECKHARDT; ROSENFELD, 1995) for 2D images and later (LATECKI, 1997) extended the definition to

(25)

24

Algorithm 1 Example of a kernel call in CUDA to sum two arrays.

1 // Kernel definition

2 __global__ void sum_arrays(float* V1, float* V2, float* R) 3 {

4 int i = threadIdx.x;

5 R[i] = V1[i] + V2[i]; 6 }

7 int main() 8 {

9 ...

10 // Calling a kernel with 1 block of N threads 11 sum_arrays<<<1, N>>>(V1, V2, R);

12 ... 13 }

3D images, although other authors such as (MA; SONKA, 1996) and (TUSTISON et al., 2011) have given other definitions for well-composed images.

Later, (LATECKI, 1998) introduced the idea to modify binary 2D images to produce

well-composed images and (ROSENFELD; KONG; NAKAMURA, 1998) presented an image

operator capable of producing binary 2D and 3D well-composed images, but with 9 and 27 times more points for 2D and 3D images, respectively.

Another algorithm based in changing the image by inserting or removing points in-stead of an operator was presented in (SIQUEIRA et al., 2008), but, unlike the one presented by (ROSENFELD; KONG; NAKAMURA, 1998), it was a random algorithm that preserved the number of points in 3D images. Following the algorithm presented by (SIQUEIRA et al., 2008), (BOUTRY; G ´ERAUD; NAJMAN, 2015) extended it to nD images.

The following algorithms were also developed to produce well-composed images, how-ever (LATECKI, 2000), (STELLDINGER; LATECKI, 2006) and (G ´ERAUD; CARLINET; CROZET,

2015) were based on some kind of interpolation to preserve the topology in the digitizing process.

Following the work of (SIQUEIRA et al., 2008) we present a parallel version of the repairing algorithm and we also propose a deterministic version by working with fuzzy images.

(26)

25

2.4 The Fuzzy Segmentation Algorithm

The multi-object fuzzy segmentation algorithm used to generate fuzzy images for this work was introduced in (HERMAN; CARVALHO, 2001) and its algorithm was improved in

(CARVALHO, 2003). This algorithm is based on the concept of fuzzy connectedness intro-duced by (ROSENFELD, 1979) and it is based on the work of (UDUPA; SAMARASEKERA, 1996). It takes a 2D or 3D image I, M ∈ Z+ _{sets of seeds, where M is the number of} objects in the fuzzy image, and affinity functions Ψ = {ψ1, ψ2, . . . , ψm} as inputs and returns a fuzzy segmentation σ of the image.

The fuzzy segmentation is a function σ : G → [0, 1]M +1 similar to a fuzzy image, however it maps to one more value and has the restriction that the first element of the mapped vector is positive and the other elements are equal to 0 or the first element of the vector, if the pertinence of that element is positive. Given this definition of segmentation, a fuzzy image may be generated by ignoring the first element of the vector mapped by σ, as it only keeps a copy of the maximum pertinence.

The fuzzy segmentation algorithm is a semi-automatic algorithm that receives the seeds and affinity functions for the objects to be segmented as input parameters. A seed s is a point in I such that if it is in the m-th set of seeds then s is in the m-th object of the fuzzy image to be generated, while an affinity function is a function ψ : (G, I(G))2 _{→ [0, 1]} that takes a pair, a point and its image in I, and associates a similarity value. For example the function shown on Equation 2.1 was used in (CARVALHO, 2003).

ψ((c, c0), (d, d0)) =    [g1(c0+ d0) + g2(|c0− d0|)], if c ∈ N6(d), 0, otherwise. (2.1)

The idea of this algorithm is that the selected seeds in the m-th set are completely inside the m-th object, thus they have 1 of pertinence in that object and 0 in the remaining, this information is stored in σ by setting the first and the (m + 1)st elements of the seed s, σs

0 and σsm+1, respectively, to 1.

Then a point p is added to the m-th object by setting σp_m+1 as the strongest chain from a seed in the m-th object to that point. Where the strongest chain is the maximum affinity of any path from the seed to the point to be added and the affinity of a path is the minimum value mapped to a pair of consecutive points in the path by the affinity function. Furthermore, if p is a point in a path of the strongest chain then σ_m+1p > 0 for it to be inserted in the priority queue.

(27)

26

This is done to all points and if a point has a chain with more than one seed then only the one with the greatest strongest chain will be stored in σ, however if the greatest strongest chain is equal then the point keeps the same value for both objects. After all points have a chain to a seed the segmentation is finished and the fuzzy image I with M objects is created by ignoring the first value a point is mapped to in σ, that is I(p)m = σp_m+1 for all m < M and p in I.

As was mentioned before, in this work we will use the connectedness maps, the output of the fuzzy segmentation algorithm as an input for one of the repairing algorithms. The idea is to maximize the total sum of connectedness values of the well-composed 3D images, i.e., to take into account the connectedness values of the voxels when deciding which voxels will change pertinence.

(28)

27

3 Repairing Algorithm

In this chapter, the algorithm for repairing 3D images into well-composed images will be described. In Section 3.1 the sequential version presented in (SIQUEIRA et al., 2008) will be explained, while Section 3.2 presents a parallel version implemented on CUDA and Section 3.3 proposes the heuristics used to choose which points to change.

3.1 Sequential Version

The sequential algorithm presented in (SIQUEIRA et al., 2008) was first described for binary images and then, still in the same article, it was extended for non-binary images. Following the same approach this section presents first the version for binary images and then the version for non-binary images in sections 3.1.1 and 3.1.2, respectively.

3.1.1 Binary Images

The binary version consists in creating a list of points which are instances of critical configurations by visiting each point and inserting it if a critical configuration is found. By convention the point p = (x, y, z) inserted in the list is always in the same corner of the cube {x, x + 1} × {y, y + 1} × {z, z + 1} that contains the points in the critical configuration with p contained in the critical configuration. The Algorithm 2 shows how to iterate over the image to find all the critical configurations and insert in a list.

And then iterating over the list of critical configurations until the image is well-composed by applying three steps, as illustrated by Figure 4.

1. Find any point that when changed to foreground will eliminate the critical config-uration;

(29)

28

(a) Repairing a shared edge critical configuration

(b) Repairing a shared vertex critical configuration

Figure 4: Repairing a critical configuration, a black point represents the point currently being visited, the red point represents a critical configuration that was generated and the dashed cube the point to be changed from background to foreground.

3. Insert all the critical configurations generated by changing the point into the list of critical configurations, if any was generated by the change, previously found in the first step.

To better illustrate how to implement this algorithm, Algorithm 3 shows how to eliminate the critical configurations in a point by calling the Algorithm 4 to execute these three steps for a critical configuration generated by a shared edge, with the implementation for shared vertex being very similar, however it is necessary to check for more points to insert into the foreground.

The problem with this algorithm is that changing a point may create new critical configurations, but when working with finite images its granted that the algorithm will eventually return a well-composed image as the image is finite and the points added in the image would fill the grid forming a cube, which is well-composed as no critical config-uration occurs. That is why it is necessary to change points exclusively from background to foreground or, conversely, from the foreground to the background.

Its important to make three remarks about this algorithm. First, finding all critical configurations to eliminate may be done in O(N ), where N is the number of points in the image. To verify if a point is in a critical configuration it is only necessary to verify the neighborhood of the point, thus it may be done in O(1), but there are N points, so it is necessary to visit each point at least once to find all the critical configurations.

(30)

29

Algorithm 2 Finding all critical configurations

1 /* Inputs:

2 * image - an array containing the image in row-major order 3 * critical_configurations - the list of critical configurations 4 * size - the sizes of the image

5 */

6 void find_critical_configurations(int * image, list<int3> & critical_configurations,

int3 size){

,→

7 for(unsigned z = 0; z < size.z; z++){ 8 for(unsigned y = 0; y < size.y; y++){ 9 for(unsigned x = 0; x < size.x; x++){

10 //the position of the point in the array

11 int pos = x + y * size.x + z * size.x * size.y; 12

13 //the values of the points in the critical configuration

14 //pn is the value in image of the point

15 //(x+((n-1)%2), y+(((n-1)/2)%2),z+(((n-1)/4)%2)) 16 int p1 = image[pos];

17 int p2 = (x < (size.x - 1)) ? image[pos + 1] : 0;

18 ...

19

20 //if there is a critical configuration insert it in the list

21 //is_critical_configuration returns true if the points are 22 //a critical configuration by sharing an edge

23 if (is_critical_configuration(p1,p2,p3,p4)) { 24 critical_configurations.push_back(make_int3(x,y,z)); 25 } 26 else if (is_critical_configuration(p1,p2,p5,p6)) { 27 critical_configurations.push_back(make_int3(x,y,z)); 28 } 29 else if (is_critical_configuration(p1,p3,p5,p7)) { 30 critical_configurations.push_back(make_int3(x,y,z)); 31 }

32 //the function is overload to also look for critical configurations

33 //created by points sharing a vertex

34 else if (is_critical_configuration(p1,p2,p3,p4,p5,p6,p7,p8)) { 35 critical_configurations.push_back(make_int3(x,y,z)); 36 } 37 } 38 } 39 } 40 }

(31)

30

Algorithm 3 Eliminating critical configurations in a point.

1 void eliminate_critical_configuration(int * image, list<int3> &

critical_configurations, int3 size){

,→

2 //the points in the critical configuration 3 int3 p1 = critical_configurations.front(); 4 int3 p2 = make_int3(p1.x+1, p1.y, p1.z); 5 ...

6

7 //is_critical_configuration is overloaded to also receive the points 8 if (is_critical_configuration(image,size,p1,p2,p3,p4)) { 9 random_change(image,size,p1,p4,p2,p3,critical_configurations); 10 } 11 if (is_critical_configuration(image,size,p1,p2,p5,p6)) { 12 random_change(image,size,p1,p6,p2,p5,critical_configurations); 13 } 14 if (is_critical_configuration(image,size,p1,p4,p8,p5)) { 15 random_change(image,size,p1,p8,p4,p5,critical_configurations); 16 } 17 if (is_critical_configuration(image,size,p1,p2,p3,p4,p5,p6,p7,p8)) { 18 random_change(image,size,p1,p2,p3,p4,p5,p6,p7,p8,critical_configurations); 19 } 20 }

(32)

31

Algorithm 4 Changing a point.

1 void random_change(int * image, int3 size, int3 p1, int3 p2, int3 p3, int3 p4,

list<int3> & critical_configurations){

,→

2 // p is the point to be changed and pos its position in image 3 int3 p;int pos;

4 //randomly chooses one of the 2 points in the background 5 if(rand()%2 == 0){

6 pos = p1.x+(p1.y+p1.z*size.y)*size.x;p = p1 7 //if p1 is in the foreground then changes p3 8 if(image[pos] == 1){

9 pos = p3.x+(p3.y+p3.z*size.y)*size.x;p = p3;

10 }

11 }else{

12 pos = p2.x+(p2.y+p2.z*size.y)*size.x;p = p2; 13 //if p2 is in the foreground then changes p4 14 if(image[pos] == 1){

15 pos = p4.x+(p4.y+p4.z*size.y)*size.x;p = p4;

16 }

17 }

18 //changes the chosen point to the foreground 19 image[pos] = 1;

20

21 for (int k = -1; k < 1; k++) { 22 for (int j = -1; j < 1; j++) {

23 for (int i = -1; i < 1; i++) {

24 int3 temp = make_int3(p.x + i, p.y + j, p.z+k);

25 int i1 = image[pos + i + (j + k * size.y) * size.x];

26 ...

27 int i8 = image[pos + i + 1 + (j + 1 + (k + 1) * size.y) * size.x]; 28 //checks the critical configurations involving the point changed 29 //and insert in the list of critical configurations

30 if (j == 0 && is_critical_configuration(p1, p2, p3, p4)) { 31 critical_configurations.push_back(temp); 32 } 33 ... 34 else if(is_critical_configuration(p1, p2, p3, p4, p5, p6, p7, p8)){ 35 critical_configurations.push_back(temp); 36 } 37 } 38 } 39 } 40 }

(33)

32

image after removing all the critical configurations, because a critical configuration gen-erated from changing a point must be in its neighborhood. Therefore a list may be used to keep track of the critical configurations generated by changing a point and then these critical configurations are eliminated by repeating the process until no more critical con-figurations are generated by removing the existing ones.

The third remark comes from the second, since when eliminating a critical configu-ration by changing a point, more than one critical configuconfigu-ration may be eliminated on different points in the image, but that does not affect the correctness of the algorithm, as it is possible to simply go to the next critical configuration.

From these three remarks we have that the algorithm is linear on the number of points. Because each point is visited at most two times, the first to verify if a critical configuration occurs in its neighborhood and the second to eliminate the critical configuration if it exists. Both to verify and eliminate a critical configuration may be done in O(1), the first by the first remark and the second because it is only necessary to randomly chose a neighbor such that changing it to foreground or background eliminate the critical configuration.

3.1.2 Non-binary images

The technique presented to non-binary images consisted in applying the binary version to an object and its complement, then successively applying to the union of objects currently composed with an object in the complement until all the objects are well-composed. However, there are some changes that should be made for this process to work. First, the critical configurations that are being eliminated are not from the union, but only the object that is not well-composed, however the points to be changed from the complement are still from the union, not only the object. As removing a point from an object already composed may generate critical configurations on the already well-composed objects.

Lastly, there are situations where there are critical configurations and no points to change in the complement of the union. If that happens instead of changing the point from the complement a point from the object is chosen. Then, if no critical configuration is generated in a well-composed object, continue from the same object, otherwise continue from the object where the critical configuration was generated.

It is worth noting that this algorithm has a different complexity, as in the worst case in an image with k objects it would require to change a point k − 1 times from the first

(34)

33

to the last object for all the N points, thus having a complexity of O(kN ).

3.2 Parallel Version

In this section a parallel version implemented in CUDA of the sequential algorithm is presented. Similarly to the previous section a version binary images will be introduced first with 3 techniques to repair images, then a version for non-binary images and lastly a version for fuzzy images as well as the definition of well-composed fuzzy images.

3.2.1 Binary Images

Based on the sequential algorithm for binary images we have to first find all the critical configurations to create the list of critical configurations. This can be done by simply creating a thread per point and inserting that point in the list if a critical configuration is caused by that point.

However, it is important to notice that inserting multiple points at the same time in a list means that several write operations will be done simultaneously, thus some kind of data access control technique must be used. Furthermore, it also should be noted that using a linked list in GPU is not very efficient, as it requires to dynamically allocate more memory every time a new element is inserted and GPUs are slow when allocating memory, specially on global memory where the list would be located so that all blocks could access it. Therefore an array implementation should be used.

Having created the list of critical configurations, lets analyze the 3 steps to be applied for each critical configuration. The first consists in reading 8 points to verify if changing any would eliminate the critical configuration and counting the number of critical config-uration generated by the change. The second chooses a point and then changes it based on the critical configurations that are generated by this change. Lastly, the points are inserted in the list of critical configurations.

The first two steps are problematic, because the neighborhood of a point being visited by a thread may change if another thread is executing the second step. Thus it is not possible to guarantee that the point chosen will not generate critical configurations and which critical configurations will be generated.

However this is a local problem, only threads that are visiting sufficiently close points may have some kind conflict due to memory access problems, as illustrated in Figure 5.

(35)

34

Observing the figure, it is clear that if two threads are visiting points distanced by at least two points in any axis, then they will not read and write a common point. Knowing that, two techniques were developed to repair an image in parallel.

3.2.1.1 Finding critical configurations

However, before explaining how to repair an image, a common step to all repairing algorithms will be explained. The common step being presented in this section is how to find the initial existing critical configurations from the image in parallel.

To find the critical configurations a kernel is launched in blocks of size x × y × z, the block threads visit points given by its index and copy the point value from global memory to a block of size x + 1 × y + 1 × z + 1 in shared memory. The block size is slight bigger because to verify if a critical configuration is in a point p = (p1, p2, p3) all the points in {p1, p1+1}×{p2, p2+1}×{p3, p3+1} must be visited, so the points at the end of the block use memory that is not being visited by a point and thus must be allocated and copied to the block. The remaining points of the block are filled by the thread visiting the closest point. After that a block-level barrier is called by the CUDA function syncthreads() to make sure all points were copied from global memory to shared memory.

After copying the data from global memory, the threads use the block of data in shared memory to detect critical configurations and insert critical configurations detected in an auxiliary structure used to keep track of all the critical configurations, this structure may change depending on the algorithm.

Doing so we will store all the critical configurations in the auxiliary structure and we do not have to worry about race conditions, as no thread write on the image, only on the auxiliary structure which may have some kind of race condition to insert a new critical configuration.

Algorithm 5: Finding all critical configurations in parallel.

1 __global__ void find_critical_configurations_parallel(int * image, auxiliary_structure critical_configurations, int3 image_size) {

,→

2 //image_copy stores a copy of a block of image in shared memory 3 //extern means that the memory is external and is used to make 4 //the memory be dynamically stored.

5 extern __shared__ int image_copy[];

6 int3 copy_size = make_int3(blockDim.x + 1, 7 blockDim.y + 1, 8 blockDim.z + 1);

(36)

35

(a) Two threads accessing neighbor points in parallel

(b) Two threads accessing points with a distance of 2 in parallel

(c) Two threads accessing points with a distance of 3 in parallel

Figure 5: Illustration of threads accessing points close to each other. The black points represent points being visited by a thread, green represent a possible read access, blue a possible read and write access, yellow two possible read access by different threads, orange read and write conflict, red a write and write conflict and gray means that no thread access that point.

(37)

36

9

10 //the coordinates in the original image

11 int x = (blockIdx.x * blockDim.x) + threadIdx.x; 12 int y = (blockIdx.y * blockDim.y) + threadIdx.y; 13 int z = (blockIdx.z * blockDim.z) + threadIdx.z;

14 //the position of the point in the image array

15 int p = x + y * image_size.x + z * image_size.x * image_size.y;

16

17 //p1 is the position of the point being visited by the thread in image_copy 18 int p1 = threadIdx.x + threadIdx.y * copy_size.x + threadIdx.z * copy_size.y *

copy_size.x;

,→ 19 ...

20 int p8 = p1 + 1 + copy_size.x + copy_size.y * copy_size.x; 21

22 //here is where we store the value of the image in the copy 23 if (x < image_size.x && y < image_size.y && z < image_size.z) { 24 bool in_image_x = x < size.x-1;

25 ...

26 bool on_border_copy_z = threadIdx.z + 1 == blockDim.z; 27 image_copy[p1] = image[p];

28 if(on_border_copy_x){

29 image_copy[p2] = (in_image_z) ? image[p+1] : 0;

30 }

31 ...

32 if(on_border_copy_x && on_border_copy_y && on_border_copy_z){ 33 image_copy[p8] = (in_image_x && in_image_y && in_image_z) ?

image[p+1+size.x+size.x*size.y] : 0;

,→

34 }

35 }else{

36 //we might have writing conflicts, however we are storing

37 //the same value so we can let it happen without worrying

38 //about errors 39 image_copy[p1] = 0; 40 ... 41 image_copy[p8] = 0; 42 } 43 44 __syncthreads(); 45

46 if(x >= c_size.x || y >= c_size.y || z >= c_size.z){ 47 return;

48 } 49

50 if (is_critical_configuration(image_copy[p1], image_copy[p2], image_copy[p3],

image_copy[p4])) {

,→

51 critical_configurations.insert(p); 52 }

(38)

37

53 }

3.2.1.2 Thread per point

Now that we know how to find the critical configurations in parallel, we will present the first technique to repair images.

The First technique is simply to launch a kernel to visit a point and each point with two points between in each axis as illustrated in Figure 6, then synchronizing all threads in all blocks, this is done by waiting for the kernel execution to end, calling the CUDA function cudaDeviceSynchronize() and then launching a new kernel, but this time the threads visit a point that was not visited yet doing so 27 times repeating the process until the image become well-composed. As shown in Algorithm 6.

However, checking if a critical configuration exists every time a point is visited is not necessary, as if a point did not have a critical configuration and no neighbor was changed then it is impossible to have a new critical configuration on that point. Thus, to speed up this process, a 3D matrix is used as auxiliary structure to store the critical configurations in a point.

The matrix stores bitmasks using a nibble of a byte, where each bit corresponds to one of the 4 possible critical configurations that are being eliminated, if all bits from the bitmask are 0 then the point is skipped, otherwise the critical configuration corresponding to that bit is eliminated the same way as in Algorithm 3, however without having to find the type as it would be in the matrix.

To initialize the matrix it is necessary to first visit every point, test which critical configurations the point has and then store the corresponding bitmask. After creating the matrix instead of updating a list with the generated critical configurations, the point in the matrix is updated.

However it should be noted that there are problems with this algorithm, specially with warp divergence and idle threads. The former is straightforward from the fact that each point may have different critical configurations and thus threads may deal with a different critical configuration, also when testing for generated critical configurations, there are 20 tests to be done, each has a small amount of code to execute however it still would cause problems with warp divergence. The latter comes from images with a very

(39)

38

Algorithm 6 Kernel and kernel call of the thread per point version.

1 __device__ d_has_critical_configuration; 2

3 void main() { 4 ...

5 /* data allocation and initialization */ 6 ...

7 //has_critical_configuration is a host function that reads and returns 8 //the value of d_has_critical_configuration, which stores true if a 9 //critical configuration was inserted in the last iteration

10 while(has_critical_configuration()){ 11 for (int i = 0; i < 27; i++) {

12 offset.x = i / 9; offset.y = (i / 3) % 3; offset.z = i % 3;

13 eliminate_critical_configurations_parallel <<<gridSize, blockSize>>>

(d_img, d_critConfig, size, offset);

,→ 14 cudaDeviceSynchronize(); 15 } 16 } 17 ... 18 /*data deallocation */ 19 ... 20 } 21 22 void eliminate_critical_configurations_parallel( 23 int * image, 24 char * critical_configurations,

25 int3 size, int3 offset){

26 int x = (threadIdx.x+blockIdx.x*blockDim.x) * 3 + offset.x;

27 int y = (threadIdx.y+blockIdx.y*blockDim.y) * 3 + offset.y; 28 int z = (threadIdx.z+blockIdx.z*blockDim.z) * 3 + offset.z; 29 //the points in the image

30 int3 p1 = make_int3(x, y, z); 31 ...

32

33 char type = critical_configuration[x + (y + z * size.y) * size.x]; 34 if(type == 0){ 35 return; 36 } 37 if (type & 1) { 38 random_change(image,size,p1,p4,p2,p3,critical_configurations); 39 } 40 if (type & 2) { 41 random_change(image,size,p1,p6,p2,p5,critical_configurations); 42 } 43 ... 44 }

(40)

39

cudaDeviceSynchronize()

Figure 6: Visitation order of points and synchronization calls of point based algorithm.

small number of critical configurations in which case it would probably be better to use a sequential implementation as many threads would stay idle waiting for a very small amount of threads to finish eliminating a critical configuration.

There are also problems with memory access, as 64 memory spaces are accessed to eliminate only one critical configuration.

Lastly it is important to notice that this algorithm has a work complexity of O(N2) where N is the number of critical configurations, as in the worst possible case we would visit all the N points and change only one point in an iteration until an all N points are changed filling the grid and generating a well-composed image. However the images in which it would be necessary to realize N2 _{operations would have to be very specific image} and with so few critical configurations per iteration that it would be better to use the sequential version.

3.2.1.3 Thread per block

The second technique is to launch 8 kernels with each block eliminating critical con-figurations of every point in a cube with side greater or equal than 2. But with a gap between blocks such that no thread write on points visited by another thread, then all blocks are synchronized and another point is visited. After all points are visited another kernel is launched visiting the gaps in a similar manner, as illustrated in Figure 7.

Note that a matrix of bitmask is also used and visiting two different points in a block still requires to keep threads visiting points distant enough to not cause any conflict. To do so the strategy used in the previous algorithm may be used, by visiting equally

(41)

40

syncthreads()

(a) Synchronization between threads

cudaDeviceSynchronize() cudaDeviceSynchronize()

(b) Synchronization of blocks

Figure 7: Visitation order of points and synchronization calls of cube based algorithm.

distant points, synchronizing the threads and then visiting another point until all points are visited, as shown in Algorithm 7.

However it should be noted that this method advantage over the previous is that a thread visits more points, thus the memory read by a point may be used by another, reducing the total amount of reads, which may reach up to 43 _{reads if the memory has} no coalescence, as a thread can read up to 43 points.

Lastly, this algorithm has the same work complexity as the previous, however we could reduce the multiplying constant in 27 times by tracking blocks with a critical configuration and visiting a block only if there is a critical configuration. But this would require to manage the threads launched in each iteration.

(42)

41

Algorithm 7 Kernel and kernel call of the thread per block version.

1 __device__ d_has_critical_configuration; 2

3 void main() { 4 ...

5 /* data allocation and initialization */ 6 ...

7 while(has_critical_configuration()){ 8 for(int i = 0; i < 8; i++){

9 offset.x = i%2; offset.y = (i/2)%2; offset.z = i/4;

10 eliminate_critical_configuration_parallel_2<<<gridSize_wc,

blockSize>>>(d_img, d_critConfig, offset);

,→ 11 cudaDeviceSynchronize(); 12 } 13 } 14 ... 15 /*data deallocation */ 16 ... 17 } 18

19 void eliminate_critical_configuration_parallel_2(int * image, char *

critical_configurations, int3 size, offset){

,→

20 int x0 = (((blockIdx.x * 2) + offset.x) * blockDim.x + threadIdx.x) * 3;

21 int y0 = (((blockIdx.y * 2) + offset.y) * blockDim.y + threadIdx.y) * 3; 22 int z0 = (((blockIdx.z * 2) + offset.z) * blockDim.z + threadIdx.z) * 3; 23 24 for(unsigned i = 0; i < 27; i++){ 25 int x = x0 + (i%3); 26 int y = y0 + ((i/3)%3); 27 int z = z0 + (i/9); 28 ...

29 /* eliminate critical configurations */

30 ...

31 __syncthreads(); 32 }

(43)

42

3.2.1.4 Further insight

It is worth noting that we could simply choose a point that eliminates the critical configuration in parallel and then verify if critical configurations were generated by the change. While one may think that this would still create problems on counting the gen-erated critical configurations because the point may still be changed, it is not the case.

This is because the point that is visited later on has the correct neighborhood when inserting the critical configurations, thus it would verify that the point generates a crit-ical configuration and correctly inserts all critcrit-ical configurations into the list of critcrit-ical configurations.

Although false positives may occur as critical configurations may be eliminated by the insertion of the point in the thread executed afterwards. However, it is necessary to notice that when choosing a point to change, the critical configuration may already have been eliminated by another thread, meaning that more changes than necessary may be done.

Another observation that must be done is that the performance of all these algorithms is much better than the sequential when a lot of critical configurations exist, however it decreases as lower amount of critical configurations are in the image, thus it is possible to create a strategy to use a sequential algorithm when the number of critical configurations is too low to increase the speed using a hybrid approach to repair the image.

3.2.2 Non-binary images

To extend this algorithm to non-binary images it is necessary to notice that it is possible to just change any object in any order as long as a priority for choosing which object is changed is followed. Choosing to change only points from an object of lower priority to higher priority.

This would increase the complexity of the algorithms by the amount of objects in the image, similarly to the sequential version, obtaining a complexity of O(kN2_{), where k is} the number of objects and N the number of points in the image.

3.2.3 Extending to Fuzzy Images

Finally, to work with fuzzy images the concept of well-composed images must be extended to fuzzy images. The most direct approach to extend would be to choose the

(44)

43

object with highest pertinence in a point to represent the point in the image. However, when working with fuzzy images it is possible to attribute two or more objects to a single point.

For this text we choose the second approach and say that:

Definition 11. a fuzzy image I with m objects is well composed if and only if the binary images In formed by the points with nonzero pertinences in the n-th object is well-composed, that is:

In(p) =    1, I(p)n > 0, 0, otherwise, is well-composed for 1 ≤ n ≤ m.

According to this definition, the repairing algorithm is extended to fuzzy images by repairing the images In. However, the algorithm now changes the pertinence from I(p)n to 0 when removing p from the foreground and to a nonzero value when adding to the foreground.

The value to be changed when adding can be any value in (0, 1]. However, some strategies are changing to 1, the minimum, maximum or mean, of nonzero pertinences in the critical configuration.

It should be noted that when removing, the point may not be in any object, in this case a strategy similar to when adding a point is used to add the point to a object by changing its pertinence to one or more objects. In the implemented algorithm, the objects chosen are all the objects containing a randomly selected neighbor in the eliminated critical configuration that does not generates the critical configuration we just eliminated, selecting the objects of another point if there is no such object.

3.3 Heuristics to choose points

Now we discuss heuristics developed to generate the closest possible image to the original input image.

There are several ways to define the closest image and heuristics with different ap-proaches to find an optimal solution, minimizing the differences between the pertinences, maximizing the points pertinences, between others. However a more general approach would be to reduce the total number of changes while repairing the image.

(45)

44

However it is important to first determine which parameters can be used to make a choice of which point to change to what object. There are for example:

1. the pertinence of a point;

2. whether to only add or remove points;

3. the number of critical configurations eliminated or generated.

The pertinence is not really important, because choosing a point with higher perti-nence does not mean that necessarily less changes are going to be done, it is expected however that the image generated is more similar to the segmented image if the perti-nences changed have lower pertinence in an object, as we have lower confidence that the point should be in that object. Therefore it will be used to decide which point change when there are more than one point to change.

As for whether to add or remove points it is mostly important to guarantee that the algorithm will end, but if there is an expected volume for an object the changes may increase(decrease) the volume of the objects to a closer value of the expected volume by only adding(removing) if the current volume is lower(greater) than the expected, having a better impact on generating a more similar image. However, it is unlikely that we have an expected volume, furthermore this heuristic is only necessary to guarantee termination and does not seen to be related with the total number of changes which we are trying to minimize, thus it may be a good idea to reduce the priority of this parameter to another that seems to generate better results. Being careful to stop running after too many iterations as we may enter a loop.

Minimizing the number of critical configurations generated is used in the original sequential algorithm, however it may not be the best strategy, as the critical configurations generated are close, thus the change of a common neighbor may eliminate all the critical configurations generated. However, it is important to verify if no critical configuration is generated and to maximize the number of eliminated critical configurations as it means that no more changes are necessary to eliminate that critical configuration.

(46)

45

4 Results

In this chapter a comparison between the times of the sequential and parallel algo-rithms as well as the number of changes made by using different heuristics.

The first noteworthy result obtained was the increase in speed observed in finding the critical configurations. The parallel version was about 100 times faster than the sequential, when running in two images, I1 of size 256 × 256 × 140 and created to have a high amount of critical configurations so that when segmented it has 884855 critical configurations and a more natural image I2 of size 256 × 256 × 101 which when segmented has 152956 critical configurations, with a speed up of over 100 times in I1 and a slight over 90 times speed up in I2, as shown in Table 1.

As for the repairing algorithm itself, the results were slight better in terms of speed, both parallel versions are faster than the sequential version, although the speed increases in only a few times. While the results in terms of number of changes was quite greater between each algorithm depending on the heuristic used, with many more changes when running the algorithms with a random heuristic than the others between each algorithm. However some surprising facts were observed in Table 2 and Table 3. The tables were created by executing each algorithm in I1 and I2, respectively. The time was measured using CUDA events for the device code and the chrono library for the host code.

The algorithm column says which algorithm that row refers to, S for the serial al-gorithm, P1 for the thread per point parallel algorithm and P2 to the thread per block

image size algorithm time(ms) speed up

I1 S 701.438

-P 5.778 121.398

I2 S 390.452

-P 4.222 92.480

Table 1: Time spent finding critical configurations for the sequential (S) and parallel (P) algorithms.

(47)

46

parallel algorithm. The choice is which heuristic will be used to chose a point to be changed, it can be random to choose any point that solves the critical configuration, min CC to choose the point that generates the minimum amount of critical configurations but obeying the termination condition to only add points to that objects and min CC* when ignoring the condition to further reduce the amount of critical configurations and thus the number of changes.

Note that all algorithms use the pertinence of a point in an object as the deciding factor to choose which point will be changed in case of a draw. The RMSE column is the average root mean squared error of the original image and the generated image calculated by the difference of pertinences of the points, and changes is the number of points that were changed. It should be noted that the values of the image range from 0 to 10000 as they were discretized, thus a high value of RMSE is justifiable.

One would expect the min CC* to be the fastest heuristic, if it stops, as it tries whenever possible to not generate new critical configurations, the result however was that it lost to the random algorithm which one would expect to be the slowest in Table 2.

A possible explanation for this is that when inserting in parallel at random we do not check which points are a good choice thus it may decrease the thread divergence and the amount of work done to change a point, so that even if it changes over the double of points it is still faster. It should be noted however that the random heuristic changed about 106 _{points more than the min CC* heuristic and decreased the time in only some} hundredths of a second, which would not be enough to justify using that heuristic.

Other surprising facts are that the algorithm P2 was slower than P1, however it may be because although the memory is being better used there still is an increase on the number of barrier calls, even if mostly are in block level unlike the P1 algorithm. And the other fact is that the parallel algorithms obtained very similar errors with the min CC and min CC* heuristics, while the sequential had an error more proportional to the number of changes, and that the sequential algorithm has a considerably lower amount of changes except with the min CC* heuristic.

(48)

47

algorithm choice running time(s) RMSE changes

S random 3.3116 3012.98 1065787 P1 0.9059 3051.71 1687587 P2 1.5561 3047.99 1684283 S min CC 3.0637 2633.61 814123 P1 1.2835 2223.58 897929 P2 1.3267 2222.36 896963 S min CC* 3.1512 2429.09 687713 P1 0.9996 2349.45 687708 P2 1.0586 2352.70 689682

Table 2: Average running time, error and number of changes for running the algorithm on I1.

algorithm choice running time(ms) RMSE changes

S random 659.972 1295.58 209062 P1 392.095 1894.72 320721 P2 707.388 1882.16 316877 S min CC 558.454 1086.98 145271 P1 379.970 1314.56 162038 P2 380.086 1310.8 161044 S min CC* 565.306 1039.09 120896 P1 290.023 1329.12 123377 P2 312.503 1329.49 123468

Table 3: Average running time, error and number of changes for running the algorithm on I2.

(49)

48

5 Final Remarks

At the end of this work, two parallel versions of a repairing algorithm were created and extended to fuzzy images, increasing the speed of the algorithm in a few times. An algorithm to create an auxiliary structure to store the critical configurations in parallel was developed with a much better performance than the original.

Surprisingly using a random heuristic proved to be faster than the others, however with a huge increase in the number of changes, and although the second algorithm had a better use of memory, it had bigger number of barriers which reduced slightly the speed of the algorithm.

5.1 Conclusion

Our approach to increase the speed of the algorithm was not very good and did not have great improvements, however the algorithm to identify critical configurations developed was much faster, the repairing algorithm still increased the speed in a few times and the heuristics showed that the resulting image may be more similar if a good enough heuristic is used.

5.2 Future Works

There is still room for improvements in the algorithms presented, as well as other studies on heuristics that may further increase the speed and the solution found. Below follows a list of topics which may be studied in future works:

• Rearranging memory to make better use of coalescence; • Make better use of shared memory when repairing; • Increase occupancy;

(50)

49

• Reduce thread divergence by launching threads to eliminate the same critical con-figuration;

• Explore hybrid algorithms using serial and parallel code; • Using cooperative thread groups to synchronize all threads.

(51)

50

References

BOUTRY, N.; G´ERAUD, T.; NAJMAN, L. How to make nd images well-composed

without interpolation. In: IEEE. Image Processing (ICIP), 2015 IEEE International Conference on. [S.l.], 2015. p. 2149–2153.

CARVALHO, B. M. Cone-beam helical CT virtual endoscopy: reconstruction,

segmentation and automatic navigation. Thesis (PhD) — University of Pennsylvania, 2003.

G´ERAUD, T.; CARLINET, E.; CROZET, S. Self-duality and digital topology: links between the morphological tree of shapes and well-composed gray-level images. In: SPRINGER. International Symposium on Mathematical Morphology and Its Applications to Signal and Image Processing. [S.l.], 2015. p. 573–584.

HERMAN, G. T.; CARVALHO, B. M. Multiseeded segmentation using fuzzy connectedness. IEEE Trans. Pattern Anal. Mach. Intel., IEEE Computer Society, Washington, DC, USA, v. 23, n. 5, p. 460–474, 2001. ISSN 0162-8828.

LATECKI, L.; ECKHARDT, U.; ROSENFELD, A. Well-composed sets. Computer Vision and Image Understanding, Elsevier, v. 61, n. 1, p. 70–83, 1995.

LATECKI, L. J. 3d well-composed pictures. Graphical Models and Image Processing, Elsevier, v. 59, n. 3, p. 164–172, 1997.

LATECKI, L. J. Discrete representation of spatial objects in computer vision. [S.l.]: Springer Science & Business Media, 1998.

LATECKI, L. J. Well-composed sets. Advances in Imaging and Electron Physics, Elsevier, v. 112, p. 95–163, 2000.

MA, C. M.; SONKA, M. A fully parallel 3d thinning algorithm and its applications. Computer vision and image understanding, Elsevier, v. 64, n. 3, p. 420–433, 1996.

MARCHADIER, J.; ARQU`ES, D.; MICHELIN, S. Thinning grayscale well-composed

images: A new approach for topological coherent image segmentation. In: SPRINGER. Discrete Geometry for Computer Imagery. [S.l.], 2002. p. 360–371.

ROSENFELD, A. Fuzzy digital topology. Inform. and Control, v. 40, n. 1, p. 76–87, 1979.

ROSENFELD, A.; KONG, T. Y.; NAKAMURA, A. Topology-preserving deformations of two-valued digital pictures. Graphical Models and Image Processing, Elsevier, v. 60, n. 1, p. 24–34, 1998.

(52)

51

SIQUEIRA, M.; LATECKI, L. J.; GALLIER, J. Making 3d binary digital images well-composed. In: INTERNATIONAL SOCIETY FOR OPTICS AND PHOTONICS. Vision Geometry XIII. [S.l.], 2005. v. 5675, p. 150–164.

SIQUEIRA, M. et al. Topological repairing of 3d digital images. Journal of Mathematical Imaging and Vision, Springer, v. 30, n. 3, p. 249–274, 2008.

STELLDINGER, P.; LATECKI, L. J. 3d object digitization: Majority interpolation and marching cubes. In: IEEE. Pattern Recognition, 2006. ICPR 2006. 18th International Conference on. [S.l.], 2006. v. 2, p. 1173–1176.

TUSTISON, N. J. et al. Topological well-composedness and glamorous glue: A digital gluing algorithm for topologically constrained front propagation. IEEE Transactions on Image Processing, IEEE, v. 20, n. 6, p. 1756–1761, 2011.

UDUPA, J. K.; SAMARASEKERA, S. Fuzzy connectedness and object definition: theory, algorithms, and applications in image segmentation. Graphical models and image processing, Elsevier, v. 58, n. 3, p. 246–261, 1996.

XIE, W.; THOMPSON, R. P.; PERUCCHIO, R. A topology-preserving parallel 3d thinning algorithm for extracting the curve skeleton. Pattern Recognition, Elsevier, v. 36, n. 7, p. 1529–1544, 2003.

Parallel Repairing 3D Fuzzy Images into Well-Composed Images

Parallel Repairing 3D Fuzzy Images into

Well-Composed Images

Rafael Lucena Germano

Rafael Lucena Germano

Parallel Repairing 3D Fuzzy Images into

Well-Composed Images

Dr. Bruno Motta de Carvalho

Acknowledgements

Parallel Repairing 3D Fuzzy Images into

Well-Composed Images

Abstract

Parallel Repairing 3D Fuzzy Images into

Well-Composed Images

Resumo

List of figures

List of tables

List of Abbreviations

Contents

1

Introduction

1.1

Outline

2

Basic Concepts and Related

Works

2.1

Definitions

2.2

An introduction to CUDA

2.2.1

Architecture

2.2.2

CUDA C

2.3

Related Works

2.4

The Fuzzy Segmentation Algorithm

3

Repairing Algorithm

3.1

Sequential Version

3.1.1

Binary Images

3.1.2

Non-binary images

3.2

Parallel Version

3.2.1

Binary Images

3.2.2

Non-binary images

3.2.3

Extending to Fuzzy Images

3.3

Heuristics to choose points

4

Results

5

Final Remarks

5.1

Conclusion

5.2

Future Works

References