Multiple tracklet matching under severe occlusions = Emparelhamento de múltiplas trajetórias em casos severos de oclusão

(1)

COMPUTAÇÃO

Karina Olga Maizman Bogdan

Multiple Tracklet Matching under Severe Occlusions

Emparelhamento de Múltiplas Trajetórias em Casos

Severos de Oclusão

CAMPINAS

2016

(2)

Multiple Tracklet Matching under Severe Occlusions

Emparelhamento de Múltiplas Trajetórias em Casos Severos de

Oclusão

Dissertação apresentada ao Instituto de Computação da Universidade Estadual de Campinas como parte dos requisitos para a obtenção do título de Mestra em Ciência da Computação.

Thesis presented to the Institute of Computing of the University of Campinas in partial fulllment of the requirements for the degree of Master in Computer Science.

Supervisor/Orientador: Prof. Dr. Siome Klein Goldenstein

Este exemplar corresponde à versão nal da Dissertação defendida por Karina Olga Maizman Bogdan e orientada pelo Prof. Dr. Siome Klein Goldenstein.

CAMPINAS

2016

(3)

Ficha catalográfica

Universidade Estadual de Campinas

Biblioteca do Instituto de Matemática, Estatística e Computação Científica Ana Regina Machado - CRB 8/5467

Maizman Bogdan, Karina Olga,

M288m MaiMultiple tracklet matching under severe occlusions / Karina Olga Maizman Bogdan. – Campinas, SP : [s.n.], 2016.

MaiOrientador: Siome Klein Goldenstein.

MaiDissertação (mestrado) – Universidade Estadual de Campinas, Instituto de Computação.

Mai1. Rastreamento automático. 2. Visão por computador. 3. Processamento de imagens. I. Goldenstein, Siome Klein. II. Universidade Estadual de

Campinas. Instituto de Computação. III. Título.

Informações para Biblioteca Digital

Título em outro idioma: Emparelhamento de múltiplas trajetórias em casos severos de

oclusão

Palavras-chave em inglês:

Automatic tracking Computer vision Image processing

Área de concentração: Ciência da Computação Titulação: Mestra em Ciência da Computação Banca examinadora:

Siome Klein Goldenstein [Orientador] Milton Shoiti Misuta

Ricardo da Silva Torres

Data de defesa: 21-06-2016

Programa de Pós-Graduação: Ciência da Computação

(4)

COMPUTAÇÃO

Karina Olga Maizman Bogdan

Multiple Tracklet Matching under Severe Occlusions

Emparelhamento de Múltiplas Trajetórias em Casos Severos de

Oclusão

Banca Examinadora:

• Prof. Dr. Siome Klein Goldenstein Instituto de Computação - UNICAMP • Prof. Dr. Milton Shoiti Misuta

Faculdade de Ciências Aplicadas - UNICAMP • Prof. Dr. Ricardo da Silva Torres

Instituto de Computação - UNICAMP

A ata da defesa com as respectivas assinaturas dos membros da banca encontra-se no processo de vida acadêmica do aluno.

(5)

(6)

Gostaria de agradecer ao Prof. Siome pela orientação e paciência durante todo o mestrado. Ao Erikson Morais pela ajuda essencial nos primeiros passos do projeto.

Aos meus pais, Marina e Oscar, obrigada pelo apoio e incentivo em todos os momentos. Em especial à minha mãe, que sem notar me incentivou a seguir esse caminho e é para mim minha inspiração, spasibo Mama. Agradeço também ao meu irmão Marcelo e, à Liza, à Natasha, e aos meus queridos avós, Iocif e Galina, que mesmo longe estão constantemente me apoiando. Liza, Nataxa, i Ded: spasibo bol~xo˘i za to, qto vy - mo sem~. Deduxka, , nakonec to, zakonqila mo “Bo˘inu i Mir”! Spasibo za lbov~ i nenost~, za to, qto ty ne daex~ mne zabyt~, kto est~.

Atílio, obrigada por todo amor e carinho, e pelo incentivo em todos os momentos. Às minhas amigas Gisele e Lucilene, obrigada por estarem comigo desde sempre e serem tão compreensivas. Aos meus queridos amigos: Daniel, Elias, Hilário, Priscila e Tom (está em ordem alfabética pra não ter intrigas internas), obrigada pela amizade e por nossas sessões incríveis no Clube do Bolo (sim, nós temos um).

Aos meus colegas Javier Vargas I e Ricardo, obrigada pela amizade e pelos vários momentos divertidos. Aos meus colegas do Recod: Alberto, Ramon, Diego, Jaime, Pablo, Javier Vargas II, Dani, Marina; e de outros cantos: Roberto, Lucas Ismaily, Henrique, Alice, Andrei, David; a todos vocês muito obrigada pela companhia e por tornar meu cotidiano na Unicamp muito mais divertido.

Agradeço também à Francisca e à Chris, pelo apoio e incentivo direto e indireto (rs). E claro, às pessoas lindas do Tae Kwon Do da Unicamp: obrigada por todos os momentos incríveis dentro e fora do dojan. Não posso deixar de agradecer ao Fábio Súnica e Gustavo Henrique por terem me dado a oportunidade de fazer parte dessa equipe, e pelas várias lições inesperadas que cam pra vida.

Por m, gostaria de agradecer à CAPES pelo suporte nanceiro, que foi fundamental para a realização desta pesquisa.

(7)

O rastreamento de objetos é uma área importante da Visão Computacional, e apesar do progresso de diversos métodos propostos recentemente, muitos desaos inerentes ao problema permanecem sem solução. Um deles é a oclusão, que afeta a corretude e conti-nuidade da trajetória do objeto, tanto em ambientes que consideram um único objeto a ser rastreado quanto em ambientes de múltiplos objetos. Neste último cenário, o nível de diculdade do rastreamento aumenta, já que o método também deve ser capaz de manter as identidades corretas dos objetos. Como a grande maioria dos métodos atuais são basea-dos nas detecções basea-dos objetos de interesse, um evento de oclusão tem um grande impacto no rastreamento. A oclusão acaba obstruindo a aparência original do objeto, que é o principal tipo de dado usado na associação. A principal consequência disso é a introdução de lacunas de informação no rastreamento no período em que o objeto foi ocluído, dado que o objeto nesse caso pode não ser detectado ou detecções incorretas são obtidas. Essa falta de informação fragmenta as trajetórias em tracklets no rastreamento de métodos baseados em associação de dados, ou seja, que realizam associações consecutivas sobre o conjunto de detecções dos objetos. Algumas classes desse evento, como as oclusões totais e de longa duração envolvendo vários objetos, tornam o problema ainda mais desaador. Além disso, em cenários complexos como esportes, rastrear múltiplos objetos implica lidar com aparências e movimentos similares dos objetos, além da alta dinâmica entre os mes-mos que é uma característica natural do ambiente esportivo. Para abordar este problema, propomos um método de associação de trajetórias (tracklets) modelado como um pro-blema de emparelhamento em grafos. Diferentes representações e abordagens para os três principais componentes do modelo proposto são avaliadas, que incluem: a representação das trajetórias no grafo utilizando os modelos de aparência dos objetos correspondentes; o custo de associação, que determina a compatibilidade entre dois conjuntos de traje-tórias referentes a dois objetos distintos; e o algoritmo de emparelhamento, que realiza uma associação global das trajetórias. Ao contrário do estado-da-arte, o modelo proposto é avaliado em um novo conjunto de dados com informações de múltiplas câmeras, que incluem ocorrências de oclusão de alta complexidade em um cenário esportivo. Resulta-dos encorajadores foram obtiResulta-dos considerando que o nosso ambiente de rastreamento tem ambiguidade nos dados. Além disso, diferentes níveis de contribuição foram obtidos para a associação quando considerados os diferentes modelos para cada componente do nosso método, no qual alguns modelos provaram ser fatores importantes para o método nal de associação das trajetórias.

(8)

Tracking is an important area of computer vision and, despite the progress by several trackers recently proposed, many challenges remain unsolved. One of them is occlusion which aects the correctness and continuity of the object's trajectory, from single to multi-ple object tracking environments. In the latter, the diculty level of tracking is increased since the method must also be capable of maintaining the object identities. Since most of the current methods rely on detection responses, an event of occlusion has a large impact in tracking. This event turns out to obstruct the original appearance of the target which is the principal data used for association. The major consequence of this is the introduction of information gaps in tracking at the corresponding period when the object was occluded, since either the object cannot be detected or incorrect detection responses are obtained. This lack of information fragments the trajectories into tracklets in methods based on data association, that perform subsequent association over the detection responses of the targets. Particular classes of the event such as full and long-term occlusions involving sev-eral targets make the problem even more challenging. Furthermore, in complex scenarios such as the sport-based context tracking multiple objects involves dealing with similar appearances and motion, besides the high-dynamic natural trait of the environment. To address this problem, we propose a tracklet matching method formulated as a graph-based model. We evaluate dierent representations for the three main components of our model: the tracklet representation in our graph that uses the appearance models of the corresponding targets, the association cost that denes the anity over two sets of track-lets corresponding to two dierent targets, and the matching algorithm that performs a global association of the trajectories. Dierent from current literature, we evaluate our proposed model in a new tracking dataset with a multi-camera setup that includes severe occlusion cases in a sport-based context. Encouraging results were obtained considering that our tracking environment incorporates ambiguous data. Beyond that, dierent levels of contribution in the association were achieved when considering the dierent models for each component of our method, in which some of them proved to be important factors for the nal data association model.

(9)

1.1 Examples of partial and full occlusion between two targets . . . 15

1.2 Dierent viewpoints for an example of collision . . . 16

1.3 Illustration of an occurrence of identity switch . . . 17

1.4 Illustration of a rupture in the trajectory caused by information loss . . . . 17

1.5 Illustration of the sets of trajectories generated by a collision event . . . . 18

1.6 Graph formulation for the tracklets association problem . . . 19

2.1 Overview of the principal methods in each category of tracking . . . 22

3.1 Overview of our data association method . . . 37

3.2 Color hexacone for the HSV representation . . . 39

3.3 Example using the HSV color space . . . 39

3.4 Part-based appearance model . . . 41

3.5 Tracklet representation in G based on the average templates . . . 43

3.6 Tracklet representation in G based on clusters . . . 44

3.7 Similarity between vα and vβ considering the cluster-based model . . . 47

3.8 An example of a weighted bipartite matching problem . . . 54

3.9 The initial matching M for Gl and the creation of an alternating path . . . 55

3.10 The feasible vertex labeling l0 _{for G and the equality subgraph . . . 55}

3.11 Extending the matching by nding an augmenting path in Gl . . . 56

3.12 Perfect matching founded and the nal result in G . . . 56

3.13 Example of the greedy algorithm for the matching problem of Figure 3.8b . 58 4.1 Position of all four cameras in the tracking scenario . . . 61

4.2 Examples of the viewpoints in Mode 1 (M1) . . . 61

4.3 Examples of the viewpoints in Mode 2 (M2) . . . 61

4.4 The division of the dataset: A and B, by their diculty level . . . 65

4.5 The division of the dataset: A and B, by their camera viewpoint mode . . 65

4.6 Expected distance values for dierent histogram dimensions . . . 67

4.7 Examples of occlusion events with players of the same team . . . 78

4.8 Examples of occlusion events with players of dierent teams . . . 78

4.9 Matching for players: ID-2 and ID-3 of the Case 3 of the BraArg game . . 80

(10)

1.1 Properties of an occlusion event . . . 16

2.1 Principal features of detection and tracking of the presented approaches . . 31

2.2 Principal categories of tracking of the presented approaches . . . 32

2.3 Appearance models used in the presented approaches for object tracking . 33 2.4 Motion models adopted in the presented approaches for object tracking . . 34

2.5 Evaluation framework of the presented approaches for object tracking . . . 35

2.6 Current datasets for multiple object tracking. . . 36

4.1 Principal features of the dataset . . . 63

4.2 Collision dataset . . . 63

4.3 Number of bins considered for the object representation . . . 67

4.4 Results for the color representation . . . 68

4.5 Parameters and properties for HOG calculation . . . 70

4.6 Block congurations considered for HOG calculation . . . 70

4.7 Results for the local shape representation . . . 70

4.8 Results for the color + local shape representation . . . 72

4.9 Overview of the results for the appearance model component . . . 72

4.10 Results for the cluster-based approach using color features . . . 75

4.11 Results for the cluster-based approach using local shape features . . . 75

4.12 Results for the cluster-based approach using color + local shape features . 76 4.13 Overview of the results for the cluster-based approach in the dataset A . . 77

4.14 Results for the average-based approach in the dataset B . . . 77

4.15 Results for the cluster-based approach in the dataset B . . . 77

A.1 Substitution events in our dataset . . . 93

A.2 Other events not considered in the experiments . . . 93

B.1 Results per case in the dataset A of the cluster-based approach . . . 95

(11)

G[Γα, Γβ] Bipartite graph with parts Γα and Γβ. Kα Number of clusters for a trajectory of Γα. Kβ Number of clusters for a trajectory of Γβ. T Trajectory (or tracklet).

Γα Set of trajectories up to ts (before the collision).

Γβ Set of trajectories obtained after te (after the collision). ¯

xG Average accuracy (range [0,1]) using the Greedy algorithm. ¯

xH Average accuracy (range [0,1]) using the Hungarian algorithm. σG Standard deviation using the Greedy algorithm.

σH Standard deviation using the Hungarian algorithm. dBatt Bhattacharyya distance function.

dL2 Euclidean distance function. d_χ2 χ2 (Chi-square) distance function.

nα Number of trajectories in Γα. nβ Number of trajectories in Γβ.

rA,B Proportion of the area of target A that is occluded by target B. te End (time step) of a collision event.

ts Start (time step) of a collision event.

(12)

1 Introduction 13

2 Object Tracking 20

2.1 Multiple Object Tracking . . . 21

2.1.1 Bayesian Approaches . . . 22

2.1.2 Data Association Approaches . . . 25

2.1.3 Mid-Level Features-based Approaches . . . 28

2.2 Occlusion Models . . . 29

2.3 Summary . . . 30

3 Data Association 37 3.1 Appearance Model . . . 38

3.1.1 Color Representation . . . 38

3.1.2 Local Shape Representation . . . 39

3.2 Graph-based Model for Data Association . . . 41

3.2.1 Tracklet Representation . . . 41

3.2.2 Association Cost Between Tracklets . . . 44

3.2.3 Matching Algorithms . . . 48

4 Experimental Protocol and Results 59 4.1 Collision Dataset . . . 60 4.1.1 Environment . . . 60 4.1.2 Collision Features . . . 61 4.1.3 Data Selection . . . 62 4.2 Appearance Model . . . 66 4.3 Tracklets Association . . . 73 5 Conclusions 82 Bibliography 85

A Dataset: Substitutions and Other Events 92

(13)

Introduction

Tracking is the problem of identifying an object at each frame of a video sequence in order to infer its trajectory over a period of time. This problem naturally encloses the identication and association tasks. First, the object has to be detected at each frame which generates a new response, and then by its localization and other types of information this new response is associated with their corresponding trajectory. An object being tracked is usually referred to as the object of interest or target. A tracking system can be applied to a single object, which we are interested in tracking, or to several objects commonly referred to as Multiple Object Tracking (MOT), which includes an additional challenge for tracking: maintaining the object identities.

Tracking is an important area of computer vision and this can be seen by the many and dierent types of the tracking applications from the well-known robotics and surveillance scenarios to underwater [77] and biological-inspired scenarios [48]. Furthermore, tracking is an essential part of high-level tasks such as action recognition and behavior analysis, and plays an important role in the sports context, where it can aid tactical and technical analysis of the players [30, 31].

Despite the recent progress and numerous methods in the eld, mostly in the pedes-trian walking scenario (sub-area of surveillance), many challenges remain open. Some ex-amples of these challenges include robust occlusion handling, abrupt motion, and dealing with appearance changes [32, 83]. In fact, tracking itself has many interfering factors [83] and most of them are related to specic scenarios that increase the diculty level of the problem, which includes, for instance, cluttered and crowded environments [24], and high-dynamic scenarios such as sports.

Occlusion is one of the principal factors that interfere in the tracking process. Most of the tracking methods rely on the object appearance for tracking, i.e., by dening a template of how an object can visually appear in an image. In this scenario, occlusion aects the correctness and continuity of the object's trajectory, since occlusion turns out to obstruct the original appearance of the target. This event generally causes the rupture of the trajectory at the corresponding time when the object is occluded, requiring further acquisition and re-identication. This is a particular behavior, for instance, of detection-based methods that are not suitable for scenarios that include occlusion, by two principal facts: those methods are generally trained with free-occlusion samples [76], or are based on data association approaches that rely essentially in subsequent associations of the

(14)

detection responses. Therefore, the detector inability of dealing with occlusions basically increases the detection missing rate or incorrect detections, fragmenting the trajectories at these events.

Furthermore, the diculty level of an occlusion is increased by particular classes of the event. One of them is a special case of occlusion referred to as collision or confusion [30, 31]. This particular case is characterized by close interactions of the targets (almost sharing the same space), which induce full and long-term occlusions. Since the targets involved in the event are very close to each other, any viewpoint of the scene captures an occlusion occurrence, and this is the major dierence between other occlusion types.

Although some contributions were already proposed to the occlusion interposal in tracking [27, 82, 87], most of them consider partial occlusions in the pedestrian walking scenario. In the sports context, other challenges are naturally added to the problem, such as: similar appearances, by the player's uniform; and high-dynamic interactions that include abrupt motion and similar directions of the targets. Back to the consequences in tracking, the detection response gap is increased by the diculty level of occlusion, hence increasing the number of fragmented trajectories or performing poor associations.

Problem Denition

In this work, we focus on the problem of collision in multiple object tracking, when multiple trajectories collapse or are lost (such as in a goal celebration). Our objective is to globally perform re-identication after the confusion scenario is over, matching trajectories before and after the gap due to these occlusion events.

In the following, we present a formal denition of this problem. Occlusion occurs when a target (object) A being tracked is by some reason visually overlapped by another target B1. The proportion of the area of target A that is occluded by target B can take any value, from a very slight occlusion of A to the state where target A is no longer visible; either way target A is occluded by B.

There is no ultimate denition of the proportion classes of a target being occluded. As an example, Yang et al. [87] dened three degrees of occlusion for a target: slight, heavy, and full occluded. Let rA,B be the proportion of the area of A that is occluded by target B. A slight occlusion is dened when 0.2 < rA,B < 0.5, a heavy one when 0.5 ≤ rA,B < 0.8 and the target is full occluded when rA,B ≥ 0.8. In a similar way, Tang et al. [76] in their occlusion-aware person detector also used three occlusion levels determined by the ranges: 0.05 < rA,B < 0.25, 0.25 < rA,B < 0.55, and 0.55 < rA,B < 0.85 for heavy occlusions. Notwithstanding, most of the tracking methods do not follow a standard denition, and usually two basic terms are used for the degree of the event: partial and full occlusion. The rst one usually covers occlusion occurrences in which the target being occluded has at least half of its total area still visible, though in a full occlusion the target being occluded is no longer visible or the area is insignicantly eligible. Figure 1.1 shows examples of both partial and full occlusions in our tracking environment.

An occlusion occurrence is also characterized by another property: the elapsed time.

(15)

(a) (b) (c) (d) (e)

Figure 1.1: Examples of partial occlusion: (a) and (b); and full occlusion: (c), (d) and (e).

Two categories are usually employed for this property: short and long-term occlusions. There is no standard (or minimum) value that dene each category regarding the elapsed time of the occurrence. However, there are some events that can be enclosed within each category. Back to the example, recall that target A is being occluded by target B. Short occlusions are essentially described as occurrences in which both targets are involved in the occlusion in a small amount of time, typically when they cross each other's paths. For instance, consider that target B is passing by A, as in a situation where target A is walking to one direction and target B is walking to the opposite direction of A, and in some moment B passes by A causing an occlusion. The targets can have also similar directions, either way the event is characterized by few seconds, where in most cases the target is occluded by less than a second. On the other hand, long-term occlusions are described by occurrences where both targets remain in the occlusion situation for a signicant amount of time. This property usually covers occlusions where the targets remain in a static position: A is being occluded by B and they remain in that position; or when the targets are moving in similar directions, extending the event: A and B are walking, and as they walk B keeps occluding A.

Another property is related to the type of the occlusion, that can be: a simple occlusion or a collision of targets. Both are occlusion events where at least one target is being occluded, being the latter an special case of occlusion. Let's consider as an example that two targets, A and B, are walking and when they met in a point of their trajectories they hug each other for a long period of time. The targets in this case collide with each other. The collision is evidenced when both targets reach the same space or very close positions, usually both being not properly visible by the occlusion event. The main dierence is that in a simple case, the targets are in dierent locations but in the camera viewpoint they are observed as in an occlusion event. If we take another camera with a dierent viewpoint of the scene probably these targets will no longer be involved in an occlusion. Dierently from a simple occlusion, the collision is characterized by the targets reaching the same space or very close positions in the real plane of the scene (e.g., court plane), and therefore any viewpoint of the scene will observe the same event: an occlusion. The most apparent target (or targets) changes for each dierent viewpoint of the scene, but the occlusion still occurs.

The last property of an occlusion case is the number of targets involved. So far, cases of two targets were described, but these occurrences can have more than two targets involved. Yang et al. [87] consider the event with at most three targets in an occlusion cascading model. In this work, we consider the following cases: collisions with four targets in a full, where at least one target is full occluded, and long-term occlusion event. In our

(16)

tracking environment, these cases represent events of goal celebrations in a Futsal match. The majority of the dataset is composed of occurrences with four players involved, where some cases have ve players in the principal collision event of the scene. Other occlusion occurrences can be possible active during the team celebration, but for data selection purposes we considered only this principal collision for the number of targets property. An example of the occlusion occurrence that we are interested in, is illustrated in Figure 1.2, and an overview of the properties of an occlusion is presented in Table 1.1.

(a) (b) (c) (d)

Figure 1.2: Dierent viewpoints for an example of collision with four targets: (a) view-point 1, (b) viewview-point 2, (c) viewview-point 3, (d) viewview-point 4.

Table 1.1: Properties of an occlusion event. The properties of the events considered in this work are typeset in boldface.

Type rA,B Elapsed time Number of Targets simple,

collision partial,full short,long-term 2, . . . , 4, . . . , N

Each property can increase the diculty level to solve the occlusion that mostly gen-erates a gap in detection-based methods since one (or more) of the targets can not be detected in the occlusion. Besides the inability of correctly detecting the targets, when the method identies the most apparent target, or the current visual image of the occlusion, generally the subsequent association step of the new set of responses to the trajectories results in incorrect associations or identity switches. Since the targets involved in the event share appearance features and remain in a common region, the visual image of the current state of each target is similar and then can be incorrectly associated with any of them.

The associations where the tracking method incorrectly assigns new detection re-sponses to the current targets trajectories are also referred to as identity (ID) switches. The trajectories generated by the rupture in association are commonly referred to as tracklets. They are subsets of the correct (or complete) trajectory and are represented with the consecutive data of their corresponding time, which can be dened as a set of models (e.g., image observations, court plane positions), each one related to each frame included in the tracklet. An illustration of an identity switch and of a rupture in as-sociation are presented in Figure 1.3 and 1.4, respectively. Therefore, occlusion usually introduces incorrect associations or the rupture of the trajectories caused by large gaps in the detection responses that prevent their association. In [30, 31], the tracking method presents these two behaviors when dealing with collisions.

(17)

(a) (b)

Figure 1.3: Illustration of an occurrence of identity switch (each circle represents the de-tection response for the target in a particular time): (a) trajectories incorrectly associated with new responses in time t, (b) the correct trajectories for both targets A and B.

Figure 1.4: Illustration of a rupture in the trajectory caused by information loss. (a) com-plete (ground truth) trajectory, (b) tracklets generated by information loss in tracking.

As for the collision cases with full and long-term properties, the probability of generat-ing incorrect associations is even higher than other occlusion cases. Not only considergenerat-ing that the collision property makes the inference of the targets positions less discriminative, since they are positioned close to each other and present similar motion, but also consid-ering that they present similar appearance models which increase the diculty level in distinguishing those targets. An appearance model is the object representation related to its appearance features. In our tracking environment, the appearance model corresponds to the target's aspect and primarily their uniform features.

These incorrect associations introduced by the detection responses gaps caused by the collisions are the main problem explored in this work. Since this event usually causes a lack of responses or unreliable ones, both are treated in this work as the same: gaps in the detection responses sets.

In a detection-based environment with a multi-camera view, gaps on tracking due to severe occlusion occurrences can be handled by a global data association method. This association method is performed over the tracklets generated by the detection responses before and after the event. Thereby, the proposed solution overcomes the drawbacks of occlusion by trying to associate the tracklets generated till the beginning of the occurrence with the ones generated by reliable detections after the end of the occlusion occurrence. This interval related to the occlusion duration is not currently considered for association purposes, since some of the targets are not detected or present poor responses. Following the problem denition, the formulation of the data association problem is described.

(18)

Problem Formulation

As a priori knowledge for this work, we consider we already have the sets of trajectories (tracking of players) before and after the collision event, i.e., these sets of trajectories are considered as an input data for our method. Thus, the tracking method is assumed to be robust enough till the moment when collision occurrences are presented in the tracking environment. Let ts be the start of a collision between some targets, and te be the end of it. Up to ts the tracking method is able to correctly build up the trajectories of the targets, therefore till the start of the collision the set of trajectories of these targets is available; along with the detections of each frame for each target being tracked. As the collision starts the method is no longer capable of tracking the targets involved, so that period of time when collision occurs can not be taken into account. Therefore, similar to the period before the collision, the trajectories after the event (period started at te) are inferred from the results in tracking after te. Consider that in tracking methods based on detection, the method still detects the targets even if some specic event generates incorrect associations or ruptures in the continuity of the trajectories. Consequently, the set of trajectories after the event can be seen as applying the tracking method with te as a start time or by associating the detection responses after collision. This approach, considering these sets of trajectories, is illustrated in Figure 1.5.

Figure 1.5: Illustration of the sets of trajectories generated by a collision event. Let Γα be the set of trajectories up to ts and Γβ the set of trajectories obtained after te, and let |Γα| = nα and |Γβ| = nβ where Γα = {T1α, . . . , Tnαα} and Γβ = {T

β

1, . . . , Tnββ}.

Each trajectory Ti corresponding to the target i is composed of the detection responses obtained for that target. However, the only known information is the trajectories. The interposal of occlusion in the tracking process generates tracklets in which their IDs are no longer known. Therefore, to overcome the missing information in tracking upon on occlusion, an association between the tracklets of Γα and Γβ is required.

Since both sets of trajectories represent distinct periods of tracking, the problem can be represented as a bipartite graph G[X, Y ] where the nodes of the part X are represented by the trajectories of Γα and the nodes of the part Y are represented by the trajectories of Γβ. In order to associate the trajectories of part X with Y , rst we have to dene an evaluation metric over a possible association, that corresponds to the accuracy and feasibility of assigning tracklet Tα

(19)

Let w : E(G) → R be that function over the edge set E(G), i.e., over the association of Ti and Tj denoted as w(Tiα, T

β

j) (or wi,j). As all possible associations are considered to obtain the correct assignments, leading to Γα × Γβ edges, the problem is formulated as a complete weighted bipartite graph G, in which we are interested in obtaining a subset of non-adjacent edges M ⊆ E(G), i.e., a matching that include n = |X| = |Y | = |M | associations over Γα and Γβ. In addition, since we are considering a multi-camera environment, this pairwise-association is performed over each camera data independently, and used to generate the nal association that considers the information from all cameras. The graph formulation is illustrated in Figure 1.6.

Figure 1.6: Graph formulation for the tracklets association problem: graph bipartition of the tracklets; denition of the association cost wi,j; generation of the complete graph G[Γα, Γβ]; and the matching obtained by the global data association method.

This dissertation contributes with three main points in order to address and evaluate the problem of collisions in the multiple object tracking. First, we model the problem as a data association method formulated as a weighted bipartite graph, where we evaluate dierent approaches in our graph model that includes: the tracklet representation, the association cost between the tracklets, and the matching algorithms, which generate the nal data association. Second, regarding the appearance model, we evaluate dierent representations and metrics for comparison between two models. Finally, we provide a new dataset for the tracking area which includes severe cases of occlusion, in a challenging sport-based context and high-dynamic tracking scenario also providing a multi-camera setup.

Dierent from current datasets, our dataset provides severe cases of occlusion and annotation over these events. Essential to proper analyze the robustness of a tracker regarding occlusions, this dataset is one of the few datasets that provides occlusion anno-tation, and the only one, to the best of our knowledge, that provides collision information. Considering the size, this dataset is currently, the second-largest dataset in the tracking area that provides occlusion information, not considering all the dierent videos each one related to one camera of our environment, and therefore is the largest one considering all sequences for all cameras independently.

This dissertation is organized as follows: in Chapter 2, we present and discuss the current state of the art in the tracking area; the data association method including the appearance models employed are presented in Chapter 3; the experimental protocol and evaluation including our dataset features are presented in Chapter 4; and the conclusions of the work in Chapter 5.

(20)

Object Tracking

Object tracking can be separated in two main areas: single-object tracking, commonly referred to as visual tracking, and multiple object tracking (or multi-target tracking) [73, 83, 86]. They have the same structure of a general tracking approach, but the tracking process is usually dierent due to the scenarios in which they are applied.

The outline of tracking an object is based on terms of several modules such as: object representation, search process, and update model [73, 83, 86]. Also two important parts are included: motion models, which are generally embedded in the search process pro-viding information to a more condent estimation; and occlusion models which overcome close target interactions that generate occlusion between targets.

These tracking modules are easily seen in online methods, that handle the tracking problem by processing each frame at a time. An object representation is dened to visually detect the target in the image, where this representation is the basis of the tracking process that, with the current object state, infers the next position of the object of interest. As a new frame is processed, the object model is then updated to deal with appearance changes and motion. In batch or oine methods, forward and backward scanning over the frames are employed in order to generate the object's trajectory. Generally, these methods rst, perform an association of the object states over the frames which generates intermediary tracklets, that are afterward globally associated in order to produce longer trajectories. Despite the distinction, the tracking modules presented are still the essential core of oine approaches, but in a dierent way that in online ones. The object model, which integrates motion and appearance information, is used to associate the object states obtained in all frames, where the search process now is related to an association step over the object states or intermediary tracklets and the model is updated regarding these associations.

Despite the distinction of the visual and multiple object tracking, it is important to visualize tracking as a wider area that has similar challenges. One of these challenges is the object representation, which has been currently well explored in the visual tracking context. Besides, as visual tracking methods can be integrated into a multi-targets track-ing framework [86], it is important to know their structure and approach to the problem too. In this regard, for detailed information on the currently visual tracking approaches, we refer the reader to [69, 83, 86]. Especially the review presented by Wu et al. [83] that presents an evaluation over sequences with attributes related to the many factors that can aect the tracking performance; and the work of Salti et al. [69] that reviews the current

(21)

state of the art in visual tracking regarding their adaptive appearance models.

In the following sections, we present approaches in the multiple object tracking area. We describe the main categories to tackle the multi-object problem currently, and the tracking modules proposed in the state of the art. As occlusion is a challenge for either the pedestrian or sport-based scenarios and by the fact that most contributions focus on pedestrian tracking, we review methods of both sub-areas. Lastly, we present an concise review of the principal dierences between these approaches.

2.1 Multiple Object Tracking

Multiple object tracking takes advantage of several attributes which aid in the classica-tion of the main category of the tracking methods. As in visual tracking, these attributes aim to visualize in which context or scenario the approach can be possible applied. The rst aspect to be considered is the identication of the target occurrence in the image coordinates, in which detection or segmentation-based methods are usually applied. Once the image location of the targets is dened, online or batch methods are used to process the detection hypotheses. Another aspect of a tracking method is the number of targets considered in the scenario. In approaches considering a xed number of targets, gener-ally the targets are present in the whole video, and in approaches considering a variable number of targets, the scenario has several targets that enter and leave the tracked area in dierent periods of time.

As one of the principal modules of a tracking method, the object representation or appearance model is well explored by presenting schemes from low-level models such as: gradient, color, texture, and spatio-temporal-based models; to temporal mid-level features such as: dense point trajectories [21] and supervoxels [84]. Furthermore, the appearance model can be represented by using a regular model based on the entire object's bounding box or by using a part-based model [31, 65] represented by regions. But even being an important feature to discriminate the targets, some methods [26, 48, 66] do not rely on this aspect to track the players. In the sports context, classiers can learn the appearance model of dierent teams [59, 60], and digit-recognition of the players jersey numbers can augment the model [53, 72]. Adaptive appearance models are also used in order to learn robust appearance models, and can be based on machine learning that are classied in online [8, 9, 19] and oine approaches [7], or based on frameworks using a model composed by previous responses up to the current time [30, 31].

Considering the search process model or the main tracking approach, there are two principal categories: Bayesian and data association-based approaches. In the former, tracking is formulated with a Bayesian structure which includes three fundamental models: observation, motion, and sampling models. This category includes Bayesian methods such as: Kalman ltering [60], particle ltering [9, 30, 31] with Markov Chain Monte Carlo (MCMC) and Reversible-Jump MCMC (RJMCMC) sampling strategies [27, 48, 87], and the Bootstrap Filter [19]. In the data association category, there are two principal approaches for tracking: applying hierarchical association models or performing a global association considering all information of the targets from all the frames. For both

(22)

sub-classes, generally a graph-based model is applied in which several methods were already proposed, such as: modeling the tracking problem as a maximum weight independent set problem [20], as a network-ow problem [24, 59, 72], as a convex optimization problem [53], and as a constrained label propagation [25]. Besides that, there are also tracking methods that use both Bayesian and data association models [9, 27, 66]. In these methods, data association is used in order to adjust some phase of the Bayesian model. Lastly, the environment considered for tracking can also include a multi-camera setup such as in [31, 40, 41, 72], where information from dierent viewpoints aid in the estimation of the target's position.

In the following, we present the three tracking categories that best enclose the current literature: Bayesian, data association, and mid-level features-based approaches. The latter models a data association problem by using dierent features from the previous categories. An overview of the principal methods in each one of the tracking categories is presented in Figure 2.1. We start each section by analyzing some methods of the corresponding category, with a nal overview of each sub-area.

Figure 2.1: Overview of the principal methods in each category of tracking.

2.1.1 Bayesian Approaches

Khan et al. [48] presented a Markov Chain Monte Carlo (MCMC) and a Reversible-Jump MCMC (RJMCMC)-based particle ltering approach to track a xed and a variable num-ber of targets, respectively. A Bayesian approach is adopted, which oers a systematic way to combine prior knowledge of target positions, modeling assumptions, and obser-vation information to the problem of tracking multiple targets [48, 74]. The primary goal of the approach is to determine the posterior distribution P (Xt|Zt)over the current joint conguration of the targets Xt at the current time step t, given all observations

(23)

Zt= {z1, . . . , zt}up to that time. The posterior distribution is then primarily determined by the prior distribution, likelihood observation, and motion model. For the motion model, a Markov Random Field (MRF) motion prior is introduced, which is constructed at each time step to address interactions between nearby targets. To overcome the high-dimensional state spaces induced by the MRF formulation, a MCMC sampling is used, based on a Monte Carlo approximation of the posterior in terms of unweighted samples via the Metropolis-Hastings (MH) algorithm [44]. This is used to generate a set of samples from the approximate posterior over the joint target congurations Xtat each time step t. To handle a variable number of targets, the MCMC sampling is extended with a RJMCMC sampling step, an extension to variable dimensional state spaces. In this case, the posterior P (Kt, XKt|Zt) is a distribution over a variable-dimension space induced by

the set of identiers Kt of the targets. The variable target motion model is partitioned into probabilities of targets entering, leaving, and staying in the tracking area, in which the samples are obtained from this variable in the RJMCMC sampling procedure. This sampling procedure is performed by changing the dimensionality of the state space. For each change considered referred to as jump, a reverse jump is dened in order to potentially move the chain back to a previous hypothesis.

Yang et al. [87] presented a tracking method based on the probabilistic framework of Khan et al. [48], by introducing an occlusion variable as part of the solution which was not considered in the latter. A vectorial variable expressing the occlusion relationships among targets at each time step is combined with the current joint conguration of targets, producing a new hypothesis for the posterior in the bayesian framework.

The new posterior distribution is then adapted, specically in the observation likeli-hood and motion priori model. In addition, an MRF occlusion priori model is incorporated to the framework. The observation likelihood model is dened considering appearance features by incorporating a global mask and a similarity function with occlusion relation-ships. If a target i is occluded, instead of calculating the similarity over the new image observation, the function is applied over a fusion of the features of both occluded and oc-cluding targets. This function also considers the current degree of occlusion ri,j dened as the proportion of the area of i that is occluded by j. Three degrees are considered: slight, heavy, and full occluded. The MRF occlusion priori model is obtained by dening a priori probability of a target i either being occluded or not by another target, independently of the current situation of other targets. This probability is dened by the proportion of targets into the neighborhood of target i that occludes the target. Their interactions are also incorporated in the MRF model, by expressing cascading occlusion relationships, i.e., target A is occluded by target C, and C simultaneously is occluding target B.

Bae and Yoon [9] proposed an online bayesian approach. The overall structure of this tracking process consists of two main parts: visual tracking and track management, with an online discriminative appearance learning method. In the visual tracking part, association is performed on a bayesian approach over the current state of the tracklets and the new observation set. This association is primarily performed over the anity score between a track and a new observation. The posterior data association probability generated for the current frame is then used to compute an association score matrix over the current tracklets and observations, which is solved by the Hungarian algorithm [2].

(24)

Once the association for each observation is determined, the track states are estimated with a particle ltering method. The track management part determines whether a track is complete (terminated) or not, nds new object hypotheses, and links fragmented track-lets by an association model that generates a cost matrix representing the feasibility of associating those tracklets considering an anity score as in the visual tracking part.

To learn the trajectories discriminative models, an ensemble learning method [7] is used, where the appearance model is generated over color, shape, and texture properties. The track anity and observation models, which compute the anity score of track-to-track and track-to-observation are based on appearance, shape, and motion. For the appearance anity only descriptors from the tail and head of the tracklets being compared are considered. For the shape anity their heights and widths, and for the motion anity a Kalman ltering method is used to estimate the positions and velocities in the time gap between two tracklets.

Collins and Carr [27] presented a method coupling the detection and association phases. A bayesian formulation is adopted where detections and associations are ran-dom variables and likelihood functions in the model measure how well the detection set and data association explain the observation data. The overall solution maximizes a joint posterior distribution over the detections and associations given the observations, decomposed into a stochastic proposal of detections and deterministic solution for data association. A stochastic search over the space of the detection congurations is per-formed, where the initial set of detections is perturbed to nd new detection hypothesis. This step is interleaved with a deterministic algorithm for the global association of the hypothesized detections.

The data association is represented as a trellis graph such that each stage corresponds to one frame with its detection set as the nodes and edges dened between each pair of detections in adjacent frames of the graph. Each frame is represented with: color (YCbCr space), foreground mask, and an occupancy map. Each edge cost is determined by the distance of the detections and color similarity calculated by Earth Mover's Distance. Then, the data association is addressed as a minimization of the sum of costs in the graph in a multi-dimensional assignment problem based on network ow [26], which is solved by the Shaque and Shah algorithm [70]. An RJMCMC sampler is used to search over the detections congurations. In the RJMCMC algorithm a new detection set is proposed perturbing the initial detection data, and for each new state an optimal data association is computed. This new state is accepted or rejected according to a Metropolis-Hastings-Green (MHG) ratio.

Another online tracking method proposed with a bayesian formulation is the particle lter framework presented by Morais et al. [31]. The observation model, dierent from the aforementioned methods, uses the estimated positions from the detections in a multimodal function corresponding to a mixture of Gaussian functions. Each one obtained by the projected position of the detection of each camera in the court plane. A part-based appearance model for the targets using color and gradient information is adopted. This model is represented by three regions corresponding to: upper, middle, and lower body parts. To reinforce the discriminative property of the method, an adaptive appearance model is included. For each target, a set of previous appearance models used is maintained

(25)

by the method. Then, at each time step, the new detection is projected to the court plane in which the resulted Gaussian function is weighted with the similarity between the current appearance set of that target and the new observation. As better and new representations are processed for the target, this set of appearance models is updated.

By using a classic particle lter approach, Pádua et al. [66] model the tracking prob-lem from a calibrated camera as in [30, 31]. But dierent from the former, they do not use appearance models for the objects and use a camera located in the top-view of the court plane, which makes the environment less aected by occlusion and clutter. Be-sides, in the observation model, the Hungarian algorithm [2] is applied given the position of the detection and the tracker current position, calculated by the Euclidean distance function, in which the prediction phase of the particle lter is adjusted by this association. As the Bayesian frameworks are widely adopted in the multiple object tracking, we can see that the main dierences between the current methods in literature rely on the following aspects: observation model, motion model, and sampling strategy. Also, there is a predominance of particle lter-based approaches against, for example, the Kalman lter-based approach of Lu et al. [60], that includes a Conditional Random Field in the posterior conguration and a Linear Programming Relaxation algorithm in the prediction phase. For the sampling strategy: classic methods [9, 30, 31, 60, 66], MCMC [48, 87], and RJMCMC [27, 48, 87] techniques are employed. For the motion model: MRF-based model [48], Discretized Wiener velocity-MRF-based model [87], and the constant velocity model [9, 19, 30, 31, 66] are embedded in the framework. And lastly, for the observation model there are: classic approaches including or not [48, 66] appearance information, using a fusion function over the data provided by multiple cameras [30, 31], based and weighted by a detector condence model [19], and assigning the target lter to new observations by a matching algorithm [9, 19, 66].

2.1.2 Data Association Approaches

Liu et al. [59] proposed a hierarchical data association approach based on context con-ditional motion models in a team sport environment. The hierarchical association is composed of three levels, wherein are obtained low, mid, and high-level trajectories. Each level extends even more the trajectories size over time by linking more context informa-tion which overrides previous gaps between the generated trajectories. Posiinforma-tion, time, and appearance information for each detection is the base data used in the rst level. Appearance information corresponds to the output of RGB-based color classiers which estimate if a tracklet belongs to a specic team.

In the low and mid-level association steps, tracklets are associated by identifying local spatial-temporal common properties and subsequent association up a time threshold, respectively. The high-level trajectories are obtained by nding the minimum cost path for each nal trajectory in a cost-ow network. The cost per unit ow, from one mid-level trajectory to another, is decomposed into probabilities in continuity of appearance, time, and motion; where the motion probability is estimated by a model over game context information.

(26)

Relying on the fact that player locations are strong related with the current situation of the game, the authors presented a conditional motion model based on the players motion features, namely Game Context Features. Four game context features are presented: Absolute Occupancy, Relative Occupancy, Focus, and Chasing feature. The Absolute Occupancy map describes the distribution of players during a time interval in the absolute game eld, by using an occupancy map where the spatial quantization of the detected players is presented. The second feature, Relative Occupancy map rely on the assumption that the relative distribution of players it is in some way indicative of the identities of the players [59, 64]. A relative occupancy map is calculated based on the relative position of the players in a specic shape context representation, where the surrounding area of the player is separated into regions, related to the proximity to the center location and direction from it. And it is calculated between the end-tracklet of a trajectory Γi and the beginning of another trajectory Γj.

The third feature is the Focus Area upon on the assumption that in team sports there is often a local region with high player density motion, which can also be correlated to the movement of individual players [59]. The location and movement of the focus area are estimated by applying meanshift tracking on the local center of the players detections.

The last feature is the Chasing Feature, which is based on the fact that some players often mark a particular opponent player by following their movements to possible interfere in their actions. This is analyzed over trajectories that have similar movements and close positions over time, namely chasing link. The chasing feature is identied by searching for each pair (Γi and Γj) of trajectories a possible third Γk trajectory that have chasing links with the pair, i.e., a third trajectory that is possibly being followed by the pair. Assuming that the third trajectory is continuous during the time gap of Γi and Γj, this may indicate that Γj is the subsequent trajectory of Γi. Finally, a conditional motion model with kinematic features such as: temporal gap, position, velocity, acceleration, change in velocity, and game context features is obtained using a Random Decision Forest. Bae and Yoon [8] proposed an online data association approach with a discriminative appearance learning method. Local and global association are performed, based on the condence value of tracklets. The tracklet condence measures the detectability and the continuity of a tracklet, dened over length, missing detections, and anity between the tracklet and associated observations. The similarity between two tracklets or detections referred to as the anity score, is dened over three principal cues: appearance, shape, and motion models.

In the local association, tracklets with high condence are associated with online-provide detections. Reliable tracklets tend to produce reliable detections, and based on this assumption high condence tracklets are rst associated. On the other hand, missing detections decrease the condence of a tracklet which usually creates fragmented trajec-tories. Hence, these tracklets are globally associated with detections and other tracklets. Local association of high condence tracklets is performed by a pairwise assignment of the detection responses and the tracklets. A cost matrix over the anity score between each possible assignment is dened, and the nal association is obtained by the Hungar-ian algorithm [2]. The tracklet is then updated with the information of the associated detection, such as position, velocity, size, and a new condence value is calculated. In the

(27)

global association, three types of association for the low condence tracklets are modeled: association with a high condence tracklet, no association denoting the end of the tracklet, and association with a detection response. As in the local step, a cost matrix considering the three types of association is dened, and with the Hungarian algorithm the optimal association pairs are determined. The same threshold and information update procedures for the local association are applied.

For the appearance modeling of tracklets, an online appearance learning method is used. Samples are extracted from high condence tracklets with dierent locations and scales, represented by color histograms. To overcome the high dimensionality of the ap-pearance model, the feature vector is projected and updated onto a low-dimensional sub-space using a method based on Incremental Linear Discriminant Analysis (ILDA) by [49]. Kumar and Vleeschouwer [53] proposed a graph formulation with label propagation to track multiple objects. A graph construction over spatio-temporal and appearance cues is employed to assign similarity over the detections, and by this association the correct identities are determined upon a graph-based label propagation framework.

Three association graphs are dened: spatio-temporal, appearance, and exclusion graph. Spatio-temporal and appearance graphs are derived from the Locally Linear Embedding (LLE) technique, which computes low-dimensional, neighborhood-preserving embeddings of high-dimensional data [68]. The algorithm denes a more signicant neigh-borhood over a data point and its data neighbors set based on the Euclidean distance and the K-nearest neighbors algorithm. Finally, the graph construction is formulated as an optimization problem over the reconstruction weights obtained by [68]. For the spatio-temporal graph, the data point is dened as the time instant and the location in-formation, and for the appearance graph the appearance feature (e.g., color histogram). The exclusion graph associates detections coexisting at the same time to prevent their assignment in the label propagation step.

To measure the consistency of the assigned labels in a graph G, a labeling error is dened by a harmonic function. The tracking problem is then formulated as a minimiza-tion/maximization problem of the labeling errors of each graph. This enables the use of more than one appearance feature, such that n appearance features generate n appear-ance graphs. The optimal solution is generated by a convex optimization model based on the dierence of convex (DC) algorithm by a projected gradient method.

The data association category has the most diversied sub-categories in the multiple tracking state of the art. With some online approaches [8] that gradually associate the trajectories with new observations, mostly oine approaches [5, 20, 26, 40, 41, 53, 59] are proposed by the self-evident optimization view of the data association problem. As the association itself can be separated in two classes (hierarchical [8] and global association [5, 20, 26, 40, 41]), some methods rely on models based on these two, i.e., interleaved in hierarchical steps with a global optimization generating the nal trajectories [53, 59]. Regarding the model, generally a plain graph representation is employed, modeling the multiple object tracking as a maximum weight independent set problem [20]; by the search of minimal paths in the graph, generating the trajectories by a shortest path algorithm [40, 41]; by modeling as a multidimensional assignment problem [26], a

(28)

discrete-continuous optimization problem [5], a generalized maximum multi clique problem [32], or by means of a network ow problem [12, 24, 59, 72].

2.1.3 Mid-Level Features-based Approaches

In this last section, we present two tracking approaches based on the recently addressed mid-level features. These features are based on spatio-temporal segmentation over optical ow techniques and represent a more coherent basis for handling occlusion and pose vari-ations [25]. The overall solution of this category is to essentially employ an association method over these features in order to obtain the object's trajectory. The rst method, proposed by Fragkiadaki et al. [42], uses a foreground-background segmentation method, and the second, proposed by Chen et al. [25] a Constrained Labeling approach, in which the constraint model between connected components of [42] is used.

Fragkiadaki et al. [42] proposed a spatio-temporal segmentation approach to obtain the target's trajectories and by incorporating anities based on motion similarity, the method identies via spectral clustering the correct trajectories of each player. The basic units used were the Dense Point Trajectories calculated by [75], which compute point trajectories based on an optical ow algorithm that tolerates fast motion.

The set of trajectories obtained is then classied either as foreground or background, since points of both scenarios can possible being tracked. The foreground-background seg-mentation is based on motion contrast from large temporal context encoded in the motion saliences of the trajectories. Non-salient trajectories are assigned to the background, and salient ones form a foreground map for each frame. Then, connected components are de-ned by regions of close points in the foreground map with no free space. In the following step, each point of the salient trajectories at each frame is associated with a connected component generated from the foreground maps.

In a graph representation, with the foreground trajectories being the graph nodes, relationships (edges) are dened based on motion similarity. Attractive weights are intro-duced between trajectories having similar motion, and repulsive weights between trajec-tories that belong to dierent connected components in any frame of their time overlap. The trajectory graph is then segmented by a normalized cut criterion, wherein the goal is to get a graph partitioning that maximizes within-group attraction and between-group repulsion. In order to circumvent the problem of noisy anities generated by less reliable short trajectories, the method follows a three step clustering. In the rst step, a set of trajectory clusters representing the basic moving entities is generated by spectral cluster-ing. In the second step, up to a threshold the remaining trajectories are associated with the closest cluster of the rst step. In the third step, additional clusters are computed centered on long trajectories. Clusters with interior repulsion and without time continu-ity are discarded. Finally, a greedily approach by choosing largest to smallest clusters is performed till most of trajectories are covered.

Chen et al. [25] proposed a constrained sequential labeling (CSL) approach. By using Supervoxels [84] and Dense Point Trajectories [21], the tracking approach sequentially assigns labels (an object identier) to the features that represent the foreground object

(29)

activity. In the initial frame, the visible objects are labeled manually. For the following frames, new objects are assigned with all possible labels. Using a linear cost function and inequality constraints, renement iterations are performed to obtain more reliable assignments. The cost for assigning a label for a feature is expressed by a cost function which takes into account a weight vector w, obtained by a Perceptron-style learning algorithm and a cost function descriptor.

The cost function descriptor encodes two types of information: the neighborhood system, which measures how reliable a feature can be labeled considering its surroundings; and the object template, which is dened as the union of all features labeled with l corresponding to the object l, and is updated after each iteration of the CSL method. Visual dissimilarity is calculated considering location, optical ow, and color properties.

The inequality constraints, second component of the iterative process, are dened to prevent incorrect labelings. A set of inequality constraints E is associated with each video, each one indicating two mid-level features that cannot be assigned the same label. The goal is to dene a consistent labeling, which does not violate any constraint. Based on the idea of constraints between connected components introduced by [42], an SVM classier is trained for the constraint generator that returns the inequality constraints over mid-level features. To reach further implications of the current labeling and identify if a labeling it is or not consistent, a constraint propagation method is used to iteratively compute the implicit consequences, generating a rened labeling consistent with E.

2.2 Occlusion Models

Even being a common drawback in most dynamic scenarios of tracking, few methods approach the occlusion problem. Some methods assume an environment where these events do not occur, causing failure in tracking if they are present (Khan et al. [48]). Other methods do not mention this event as an interfering factor: Kumar and Vleeschouwer [53]. The environment is also an indicative of potential occlusions. In the sports context, high-density scenarios are more likely to introduce dicult occlusion cases. But, even in sports some environments are less aected by this occurrence. In the work of Pádua et al. [66], a camera is placed at the top-view of the court, which prevents occlusions to occur since in the worst case the targets will be very close to each other, but never occluding each other.

There are two topics regarding occlusion to be considered: their detection and their handling process. The former is less explored as the challenge is even bigger. One excep-tion is the approach of Collins and Carr [27] that tries to detect and solve this problem. Collins and Carr [27] address the problem by using methods for location and counting targets in groups. But the authors concluded that only in partial occlusions the method can successfully count the number of targets, full occlusion events were not solved by the method.

Data association-based approaches can probably handle better the occlusion problem. These methods usually include steps that are capable of handling time gaps between the target trajectories which are generally caused by occlusion. In these approaches,

(30)

hierarchical association is performed to construct more reliable and longer trajectories and consequently overcoming the occlusion interposal in tracking. Besides, methods that rely on more information than the appearance and position features, can also solve more cases of occlusion. One example of this is the approach of Liu et al. [59] that included game context information in the model. The context information can possibly reinforce the association when appearance or position are not very well discriminative anymore, as in the collision cases where the targets have similar appearance and motion.

By the current large number of multi-object tracking methods, we can expect that some occlusion events are handled by these methods: as presented by Chen et al. [25] (partial occlusions), Fragkiadaki and Shi [42] (short and partial occlusions), and An-driyenko et al. [5] (short occlusions). But, most of them present an approach that does not explicitly model this event. The only approach that directly handles occlusion by modeling the event in the tracking process is the approach by Yang et al. [87]. But the authors performed an evaluation of the occlusion model in a dierent context of the eval-uation of the tracking method. The occlusion model in [87] is evaluated in a sequence of 180 frames with two rigid objects as the objects of interest, more specically, where two paper cards with dierent colors (black and white) representing the interacting objects are tracked. Unlike the occlusion model evaluation, the complete evaluation of the track-ing approach is performed in a pedestrian context without any specic information of the occlusion events in the environment.

This is the principal problem of evaluating robustness regarding occlusion in the mul-tiple object tracking. The majority of the datasets currently used have occlusion events embedded in the scenario, but without any particular information about these occur-rences. The only dataset that have occlusion annotation is the UCI [80] dataset (see Section 2.3 for more details). The information that is important to a proper evaluation considering occlusion are: number of the events, targets involved in each event, duration, and the degree of occlusion: partial or full cases.

Therefore, since few datasets have information about occlusion, it is dicult to evalu-ate the robustness of a method regarding this aspect. Beyond that, there is no standard evaluation framework for these events, which makes dicult the comparison over that property among the current state of the art. Indeed, currently there is no evaluation framework that includes all the important aspects of tracking [62].

2.3 Summary

In this section, we give a brief review of the main points of each one of the presented tracking methods in this chapter. Principal dierences are presented in modules such as: appearance and motion models; detection or segmentation-based method; tracking tech-nique (category, sub-area, training algorithm if it is used); and the evaluation framework for the method that includes: metrics, context/scenario of tracking, and the datasets itself. The principal features of detection and tracking are presented in Table 2.1 and 2.2; of the appearance and motion models are presented in Table 2.3 and 2.4, respectively; of the evaluation framework in Table 2.5, and the datasets in Table 2.6.

(31)

Table 2.1: Principal features of detection and trac ki ng of the presen ted approac hes. Ob ject lo cation (L): if the metho d uses a detection (d) or segmen tation-based (s) tec hnique. Lo cation metho d: the lo cation metho d or indication of the public av ailable detections (p.a.d.) used . Online approac h (On): indicate if it is an online approac h (Y) or not (-). Num ber of targets (T): trac king of a xed (F) or variable (V) num ber of targets. Plane co ordi nates (PC): if the position obtained in the image co ordinates is pro jected in to the real plane (e.g., court plane for sp orts) (Y) or not (-). Reference Y ear Extension of L Lo cation metho d On T PC Khan et al. [48] 2005 s Color image segmen ta ti on [22] Y F/V -Breitenstein et al. [19] 2011 Breitenstein et al. [18] d ISM [57] and H O G-based [29] detector Y V -Brendel et al. [20] 2011 d Deformable part-based detector [38] -V -Fragkiadaki and Shi [42] 2011 s Dense poin t tra jectories [75] -V -Morais et al. [30, 31] 2012 d Vi ol a and Jones detector [79] Y V Y Kumar and Vleesc hou w er [53] 2013 d p.a.d AP IDIS: [33], TUD: [4] -V -Liu et al. [59] 2013 d Occupancy map [23] -V Y Lu et al. [60] 2013 d Deformable part-based detector [38] Y V Y Bae an d Y oon [8] 2014 d p.a.d by [55, 61, 85] Y V -Bae an d Y oon [9] 2014 d p.a.d for PETS and ETHMS: [55, 61, 85] and for CA VIAR and Ho ck ey: m ultiscale detector [34] Y V -Chen et al. [25] 2014 s Sup erv oxels [84] and de nse poin t tra jectories [2 1] -F/V -Collins and Carr [27] 2014 s Bac kgro und subtraction and occupancy map [23] -V -Shitrit et al. [72] 2014 Berclaz et al. [12] d Probabi lit y occupancy maps -V Y Y ang et al. [87] 2014 Khan et al. [48] d Bac kground subtraction [58] Y F/V -Pádua et al. [66] 2015 d GMM-based bac kground subtraction [88] Y V Y

Multiple tracklet matching under severe occlusions = Emparelhamento de múltiplas trajetórias em casos severos de oclusão

Karina Olga Maizman Bogdan

Multiple Tracklet Matching under Severe Occlusions

Emparelhamento de Múltiplas Trajetórias em Casos

Severos de Oclusão

CAMPINAS

2016

Multiple Tracklet Matching under Severe Occlusions

Emparelhamento de Múltiplas Trajetórias em Casos Severos de

Oclusão

CAMPINAS

2016

Karina Olga Maizman Bogdan

Multiple Tracklet Matching under Severe Occlusions

Emparelhamento de Múltiplas Trajetórias em Casos Severos de

Oclusão

Introduction

Problem Denition

Problem Formulation

Object Tracking

2.1 Multiple Object Tracking

2.1.1 Bayesian Approaches

2.1.2 Data Association Approaches

2.1.3 Mid-Level Features-based Approaches

2.2 Occlusion Models

2.3 Summary

Problem Denition