Pivot-based approaches for Movelets and MASTERMovelets optimizations

(1)

(2)

(3)

Camila Leite da Silva

Pivot-based approaches for Movelets

and MASTERMovelets Optimizations

Dissertação submetida ao Programa de Pós-Graduação em Ciência da Computação para a obtenção do título de Mestre em Ciência da Computação.

Orientadora: Profa. Vania Bogorny, Dr.

Florianópolis 2020

(4)

Ficha de identificação da obra elaborada pelo autor,

através do Programa de Geração Automática da Biblioteca Universitária da UFSC.

Silva, Camila Leite da

Pivot-based approaches for Movelets and MASTERMovelets optimizations / Camila Leite da Silva ; orientadora, Vania Bogorny, 2020.

88 p.

Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2020.

Inclui referências.

1. Ciência da Computação. 2. Trajectory Classification. 3. Multiple Aspect Trajectory Classification. 4. Movelets. 5. MASTERMovelets. I. Bogorny, Vania . II. Universidade Federal de Santa Catarina. Programa de Pós-Graduação em Ciência da Computação. III. Título.

(5)

Camila Leite da Silva

Pivot-based approaches for Movelets and MASTERMovelets Optimizations

O presente trabalho em nível de mestrado foi avaliado e aprovado por banca examinadora composta pelos seguintes membros:

Prof. Iraklis Varlamis, Dr. Harokopio University of Athens

Prof. Mauro Roisenberg, Dr. Universidade Federal de Santa Catarina

Prof. José Macêdo , Dr. Universidade Federal do Ceará

Certificamos que esta é a versão original e final do trabalho de conclusão que foi julgado adequado para obtenção do título de Mestre em Ciência da Computação.

Prof.a. Vania Bogorny, Dr. Coordenador do Programa

Profa_{. Vania Bogorny, Dr.}

Orientadora

Florianópolis, 2020.

Documento assinado digitalmente Vania Bogorny

Data: 09/04/2020 11:13:25-0300 CPF: 684.586.570-15

Documento assinado digitalmente Vania Bogorny

Data: 09/04/2020 11:13:50-0300 CPF: 684.586.570-15

(6)

(7)

Este trabalho é dedicado a todas as mulheres que desbravam seus próprios caminhos e lutam pelos seus ideais. E também àquelas que ainda não descobriram forças para tal.

(8)

(9)

ACKNOWLEDGEMENTS

I am very thankful to my supervisor, Vania Bogorny, for all the patience and guidance through this very amazing trajectory named Academic Life. To the love of my life Diego Cortez, without his constant support I would not have this much strength to go further. I also would like to declare my deepest love and gratitude to my parents Zaira Marliza and Josemir Delmiro, who have always believed and encouraged my growth.

To my dream team and best friends forever Patrícia Veron, Gabriela Balduíno and Jéssica Mioto, for this amazing friendship in which I can rely on. I also want to thank to the friends and colleagues that I have in UFSC, specially those made in the fourth floor of INE, you all were crucial for my journey.

I want to thank all those who contributed to this research, and notably for Brazilian agency Coordenação de Aperfeiçoamento de Pessoal de Nivel Superior - CAPES, for financing this work with more than a scholarship, but a life changing opportunity of personal growth.

(10)

(11)

After all, what we love will always be part of us. -Harry Potter.

(12)

(13)

RESUMO

A mineração de dados de trajetórias se tornou um importante tópico de pesquisa nos últimos anos em função do grande volume de dados de movimento coletados das mais diversas fontes e formatos. Em especial, temos a classificação, que visa a identificação da classe que produziu uma determinada trajetória. Os trabalhos deste escopo são focados na extração dos melhores atributos para descrever as classes, ou das subtrajetórias relevantes que sejam capazes de carac-terizar as trajetórias em uma classe. Inicialmente as trajetórias eram representadas pela posição de um objeto em movimento no espaço ao longo do tempo, as chamadas trajetórias brutas, e os trabalhos em classificação de trajetórias brutas extraíam características numéricas baseadas em fórmulas matemáticas que utilizavam a informação espaço-temporal. Contudo, com o ba-rateamento e disseminação dos sensores e das redes sociais presentes nos dispositivos móveis, as trajetórias puderam ser enriquecidas com mais dados, produzindo as chamadas trajetórias múltiaspecto. A classificação deste novo tipo de dado de trajetória ainda está nos primórdios, e alguns trabalhos da literatura utilizam técnicas de mineração de texto para extrair as caracterís-ticas da dimensão semântica das trajetórias, enquanto outros trabalhos se baseiam na extração das subtrajetórias relevantes. As técnicas Movelets e MASTERMovelets são baseadas na extra-ção das subtrajetórias relevantes, as movelets, e têm superado os outros trabalhos da literatura tanto em classificação de trajetórias brutas quanto multiaspecto. A capacidade destas técni-cas de lidar com múltiplas e diferentes subtrajetórias e suas dimensões as torna inviáveis para grandes conjuntos de dados, uma vez que exploram e avaliam exaustivamente todas as subtraje-tórias possíveis. Neste trabalho, são propostos os métodos MASTER-Pivots e SUPER-Pivots, que são estratégias para reduzir a geração de subtrajetórias para ambas as técnicas Movelets e MASTERMovelets, a fim de acelerar seus processos, mantendo a acurácia da classificação. O MASTER-Pivots é um método não supervisionado que limita o tamanho e o local a partir do qual as movelets são extraídas. Ele baseia-se na identificação dos pontos pivô, que são os me-lhores movelets de tamanho um, e na limitação da extracção dos movelets apenas a partir dos pontos pivô e dos seus pontos vizinhos que são diretamente alcançáveis. O SUPER-Pivots é uma estratégia supervisionada que consiste em identificar os SUPER-Pivots, que são as subtra-jetórias e seu número de dimensões que ocorreram mais frequentemente nas trasubtra-jetórias de uma determinada classe, e em extrair as movelets apenas dos SUPER-Pivots. Ambos os métodos fo-ram avaliados através de uma série de experimentos, onde apresentamos uma extensa avaliação e comparação experimental que reúne nove técnicas do estado da arte e seis conjuntos de dados conhecidos e disponíveis ao público. Os MASTER-Pivots e SUPER-Pivots foram avaliados considerando o número de subtrajetórias geradas, sua escalabilidade e a acurácia em classifi-cação, onde o primeiro reduziu o tempo de processamento com relação ao MASTERMovelets em mais de 70% em todos os conjuntos de dados, enquanto o segundo reduziu em pelo menos 80% em qualquer conjunto de dados. Todos os códigos e resultados utilizados neste trabalho são fornecidos como um benchmark, com o objetivo de facilitar a análise e comparação das técnicas de classificação de trajetórias.

Palavras-chave: Classificação de Trajetórias. Classificação de Trajetórias Multiaspecto. Mo-velets. MASTERMoMo-velets. Pivôs de MoMo-velets.

(14)

(15)

(16)

As técnicas Movelets e MASTERMovelets, que se baseiam na extração de movelets, conse-guem lidar com todas as dimensões presentes em uma trajetória (espaço, tempo e semântica) superando todos os outros métodos da literatura em acurácia de classificação tanto para traje-tórias brutas como para as multiaspecto. Contudo, para identificar as subtrajetraje-tórias relevantes, estas técnicas precisam avaliar todas as possíveis subtrajetórias presentes em uma trajetória, em um processo que se torna excessivamente custoso quando aplicado em grandes bases de dados. Objetivos

O principal objetivo desta dissertação é a detecção de subtrajectórias relevantes, chamadas

mo-velets, para classificação de trajetórias de maneira precisa e com menor tempo computacional

em comparação com os métodos Movelets e MASTERMovelets. Outro objetivo é a disponibi-lização dos algoritmos e das base de dados de classificação de trajetórias para novos pesquisa-dores da área. Para alcançar o objetivo geral, são definidos os seguintes objetivos específicos:

1. Identificação da melhor maneira de dividir e apresentar os métodos de classificação de trajetórias, propondo uma hierarquia conceitual do estado da arte.

2. Proposta e implementação de novas técnicas para classificação de trajetórias baseadas em pivôs, visando reduzir o tempo computacional do método MASTERMovelets mantendo a acurácia de classificação. Os pivôs definirão o espaço de busca por movelets na base de dados.

3. Propor um conjunto de bases de dados públicas e algoritmos para um benchmark para classificação de trajetórias.

4. Avaliação do método proposto utilizando diversos datasets, e comparando-os com outros métodos.

Metodologia

A metodologia adotada para cumprir os objetivos propostos neste trabalho baseia-se em três etapas principais. Onde a primeira consiste na realização de extensa revisão de literatura so-bre classificação de trajetórias, considerando os trabalhos que realizam classificação em ambas trajetórias brutas ou multiaspecto. Seguido pela proposta de taxonomia capaz de dividir os tra-balhos da literatura aravés da identificação de suas principais características. Faz-se também a organização, pré-processamento e agrupamento das principais bases de dados de trajetória na forma de benchmark, e implementação das principais técnicas em classificação de trajetórias, que são aquelas que obtiveram maior impacto acadêmico.

A segunda etapa consiste no desenvolvimento de duas soluções para reduzir o tempo computaci-onal e espaço de busca dos algoritmos que se baseiam na extração de movelets. Onde a primeira técnica, o MASTER-Pivots, utiliza as mais bem avaliadas movelets de tamanho um, que são pontos das trajetórias, como sementes, ou pivôs, para a extração e avaliação das subtrajetórias de tamanho maior que um.

Este processo é resumido na Figura 2, onde, dada uma trajetória T1 formada por 12 pontos (A), suas subtrajetórias de tamanho um são avaliadas através das funções de qualidade utilizadas pelas técnicas Movelets ou MASTERMovelets. As melhores movelets de tamanho um serão utilizadas como pivôs, no caso do exemplo são os pontos p5, p9 e p10. A partir dos pivôs, são identificados seus vizinhos alcançáveis, ou seja, os pontos diretamente à esquerda e à direita dos pivôs. Resultando nas Vizinhanças N1 e N2, apresentadas na Figura. As subtrajetórias a serem avaliadas serão extraídas apenas das vizinhanças, limitando assim o espaço de busca de subtrajetórias.

(17)

(18)

A terceira etapa da metodologia consiste na avaliação e comparação do desempenho dos mé-todos propostos utilizando bases reais e sintéticas, de forma a medir sua escalabilidade, tempo computacional gasto e o número total de subtrajetórias que foram geradas e qualificadas. De acordo com as conclusões extraídas, é prevista a escrita de artigos que descrevam: (i) a revi-são do estado da arte comparando os principais trabalhos em classificação de trajetórias; (ii) o método não supervisionado para reduzir o espaço de busca para extração das movelets e (iii) o método supervisionado para identificação das movelets.

Resultados e Discussão

Os resultados obtidos neste trabalho mostram que ambas as técnicas propostas, o MASTER-Pivots e o SUPER-MASTER-Pivots consistem em alternativas mais rápidas, porém acuradas para se fazer a extração de movelets para classificação de trajetórias. Ambos os métodos foram avaliados de forma a medir seus comportamentos segundo algumas métricas como o tempo de processa-mento, acurácia e escalabilidade. Onde o MASTER-Movelets se mostrou pelo menos 74.0% mais rápido do que o MASTERMovelets, enquanto o SUPER-Movelets se mostrou pelo menos 80.00% mais rápido. Apesar da diferença em tempo computacional, os valores de acurácia de classificação de ambos os métodos não diminui de forma equivalente em comparação com o Movelets e MASTERMovelets, e em algumas análises a acurácia é maior quando se escolhe os métodos otimizados.

Palavras-chave: Classificação de Trajetórias. Classificação de Trajetórias Multiaspecto. Mo-velets. MASTERMoMo-velets. Pivôs de MoMo-velets.

(19)

ABSTRACT

Trajectory data mining has become an important research topic in the last few years, specially the trajectory classification, which aims to identify which class of moving object produced a given trajectory. The works in this scope are focused on extracting the best set of features to describe the classes, or the relevant subtrajectories that are capable of characterizing the trajec-tories in a class. Initially the trajectrajec-tories were represented by the position of a moving object in space through time, the so-called raw trajectories. The works in raw trajectory classification extracted numeric features based on the spatio-temporal information. Moreover, with the cheap-ness and dissemination of sensors and the social media present in smartphones, the trajectories could be enriched with more data, producing the multiple aspect trajectories. The classification in this new type of trajectory data is still in the early days, and some works of the literature use techniques of text mining to extract the features from the semantic information of trajecto-ries while other works are based on extracting the relevant subtrajectotrajecto-ries. The Movelets and MASTERMovelets techniques are based on extracting the relevant subtrajectories, or movelets, and have surpassed the other works in the literature for both raw and multiple aspect trajectory classification, once they can handle multiple and different subtrajectories and their dimensions, but are unfeasible for highly dimensional and large datasets, as they exhaustively explore and evaluate all possible subtrajectories. In this work we propose the methods MASTER-Pivots and SUPER-Pivots, that are strategies for reducing the generation of subtrajectories of both

Movelets and MASTERMovelets techniques, in order to speed up their processes while

main-taining classification accuracy. The MASTER-Pivots is an unsupervised method that limits the size and the place from where the movelets are extracted. It is based on identifying the pivot points, which are the best movelets of size one, and by limiting the extraction of movelets from only the pivot points and their directly reachable neighbour points. The SUPER-Pivots is a su-pervised strategy that consists on identifying the SUPER-pivots, which are the subtrajectories that occurred more in the trajectories of a given class, and by extracting the SUPER-pivots only. We also present an extensive experimental evaluation and comparison that ensembles nine state of the art techniques and six well known and publicly available data sets. The MASTER-Pivots and SUPER-Pivots were evaluated considering the number of generated subtrajectories, the scalability behavior, and the accuracy, where the first one reduced the processing time w.r.t. MASTERMovelets in more than 70% in all datasets, while the latter reduced in at least 80% in any dataset. Every code and result used in this thesis is provided as a benchmark, aiming to facilitate further analysis and comparison of trajectory classification techniques.

Keywords: Trajectory Classification. Multiple Aspect Trajectory Classification. Movelets. MASTERMovelets. Movelet Pivots.

(20)

(21)

LIST OF FIGURES

Figure 4 – Examples of trajectory data types of a given person: (A) raw trajectory: trajectory points are represented by the spatio-temporal information only, (B) multiple-aspect trajectory: each point has the weather condition, the social media data, the humor and the and the visited place, for example. . . 25 Figure 5 – Examples of animal trajectories: T 1 is the trajectory of a cow, T 2 is the

trajectory of a deer and T 3 is an unlabeled trajectory. . . . 26 Figure 6 – Example of a Decision Tree for classifying whether an example e1 is "YOUNG"

or "ADULT". . . 33 Figure 7 – Exemplification of the structure of a Multilayer Perceptron, which has the

input, the hidden and the output layers. . . 34 Figure 8 – Example of classification task with two classes: yes and no. The real class

of each element is written above them, while the class given by the classifier is given by the color of each element. . . 35 Figure 9 – Exemplification of the best alignment of a given movelet candidate c1 in a

trajectory T1. The best alignment is highlighted by a rectangle; . . . 37

Figure 10 – Examples of movelet candidate extraction and distance calculator. . . . 38 Figure 11 – Example of splitpoint selection in the Movelets technique. The distances in

the best alignment are in ascending order, and the splitpoint is given in the position where the class changes. . . 38 Figure 12 – Example of splitpoint selection in the MASTERMovelets technique, for a

movelet candidate with the dimensions of time and rating of visited place. . 39 Figure 13 – Taxonomy that summarizes the related works in trajectory classification. . . 40 Figure 14 – (a) Trajectory T1, (b) The Pivots of trajectory T1, and (c) The Pivots

neigh-borhood. . . 49 Figure 15 – Running time comparison between the Movelets, the MASTERMovelets,

the MASTERMovelets-Log, the MASTER-Pivots and the Movelet-Pivots by varying (Fig A) the size of the trajectories, (Fig B) the number of trajectories and (Fig C) the number of dimensions. . . 61 Figure 16 – Diagram of the SUPER-Pivots, with each main step of the method: (a)

a trajectory dataset with different classes, (b) the selection and finding of SUPER-Pivots in one class (c) the λ threshold learning and, for last, (d) the movelets extraction by using only the SUPER-Pivots and using up to λ dimensions. . . 64 Figure 17 – Quality proportion of the pivot candidates of: (a) two points and occurring

in four of the five trajectories of Class 4, resulting in 80% of occurrence, (b) three points and occurring in two of the five trajectories of Class 4, a proportion of 40%. . . 66 Figure 18 – Filters used to prune the number of pivot candidates. . . . 67

(22)

Figure 19 – Example of two redundant pivot candidates c_pivaand c_piva, where both

have the same size and the same trajectory dimensions. . . 67 Figure 20 – Identification of SUPER-Pivots given the pivot candidates. . . . 68 Figure 21 – Identification of λ threshold, which determines the maximum number of

dimensions for the movelet candidates extracted from a class. . . 70 Figure 22 – Number of movelets outputted by the each technique in grey scale. . . 73 Figure 23 – Running time comparison between the MASTERMovelets,

MASTERMovelets-Log, SUPER-Pivots and SUPER-Pivots-Log. Running time varying (Fig A) the size of the trajectories, (Fig B) the number of trajectories and (Fig C) the number of dimensions. . . 74 Figure 24 – Number of movelets outputted by the each technique in grey scale. . . 76 Figure 25 – Running time comparison between the MASTER-Pivot, the SUPER-Pivots

and the SUPER-Pivots-log. Running time varying (Fig A) the size of the trajectories, (Fig B) the number of trajectories and (Fig C) the number of dimensions. . . 77

(23)

LIST OF TABLES

Table 1 – Datasets and classifiers used by each state of the art method. . . 46 Table 2 – Datasets with the trajectories sizes, the total number of trajectories and points,

the dimensions and the number of classes. The Geolife, Animals and Hurri-canes are raw trajectory datasets, while the Brightkite, Gowalla and Foursquare are multiple aspect trajectory datasets. . . 53 Table 3 – Trajectory dimensions and features used as input for the classifiers by each

technique. . . 54 Table 4 – Classification accuracy achieved with the MLP, SVM and RF classification

models over six different datasets. * The technique does not support this evaluation. ** Partial result. . . 57 Table 5 – Classification accuracy of the Movelet-Pivots by using different percentage

of points selected. . . 58 Table 6 – Comparison between the accuracy (ACC), computational time (HH:MM:SS)

and movelet candidates generated by the Movelets, the Movelet-Pivots, the MASTERMovelets, the MASTERMovelets-Log and the MASTER-Pivots tech-niques. *Partial result. . . 59 Table 7 – Summary of the used datasets. . . 71 Table 8 – Comparison between the number of candidates, computational time, accuracy

and F-Score of each technique. . . 73 Table 9 – Comparison between accuracy, F-Score and time consumption of the methods

(24)

(25)

CONTENTS

1 INTRODUCTION AND MOTIVATION . . . 25 1.1 OBJECTIVES AND CONTRIBUTION . . . 27 1.2 METHODOLOGY . . . 28 1.3 THESIS OUTLINE . . . 29 2 BASIC CONCEPTS AND STATE OF THE ART . . . 31 2.1 BASIC CONCEPTS . . . 31 2.1.1 Trajectory Data . . . 31 2.1.2 Trajectory Classification . . . 32 2.2 STATE OF THE ART . . . 40 2.2.1 Raw Trajectory Classification . . . 40

2.2.1.1 Method that supports only space . . . 40 2.2.1.2 Methods that support spatio-temporal features . . . 41

2.2.2 Multiple Aspect Trajectory Classification . . . 43

2.2.2.1 Methods based on Semantic Dimensions . . . 43 2.2.2.2 Methods based on Space, Time, and Semantic Dimensions . . . 44

2.2.3 Summary of Related Works . . . 45 3 UNSUPERVISED APPROACH: MASTER-PIVOTS . . . 47 3.1 MASTER-PIVOTS . . . 47 3.1.1 Pivot Points . . . 48 3.1.2 Pivot Neighbourhood . . . 49 3.2 EXPERIMENTAL EVALUATION . . . 51 3.2.1 Benchmark . . . 51 3.2.2 Evaluated Techniques . . . 54 3.2.3 Experimental Setup . . . 54 3.2.4 Results and Discussion . . . 56 3.2.5 Pivot Evaluation . . . 58

3.2.5.1 Pivot Selection Analysis . . . 58 3.2.5.2 Comparison between Pivots and Movelets . . . 59 3.2.5.3 Scallability/Performance Evaluation . . . 60

3.3 DISCUSSION . . . 60 4 SUPERVISED APPROACH: SUPER-PIVOTS . . . 63 4.1 SUPER-PIVOTS . . . 63 4.1.1 Learning SUPER-Pivots (LSP) . . . 65 4.1.2 Learning the Number of Dimensions (SND) . . . 69 4.1.3 MoveletsExtraction . . . 71

(26)

24

4.2 EXPERIMENTAL EVALUATION . . . 71 4.2.1 Datasets . . . 71 4.2.2 Evaluated Techniques and Setup . . . 72 4.2.3 Results and Discussion . . . 72

4.2.3.1 Candidate Generation Analysis . . . 72 4.2.3.2 Scalabillity Performance Comparison . . . 74 4.2.3.3 Comparing MASTER-Pivots and the SUPER-Pivots . . . 75

4.3 DISCUSSION . . . 76 5 CONCLUSION . . . 79

(27)

25

1 INTRODUCTION AND MOTIVATION

Understanding the environment, the habits, and the different behavior of people has been of interest to several companies and for many objectives. In the recent years we are witnessing a massive increasing of different types of data collection enabled by the advances and cheapness of mobile devices and sensors. Once these devices are capable of collecting a number of information, it is possible to track the position of a given object in space and time by using Global Positioning Systems (GPS), or the social information and feelings of a user by accessing his/her social media data. This information originates a new type of data, called

trajectories.

In the beginning, the literature considered the trajectories as the moving object traces with only the spatio-temporal dimensions, which were called raw trajectories. Raw trajectories consist on a sequence x and y in space, as the latitude and longitude, at a given timestamp. Figure 4 (A) presents an example of raw trajectory, with n points, where each point has spatio-temporal information only.

Figure 4 – Examples of trajectory data types of a given person: (A) raw trajectory: trajectory points are represented by the spatio-temporal information only, (B) multiple-aspect trajectory: each point has the weather condition, the social media data, the humor and the and the visited place, for example.

With the increasing use of devices that enable the collection of more data dimensions about the movement of a moving object rather than just the spatio-temporal, trajectories can be enriched with more data. Spaccapietra in (SPACCAPIETRA et al., 2008) proposed the concept of semantic trajectories, which are an attempt of aggregating more information in a trajectory by identifying the stops and moves made along its movement. Bogorny in (BOGORNY; WA-CHOWICZ, 2009) and Mello in (MELLO et al., 2019) proposed a formal definition of multiple

aspect trajectories, which are spatio-temporal sequences enriched with any sort of information.

(28)

26

point has the spatio-temporal information, the name of the visited place, the weather condition, the user humor and the transportation mode.

Trajectory analysis is important for many applications of real world problems by us-ing either the raw or multiple aspect trajectories. For example, in smart cities the trajectories collected from citizen, streets and transportation modes can be used to detect, avoid and predict the occurrence of traffic jams. In biology, animal tracking can be used to understand the impact of the ecosystem in animals behavior.

One of the most important and explored topic in trajectory data mining is classification. It consists on extracting features from trajectories in a dataset and training a classification model for identifying the label of the moving object of a given trajectory (LEE et al., 2008). In the example of Figure 5 we can visualize the trajectories of three animals, and the classification task consists on identifying the label, or the animal specie, of the non-labeled trajectory T 3, by considering the labeled trajectories T 1 and T 2.

Figure 5 – Examples of animal trajectories: T 1 is the trajectory of a cow, T 2 is the trajectory of a deer and T 3 is an unlabeled trajectory.

Many are the works that perform trajectory classification in the literature, and they differ from each other on the trajectory dimensions they support and the features they extract. Most works focus in raw trajectories with spatio-temporal dimensions by extracting features as the average speed, direction variation, maximun acceleration, etc. (BOLBOL et al., 2012) (SOLEYMANI et al., 2014) (DABIRI; HEASLIP, 2018) (ZHENG et al., 2010), (SHARMA et al., 2010), (JÚNIOR; RENSO; MATWIN, 2017), (DODGE; WEIBEL; FOROOTAN, 2009), (PATEL et al., 2012), (XIAO et al., 2017) (ETEMAD; JÚNIOR; MATWIN, 2018).

Although the trajectories can be enriched with a number of dimensions, some works focus on using only the semantics, as (GAO et al., 2017) and (ZHOU et al., 2018), that are based on applying Natural Language Processing (NLP) techniques as word embeddings (MIKOLOV et al., 2013) to extract the features to represent the trajectories.

Only a few works are able to extract features from all dimensions of a multiple aspect trajectory, and most of them were proposed by the MASTER1 _{group. Some are based on}

(29)

27

mantic features, as if the trajectories are nearby touristic places or the day of the week when the data were collected (TRAGOPOULOU; VARLAMIS; EIRINAKI, 2014) (VARLAMIS, 2015). The works of Petry (PETRY et al., 2019a) and Ferrero (FERRERO et al., 2018) (FERRERO et al., 2020) are the only specifically developed for multiple aspect trajectories. The works Movelets (FERRERO et al., 2018) and MASTERMovelets (FERRERO et al., 2020) extract movelets, which are the relevant trajectory parts, or relevant subtrajectories, for describing and discriminating the trajectory classes. For last, MARC (PETRY et al., 2020) extracts features from trajectories by using word embeddings in all dimensions of multiple aspect .

The Movelets (FERRERO et al., 2018) and the MASTERMovelets (FERRERO et al., 2020) are the methods based on relevant subtrajectories extraction, and they have outperformed all previous methods for trajectory classification. Both methods consist on extracting and evalu-ating every possible subtrajectory from the database, in order to find the best ones for describing the classes. Movelets and MASTERMovelets qualify the subtrajectory by calculating the dis-tance between each subtrajectory to every trajectory in the dataset, where a higher quality is given to the subtrajectories that are closer to trajectories of the same class and further to tra-jectories of different classes. Only the best qualified subtratra-jectories are considered movelets. The main difference between both methods is that the Movelets considers all the trajectory dimensions distances in each subtrajectory together, while the MASTERMovelets combines the dimensions to find the best dimension combination for each subtrajectory. The process of extracting and qualifying each possible subtrajectory is costly for both approaches, and the di-mension combination is even more costly in the MASTERMovelets, which lead the techniques to be unfeasible in large datasets.

As previously said, the techniques based on movelets extraction have outperformed the other works in the literature for trajectory classification, but their main problem is their high computational cost for extracting the movelets. Indeed, although trajectory classification is an important research topic, we can not find an easy and complete source with the commonly used datasets and algorithms for trajectory classification, and neither a simple taxonomy to organize the works of this topic.

1.1 OBJECTIVES AND CONTRIBUTION

The main objective of this thesis is to discover movelets for trajectory classification in accurate and faster manners than the Movelets and MASTERMovelets. We also aim to facilitate the access to trajectory classification algorithms and datasets to new researchers by providing all the datasets and used algorithms as a benchmark. To achieve the proposed goals, we present the following specific objectives:

1. Propose a taxonomy for dividing and presenting the trajectory classification methods; 2. Propose and implement pivot-based techniques for trajectory classification based on the

(30)

28

• limit the trajectory search space for finding movelets, by using an unsupervised approach that finds a subset of trajectory points for extracting relevant subtrajetories for classification;

• identify the movelets in a supervised manner with a solution that searches relevant subtrajectories only in trajectories of one class before evaluating them with trajec-tories of other classes;

3. Evaluate the proposed techniques in several datasets and comparing them to several state of the art methods.

The main contributions of this thesis are: (i) a new taxonomy to divide the works in the literature for trajectory classification, based on the trajectory type and the trajectory dimensions they support, (ii) two new methods for extracting movelets for classifying raw and multiple aspect trajectories, in a faster and still accurate manner, (iii) a benchmark for trajectory classi-fication providing the main datasets and the most relevant techniques.

1.2 METHODOLOGY

We adopted the following methodology to accomplish the objectives proposed in this thesis, which is composed of eleven steps in total:

1. Perform an extensive literature review in trajectory classification, by considering the works that perform classification in both raw and multiple aspect trajectory data, and eval-uate them under the following metrics: the compared datasets, the classification models, the classification purpose and the comparison with other works in the literature;

2. Propose a taxonomy for dividing the works in trajectory classification, by identifying their main characteristics and a way to present them;

3. Identify, organize, pre-process and ensemble the main real trajectory datasets and gener-ate an ensemble of these datasets as a benchmark;

4. Implement the main trajectory classification methods, which are those that produced the works of higher academic impact;

5. Structure and generate a synthetic trajectory datasets based in real data, with different sizes and characteristics, for evaluating the computational time spent by each of the algo-rithms;

6. Develop and implement an algorithm for reducing the movelets search space by using the best evaluated trajectory points;

7. Develop and implement a method for extracting the movelets, by evaluating the subtra-jectories in the classes before evaluating them in the rest of the dataset;

(31)

29

8. Evaluate the behavior of the proposed methods using real and synthetic trajectory datasets, evaluating their scalability, computational time spent and the total number of generated and evaluated subtrajectories.

9. Compare the results with the main methods in the literature, evaluating their behavior over the real trajectory datasets, under the main metrics for trajectory classification. 10. Write articles describing: (i) the review of the state of the art comparing the main works

for trajectory classification, (ii) the unsupervised method for reducing the search space for extracting the movelets and (iii) the supervised method for identifying the movelets; 11. Write the thesis by describing the main necessary concepts of trajectory data, the

trajec-tory classification problem, the state of the art, the description of both proposed solutions and the conclusions obtained in the work.

1.3 THESIS OUTLINE

This thesis presents a new representation of the related works in trajectory classifica-tion and presents two new methods for trajectory classificaclassifica-tion based on the movelets extracclassifica-tion. The remaining of this thesis is organized as follows:

• Chapter 2 presents the basic concepts and the main works of the state of the art in trajec-tory classification, with their conceptual comparison considering their main characteris-tics;

• Chapter 3 describes and evaluates the unsupervised approach, called MASTER-Pivots, which is based on reducing the movelet search space;

• Chapter 4 presents and evaluates our supervised approach for finding movelets, the SUPER -Pivots, which is based on pre-evaluating the subtrajectories in the trajectories of the same class;

(32)

(33)

31

2 BASIC CONCEPTS AND STATE OF THE ART

For a better understanding of the problematic of the trajectory datatype and classifi-cation, in this Chapter we describe the basic concepts and the main characteristics about the Movelets and MASTERMovelets algorithms in Section 2.1, and we detail the works in state of the art that perform trajectory classification in Section 2.2.

2.1 BASIC CONCEPTS

Trajectories and their application in classification tasks are the main concepts neces-sary for plenty understanding our proposal. For this purpose, in this Section we describe these main concepts, starting by explaining the trajectory data in Section 2.1.1 and the trajectory classification problem and the approaches based on extracting movelets in Section 2.1.2. 2.1.1 Trajectory Data

A raw trajectory is a trace made by a moving object, which consists in a sequence of points with the spatial coordinates in a time stamp. In Definition 1 we present the description of this datatype:

Definition 1 Raw Trajectory: A raw trajectory consists on a sequence of n points T = hp1,p2, ...,pni, in which p = {x, y,t}, where x, y is the position of the moving object in space and t is

the timestamp that the point was collected.

The collection of raw trajectory data has increased in the last decade, as this process spread with the cheapness of the technology that collect geographical information, as the Global Positioning System (GPS) devices. There are different moving objects that generate trajectories, such as hurricanes, animals, transportation modes in the streets, etc. (LEE et al., 2008). Figure 4 (A) presents an example of raw trajectory, with a sequence of points with the spatio-temporal information.

Mello in (MELLO et al., 2019) and Petry in (PETRY et al., 2019a) were the first to propose the so called multiple aspect trajectories, which are described in Definition 2, according to (PETRY et al., 2019a):

Definition 2 Multiple Aspect Trajectory: A multiple aspect trajectory is a sequence of n points

T = hp1,p2, ...,pni, in which p = {x, y,t, A}, where x, y is the position in space at timestamp t,

and A is a set with r aspects A= {a1,a2, ...,ar}.

Basically, a multiple aspect trajectory can have r aspects associated to a single point, apart from the spatio-temporal information. It means that a point can be associated to seman-tic data, such as the weather condition, the moving object heart rate, the humor of the user, etc.

(34)

32

These different aspects can also be presented as dimensions or attributes, and they are called tra-jectory dimensions from now on. Figure 4 (B) presents an example of multiple aspect tratra-jectory made by an user, where the trajectory points are associated to more than just the spatio-temporal dimensions, but also with the weather condition, the name of the Point of Interest (POI), the humor and the textual information of the social media posts of the user.

The literature in the field of raw trajectories is dense, and many are the applications of this datatype, but the multiple aspect trajectories are still a very novel concept, and some research challenges related to this datatype are presented in (FERRERO; ALVARES; BO-GORNY, 2016). Mining the multiple aspect trajectories is not a trivial task, once we can have different types of aspects associated to a given point, Some effort is made in the fields of similarity measures (LEHMANN; ALVARES; BOGORNY, 2019), (PETRY et al., 2019b), (FURTADO; PILLA; BOGORNY, 2018), user privacy preserving (PORTELA; VICENZI; BO-GORNY, ), in the ecology field (TOOR et al., 2016) (BUCHIN; DODGE; SPECKMANN, 2012), but our main focus is the trajectory classification for both raw and multiple aspect tra-jectory data types.

2.1.2 Trajectory Classification

Trajectory classification is the task of identifying the moving object that performed a given trajectory (LEE et al., 2008). In the literature there are many works performing trajectory classification for different purposes: finding a hurricane level, determining an animal specie, predicting a transportation mode, user identification, and so on.

Before understanding the trajectory classification, it is necessary to address the

classi-ficationitself. In data mining, the classification is an important and very explored topic which

aims to distinguish the classes in a dataset (NIKAM, 2015). A dataset is composed by several elements, where each element is a set of features, or attributes, and each element has a class label. The objective of the classification algorithms is to train a model capable of assigning the correct label of unlabeled elements in a dataset, with the least classification error. The classification models are trained using the labeled elements of a dataset, which means that the classification techniques need two datasets: a training and a testing set.

The training set is composed by the elements in a dataset whose classes are known by the classification algorithm, which uses the training set for inducing a classification model. The testing set, instead, is composed by elements whose classes are unknown by the classification algorithms, and they are used for validating whether a classification model is good or not for classifying unlabeled elements. A dataset can be divided in training and testing sets in two main manners:

• Hold-out approach: it divides the datasets in proportion of elements, e.g. 70% of the elements in the dataset is used for training the model, respecting the class balance, while 30% is used for testing it;

(35)

(36)

(37)

(38)

36

discriminate the class. The trajectory features strongly depend on the trajectory type. The av-erage speed and direction change for example are numerical features that can be extracted from the spatio-temporal dimensions of a raw trajectory. The multiple aspect trajectories have more textual information from where the features can be extracted.

A common consensus when classifying any of the trajectory types is that the extracted features will be pertinent to the whole trajectory or to a subtrajectory. The subtrajectories consist of the trajectory parts, and are described in Definition 3:

Definition 3 Subtrajectory: Given a trajectory T = hp1,p2, ...,pni of size n, a subtrajectory

sT = hpa,pb...,pmi of T is a contiguous subsequence where a >= 1, and m <= n.

Given the subtrajectory meaning, the last concepts necessary for trajectory classifica-tion understanding are the global and local features. A global feature is a feature or pattern whose meaning is related to the entiry trajectory, as presented in Definition 4:

Definition 4 Global Feature: Given a trajectory T , the global features are information

ex-tracted from the entire trajectory T .

Global features can be, for example, the average speed of a trajectory during its whole movement, among others. On the other hand, the local features are the features or patterns that are extracted from subtrajectories, and their meaning is related only to a trajectory part, as described in Definition 5:

Definition 5 Local Feature: Given a trajectory T , the local features are patterns extracted from

subtrajectories of T .

The works of (FERRERO et al., 2018) and (FERRERO et al., 2020) proposed to use relevant subtrajectories as the features for trajectory classification, where the relevant subtrajec-tories are called movelets, and their relevance is measured by their capacity for discriminating the classes. For that, the subtrajectories extracted from a trajectory T are first called movelet

candidates, then they are evaluated and only the best ones are called movelet. In Definition 6

we define the movelet candidates:

Definition 6 Movelet Candidate: A movelet candidate is a tuple c = (T,start,end,D,quality),

where T is the trajectory that origins c, start and end are the positions in T where c begins and finishes, respectively, D= (d, T′,class) is a vector that contains the distance d of the movelet

candidate c to every trajectory T′ in the dataset and its class. The quality is the quality score given to c.

Each movelet candidate has the information of the trajectory T and the consecutive points from where it was extracted, a vector D with the distances between the movelet candidate to every trajectory of the dataset, and a quality value. To calculate the distance between a

(39)

(40)

(41)

(42)

(43)

41

inside a cell belong to the same class. If the trajectories inside a cell are mostly from the same class, then the cell is selected as a feature, and is no longer split in smaller sizes. Otherwise, the process continues for this cell until it reaches the lowest possible size, where the size is defined by a threshold. Then it splits the subtrajectories inside this cell by direction change, and these subtrajectories are then grouped by class. Finally, all grid cells and subtrajectory clusters are then used as trajectory features to feed a Support Vector Machine (SVM) (CORTES; VAPNIK, 1995) classifier. The values of the features are yes or no, i.e., if the trajectory crossed or not the cell or the cluster. This method is generic and can be used to classify different types of moving objects.

2.2.1.2 Methods that support spatio-temporal features

Methods that consider both space and time dimensions extract features from the spatio-temporal points using mathematical formulas that use these dimensions, such as speed,

acceler-ation, direction change, etc; or statistics as standard deviation, average speed, and so on. They

basically differ from each other on the number and type of local (Definition 5) and global (Def-inition 4) features they extract. As presented in Figure 13, these works can be divided in three main categories: (i) extract local features from subtrajectories or trajectory points, (ii) extract

global features from the entire trajectory; and (iii) extract both local and global features from

both subtrajectories and the entire trajectory. These works are detailed in the following Sec-tions:

Local Features

Three main works extract local features for trajectory classification, and they are lim-ited to specific classification problems: (BOLBOL et al., 2012) and (DABIRI; HEASLIP, 2018) focus on transportation mode classification, and (SOLEYMANI et al., 2014) on classifying fish. Bolbol (BOLBOL et al., 2012) segments the trajectories based on a pre-defined number of subtrajectories. After the segmentation process, a sliding window of fixed size is used, and for the subtrajectories inside the window it extracts features as average acceleration and average speed. Soleymani (SOLEYMANI et al., 2014) classifies medicated fish by segmenting the trajectories using spatial-based grids and time windows. The first segmentation consists on dividing the space in grids, and calculating the time duration of the subtrajectories inside each grid. The second segmentation consists on segmenting the trajectories using time windows of fixed size, and calculating features as the standard deviation of speed and maximum turning angle from each subtrajectory. Both techniques train a SVM classification model using the features extracted from the subtrajectories.

Dabiri in (DABIRI; HEASLIP, 2018) uses Convolutional Neural Networks (CNNs) for classifying transportation mode. It first extracts features from every pair of sequential tra-jectory points (e.g. speed, acceleration, direction change and stop rate). Then it represents the

(44)

42

trajectories as a vector of four dimensions, one for each feature, similar to a time series of four dimensions. This vector is used as the entry of a CNN, which is tested with multiple CNN architectures. As the CNNs need a fixed input size, an m threshold value is chosen, and trajec-tories with size greater than m are split in subtrajectrajec-tories of size m, while the shorter trajectrajec-tories receive a padding with zeros to reach the m size.

Global Features

Three main works classify trajectories by using global features: Zheng in (ZHENG et al., 2008) uses Decision Trees (DT) for transportation modes classification. It computes the speed and acceleration between two consecutive points of each trajectory, and then extracts global features as length, the maximum speed and acceleration, the average, expectation and variance of the speed, the heading change rate, the stop rate and the velocity change rate.

Sharma (SHARMA et al., 2010) initially calculates the pairwise consecutive point fea-tures of speed, acceleration, turning angle, displacement, direction, distance and time between the points. Trajectories are then represented as the sequence of these features, and they are used as input to a Nearest Neighbour Trajectory Classification (NNTC) to predict road vehicles the classes based on the label of the nearest trajectory, which is the same principle of a common 1-NN approach.

The ANALYTiC (JÚNIOR; RENSO; MATWIN, 2017) is an Active Learning approach that computes the speed, the direction variation and the traveled distance between the consecu-tive points, and it calculates the global features of minimun, maximun and average values from the other features. The Active Learning uses the extracted features to gradually train a binary classifier, which can predict if an example is of a certain class or not. The examples with high level of uncertainty are reinforced by re-training the binary classifier with new and similar la-beled examples. As the training process is only capable of determining whether the trajectory is of a given label or not, the approach is only suitable for datasets with a few classes.

Local and Global Features

The last group of works for raw trajectory classification consider both local and global

features. Dodge in (DODGE; WEIBEL; FOROOTAN, 2009), Santos in (SANTOS; ALVARES,

2011) and Patel (PATEL et al., 2012) classify general types of trajectories, while Xiao in (XIAO et al., 2017) and (JIANG et al., 2017) classify transportation modes, all of them train different types of classification models, as SVMs, Neural Networks (NN), Random Forests (RF), DTs, KNN and Bayes models.

Dodge in (DODGE; WEIBEL; FOROOTAN, 2009) calculates the features of speed, acceleration and direction change between every two consecutive trajectory points, and the trajectories are then represented as sequences of each feature, similar to an unidimensional time series. Local features are extracted from subtrajectories with the same characteristics (e.g.

(45)

43

same speed, same acceleration, etc.) and global features are statistics of the entire trajectory as the minimum speed, maximum speed, average speed, minimun acceleration, etc. The feature selection method called Principal Component Analysis (PCA) is used for selecting the best features.

Xiao in (XIAO et al., 2017) and Etemad in (ETEMAD; JÚNIOR; MATWIN, 2018) also extract global and local trajectory features from all pairs of trajectory points, and the main differences are that Xiao (XIAO et al., 2017) adds some new statistics, as the percentiles, in-terquatile range, skewness, coefficient of variation and kurtosis for transportation mode classi-fication. Etemad (ETEMAD; JÚNIOR; MATWIN, 2018) calculates new rate values as global

features, and percentile values as local features.

Patel in (PATEL et al., 2012) extended the work of Lee (LEE et al., 2008) to consider the temporal dimension by adding a time interval to the grid cells, considering the local features as the grid cells themselves, and also calculates global features, as the total time duration of the trajectory and the traveled distance. Jiang in (JIANG et al., 2017) calculates the speed between the pairs of trajectory points and extracts the global features of average speed and standard deviation of speed in the trajectory. The local features are the discretization of the speed along the trajectory points, by converting the continuous values in intervals, and finally, an embedded vector representation is learned from both local and global features and a Recurrent Neural Network (RNN) is used for classifying the trajectories.

2.2.2 Multiple Aspect Trajectory Classification

Methods developed for multiple aspect trajectories deal with the dimensions presented in Definition 2: the spatio-temporal and semantic dimensions (non-numerical). As can be ob-served in Figure 13, these works can be split in two types: (i) works that consider only seman-tics, detailed in Section 2.2.2.1 and (ii) works that consider all three dimensions of space, time and semantics, detailed in Section 2.2.2.2.

2.2.2.1 Methods based on Semantic Dimensions

Three techniques classify trajectories by considering only the semantic dimension: the work of Lee (LEE et al., 2011), that uses the semantics of the roads to segment trajectories for classification of vehicles, and the works of Gao (GAO et al., 2017) and Zhou (ZHOU et al., 2018) that are limited to the POI identifier to classify the person who is the owner of the trajectory.

Lee in (LEE et al., 2011) uses the semantics of the streets to classify GPS trajectories of road vehicles. It represents the trajectories as the sequence of roads took by the vehicles, and roads that are taken by at least one trajectory are selected as the first feature. When a certain sequence of roads is took by a high number of trajectories, the F-Score measure is used to identify which of these sequences are discriminant between the vehicle classes, and selected as

(46)

44

feature. Information gain is then used to evaluate which of the roads and road sequences are kept as feature for classifying the vehicles using an SVM classifier.

The Bi-TULER method proposed by Gao in (GAO et al., 2017) converts check-in iden-tifiers to a continuous embedding representation which consists on transforming the trajectory POI identifier in a numeric vector representation. After the transformation, the embedded tra-jectory is used to train a RNN, which is responsible for detecting the sequence of POI identifiers that better discriminate the class. Similarly, Zhou in (ZHOU et al., 2018) proposed the TUL-VAE method that also uses the embedding representation for training a RNN, but differently from Bi-TULER, it uses an extra neural network structure, the Variational Autoencoder (VAE), which is designed for handling large volumes of data.

2.2.2.2 Methods based on Space, Time, and Semantic Dimensions

Some few works for trajectory classification can deal with all three trajectory dimen-sions of space, time and semantics. Tragopoulou in (TRAGOPOULOU; VARLAMIS; EIRI-NAKI, 2014) and Varlamis in (VARLAMIS, 2015) were the first works to consider all three tra-jectory dimensions for transportation mode classification. Tragopoulou in (TRAGOPOULOU; VARLAMIS; EIRINAKI, 2014) proposed a smart-phone application which records the user trajectory while enriching the trajectory points with the current speed and the semantic infor-mation of the day of the week, the time zone, if it is in a working day and if it is near to a metro station. All features extracted from the spatio-temporal and semantic dimensions are passed as input to tree-based classifiers. Varlamis in (VARLAMIS, 2015) extended the work of (TRAGOPOULOU; VARLAMIS; EIRINAKI, 2014) by adding the semantic information of whether the point is near a touristic place, if in a bus line or if in train rail. It uses an Evolution-ary Algorithm to generate new examples for training the classifiers, with the purpose of dealing with datasets with a small number of labeled samples. The technique evaluation is made with several classification methods, as the SVM, KNN, DT, ML and, RF.

A recent method called Movelets, that is based on the time series concept of shapelets (YE; KEOGH, 2011), has outperformed most state of the art works for trajectory classifica-tion (FERRERO et al., 2018). It is a general method for classifying any type of trajectory. It extracts all possible subtrajectories in the dataset, turns each in a movelet candidate (Defini-tion 6), and compares the distance of each subtrajectory to all trajectories in the dataset. As it extracts all possible subtrajectory from each trajectory, this process is computationally expen-sive, but it supports any data dimension (space, time, and semantic), because it uses a different distance function for each dimension. The movelet candidates are all possible sizes of subtra-jectories, starting from two consecutive trajectory points until the size of the entire trajectory. Each movelet candidate is qualified by checking its relevance by using the Equation 2.5 by encapsulating the distances in the dimensions in a single vector of distances, and choosing a splitpoint by using the left side pure concept. The best movelet candidates without point over-lapping in the trajectory from where it was extracted are called movelets, which are then used

(47)

45

as input in traditional classifiers.

Ferrero in (FERRERO et al., 2020) presented the MASTERMovelets, which extends the method Movelets by not encapsulating the distances of all dimensions into a single movelet, but keeping all dimensions either separately or combined, depending on its quality. While in the Movelet method a subtrajectory generated from two trajectory points with three dimensions generates only one movelet candidate, the MASTER Movelet may generate seven movelet can-didates, because it generates 2l_{− 1 candidates, where l is the number of trajectory dimensions.}

Also, the quality function adopted by the MASTERMovelets is the given in Equation 2.6, as it has a multidimensional vector of distances.

The main issue of the Movelet and MASTERMovelet techniques is the excessive time consumption when analyzing the movelet candidates in order to generate movelets, which is not feasible for large datasets. The complexity of the Movelets in the worst case is O(m2_{· n}3_),

where m is the number of trajectories in the dataset, and n is the size of the largest trajectory in the dataset. For the MASTER Movelets, the complexity is O(m2_{· n}3_{log n · 2}l_{), where l is}

the number of trajectory dimensions. Also, the method Movelets generates, for each trajectory, a total of (∑n

i=1i) candidates, where n is the size of the trajectory, while the total number of

candidates generated by the MASTER Movelets for one trajectory is (∑n

i=1i) · (2l− 1), where

(2l_{− 1) represents all the possible combinations without overlapping of a given set of l}

dimen-sions. The main objective of this work is to propose alternative solutions for the Movelets and MASTERMovelets candidate generation.

2.2.3 Summary of Related Works

In Table 1 we present a summary of the related works on trajectory classification con-sidering : (i) the datasets used by each method; (ii) the evaluated classifiers; (iii) the classifica-tion purpose, if it is general or for a specific classificaclassifica-tion problem; and (iv) to which state of the art method the technique is compared. We notice that most of the works use different datasets for evaluating their methods, and some of them are private datasets, making their experiments difficult to be reproduced and compared because the data are not publicly available. Second, some of the works are developed for generic purposes, i.e., any classification problem, while others are developed for specific problems as transportation mode, vehicles or fish classifica-tion. The third observation is that the majority of works do not compare their results to existing methods, what makes it difficult to estimate their robustness w.r.t competitors.

(48)

46

Technique Datasets Evaluated Classifier Classification Purpose Compares to (LEE et al., 2008) Animals, Vessels, Hurricanes and Synthetic Dataset SVM General None (ZHENG et al.,

2008) Geolife DT Transportation Mode None

(DODGE; WEIBEL; FOROOTAN, 2009)

Open Street Map

and Eye-Track SVM General None

(SHARMA et al.,

2010) Milan Metropolitan KNN Road Vehicles None

(LEE et al., 2011)

Taxis from San Francisco and Synthetic Dataset

SVM Road Vehicles None

(SANTOS; ALVARES, 2011)

Animals, Vessels

and Hurricanes SVM, NN, Bayes General (LEE et al., 2008) (PATEL et al., 2012)

Geolife, Animals, Hurricanes and School Buses

SVM, DT, Bayes General None

(BOLBOL et al.,

2012) Private Dataset SVM Transportation Mode None

(SOLEYMANI et

al., 2014) Private Dataset SVM Medicated Fish None

(TRAGOPOULOU; VARLAMIS; EIRINAKI, 2014)

Private Dataset RF, DT Transportation Mode None

(VARLAMIS,

2015) Private Dataset

RF, DT, KNN,

SVM, NN Transportation Mode None

(XIAO et al., 2017) Geolife KNN, DT, SVM,

RF Transportation Mode

(ZHENG et al., 2008), (DODGE; WEIBEL; FOROOTAN, 2009) (JIANG et al.,

2017) Geolife NN Transportation Mode

(ZHENG et al., 2008) (JÚNIOR; RENSO; MATWIN, 2017) Animals, Vessels and GeoLife DT, Bayes, KNN, RF, Logistic Regression General None

(GAO et al., 2017) Brightkite,_Gowalla SVM, NN, LDA Users None (DABIRI;

HEASLIP, 2018) Geolife NN Transportation Mode

(ZHENG et al., 2008), (ENDO et al., 2016), (WANG et al., 2017) (ETEMAD; JÚNIOR; MATWIN, 2018) Geolife DT, RF, NN, Bayes, Quadratic Discriminant Analysis Transportation Mode (ZHENG et al., 2008), (ENDO et al., 2016), (XIAO et al., 2017), (JIANG et al., 2017), (DABIRI; HEASLIP, 2018) (FERRERO et al., 2018) Animals, Athens Vehicles, Hurricanes and Geolife Bayes, DT, SVM General (LEE et al., 2008), (DODGE; WEIBEL; FOROOTAN, 2009), (ZHENG et al., 2008), (XIAO et al., 2017) (ZHOU et al., 2018) Foursquare, Gowalla, Brightkite

LDA, DT, RF, NN Users (GAO et al., 2017)

(FERRERO et al., 2020) Brightkite, Gowalla, Foursquare NN, RF General (GAO et al., 2017), (ZHOU et al., 2018) (PETRY et al., 2019a) Brightkite, Gowalla, Foursquare NN General (GAO et al., 2017), (ZHOU et al., 2018) (FERRERO et al., 2018)

(49)

47

3 UNSUPERVISED APPROACH: MASTER-PIVOTS

Our first approach aims to tackle the Movelets and MASTERMovelets problem of generating all possible movelet candidates. It is an alternative for finding the movelets, which is named MASTER-Pivots, and consists on limiting the sizes of the movelet candidates and the places from where they are extracted. We evaluated the MASTER-Pivots technique by comparing its behavior with the MASTERMovelets, the Movelets, and the main methods for trajectory classification, by using the main datasets used by the state-of-the art. In this Chapter we describe the proposed method in Section 3.1, the experimental evaluation in section 3.2, and finally, the discussion is presented in Section 3.3.

3.1 MASTER-PIVOTS

Movelets and MASTERMovelets generate the movelet candidates of all sizes, and this process starts by first constructing those of size one. This happens in order to optimize the finding of the best alignment, as it needs the calculus of the distances between the movelet

candidates and the trajectories. By first generating all movelet candidates of size one, which

are the trajectory points themselves, we have the distance between the points of a trajectory to every point of the other trajectories. To calculate the distances between the movelet candidates of bigger sizes, both Movelets and MASTERMovelets re-use the values calculated from those of size one.

Although they can re-use the distance values, both methods are still very time consum-ing, once they evaluate every possible subtrajectory. The optimization is a common concept in computer science which aims to modify a given method in order to improve some of its char-acteristics, as the time it spends or the memory it requires (BÄCK; SCHWEFEL, 1993). We explore the Movelets and MASTERMovelets optimization by using pivots, which is a notion in the optimization scope that describes the greedy solutions that choose specific elements, the pivot, to solve an optimization problem (BÄCK; SCHWEFEL, 1993).

In this section we propose the method MASTER-Pivots, where its general idea is to avoid the generation of all movelet candidates of different sizes until reaching the size of the trajectory. This strategy is unsupervised because it does not use information about the trajectory classes for optimizing the movelet candidates extraction, but it uses the best movelets generated from the subtrajectories of size one only. In this thesis, the unsupervised expression is used to qualify the technique that does no evaluate the subtrajectory inside the class before evaluating it in the other classes. It consists on two main steps: finding the Pivot points, that are the best movelets of size one, and from these Pivots generate only the movelet candidates of size not larger than the Pivot and its directly reachable neighbour points. We explain the Pivot points in Section 3.1.1, while the details of the Pivot neighbourhood is given in Section 3.1.2.

(50)

48

3.1.1 Pivot Points

The MASTER-Pivots uses the movelet candidates of size one that exists in both Movelets and MASTERMovelets methods. We follow the same step of generating all the movelet

candi-datesof size one, and then we select the best to be the Pivots, described in Definition 9:

Definition 9 Pivot: Given a trajectory T = hp1,p2,p3, . . . ,pni, the set of Pivots P are the points

p∈ T with the best quality to discriminate the class of T , according to the quality function given

in Equation 2.5 or Equation 2.6.

The first step of the MASTER-Pivot technique is to generate all movelet candidates of size one from a given trajectory from the trajectory training set. Secondly, the quality of all movelet candidates of size one is computed with the function quality, and they are than ranked them in descending order. After the ranking it is possible to know which are the most discriminant, and consequently, those that can be selected as Pivots. We use a percentage threshold ∆p to select only the best ranked movelets as Pivots.

Algorithm 1 Finding Pivots

Input: trajectory T, trajectory training set T, pivot percentage ∆p Output: Set of pivots

1: pivots← /0

2: pivot_candidates← /0

3: pivot_candidates← GenerateCandidates(T , 1) 4: for candidate in pivot_candidates do

5: candidate.D[] ← GetBestAlignments(candidate, T ) 6: split point[] ← GetSplit point(candidate.D[])

7: candidate.quality← Quality(candidate.D[], split point[]); 8: end for

9: SortByQuality(pivot_candidates); 10: pivots← Sublist(pivot_candidates, ∆p);

Algorithm 1 shows the process of extracting the Pivots. Given as input a trajectory, the trajectory training set and a percentage of pivots to be extracted, it starts initializing the pivots and pivot_candidate variables (lines 1 and 2). The function GenerateCandidates extracts all movelet candidates of size one from trajectory T (line 3), and, for each candidate (line 4), it calculates the vector D[] of distances from the candidate to all the trajectories in the dataset T in each trajectory dimension (line 5). A splitpoint value in D[] is found (line 6), and the quality of a candidate is assessed by the function Quality (line 7), which was explained in Equation 2.6. If this algorithm is used to also optimize the Movelets approach, then the quality of the movelet candidate is evaluated with Equation 2.5. The candidates of size one are ranked in descending order (line 9), and only the best candidates of the ranking are selected, by using the percentage ∆pgiven by the user (line 10).

(51)

(52)

50

Algorithm 2 Pivot Neighbourhood Construction Input: trajectory T, PivotSet

Output: Set of Pivot Neighbourhood 1: pivot_Neighbourhoods← /0

2: p1← T. f irst_point 3: pm← T.last_point

4: for each pivot in PivotSet do 5: if pivot.visited() == false then

6: start, end← 0 7: if pivot ! = p1then 8: start←pivot.position() − 1 9: else 10: start← 1 11: end if

12: end← NeighborhoodEnd(pm,pivot,PivotSet)

13: pivot_Neighbourhoods← (T, start, end) 14: end if

15: end for 16:

17: procedure NEIGHBORHOODEND(pm, pivot, PivotSet)

18: pivot.visited() ← true 19: if pivot == pmthen

20: return pivot.position

21: else if pivot.position + 1 not in PivotSet then 22: return pivot.position + 1

23: else if pivot + 1 ∈ PivotSet then

24: return NeighborhoodEnd(T,pivot+1,PivotSet) 25: end if

26: end procedure

To construct the Pivot Neighbourhood, we need to find the directly reachable points starting from the Pivots, as described in Algorithm 2. The algorithm input is a trajectory T and the Pivots extracted from T. The set of Neighbourhood is set as empty, and the first point (p1)

and the last point of the trajectory (pm) are identified to guarantee the upper and lower bounds

(lines 1 to 3). For each non-tested Pivot the algorithm will construct a neighborhood (line 4 to 5), that is basically composed by a start and ending point position, and these are initially instantiated with zero, or null. If the current Pivot does not correspond to the first point of the trajectory, then the neighbourhood starts one position before the Pivot, otherwise, it starts exactly at the Pivot position (line 10). The function NeighborhoodEnd() defines which is the neighbourhood ending (line 12).

The procedure NeighborhoodEnd() (lines 17-26) indicates that the Pivot is visited (line 18), to avoid re-using a same Pivot in different neighbourhoods. There are three possible con-ditions for finding the neighbourhood ending, the first option is if the Pivot corresponds to the last point of the trajectory (line 19), then the neighbourhood ending is the own Pivot position. Otherwise, if the point in the position after the Pivot does not correspond to another Pivot (line

(53)

51

21), then the neighbourhood ends in the position right after the Pivot. The last possible ending is when the next position after the Pivot also corresponds to a Pivot (line 23), then the func-tion NeighborhoodEnd() is called recursively. Finally, it adds to the pivot_Neighbourhoods the reference to the trajectory, the start and ending (line 13).

The movelet candidates are extracted only from the points covered by the Pivot Neigh-bourhood, including the Pivot. This way, we reduce the number of generated movelet candidates of the original Movelets technique, as the candidates are not extracted from all the trajectory points, but from a limited subset of the most discriminant points. Besides reducing the number of movelets of size one, we also limit both the number and size of the movelet candidates.

The Movelets technique encapsulates the dimension distances of a trajectory point in one single distance value, which means that each point of a trajectory will generate one movelet candidate of size one. However, in the MASTERMovelets, as it tries to identify the best combination of dimensions for each trajectory point, a single point will generate 2l_{− 1}

movelet candidates, where l is the number of trajectory dimensions. Thus, we have two Pivot approach, the Movelet-Pivot, which follows the same characteristics of the Movelets, and the MASTER-Pivots, which follows the configuration of the MASTERMovelets.

3.2 EXPERIMENTAL EVALUATION

We selected and implemented nine methods for trajectory classification from the liter-ature including the Pivots, based on some characteristics: (i) the number of details given in the paper in order to allow reproducability, (ii) their accuracy, (iii) if they were evaluated over pub-licly available datasets (we do not evaluate methods that considered only private datasets), and (iv) deal with raw trajectories or semantic trajectories. In summary, we compare the following works: Dodge by (DODGE; WEIBEL; FOROOTAN, 2009), Zheng by (ZHENG et al., 2008), Xiao by (XIAO et al., 2017), Bi-TULER by (GAO et al., 2017), Movelets by (FERRERO et al., 2018), TULVAE by (ZHOU et al., 2018), Etemad by (ETEMAD; JÚNIOR; MATWIN, 2018), MASTERMovelets by (FERRERO et al., 2020), and the MASTER-Pivots. In the following we describe the datasets in Section 3.2.1, the input of the evaluated techniques in Section 3.2.2, the experimental setup in Section 3.2.3, the classification results in Section 3.2.4 and for last, the Pivot evaluation in Section 3.2.5.

3.2.1 Benchmark

For the experimental evaluation, we used six publicly available and commonly used datasets, among which three of them are raw trajectories, and three are multiple aspect trajecto-ries. Table 2 shows the characteristics of each dataset, with the number of trajectories, the size of the smallest and the longest trajectory, the attributes of the dataset and their classes.

GeoLife : Dataset from the GeoLife project (ZHENG et al., 2008) with daily routine trajectories of 182 users, collected between 2007 to 2012, and with the information of