• Nenhum resultado encontrado

Using mobility data to estimate bus arrival time in a smart city

N/A
N/A
Protected

Academic year: 2021

Share "Using mobility data to estimate bus arrival time in a smart city"

Copied!
120
0
0

Texto

(1)

Universidade de Aveiro Departamento de Eletrónica,Telecomunicações e Informática 2019

Ana Filipa

Ferreira Tavares

Utilização de dados de mobilidade para prever o

tempo de chegada de autocarros em cidades

inteligentes

Using mobility data to estimate bus arrival time in a

smart city

(2)
(3)

Universidade de Aveiro Departamento de Eletrónica,Telecomunicações e Informática 2019

Ana Filipa

Ferreira Tavares

Utilização de dados de mobilidade para prever o

tempo de chegada de autocarros em cidades

inteligentes

Using mobility data to estimate bus arrival time in a

smart city

Dissertação apresentada à Universidade de Aveiro para cumprimento dos requisitos necessários à obtenção do grau de Mestre em Engenharia Informática, realizada sob a orientação científica do Doutor Ilídio Castro Oliveira, Professor auxiliar do Departamento de Eletrónica, Telecomunicações e Informática da Universidade de Aveiro, e do Doutor José Maria Fernandes, Professor auxiliar do Departamento de Eletrónica, Telecomunicações e Informática da Universidade de Aveiro.

(4)
(5)
(6)
(7)

o júri / the jury

presidente / president Professor Doutor Sérgio Guilherme Aleixo de Matos Professor Auxiliar em Regime Laboral da Universidade de Aveiro

vogais / examiners committee Professora Doutora Ana Cristina Costa Aguiar

Professora Auxiliar da Faculdade de Engenharia da Universidade do Porto

Professor Doutor Ilídio Fernando de Castro Oliveira Professor Auxiliar da Universidade de Aveiro

(8)
(9)

agradecimentos /

acknowledgements Aproveito, em primeiro lugar, para agradecer aos meus pais por me terem propor-cionado a oportunidade de ingressar no ensino superior e por todo o apoio e força dados ao longo deste percurso.

Em segundo lugar, agradeço à minha irmã pelo companheirismo nas horas de es-tudo e apoio incondicional.

De seguida gostaria de agradecer ao meu orientador, professor Ilídio Oliveira, pela orientação, disponibilidade e conhecimento trasmitido ao longo da realização deste trabalho. Ao meu co-orientador professor José Maria, pelas sugestões e visão par-tilhadas e à professora Susana Brás pelas sugestões, clarificações e conhecimento transmitido na área de aprendizagem automática.

Sem os dados da rede veicular do Porto este trabalho não teria sido desenvolvido, por isso deixo aqui o meu agradecimento à professora Susana Sargento por ter proporcionado o acesso aos mesmos.

Ao Leandro Ricardo e Jorge Pereira por terem disponibilizado do seu tempo para responderem a dúvidas relativamente a trabalho anterior realizado.

Por último, gostaria de agradecer aos colegas que me acompanharam ao longo do curso, pelo companheirismo e amizade especialmente nesta última etapa.

(10)
(11)

Palavras Chave Aprendizagem Automática, Estimativa de tempos de chegada, Rede Veicular, Ci-dades Inteligentes

Resumo As cidades inteligentes utilizam informação pervasiva e tecnologias de comunicação, nomeadamente a detecção e a análise de dados, para fornecer novas ferramentas e serviços de apoio à decisão. Um dos principais casos de uso é a mobilidade inteli-gente, que aborda o uso de ferramentas computacionais para melhorar os sistemas de transporte e a mobilidade privada. Neste contexto, os sistemas de informação fiáveis relativos a tempos de chegada dos autocarros proporcionam serviços úteis para os utilizadores finais.

A cidade do Porto é frequentemente apresentada como uma cidade inteligente, que tem implementada uma rede veicular que incorpora mais de 600 veículos (au-tocarros, táxis e camiões do lixo) gerando dados sobre a localização GPS dos nós (móveis). Os registos de localização dos autocarros oferecem novas possibilidades para compreender os padrões de mobilidade da cidade.

O objetivo desta dissertação é o desenvolvimento de um sistema capaz de estimar o tempo de chegada dos autocarros, utilizando técnicas de aprendizagem automá-tica sobre os dados disponíveis da rede veicular existente. O sistema desenvolvido tem três módulos principais: (1) deteção de linhas, responsável por inferir possíveis linhas que um autocarro possa estar a operar; (2) modelo de aprendizagem auto-mática capaz de prever o tempo de viagem de um autocarro entre duas paragens e (3) serviço que liga o contexto atual da localização dos autocarros com o modelo de previsão histórica que devolve as previsões para uma dada paragem de destino. Os resultados de previsão obtidos estão em linha com os relatados na literatura. Como prova de conceito, também foi desenvolvida uma aplicação móvel para os passageiros, demonstrando a aplicabilidade prática do sistema.

(12)
(13)

Keywords Machine Learning, Estimated Time of Arrival, Vehicular Ad-Hoc Network, Smart Cities

Abstract Connected cities use pervasive information and communication technologies, espe-cially sensing and data analysis, to offer new decision support tools and services. One of the key use cases of connected cities is smart mobility, which addresses the use of computational tools to enhance transportation systems and private mobility. In this context, reliable information systems concerning bus arrival times provide useful services for end-users.

Porto is often presented as a smart city, which has deployed a Vehicular Ad-Hoc Network with more than 600 vehicles (buses, taxis and garbage trucks) generating data regarding the GPS location of the (moving) nodes. Traces of buses location offer new possibilities to understand the city mobility patterns. The goal of this work is to develop a system for estimating bus arrival times, using Machine Learn-ing techniques in the data available from the existLearn-ing vehicular network.

The developed system has three main modules: (1) line detection, responsible for inferring possible lines on which a bus may be operating; (2) machine learning model capable of predicting travel times between two bus stops and (3) service link-ing the current context of buses’ locations with the historical prediction model that returns predictions for a given destination stop. The prediction results obtained are in line with those reported in the literature. A proof-of-concept mobile application for the citizen was also developed, demonstrating the real-life applicability of the system.

(14)
(15)

Contents

Contents i

List of Figures v

List of Tables vii

Acronyms ix 1 Introduction 1 1.1 Motivation . . . 1 1.2 Objectives . . . 2 1.3 Contributions . . . 2 1.4 Dissertation structure . . . 3

2 State of the Art 5 2.1 Mobility in Smart Cities . . . 5

2.2 Estimating the time of arrival of buses in connected cities . . . 7

2.2.1 Overview . . . 7

2.2.2 Literature Review . . . 8

Model Taxonomy . . . 8

Models based on the Historical Data . . . 8

Regression Models . . . 9

Time Series . . . 10

Kalman Filter . . . 10

Artificial Neural Network (ANN) . . . 11

Support Vector Machine (SVM) . . . 12

Hybrid models . . . 12

Models incorporating multi-routes . . . 12

2.2.3 Conclusion . . . 13

(16)

2.3.1 Vehicular network concepts . . . 14

2.3.2 Porto’s VANET setup . . . 15

2.4 Sociedade de Transportes Colectivos do Porto (STCP) Bus Network . . . 16

2.5 Related work for bus passengers support . . . 18

2.5.1 Google Maps . . . 18

2.5.2 Moovit . . . 19

2.5.3 SMSBUS . . . 22

2.5.4 Porto.Bus . . . 22

3 Previous work related to this dissertation 25 3.1 Overview . . . 25

3.2 Architecture . . . 26

3.3 Matching Unit . . . 27

3.4 Prediction module . . . 31

3.5 Limitations of existing implementation . . . 32

4 System Proposal 35 4.1 Selected use cases . . . 35

4.2 System Architecture . . . 36

4.2.1 Data Sources . . . 37

4.2.2 STCP Repository . . . 37

4.2.3 Bus Line Matching Module . . . 37

4.2.4 Machine Learning Model . . . 38

4.2.5 Service Module . . . 38

4.2.6 Bus Passenger Application . . . 38

4.3 Interactions between modules . . . 38

5 Machine Learning Implementation 41 5.1 Overview . . . 41

5.2 Problem Setting . . . 42

5.3 Model Description . . . 42

5.4 Algorithms . . . 44

5.5 Evaluation Metrics . . . 45

5.6 Model Selection and Cross Validation . . . 46

5.7 Data Collection . . . 46

5.8 Data Preparation . . . 51

5.9 Results . . . 51

(17)

6.1 System Deployment . . . 57

6.2 Bus Network Repository . . . 58

6.3 Bus Network Application Programming Interface (API) . . . 61

6.4 Bus Line Matching Module . . . 62

6.4.1 Available Data Sources . . . 62

6.4.2 Visualisation of Node Position Data . . . 63

6.4.3 Common Segments of Bus Lines . . . 65

6.4.4 Bus Line Matching Algorithm . . . 69

6.4.5 Bus Line Completions Database . . . 74

Overview . . . 74

Database Schema Description . . . 74

6.4.6 Solution Agent . . . 75

6.4.7 Candidate Line Database . . . 76

6.4.8 Candidate Line Update Agent . . . 77

6.5 Machine Learning Services . . . 77

6.6 Prediction API . . . 79

6.7 Estimation API . . . 79

6.8 Bus Passenger Application . . . 80

7 Results and Validation 85 7.1 Matching GPS traces with bus lines . . . 85

7.2 Prediction Module Results . . . 86

7.2.1 Test dataset validation . . . 86

7.2.2 Bus line validation . . . 86

7.2.3 End to end validation . . . 87

7.3 System Acceptance . . . 87

8 Conclusions and Future work 91 8.1 Conclusion . . . 91

8.2 Future Work . . . 92

(18)
(19)

List of Figures

2.1 Smart cities applications [3] . . . 6

2.2 Concept map based on authors’ keywords in scientific literature databases in the mobility area between 2000 and 2017 [5] . . . 7

2.3 Categorisation of models used to estimate times of arrival of buses (adapted from [10]) . 8 2.4 Example of a Vehicular Ad-Hoc Network (VANET) architecture [40] . . . 14

2.5 Services provide by Veniam in Porto [42] . . . 15

2.6 Distribution of STCP bus lines in Porto . . . 17

2.7 Distribution of STCP bus stops in Porto . . . 18

3.1 Architecture components and interactions [2] . . . 26

3.2 Illustration of the matching algorithm steps [2] . . . 29

3.3 Finding solutions algorithm flowchart [2] . . . 31

4.1 System architecture . . . 36

5.1 Multi-route prediction illustration [31] . . . 42

5.2 Example of time series data splitting [49] . . . 46

5.3 Box plots of the two numerical features (line_time and weighted_time) and the output variable (running_time) from 1 March 2019 to 4 April 2019 . . . 49

5.4 Scatter matrix of the two numerical features (line_time and weighted_time) and the output variable (running_time) from 1 March 2019 to 4 April 2019 . . . 49

5.5 Scatter matrix of the two numerical features (line_time and weighted_time) and the output variable (running_time) from 11 March 2019 with 79 924 samples . . . 50

5.6 Average segment travel time from 1 March 2019 to 28 March 2019 comprising 1 730 463 samples . . . 54

5.7 Average segment travel time from 1 March 2018 to 28 March 2018 comprising 3 135 596 samples . . . 55

6.1 System deployment . . . 58

(20)

6.3 Location log of a bus on 1 March 2019 on top of Porto city map . . . 64 6.4 Location log of a bus on 1 March 2019 on top of STCP line geometries . . . 64 6.5 Location log of a bus on 1 March 2019 on top of two bus line geometries (200 BOLHÃO-CAST.

QUEIJO and 200 CAST. QUEIJO-BOLHÃO) . . . 65 6.6 Geometries of bus lines 600 ALIADOS (depicted by a red line) and 4M AV.ALIADOS (depicted

by a white segmented shape like tube) and their intersection. Both lines end at bus stop AAL2 67 6.7 Geometries of bus lines 701 BOLHÃO (depicted by the segmented topology) and 703

CORDOARIA (depicted by a red line) and their intersection. . . 68 6.8 Geometries of bus lines 501 ALIADOS (depicted by a red line) and 208 ALIADOS (depicted

by a white segmented shape like tube) and their intersection. Both lines end at bus stop AL2 69 6.9 Bus Line Completions Database Diagram . . . 74 6.10 Dataset Database Diagram . . . 79 6.11 Bus passenger application map views . . . 81 6.12 Estimated times of arrival resulting from the selection of bus line 201 VISO in the map view 82 6.13 Bus passenger application search views . . . 82 6.14 Estimated times of arrival resulting from the selection of bus line 501 ALIADOS in the

search view . . . 83 6.15 Excerpt of the timetable regarding working days of bus line 200 CAST. QUEIJO . . . 84

(21)

List of Tables

4.1 Use case description . . . 36

5.1 Values considered for feature time of day based on [48] . . . 44

5.2 Dataset division per week . . . 47

5.3 Dataset statistics . . . 47

5.4 Time series splits concerning five weeks starting on 1 March 2019 . . . 51

5.5 Model setup with different combinations of features . . . 51

5.6 Average results per model . . . 53

5.7 Time series splits concerning the first two weeks of March 2018 . . . 55

5.8 Results of the best model (Gradient Boosting using all features) with the dataset of March 2018 . . . 55

6.1 Description of the attributes used from the table node_data . . . 62

6.2 Statistics concerning the table node_data in 2019 . . . 63

6.3 The ten pairs of lines with the biggest common segment . . . 65

6.4 Results of the variaton of the maximum missed bus stops . . . 68

7.1 Results of the bus line matching algorithm when the maximum number of missed stops and the minimum completeness are changed . . . 86

7.2 GradientBoosting regression metrics per week using all features . . . 86

7.3 Bus line based regression metrics using the Gradient Boosting model . . . 87

7.4 End to End Evaluation Metrics . . . 87

(22)
(23)

Acronyms

ANN Artificial Neural Network

APC Automatic Passenger Counting

API Application Programming Interface

APTS Advanced Public Transportation System

AVI Automatic Vehicle Identification

AVL Automatic Vehicle Location

CPU Central Processing Unit

ETA Estimated Time of Arrival

GPS Global Position System

ITS Intelligent Transportation System

HTTP Hypertext Transfer Protocol

JSON JavaScript Object Notation

OBU On Board Unit

RFID Radio Frequency Identification

REST Representational State Transfer

RSU Road Side Unit

STCP Sociedade de Transportes Colectivos do Porto

SVM Support Vector Machine

(24)
(25)

CHAPTER

1

Introduction

1.1

Motivation

Connected cities use pervasive information and communication technologies, especially sensing and data analysis, to offer new decision support tools and services. One of the key use cases of connected cities is smart mobility, which addresses the use of computational tools to enhance transportation systems and private mobility. In this context, reliable information systems concerning bus arrival times provide useful services for end-users and may attract passengers to public transportation, allowing more efficient trip planning.

Traffic congestion has been increasing due to city expansion, population growth, vehicle ownership leading to several negative impacts such as delays, pollution, fuel use, and decreasing accessibility and mobility. Different techniques have been suggested to alleviate the congestion problem such as traffic management, constructing more roads, adding more lanes [1]. However, infrastructure expansion alone cannot meet the vehicular growth, and hence, there is a need to explore better solutions.

Improving and expanding public transportation is one of the many promisingly effective improvement techniques, and to this effect, the use of Intelligent Transportation System (ITS) is gaining interest in recent years due to technologies such as Automatic Vehicle Location (AVL) or Automatic Vehicle Identification (AVI) and Automatic Passenger Counting (APC). These technologies allow the collection of real-time information that can be used to improve the quality of service (QoS) of public transportation. Moreover, attracting travellers to use public transportation systems using these technologies is one of the Advanced Public Transportation System (APTS) (a major functional area under ITS) applications.

One of the modes of transportation is using the bus service, which has the advantage of covering more extensive areas of a city when compared to the subway. However, bus service is

(26)

affected by several traffic conditions, which leads to delays and consequent frustration of the commuters.

In this context, the accurate estimation of bus arrival times can contribute to making public transportation more attractive, since travel time information is relevant to bus passengers [1].

The opportunity for this work is also related to the availability of comprehensive bus location traces, which was made available by the Network Architectures and Protocols (NAP) group, with the Institute of Telecommunications (Aveiro). The datasets have been collected in the context of several R&D projects.

The present work is strongly related to a previous masters dissertation project [2]. Both the present and previous work aim at processing bus location traces to estimate arrival times. However, the underlying methods are substantially different. Chapter 3 offers a detailed characterisation of the preexisting starting point.

1.2

Objectives

Given the motivation for this work, the main objectives set for this dissertation are: • Change the execution of the bus line matching algorithm implemented in previous work

to a streaming paradigm, instead of batch processing.

• Develop a machine learning model to make estimations of times of arrival of buses. • Develop an integrated system, with a proof-of-concept mobile application for bus

passengers, to evaluate real-world applicability.

1.3

Contributions

The contributions of this dissertation are the following:

• Implementation of a bus line matching algorithm that processes live mobility data coming from the nodes of a vehicular network

• Availability of the state of the execution of such algorithm to other entities

• Creation of a dataset related to travel times at segment level which opens up the possibility of comparisons with the times of the latest bus trips.

• Availability of a service that, given a context, makes estimations of times of arrival of buses using a deployed machine learning model.

An article with the title "Estimation of buses arrival time combining historic and live mobility data" describing the methods used in this dissertation was submitted and accepted in INForum 2019.

(27)

1.4

Dissertation structure

This document contemplates eight chapters organised as follows:

• Chapter 1 - describes the context and motivation of this dissertation, as well as its objectives, contributions and structure.

• Chapter 2 - presents an analysis and general description of the various areas related to this dissertation: intelligent transportation systems, research on machine learning techniques used to estimate the times of arrival of buses, implemented technologies in Porto city to support research in the area of mobility and existent mobile applications that provide information of times of arrival to users.

• Chapter 3 - presents previous work carried out in the same scope of this dissertation. • Chapter 4 - presents the system proposal encompassing the requirements and the

architecture.

• Chapter 5 - describes the methodology used to create the machine learning model. • Chapter 6 - describes in detail how the several modules were implemented and how

the system is integrated.

• Chapter 7 - presents the results.

(28)
(29)

CHAPTER

2

State of the Art

The scope of this dissertation encompasses several areas, from dealing with vehicular Global Position System (GPS) logs, estimating the travel time of buses using machine learning methodologies, and deploying a mobile application as a service to the bus passengers. Thus it is necessary to get some insights about the related work in each area as well to explain the supporting concepts and terminologies.

2.1

Mobility in Smart Cities

Smart Cities, also designated as Connected Cities, use pervasive information and

com-munication technologies, especially sensing and data analysis, to offer new decision support tools and services or to improve the efficiency of cities. The data collected, after analysis and mining methodologies, often translates into quality information allowing valuable insights, thus adding another layer of knowledge about the city. Figure 2.1 depicts areas of application of smart cities.

Equipping cities with these technologies to collect data in several areas increases the extracted knowledge since it includes multiple areas and their interactions between them. Enhancing or creating city services consequently leads to the improvement of the quality of life of citizens.

One of the key use cases of connected cities is Smart Mobility, which addresses the use of computational tools to enhance transportation systems, private mobility, and the safety of travellers. Taking advantage of this pervasive computing context opens the possibility to understand how urban particles (citizens, buildings, public vehicles) interact and move through the city, providing means to identify traffic, parking, public transportation and infrastructure problems. New solutions, with the aid of better insights, arise to tackle these urban mobility problems.

In SenseMyCity [4] citizens play an important role to collect data and consequently identify problems in the city since they keep their smartphones around nearly 24/7.

(30)

Figure 2.1: Smart cities applications [3]

Another significant point is that these smart devices, seen as everyday objects, have a variety of embedded sensors such as GPS for location, magnetometer, accelerometer, gyroscope which allows the acquisition of urban mobility data.

Smart mobility is also trying to shift paradigms from a "single modal" transport system, where a majority of travellers owns a private car, to a flexible multi-modal system where travellers can use several modes of transportation best suited for their current needs [5]–[8]. All of these principles should be aligned with sustainable practices allowing the reduction of air and noise pollution and the use of more eco-friendly vehicles.

However, if the alternative modes of transportation for owners of private cars are not reliable nor ease the commuters ridership experience, then that paradigm shift is not possible, and problems remain unchanged. One of the possible ways to move through cities is using public transportation systems and making these systems more reliable by providing accurate real-time information allows commuters to make decisions regarding the use of public transportation or personal vehicles.

A more detailed view regarding the main trends, concepts, and the evolution of research in the urban smart mobility area can be found in [5]. The authors of this study analysed several publications in Web of Science and Scopus databases and extracted insights through bibliometric analysis (figure 2.2).

Different classes of mobility-related research areas can be found in the literature.

Automatic Vehicle Identification (AVI) systems are systems able to identify a moving vehicle when it passes a particular point. The most common applications include law enforcement (through the use of automatic license plate recognition or video), access control, and electronic toll collection using Radio Frequency Identification (RFID). AVI systems in the public transportation system can be used to monitor fleets, for instance, through the recording of arrival times at bus stops providing a powerful tool to carry out performance analysis to detect potential problems in the current bus carrier infrastructure.

(31)

Figure 2.2: Concept map based on authors’ keywords in scientific literature databases in the mobility area between 2000 and 2017 [5]

of moving vehicles, and most of these systems use GPS to obtain the vehicle’s location. The collection of vehicle’s positions can be used to track and monitor fleets, and in the particular use case of public transportation, it shares the advantages of AVI systems.

Automatic Passenger Counting (APC) systems count the number of passengers on board of buses by recording boarding and alighting of passengers. These systems can be used to identify points of higher demand and problems in servicing the commuters in public transportation companies.

2.2

Estimating the time of arrival of buses in connected

cities

2.2.1 Overview

Nowadays, most public transportation systems provide their timetables on the web or the board of bus stops and with the widespread of smartphones by commuters, more apps are used for consulting bus schedules. However, this information is often static and cannot account for the delays in the journey. Using the advanced technologies in Advanced Public Transportation Systems, transit agencies can acquire real-time bus information to reduce passenger journey time and improve service level by helping passengers to schedule their departure time, make smart choices for their travel and increase their confidence.

(32)

bus network managers, a dashboard can help to get insights on the network such as pattern detection in bus route on a reference window, detect high demand at bus stops and the occurrence of delays, support decision-making, review routes in problematic areas, monitor the execution of the journey and evaluate the overall efficiency of service operation. For the commuters they can track buses nearby, check the predicted time of arrival and plan their trip with near real-time information.

2.2.2 Literature Review

Model Taxonomy

The literature offers systematic reviews of methodologies for estimating the time of arrival of buses presenting a similar categorisation of the techniques used (Figure 2.3).

A study reviewed papers from 2004 to 2015 that used machine learning techniques such as regression models, Kalman filter, Artificial Neural Network (ANN), SVM, hybrid models, and Radial Basis Functions [9]. The authors stated that Artificial Neural Network, regression models, Kalman filters and historical data models are used to predict the bus arrival time under several factors and that ANN had a better overall performance when compared to the other techniques.

Another study surveyed several publications that used historical data, statistical methods, Kalman filters and ANN [10]. The authors argued that data concerning bus arrival times are complex and robust results are not achieved by using only one method, leading to the increasing use of combined algorithms to improve performance.

Figure 2.3: Categorisation of models used to estimate times of arrival of buses (adapted from [10])

Models based on the Historical Data

These models predict the time of arrival the bus based on previous data from bus trips, in the same reference window as the bus trip being estimated and assume that daily and weekly patterns in traffic conditions exist, so a historical average in a given period is a good

(33)

prediction of future conditions in the same period [9], [11], [12]. The performance is dictated by the similarity between the historic traffic patterns and the future observations, and since traffic is inherently an environment with unpredictable behaviour, historical-based models are more reliable in steady/constant environments or in areas that have shown to have minimum traffic congestion.

The methodologies are based on average metrics, such as average travel time and average speed. When using average travel time models, this metric can be used solely or in combination with other inputs to estimate the time of arrival. In most researches, they were developed as a baseline for comparison with other techniques, and in almost all of them, they were outperformed by the respective proposed main algorithms [1], [9], [12]–[15]. Research using average speed to estimate the time of arrival compute this metric over segments of the bus route usually using GPS data since the distance is easily calculated between two points.

Sun et al. [16] proposed an algorithm combining the current speed of bus derived from real-time GPS data with historical average speeds of individual route links. The results indicated that the maximal absolute estimation error was less than 5%. However, the model performed worse during rush hours than on non-rush hours, meaning that the variation of traffic affects accuracy.

The performance of the algorithm was also compared with a similar one proposed by Weigang et al. [17] regarding prediction accuracy and achieved better results.

Regression Models

Regression models try to mathematically formulate the relationship between the bus arrival time and the factors affecting it, such as dwell time, distance, number of stops, boarding and alighting of passengers, and weather conditions.

Patnaik et al. [18] developed a set of regression models that estimate arrival times concerning buses travelling between two points along a route with data collected by an Automatic Passenger Counting (APC) system. The selected independent variables were distance, the number of stops, dwell times, boarding and alighting of passengers, and weather descriptors. The authors stated that results were promising, although the available data were limited.

Jeong and Rilett [14] compared a historical data-based model, regression model, and ANN model while Ramakrishna et al. [15] evaluated a multiple linear regression and an ANN model to estimate the bus arrival time using GPS-based data. Both studies concluded that other models, namely ANN outperformed regression models.

Recent studies used regression models only for comparison with proposed methods, and the latter outperformed the former [12], [13], [19]–[22].

The field of transportation can be an unpredictable environment subject to many factors, and in the regression models approach, it is difficult to identify and incorporate all the variables that affect bus arrival time.

Moreover, the applicability of these models is limited because those variables in transport-ation systems usually have a high correltransport-ation factor (multicollinearity) which poses a problem

(34)

because independent variables should be independent to the model to achieve good results [23].

Time Series

A time series model can be used simultaneously to find the underlying patterns in the data and use them to forecast/predict values in the future periods. Time series data can be decomposed to extract meaningful information and this decomposition process comprises Trend, Seasonality, Cyclic, and Residual components.

Since these models assume that are patterns that exist in the data, they try to describe them by mathematical functions and use these functions to make a forecast.

Chien et al. alerted to the fact that variation in the historical data patterns or dissimilarity in real-time data and historical data can lead to inaccurate results [23].

D’ Angelo used a non-linear time series model to predict travel time, comparing two models: the first one used only speed data as a variable, while the second model used speed, occupancy, and volume data. The results showed that the single variable model was better than the multivariable prediction model [9]. However, they referred to some limitations in the collected data.

In 2012, Kumar and Vanajakshi [24] used multiplicative decomposition and exponential smoothing methods to estimate bus travel time considering a constant smoothing parameter. The authors concluded that there would be a strong weekly pattern followed by a trip-wise pattern and in the presence of limited data, the smoothing method outperformed the decomposition method [13].

Kumar et al. [13] implemented a hybrid method by integrating a time series analysis with the Kalman Filter technique to estimate bus travel time. The authors used the exponential smoothing method stating that it requires only one parameter, i.e., smoothing constant, for each interval, which is easy to calculate and update in real-time.

Zhong et al. [25] compared a time series model with an ANN, SVM, and a hybrid model. The results showed that the time series model performed worst.

Kalman Filter

Kalman filtering, also known as linear quadratic estimation (LQE), is a recursive procedure comprising two steps that estimate the state of a dynamic system from a series of noisy measurements. In the prediction step, an estimate of the current state variable (an a priori estimate) and the corresponding uncertainty are calculated. When a new measurement is observed, the update step takes place, and the estimate is updated (a posteriori estimate) as well as the uncertainty.

Some of the advantages of the Kalman filtering technique are its simplicity and its suitability for real-time processing since it only uses the present input measurements and the previously calculated state and its uncertainty matrix and no additional past information is required.

(35)

This method is commonly used in several areas such as time series analysis, signal processing, robotics, economics.

Shalaby and Farhan developed two Kalman filtering algorithms to predict running times and dwell times separately using AVL and APC data. They stated that Kalman filters had better performance than the historical models, regression models, and time lag recurrent neural network models regarding the accuracy, demonstrating the dynamic ability to update itself based on new data [26].

Vanajakshi et al. implemented an algorithm based on Kalman filtering under various traffic conditions using a space discretisation approach. The results obtained showed that the Kalmen filter outperformed the average method over seven days out of ten days [27].

In 2014, Kumar et al. [28] evaluated the performance of an ANN and a Kalman filter using data collected for one week from the field. The authors concluded that the ANN-based method performed slightly better when using an extensive database, but the Kalman filter method performed better when such a database is not available.

Artificial Neural Network (ANN)

An Artificial Neural Network is a machine learning model inspired by the neural structure present in the human brain to simulate its intelligent data processing through the interaction between neurons. Chien et al. [23] developed a link-based and stop-based ANN models to estimate bus arrival time and an adaptive algorithm using simulated data to improve the performances of the ANN-based models. The link-based model outperformed the stop based, and the use of the adaptive algorithm improved prediction accuracy.

As has already been stated (section 2.2.2) Jeong and Rilett and Ramakrishna et al. have also developed ANN models to predict bus arrival times [14] [15].

Fan and Gurmu [1] compared historical average model (as a baseline model comparison) with two other models, ANN and Kalman filtering, that were developed to predict bus travel time dynamically, using only global positioning system (GPS) data. A short, medium and a long/whole section were used to test the models, and all three models performed well for short sections, but Kalmen filtering and ANN outperform historical average for both medium and long sections. However, the Kalman filter failed at giving reasonable estimation during peak hours, especially evening peak, since it is not able to handle abrupt variations.

On the other hand, ANN gave stable predictions during peak and off-peak hours. Also, ANN made the most robust predictions and had the lowest prediction error [1].

In 2014, Kumar et al. [28] evaluated the performance of ANN (data-driven technique) and Kalman filtering (limited data requirement) using data collected for one week from the field. The authors concluded that the ANN -based method performed slightly better when using an extensive database, but the Kalman filter method has advantages when such a database is not available.

(36)

Support Vector Machine (SVM)

Similar to ANNs, SVMs are also a supervised learning algorithm. In 2006, Yu et al. [29] evaluated the suitability of SVM in bus travel time estimation, pointing out that SVMs are not subject to the overfitting problem like ANNs. The authors built separate models according to the time-of-day and weather conditions.

The developed models were tested using off-line data of a transit route and had better performance when compared with the 3-layer ANN. The results also showed that the performance of the SVM with three input variables was the best, proving that the incorporation of the latest travel times predicts more accurately. Moreover, the SVM that used only historical data had worse results.

Vanajakshi and Rilett [30] compared some different forecasting methods for travel time prediction (not applied specific to buses but in general) including historical method, time series analysis, ANN, and SVM. The comparison showed that the performances of both SVM and ANN models were comparable to each other, and these two methods outperformed other methods [31].

Wu et al. [32] also made contributions to the research of travel time prediction using SVM models [31]. The authors compared the performance of SVM and other baseline predictors (historical average and current time prediction).

SVM models proved to be a good substitution of ANNs having similar or better prediction performance than that of the ANN models. In general, SVMs have been adopted by researchers since they are also able to infer non-linear relationships and outperform other models concerning prediction accuracy.

Hybrid models

The researches presented in previous sections focus on the development of one primary method and the comparison of several individual methods. In their review in [9], the authors concluded that the number of hybrid methods, i.e., methods that combine more than one technique to predict travel time, increased. This rise can be due to the complex nature of the problem which needs views on historical data but also on the most recent information and the need to bring together the advantages of multiple methods.

The most used combination of techniques found comprises a machine learning model such as ANN or SVM to handle historical data and make a baseline prediction and a Kalman filter to use the most recent information and adjust that prediction [21], [25], [33]–[37].

Models incorporating multi-routes

The research under this scope does not form a category in the model taxonomy since it does not present a new method to predict bus arrival times. However, the research presented earlier only considers the information of the same bus route and, recently, authors have been addressing the importance of integrating bus travel times of different bus routes on the same road segments to improve the prediction accuracy. In 2011, Yu et al. [22] were the first

(37)

ones to state that in the presence of multiple routes to reach the same destination and the prediction of arrival time to that destination, passengers would have several choices and could choose the fastest bus to arrive at the desired stop improving their experience. Moreover, they pointed out that the travel times of the preceding buses that have passed the targeted stop can be used to reflect the traffic conditions since the most up-to-minute data can provide most reliable information to the prediction. Their work showed that using multi-route information improved accuracy and that the SVM model performed better. Yu et al. [22] presented several contributions, but it was only applied to a single bus stop, and only peak hours were studied.

In 2017, Yin et al. [38] also made contributions proposing a SVM and ANN to predict bus arrival time at stops with multi-routes.

Hua et al. [19] followed the work of Yu et al. using the detection point at the previous stop instead of it being between the target stop and the previous stop. The authors claimed that this approach was better due to its fixed detection point, making it possible to predict the bus arrival time at the next continuous multi-stop at one time. The results reported that SVM showed a slim advantage over ANN, and all the proposed models performed well at both heavy and moderate mixed traffic segments.

Bai et al. [31] also following the work of [22] claimed that no dynamic model using multi-route information was developed and that assessment of whether it could improve the prediction accuracy was necessary. To this purpose, they used SVM/ANN as a baseline prediction and adjusted the prediction result by the Kalman filtering based dynamic algorithm using the most recent bus trips on multiple routes. The results indicated that the proposed dynamic models could improve the prediction accuracy.

2.2.3 Conclusion

The main goal of Advanced Public Transportation System is to attract passengers to public transportation and to reduce the congestion on urban roads. One of the techniques to achieve that goal is to provide reliable information about bus arrival to passengers. However, the reliability of the information provided to passengers depends on the prediction technique and the data used.

This section aimed to present a comprehensive view of methodologies to predict bus arrival times, including papers from 2002 to 2018.

It was concluded that several models had been developed over the years using different APTS technologies to tackle a problem affected by several factors. The inability of explicitly formulate the relationship between the input data and the prediction as well as the increase of the volume data generated by connected cities lead to the generalised use of machine learning models, namely ANN and SVM, on historical data. However, due to the dynamic behaviour of traffic conditions, there is an increasing trend to implement hybrid algorithms to improve the prediction accuracy, generally using Kalman filtering to adjust prediction from machine learning models.

It was also concluded that using arrival times from other bus routes in common road segments improves the prediction.

(38)

Further detail is provided as a synoptic table describing the reviewed works and can be found in https://bit.ly/2ChjHxC.

2.3

Porto’s vehicular network setup

2.3.1 Vehicular network concepts

A Vehicular Ad-Hoc Network (VANET) follows the same concept of a mobile ad-hoc network but applying it in the domain of vehicles. In this paradigm, the nodes of the network can be in a moving or stationary state and are interconnected in an ad-hoc way having the ability to rearrange themselves, thus creating a dynamic wireless network through technologies such as Wi-Fi, WAVE or cellular [39].

VANETs are composed of two types of nodes as depicted in figure 2.4:

• On Board Unit (OBU) - units equipped with CPU, memory, and storage deployed in moving vehicles that can communicate with other OBUs and fixed infrastructures alongside the road.

• Road Side Unit (RSU) - static nodes strategically located close to the road to improve communication and provide access to the Internet.

In the vehicle domain, the moving particles of a VANET can be of several kinds such as private cars, buses, taxis, garbage collection trucks, while infrastructure nodes often are the property of city hall or private companies [40].

Figure 2.4: Example of a VANET architecture [40] These nodes can establish several communications:

(39)

• Vehicle to Vehicle (V2V) - a vehicle connects to the other nearby vehicles to exchange and spread data

• Vehicle to Infrastructure (V2I) - communication established between OBUs and RSUs, for instance, to access to external networks

This infrastructure can also be used to improve mobility in the cities since it allows the transmission of information (for instance, GPS location) regarding the vehicles where the OBUs are deployed.

2.3.2 Porto’s VANET setup

The work carried out in this dissertation uses vehicular data from the second-largest city in Portugal: Porto.

A partnership between IT1, UA2, UP3, VENIAM4, Porto Digital5 and STCP6 allowed the deployment of the world largest network of more than 600 connected vehicles including taxis, garbage collection trucks and buses from a public bus fleet operating in the city, while providing free internet access to citizens located both inside and outside the vehicles [41]. An illustration of some of the services provided by Veniam in Porto is depicted in Figure 2.5.

Figure 2.5: Services provide by Veniam in Porto [42]

Porto embodies the smart city initiatives and the VANET enables the acquisition of massive amounts of data concerning "the urban flows of people, vehicles and goods and citizen

1Instituto de Telecomunicações https://www.it.pt/

2

Universidade de Aveiro https://http://www.ua.pt/

3Universidade do Porto https://www.up.pt/

4

A vehicular networks specialised company https://veniam.com/

5Association promoting ICT projects within the context of the city of Porto and its metropolitan area.

https://portodigital.pt/ 6

(40)

patterns", which allows the development of a new set of services on top of the infrastructure,

the identification of problems and consequently the creation of solutions to make operations more efficient [43].

To illustrate the amount of data created in this network, here are some results as of May 2019 [41]:

• More than 780 000 unique WI-FI users • More than 17 million internet sessions • More than 43 million connected KM

2.4

STCP Bus Network

The On-Board Units of the Porto’s vehicular network are deployed on a public bus fleet operating in the city: STCP. To be familiarised with the naming conventions and understand how the bus network is geographically distributed in the city, it is necessary to consult and extract information from the STCP website and create visualisations to gain some insights.

The STCP website, alongside with Porto’s vehicular data, constitutes the other data source to carry out the development of this dissertation.

It contains information about infrastructure and operations, namely:

• Bus stops - each one has a code, name, address and belongs to a predefined zone. • Bus lines/routes - each one has attributes such as code, name, direction. The junction

of the code and direction attributes uniquely identify a bus route. • Stops per line - which bus stops belong to a given line.

• Lines passing at a stop - which lines operate in a given bus stop. • Full path of lines - geospatial information about the bus route.

• Stops Coordinates - geospatial information about the location of bus stops.

STCP operates 142 routes covering Porto city but also several commuting routes to nearby municipalities.

Further analysis of bus lines revealed the existence of a clear distinction between daytime and night routes. There are a set of routes that only operate in the nocturne period, i.e., between 00h30 and 05h30 and whose path is very similar to daytime lines to fulfil the needs of the bus passengers from different parts of the city. To identify night routes, STCP places the character M as the suffix of the line code and represents these routes with the black colour. Daytime routes are grouped, colour-coded and usually there is some relation that defines the grouping and consequently assigned colour. For instance, the seven daytime routes that connect Porto and Vila Nova de Gaia are identified with a code that starts with 90 and with the orange colour. Another daytime line that only operates in Vila Nova de Gaia is coded as ZF but uses the orange colour.

The website also informs about service modifications, which is valuable information to debug potential problems in the developed work and to identify periods of higher unpredictability.

(41)

Below are some statistics concerning the bus carrier infrastructure: • Number of bus routes: 142

• Number of stops: 2467

• Median length of a bus route: 37 • Minimum length of a bus route: 12 • Minimum length of a bus route: 63

The information also allowed the creation of visualisations concerning the shape and density of bus lines in the city, as well as the distribution of bus stops.

Figure 2.6 shows the bus line distribution of STCP in Porto, which covers the city in all directions. It is possible to notice that the line density is higher in Porto’s downtown than in the main peripheral areas: Matosinhos (NO), Maia (N), Ermesinde (NNE), Valongo (NE), Gondomar (E) and Gaia (S).

Figure 2.7 shows the stop distribution (on top of bus lines geometries) within Porto city. The red dots represent bus stops and maintaining the same geographical references the conclusions inferred from Figure 2.6 are more noticeable in this visualisation.

STCP offers special services during some events occurring in the city, and it has two depots (Francos and Via Norte) where bus drivers park the buses to switch shifts.

(42)

Figure 2.7: Distribution of STCP bus stops in Porto

2.5

Related work for bus passengers support

Although the purpose of this work is not to deploy a production-ready service, the underlying methods should support such future development. In this context, there are some applications which provide related functionality, aiming to support the bus passenger planning. The detailed implementation of each application is not generally available and will focus on the perceived service for the end-user.

2.5.1 Google Maps

Google Maps offers several functionalities to commuters concerning route planning, as it incorporates multiple modes of transportation. Regarding bus transit agencies, it provides information about bus stops, lines that pass through that stop, timetables and more recently live information about bus delays in places where Google does not have real-time information [44]. It is unclear whether Google has STCP real-time information.

A recent summarised post on the AI blog hints that a neural network is used with sequences of units (stop unit, road unit) where each unit has an independent prediction and the final output combines all unit values. It is also stated that the model does not learn to combine unit outputs, nor the state is passed through the sequence [45].

Information about car traffic data, the bus route and patterns related to the period of the day and day of the week are also crossed. The model output predicts travel time, which can be used to adjust the times of arrival.

Sub-figure 2.8a presents a list of STCP bus lines in Porto that pass through the bus stop Trindade, and its estimated times of arrival for different trips. Sub-figure 2.8b presents a detailed view of estimated times of arrival at bus stops from a selected bus journey.

(43)

(a) Google Maps STCP bus stop view

(b) Google Maps STCP bus trip detailed view

2.5.2 Moovit

Moovit’s mobile application is available in more than 2700 cities, across 90 countries and provides users with functionalities to plan routes using different modes of transportation, check transit agencies routes and timetables, check nearby stops and stations according to their location and marked them as favourites. The application also allows the subscription of notifications regarding service alterations and planned routes.

The STCP information is available in the application and service alterations are reported as in their website, but is unclear how estimations are calculated. The described analytic functionalities hint at data-driven methodologies.

Moovit is community-based since it integrates data provided by a group of users concerning their city public transportation system such as bus stop locations and timetables, with official data provided by transit agencies. Millions of users around the world also contribute by generating data everyday [46].

In 2017, Moovit launched a set of solutions in the area of Mobile as a Service to help public fleet agencies to improve mobility in their cities [46]. A public transit index is also available to consult insights concerning specific cities 7. Porto city metrics are aggregated with Braga and Vila Real, but some conclusions are only related to Porto such as the most popular bus lines and bus stops. According to the index, the bus lines 204 Hosp. São João - Foz, 200 Bolhão - Castelo do Queijo and 903 Boavista - Laborim are the most popular whereas

7Moovit insights index https://moovitapp.com/insights/en/Moovit_Insights_Public_Transit_

(44)

(a) STCP bus trip detailed

view (b) STCP bus stop view

(a) Moovit suggested itin-eraries from Porto-São Bento to STCP bus stop PAL2

(b) Moovit itinerary detailed view

(45)

the most popular bus stops are Boavista - Casa Da Música (BCM1), Praça Filipa De Lencastre (PRFL) and Trindade 8.

8

Moovit insight concerning Braga, Porto and Vila Real https://moovitapp.com/insights/en/Moovit_ Insights_Public_Transit_Index-1904

(46)

2.5.3 SMSBUS

SMSBUS9 is a paid mobile phone messaging service provided by STCP to deliver estima-tions of times of arrival and waiting times. It has two use cases: one returns the information considering four buses of the same bus line if the user sends the bus stop code and line, the other returns information concerning four buses of different bus lines if the user only sends the bus stop code.

SMSBUS predictions are built using mathematical calculations, using information about the buses which are equipped with GPS tracking system that detects their location in intervals of 30 seconds. STCP claims that the estimations of times of arrival deviate on average about 2 minutes for 90% of the requests. However, it is stated that the predictions can be affected by abnormal situations regarding traffic congestion or road accidents.

2.5.4 Porto.Bus

Porto.Bus is a mobile application that provides information regarding the bus network in Porto city. Users are able to consult the bus line list (subfigure 2.11a), view the specific route of each line (subfigure 2.11b) and check the times of arrival of buses at a bus stop (2.12a). To the best of our knowledge, the methods to estimate/find the times of arrival are not published. The application also allows the user to mark stops as favourites and view the recent search history.

9

SMSBUS service website https://www.stcp.pt/smsBusMicroSite/index.html

(a) List of STCP bus lines

(b) Bus route of line200 Bolhão - Castelo do Queijo

(47)

(a) Times of arrival at bus stop PRFL

(b) Search page with bus stops marked as favourites and recent search history

(48)
(49)

CHAPTER

3

Previous work related to this

dissertation

This chapter presents the previous work carried out in the same scope of this dissertation using the data from the VANET deployed in Porto [2].

3.1

Overview

With the deployment of the vehicular network in Porto raising new opportunities for creating a new set of applications, several works have been developed in many areas. One, in particular, addressed how in the mobility area, the spatial and temporal data generated by the vehicular network can be used to improve city transportation services which are a big concern of municipality institutions and transportation companies, especially where traffic congestion presents an everyday problem.

The work started by analysing the data generated from the VANET and identifying key stakeholders as well as a new set of tools from whom they would benefit. This lead to the identification of bus passengers and bus fleet managers as stakeholders with the former benefiting from more accurate bus arrival times and the latter benefiting from having reports about the usual behaviour of bus lines, generating insights and supporting decision making. To develop those services starting with only spatial data concerning GPS position and corresponding captured time, an algorithm was first implemented to match bus’ GPS traces to specific bus lines operating in the STCP network since there is no information of on which bus line is a bus operating. The result of the algorithm execution created a considerable amount of data concerning previously completed bus line journeys and arrival times at the identified bus stops. Generating this historical data allowed the computation of estimations of bus arrival times for a given bus line and bus stop and the development of a machine learning model which is trained upon a user request for a prediction of the time of arrival at a specific bus stop [2].

(50)

To prove real-life applicability of the services, the author developed two proof-of-concept applications. One was a line performance dashboard intended for bus fleet managers while the other was a mobile application for bus passengers.

3.2

Architecture

The author implemented a system with the architecture depicted in figure 3.1. The illustration was obtained from the dissertation document, and changes have occurred since the time of implementation (2017) until now (2019).

Figure 3.1: Architecture components and interactions [2]

The architecture encompasses the following modules starting at the bottom to top layer: • vanetV3 and the STCP website constitute the raw data sources. The vanetV3 was a MySQL database that stored the spatial logs from the nodes. Later it was changed to a Timescale 1 database, a PostgreSQL extension to deal with time-series data. The STCP website contains information about the transportation company infrastructure. • The Extraction Scripts collect and process the data available on the website

trans-forming it in a model that maps the infrastructure (bus lines, bus stops, lines operating a stop, stops belonging to a line).

• The Matching Unit implements an algorithm that matches the bus GPS traces with the lines of the bus carrier infrastructure. The result is a set of identified bus stops of a given line with the corresponding capture timestamp.

1

(51)

• The Matches Database stores the Matching Unit results. • The Data Mart gathers performance metrics from bus delays.

• The Synchronisation Script updates the Data Mart. It synchronises the Matches

Database with the Data Mart.

• The Bus Network Information Database stores the information about the bus network resulting from the execution of the extraction scripts.

• The Matches API returns information about the matched bus journeys stored in the

Matches Database.

• The Estimation API queries the Data Mart for estimations of times of arrival. Both applications use it.

• The Prediction API exports a service for making bus arrival predictions on-demand using a machine learning model.

• The Bus Network Information API delivers information about the bus carrier infrastructure for both web and mobile applications.

• The Line Performance Dashboard displays the performance of bus lines and it is intended for the bus fleet manager. It supports decision making because it provides insights about the lines identifying possible problems in the current bus infrastructure. • The Bus Passenger Application is a mobile application for the bus commuters to

consult estimated times of arrival for a given line and bus stop.

From the developed components depicted in the architecture, the implementation of the matching unit and the prediction module are detailed in the next sections.

3.3

Matching Unit

The matching unit implements the algorithm that processes the raw data from the position log database to find bus lines completed by the nodes, given the STCP infrastructure context. This algorithm is executed in a service that runs every night to match the GPS traces of all nodes from the previous day.

Since more than 600 nodes were producing (at the time) at least 1.2 million of records to the database every day, a multiprocessing implementation is used to divide the computation into tasks to be performed by several workers. It follows a master-slave architecture where one entity the, master, controls the overall process and the others are slaves who compute a subset of the problem, denoted as a job. Each job processes data relative to one node on that day. The Dispatcher entity creates the jobs and each Worker fetches one job at a time and executes the matching algorithm on that data. To fully exploit parallelism, the author recommended that the number of workers is the number of Central Processing Unit (CPU) of the target machine. A complete detailed implementation description is found in [2].

(52)

Regarding the algorithm, the solution found is an intuitive one because it iterates over the GPS points to detect nearby bus stops confronting them with the context of the STCP network to find possible lines that the node might be operating. The suitability of one of the candidate lines as a solution is determined by meeting specific criteria.

To achieve the previously described steps, it was necessary to answer the questions below: • How to detect nearby stops

• What criteria defines a nearby stop • How to determine candidate lines

• How to find a solution line within a set of possible ones by detecting consecutive bus stops

The author stated that detecting nearby stops requires doing spatial searches within a radius. Then, with the information about the bus carrier infrastructure and detected bus stops, it is possible to infer whether some detected bus stop is the first stop of any bus line and consider that line a candidate one. For each candidate line, the iteration over the next pairs of locations establishes whether that candidate line is a solution according to some criteria: completion rate, detected stops, detection order, line completion.

To perform proximity searches, the author used a spatial data structure, R-tree, capable of indexing geographical coordinates since, in this context, bus stops coordinates are available. Although R-Tree is capable of finding the K-nearest neighbours of a given point, it does not allow radius queries, i.e., finding the K-nearest neighbours within a certain radius, so these two steps are executed separately [2]. For this, the distance between two GPS points is calculated using Vincenty Formulae [47].

Concerning the choice of the detection radius, the author pointed out the importance of not using distance measures in a cartesian plane since the Global Position System considers the earth as an ellipsoid. Choosing a detection radius, after the distance is calculated, is not an easy task because it is dependent on factors such as GPS jitter and node speed between two consecutive location captures which could lead to missed bus stops and potentially candidate lines if the radius is too small.

This would likely lead to the use of a larger radius to secure the identifications of bus stops. However, this requires more processing since more bus stops are identified and consequently, more candidate lines are considered, which is not also an ideal case.

To deal with this issue, the author used a radius equal to half of the maximum distance between two points when the bus is going at the maximum allowed speed in the city of Porto which is 50 km/h. However, due to possible speeding resulting from driver behaviour, the value considered was 55 km/h (15.2 m/s).

At the time of the development, the time between two node observations was 15 seconds, so with a speed of 15.2 meters per second, the maximum travelled distance is 228 meters, resulting in a radius equal to half that distance – 114 meters. Around January of 2018, the capture periodicity changed to 60 seconds, so the author adjusted the radius to 125 meters.

(53)

Once the methodology for discovering nearby bus stops was addressed, the bus line matching algorithm was developed using two main steps computed sequentially: detect bus line starts and find bus line completions (solutions).

First, all records are iterated, and for each extracted position, the nearest bus stops within the predefined radius are found. If any of the detected bus stops is the first stop of a bus line, then that bus line is considered a candidate line from that record forward and stored into a list. This list is used in the final step where each one of the detected candidate lines is tested to asses whether it is a solution. This assessment involves iterating the positions log while maintaining a context of all the detected bus stops of the candidate line and corresponding captured timestamp.

Figure 3.2: Illustration of the matching algorithm steps [2]

Figure 3.2 presents an illustration of the execution of the previously described steps. First, the blue dots represent the GPS positions of the node (step 1). Secondly, as the iteration proceeds, nearby stops are detected, and the bus lines that start at any of the detected stops constitute candidate lines. In the illustration, two candidate lines were detected starting at stop S (step 2). In the third square, it is represented the path of one of the candidate lines (outlined in red) as well as the bus stops belonging to that line illustrated as numbered red markers.

(54)

Lastly, step 4 shows an ongoing evaluation of the depicted candidate line as a possible solution. For each position, the nearest stops are detected to find one that is part of the line. The illustration depicts the detection area in green around the position.

An implicit requirement in this detection is the correct order relative to the route of the line. Stops that are part of the line but are not ordered in compliance with the current context are not considered. The bigger dashed blue dot near stop 5 is the position of current record being iterated.

As mentioned in section 2.4, the bus carrier has two depots, which translates into movement unrelated to the operation of any line. More specifically, the following situations can increase the number of lines matched, which in reality were matched in out of service periods (false positives) [2]:

• When the bus is stopped at one of the depots for a long period (between shifts) • When the bus crosses the city to start a shift

• When the bus crosses the city to return to one of the depots

• When the bus crosses the city to reach the first stop of a line (already in the shift) The author considers a candidate line to be a solution if it meets the following two requirements:

• The last stop of the candidate line is detected

• The percentage of the number of detected bus stops divided by the total number of bus stops of that line is above a certain threshold - completeness.

The completeness measures the percentage of detected bus stops. Ideally, this value would be 100%, however, due to GPS jitter, the periodicity of captured observations (15 seconds), gaps between observations and service modifications, it is expected that the completeness is lower.

To define a threshold for this metric, the author opted by computing the median number of bus stops per line. At the time, this value was 36, and the completeness rate of 80% was set, which is equivalent to not detecting 7 in 36 bus stops.

After the criteria to detect a solution were established, it is necessary to address another question: how to discard a candidate line. The natural answer is when a solution is found, and the algorithm advances to process the next records after the record that holds the last stop of the solution, discarding all previous candidate lines. However, as the author argued, if the current candidate line under analysis is not a solution, the simulation uses more resources since only the iteration over all the remaining records discards the line as a solution. To address this issue, the algorithm defines a timeout to find the next stop of the candidate line. This timeout is a variable, and is changed upon iterating records and not by comparing the records captured timestamps. Every time the algorithm processes a record if no bus stop is detected the timeout decreases by one unit and it is reset otherwise. The timeout chosen was 15 minutes, when the capture periodicity was 15 seconds, which is equivalent to process 15 × 4 = 60 records.

(55)

Figure 3.3: Finding solutions algorithm flowchart [2]

3.4

Prediction module

The author applied machine learning to predict times of arrival of buses in the form of a float representing a time in a UNIX epoch and evaluated three algorithms: Random Forrest,

Referências

Documentos relacionados

For each measurement, the collaborations have privately provided unpublished information which is necessary for the combination of the LEP results, such as the expected

Resumo Através do estudo da correlação entre duas variáveis, determina-se, por simulação, a proporção de variância distribuição amostral do coeficiente de correlação

O modo de operar Bitcoin pode ser analisado à luz das ideias de Mises (1958) e Hayek (1976), porque estes criticaram intensamente a estatização do dinheiro, sendo os argumentos

Os antirretrovirais indicados para o tratamento pela infecção pelo HIV incluem os inibidores nucleosídeos e nucleotídeos da transcriptase reversa, inibidores não nucleosídeos

fazê-lo, permanência em reformatório ou prisão, inclusão em família adoptiva, morte de um familiar próximo e vivência de uma doença ou lesão não relacionada com o VIH que

Deve, pois, ser ressarcido unicamente como dano não patrimonial (Cfr. Alessandra Angiuli, “La riduzione dele poste risarcitorie come dela configurazione del ‘nuovo’ danno

A presente pesquisa é exploratório-descritiva, de abordagem qualitativa, sendo utilizado como instrumento para coleta de dados uma entrevista semi-estruturada aplicada com

• Como é feita a determinação destes compostos (numa amostra de água) 1º ) Preparação da amostra: - Extração em fase sólida, solid phase extraction (SPE).  Técnica