• Nenhum resultado encontrado

Recommended System for Optimizing Battery Energy Management with Floating Car Data

N/A
N/A
Protected

Academic year: 2021

Share "Recommended System for Optimizing Battery Energy Management with Floating Car Data"

Copied!
79
0
0

Texto

(1)

FACULDADE DE

ENGENHARIA DA

UNIVERSIDADE DO

PORTO

Recommended System for Optimizing

Battery Energy Management with

Floating Car Data

Leonel Rocha Araújo

Mestrado Integrado em Engenharia Informática e Computação Supervisor: João Pedro Carvalho Leal Mendes Moreira, PhD

Supervisor: Luís Alexandre Moreira Matias, PhD

(2)
(3)

Recommended System for Optimizing Battery Energy

Management with Floating Car Data

Leonel Rocha Araújo

Mestrado Integrado em Engenharia Informática e Computação

(4)
(5)

Abstract

Nowadays, heavy duty vehicles that transport temperature-sensitive goods, typically use a diesel engine entirely dedicated to feed a refrigeration unit. This diesel engine represents several dis-advantages to transportation companies. For instance, the refrigeration units are very needy fuel wise, which in turn leads to further CO2 emissions. Additionally, these engines have high noise

emissions and require high level maintenance. In order to avoid this fuel dependence, an energy management system (EMS) capable of producing energy during the operation of the vehicle is be-ing developed. This recovery is possible due to the regenerative brakbe-ing (RB) functionality, which converts kinetic energy into electrical energy during decelerations and brakings. Such recovered energy is then stored in a set of batteries that supplies the refrigeration system when needed, al-lowing it to run in electrical mode. An opportunity to use the regenerative braking functionality intelligently emerges as analyzable information is being collected from the vehicle’s operation and the EMS. By introducing an intelligence layer on the energy management system, a decision on applying the RB functionality could be made based on the trip’s energetic potential. This decision will optimize the battery usage and reduce the load and wear on the EMS components.

In order to calculate the energetic potential of a certain route, an estimation of the road is needed. This document presents context information and different approaches towards this end. In the modeling approach recommended and implemented, a route is divided into several spatial segments and each segment is categorized within three predefined classes. A classification model is used to predict traffic using vehicle historical data as input. By using this modeling approach based on travel times, information on traffic flow and intersection queues are incorporated and by predicting the most likely sequence of states, an estimation of the road ahead is made.

Using the information of the modeled path, when the RB system detects a situation where the RB functionality can be applied, a decision will be made by weighting the energetic potential of the path ahead and the energy need. When the algorithm sees fit, higher power may be applied to the generator, which will result in a larger quantity of energy recovered. Since this may cause stress to the EMS, this functionality needs a robust intelligence layer.

In the end, using simulations with values and braking moments from real trips and the clas-sification model produced, the insertion of controlled power peaks was able to increase energy production by approximately 7.35% without imposing any stress to the EMS.

(6)
(7)

Acknowledgements

The research and writing process of a dissertation is long and arduous and it certainly is not done singlehandedly.

I wish to express my sincere gratitude to AddVolt and all its collaborators for the domain expertise that greatly assisted in shaping the research, specially to Bruno Azevedo and José Pedro Araújo who closely followed the entire process with important input and suggestions.

I would certainly be remiss to not mention and sincerely thank my supervisors, professors João Mendes Moreira, PhD and Luís Alexandre Matias, PhD. Without their technical expertise, advice and encouragement, this research could not possibly have achieved these results. Their feedback and insight were essential throughout the entirety of the research and greatly improved this document.

I take this opportunity to express gratitude to my family, who has been unceasingly and faith-fully supporting and encouraging me throughout all these years, with special regards to my parents and my sisters.

Finally, I would like to thank the department faculty members and my closest friends for their help and support that helped me mold my academic journey.

(8)
(9)

“The secret of getting ahead is getting started.”

(10)
(11)

Contents

1 Introduction 1 1.1 Context . . . 1 1.2 Motivation . . . 2 1.3 Goals . . . 3 1.4 Contents . . . 4

2 A Review on Time Series Data Mining 5 2.1 Data Mining . . . 5

2.2 Time Series Data Mining . . . 5

2.3 Tasks in Time Series Data Mining . . . 6

2.3.1 Unsupervised Learning . . . 7

2.3.2 Supervised Learning . . . 8

2.4 Applications . . . 10

2.5 Estimating Traffic Characteristics with Time Series . . . 12

2.5.1 Optimal Speed Profiles . . . 12

2.5.2 Travel Time State Estimation . . . 13

3 Solution Proposal 17 3.1 Selecting the Test Route . . . 18

3.2 Modeling Traffic Characteristics . . . 18

3.3 Optimizing the Energy Recovery Process . . . 19

3.4 Dataset Analysis . . . 19

3.4.1 Data Sources . . . 19

3.4.2 Features . . . 20

4 On the Selection of the Most Common Route 23 4.1 Detecting Trips in Data . . . 24

4.1.1 Pre-processing . . . 24 4.1.2 Adopted Heuristic . . . 24 4.2 Map Discretization . . . 26 4.2.1 Spatial Clustering . . . 26 4.2.2 Trip Discretization . . . 27 4.2.3 Trajectory Augmenting . . . 27

4.3 Sequential Pattern Mining . . . 29

4.3.1 PrefixSpan . . . 30

4.3.2 Application . . . 30

(12)

CONTENTS

5 Towards Constructing a Traffic Prediction Model 33

5.1 Labeling Tuples . . . 33

5.1.1 Data Grouping . . . 33

5.1.2 Traffic States and Classification Rules . . . 34

5.1.3 Selecting Relevant Features using Domain Knowledge . . . 35

5.2 Classifier Selection . . . 36

5.2.1 Naïve Bayes . . . 37

5.2.2 Logistic Model Tree . . . 38

5.2.3 k-Nearest-Neighbor . . . 38

5.2.4 Random Forest . . . 39

5.2.5 C5.0 Decision Tree with AdaBoost . . . 40

5.3 Summary and Results . . . 41

6 Integrating Traffic Predictions into the Energy Recovery Process 43 6.1 Calculating the Energy Recovered during a Trip . . . 43

6.1.1 Energy Calculation . . . 44

6.1.2 Generator Motor Functionalities . . . 45

6.2 State-Based Optimization of the Energy Recovery Process . . . 45

6.2.1 Energy Recovery Approaches . . . 46

6.2.2 Rules of Application of Each Approach . . . 46

6.3 Simulations Setup and Results . . . 47

7 Conclusions 53 7.1 Main Contribution . . . 53

7.2 Future Work . . . 54

7.2.1 Route Segments Division . . . 54

7.2.2 Cost-Sensitive Learning . . . 54

7.2.3 Real Time Contribution to Traffic State Estimation . . . 55

(13)

List of Figures

2.1 Different number of clusters resulting in different outputs. . . 7

2.2 Example of linear regression modeling. . . 9

2.3 Combining segments to find the optimal speed profile . . . 13

2.4 Comparing and clustering two consecutive links . . . 15

3.1 Development steps covered by the proposed solution. . . 17

4.1 Phases and flux of the route selection process. . . 23

4.2 Trips detected from the dataset used. . . 25

4.3 Discretized trips with different grid sizes. . . 28

4.4 Cascade sequence augmenting examples. . . 29

4.5 Prefix span results with different grid sizes. . . 31

4.6 Most common route detected. . . 32

5.1 Trips detected on the most common route detected on a grid. . . 34

5.2 Data flux in experimental setup. . . 37

5.3 AUC of several k-Nearest Neighbor classifiers with different k . . . 39

5.4 AUC of the classifier obtained with C5.0 . . . 41

5.5 AUC comparison of all the classifiers generated. . . 42

5.6 Prediction example at two different hours of the day on a Monday. . . 42

6.1 Propagation of the wheel speed onto the EMS. . . 44

6.2 Simulation process layout and flux. . . 48

6.3 Comparison of energy produced in simulations. . . 49

(14)
(15)

List of Tables

3.1 Columns of the table in the database with the information retrieved. . . 21

5.1 Class distribution of initial, rule based classifier. . . 36

5.2 Confusion Matrix of Naïve Bayes Classifier. . . 38

5.3 Confusion Matrix of Logistic Model Tree Classifier. . . 38

5.4 Confusion Matrix of k-Nearest-Neighbor Classifier with k = 30. . . 39

5.5 Confusion Matrix of Random Forest Classifier with 1000 trees generated. . . 40

5.6 Confusion Matrix of the classifier obtained with C5.0 . . . 40

(16)
(17)

Abbreviations

AdaBoost Adaptive Boosting AUC Area Under Curve

CHMM Coupled Hidden Markov Model EMS Energy Management System EV Electric Vehicle

GPS Global Positioning System

PrefixSpan Prefix-Projected Sequential Pattern Mining RB Regenerative Braking

ROC Receiver Operating Characteristic TSDM Time Series Data Mining

TTSE Travel Time State Estimation V2G Vehicle to Grid

(18)
(19)

Chapter 1

Introduction

In this chapter, contextualization of the problem is made taking into account the business need. Verticalization approaches of the output are mentioned and explained. Finally, a brief description of the main goals of this research is made.

1.1

Context

Nowadays, temperature-sensitive goods are mainly transported by heavy duty vehicles equipped with refrigeration units. Apart from the usual traction engine, these vehicles need an additional diesel engine entirely dedicated to supplying the refrigeration unit. This additional engine repre-sents several disadvantages to transportation companies. For instance, these units are very needy fuel wise, which in turn leads to further CO2emissions. Additionally, these diesel engines have

high noise emissions (approximately 90dB) and require high level maintenance.

To tackle these disadvantages, an energy management system (EMS) capable of recovering electrical energy during the vehicle’s operation is being developed. This recovery is possible due to a regenerative braking functionality, which consists in converting kinetic energy to electrical energy during a slowdown or brake pedal actuations. This energy would be otherwise wasted and dissipated as heat. The recovered energy is then stored in a set of batteries that supplies the refrigeration system when needed, allowing it to run in electrical mode. By running in this mode the vehicle will no longer be dependent of noise emissions or fuel consumption, which means it will not only represent less financial costs, but also work in a more environmentally friendly way. Besides removing all these disadvantages and providing a greener solution, the EMS ends up transforming a vehicle into an energy production unit. If the production is superior to the demand, the unused energy can be fed to the grid or another vehicle.

Additionally, the EMS has a module that is able to retrieve data on the vehicle operation and the energy management module. This data is being recorded to an SD Card on board of the vehicle and is being sent to a remote database. Global Positioning System (GPS) measurements are made every thirty seconds and information on the vehicle operation is recorded every third second.

(20)

Introduction

With an ever-growing quantity of data retrieved, an opportunity towards intelligently using the Regenerative Braking (RB) functionality of the EMS and managing these energy production units emerges. Analysis of the data retrieved can supply information on not only vehicle operation, but also road and traffic development.

1.2

Motivation

During the vehicle’s operation, the available recoverable energy is conditioned by several opera-tion variables. In order to better understand the EMS’s operaopera-tion, analysis on the data is required. The information retrieved from this analysis can help identify refinements on the EMS, for in-stance, if a lower power and lower capacity battery fits the needs of a vehicle and its usual routes, the EMS can be saved from heavier power load. Heavy power load on the EMS could shorten some components’ lifespan and represent further loss in the case of a road accident. Furthermore, this information extraction could be a differentiating factor for the EMS, since it provides more ob-jective results, supported evidence on advantages and a more individual treatment when engaging to a customer.

Currently, the EMS applies its RB functionality every time the vehicle slows down. As an elec-tronic system, this continuous usage will eventually define a shorter lifespan for its components. Despite the increased lifespan that a lower power load on the system can grant, with a rational usage, this lifespan could be further widened. As such, the implementation of an intelligence layer could represent a more valuable and resistant product.

A rational usage would consist in only applying the RB functionality on specific settings. A situation where an extended braking is identified is a worthy situation since, given the nature of the RB technology, a vehicle will be able to recover more energy on extended slowdowns, as opposed to when threshold braking. Another example would be a braking moment when the vehicle is traveling at high speed: since there is more kinetic energy available from the vehicle movement, more energy can be recovered. A not so worthy situation would be any braking moment during a jam, where vehicles are constantly in a start-and-stop and brake very little at very low speed.

This rational usage can be achieved by, when detecting a braking moment, process a weighted decision on the application of the regenerative braking. This decision should take into account how the road develops (amount and harshness of curves and intersections ahead), the time of the day (since traffic behavior varies with time, specially during peak hours) and the vehicle’s end of operation (if the current energy level can complete the refrigeration of the goods being transported in the planned route).

Furthermore, when dealing with transportation companies, where routing is already highly efficient, the information on how much energy will a battery have in a certain point of a trip is highly valued. This is particularly interesting and relevant if an opportunity to channel a vehicle’s excess of energy to other applications arises. Among these applications, it is worth mentioning the EMS’s ability to feed other vehicle’s refrigeration units, warehouses and selling energy to the grid. Moreover, this information can be of great value in the growing electric mobility market. On one

(21)

Introduction

hand, the diffusion of electric vehicles (EV) is predicted to dramatically increase electric energy demand in the future, exceeding the current distribution capacity [Gallo et al., 2015]. Information on the energy demand and energy potential of vehicles with batteries can provide support to deal-ing with this high demand, enhancdeal-ing Vehicle to Grid (V2G) and chargdeal-ing operations. Succinctly, the basic concept of V2G is the possibility of providing energy to the grid, as an EV, while parked. A considerable amount of interest has been noted in various automobile producers as V2G policies and drivers emerge. The economic potential of these transactions has been attracting more research and convincing major automobile producers to embrace EVs and V2G infrastructures and its integration in the concept of Smart Cities [Points, 2015,Lambert, 2015,Lambert, 2016].

1.3

Goals

This dissertation project addresses the knowledge retrieval process that can eventually be associ-ated with a modular intelligence layer to be implemented in the EMS. A framework that encom-passes all the steps needed is developed and presented. These steps include route detection, traffic modeling and the application of the information extracted using a set of rules.

Strictly speaking, this project focuses on using the historical information of the vehicle’s op-eration to maximize the recovered energy. The framework developed will follow a route-per-route approach, which means that the evaluation is made to a single route at a time, but is extensible to any set of routes.

Towards intelligently using braking moments for energy production, properly understanding how this type of vehicles behave on the road is a major concern. This behavior can be identified by analyzing variables like braking rate, acceleration rate, speed, location and other environment variables, such as time of the day.

This analysis will lead to a set of descriptive variables of the route. During a trip, given a current state and the current information available on the mentioned parameters, a prevision is made on how does the road develop and what the vehicle should expect. This prediction will be based on historical trips over the same route. From this prediction, we can extract information on the energetic potential of the route in short term.

The information on the path’s traffic conditions, allied to knowledge on EMS’s functionalities, can lead to the identification of the best braking points, which in turn leads to a proper decision on the application of the RB functionality. This rational usage may include not using the system at all (useful in situations where the predicted demand is already inferior to the available energy) or force a heavier power load (which results in a higher energy production in exchange for system heating). Management on whether the system can apply this heavier load is the framework’s responsibility and is a problem that needs to take into account generator motor specifications.

These conditions can be fed to the EMS on board of the vehicle to assist on decision making, however, this implementation is out of scope for this dissertation. Nevertheless, simulations with data from real trips are ran in order to understand the gain that this intelligence layer is able to provide.

(22)

Introduction

1.4

Contents

This document presents the entirety of the development process carried out towards the creation of a framework able to introduce an intelligence layer to the EMS.

Chapter2starts by introducing the concepts of data mining and bridging them to time-series related problems. Several examples of the application of time-series data mining techniques are given on different study domains. More importance is given to transportation related problems.

Chapter3provides a brief description of the steps taken and the general concept of the whole framework. Additionally, the dataset used is presented.

Chapter 4 a selection of the most common route is made. The steps of this selection are described in great detail. This step includes spatial clustering and sequence mining methods. The route selected will be the route used in modeling and testing.

Chapter5presents the steps taken towards modeling the traffic in the route found by the end of chapter4. This modeling process will be entirely based on historical data. The domain problem is transformed into a classification problem and, under the same training setup, several different classification methods are used to model the traffic in the selected route. A selection on the most fit model is then made.

In chapter6, the main focus is how the information retrieved in chapter5is used to feed the intelligence layer and grant better energy gains to the EMS. This new knowledge is also put into practice in simulations.

Finally, in chapter7some final remarks are made, focusing mainly on a review of the advan-tages of the developed framework and the energy production gains. In addition to this, future work and possible improvements are suggested.

(23)

Chapter 2

A Review on Time Series Data Mining

In this chapter, a description of the bibliographic review that was carried out is presented. Ini-tially, the concept of data mining is introduced. Then, some investigation is made on the type of data that will be worked on. This includes its connection to the concept of data mining and the advantages and disadvantages that its characteristics entail. Following this comparison, some common tasks that are performed over this type of data are introduced. Finally, two approaches on similar problems are described and, for each of them, investigation on different input assumptions is deepened.

2.1

Data Mining

Data collection is very common across a wide variety of fields. With the advancements in re-trieval and storage technologies, volumes of data in databases are in an ever-growing direction. The traditional method of retrieving knowledge from data relies on manual analysis and human interpretation. Now that the size of databases have changed from gigabyte to terabyte measure-ments, such analysis is usually impractical and exceeds the human ability without the proper tools [Fayyad et al., 1996].

Data mining consists in a set of tools that assist in the analysis of these rapidly growing vol-umes of data and has been attracting a significant amount of research and attention. Succinctly, data mining can be described as the nontrivial process of identifying novel, useful and understand-able patterns in data [Han et al., 2011].

2.2

Time Series Data Mining

In almost every scientific field, measurements are performed over time. This approach means that data will entail more information on the nature and behavior of the observed object. These observations lead to a collection of organized data called time series. Formally, this means that

(24)

A Review on Time Series Data Mining

time series represent a collection of values obtained from sequential measurements over a specific period of time, simply put, a collection of observations made chronologically. Similarly to tra-ditional data mining, the purpose of time series data mining (TSDM) is to extract all meaningful knowledge from the shape of data.

Due to their highly correlated data points, time series are particularly useful in regression, trend, seasonality and or forecasting analyses [Shumway and Stoffer, 2013]. This high correlation among time series data points occurs because of the sequentiality of the data points. By receiving data in a sequence, implicit information on feature variations is provided in addition to the explicit values.

The nature of time series data also includes more intrinsic characteristics, such as large data size, high dimensionality and continuous update. This leads to a lot of numerical and continuous information. Furthermore, time series data should always be considered as a whole instead of each individual numerical field. Therefore, as opposed to an exact match based model, similarity search is typically carried out in an approximate manner [Fu, 2011]. Therefore, similarity measure is of fundamental importance for various time series related tasks. Nevertheless, devising an appropriate similarity function is no trivial task and is seen as one of the major challenges when dealing with time series data. As the dimensionality increases, this process grows in complexity and difficulty [Esling and Agon, 2012].

In addition to this, the high dimensionality also brings about another challenge, which consists in data representation. Finding the best way to represent fundamental shape characteristics of a time-series is not an easy task. Exploratory analysis and displaying results is tougher when dealing with time series data due to difficulty in correlating all the time series’ data points in a way that the shape or embed knowledge of the data is perceptible. Usually, to achieve this, an approach based on reducing dimensionality while trying to retain the essential characteristics is taken [Esling and Agon, 2012].

Lastly, it is worth noting that the main sources of this type of data are experiencing a large deployment growth and, in some cases, towards a more personal use. Sensors, telemetry devices and website monitoring are examples of data sources able to collect large amounts of data, in the order of gigabytes per day or even per minute [Han et al., 2011].

2.3

Tasks in Time Series Data Mining

In different problems, different data mining tasks are needed and sometimes a combination of successive operations is the best way to extract relevant and useful information from data. This approach is extended to time series problems. However, given the nature of time series data, there are some differences and additional precautions to take into account. Below, a distinction between supervised and unsupervised learning techniques is made and relevant tasks are described.

(25)

A Review on Time Series Data Mining

2.3.1 Unsupervised Learning

The main objective of these tasks is to explore or describe a set of data, this is why these learn-ing tasks are often labeled as descriptive. In the algorithms used in this type of tasks, the output attribute is ignored, hence the lack of supervision in the learning process. The most common unsu-pervised learning methods are clustering, association rules and summarization [Gama et al., 2012]. For the current project, the most relevant task is clustering.

2.3.1.1 Clustering

Clustering is the process of finding natural groups in a dataset. In practice, it aims to find the most homogeneous clusters that are as distinct as possible from other clusters, the grouping should maximize inter-cluster variance while minimizing intra-cluster variance. The main objective of this task is to identify groups intrinsically present in the data. A point to consider is that even out of the scope of TSDM, the main difficulty in clustering tasks lies in defining the right number of clusters. This decision may determine the interpretability and usefulness of the results. An exam-ple of two possible outputs with the same data using a different number of clusters is presented in figure2.1. This example supports that clustering is a problem that highly depends on the way parameters are initialized and on the level of detail targeted [Esling and Agon, 2012].

Figure 2.1: Different outputs of the same clustering algorithm when given different input number of clusters. In a dataset was divided in 3 clusters, whereas in b the dataset was divided in 8 clusters [Esling and Agon, 2012].

The time series clustering task can be divided into two types:

• Whole Series Clustering: the clustering method is applied to each complete time series in a set. This way, entire time series can be grouped into clusters.

• Subsequence Clustering: clusters are created from subsequences of a single or multiple time series, slicing the data to maximize intra-cluster similarity.

[Han et al., 2011] proposes a classification of traditional clustering methods into five cate-gories: partitioning, hierarchical, density based, grid based, and model based. When restricting research to time series data, three of these five categories (partitioning, hierarchical, and model based) have been utilized. A brief description of these methods is presented below.

(26)

A Review on Time Series Data Mining

Partitioning clustering consists in constructing k partitions of a set of n unlabeled data tuples. Each partition represents a cluster containing at least one object and k ≤ n. Regarding hierarchical clustering, there are generally two types of methods: agglomerative and divisive. In agglomerative methods, each object is placed in its own cluster and then clusters are merged into larger clusters, until certain termination conditions, such as desired number of clusters, are reached. In divisive methods, the procedure is the exact opposite: from a single huge cluster, objects are removed to form other clusters until certain termination conditions are met. As for model-based methods, the idea is to assume a model for each of the clusters and attempt to fit the data in the assumed model [Liao, 2005].

However, [Liao, 2005] also states that all in all, these methods either try to modify the exist-ing algorithms or try to convert time series data so that existexist-ing algorithms are applicable. In the former, where algorithms are altered, the major modification lies in replacing the distance mea-surement for a more time series friendly one, since this approach keeps the original time series data, it is often referred to as data-based approach. In the latter, where data is altered, a conversion of time series data is made either into a feature vector of lower dimension or a number of model parameters (hence called model-based approach). After one of these conversions, conventional clustering algorithms can be directly used.

2.3.2 Supervised Learning

In prediction tasks, the main aim is to find a function, model or hypothesis from training data that can be used to predict a label or a value of new examples. The methods used in this type of tasks follow a supervised learning paradigm. This term originates from the simulation of an external watcher that knows the output of each example. With this knowledge, this external watcher can evaluate the predictions made. The most common methods in supervised learning are classification and regression [Gama et al., 2012].

2.3.2.1 Classification

Classification consists in assigning labels of symbolic data to each series of a set. The classes are known in advance (as opposed to a clustering task) and the algorithm is trained on an example dataset. The main objective is to learn what are the distinctive features distinguishing classes from each other and automatically determine each new series’ class.

With this in mind, three main steps of a classification task are identified. Firstly, with a labeled training set as input, the algorithm will try to learn what are the characteristic features of the classes the dataset encompasses. Then, an unlabeled dataset is entered into the system. The elaborated algorithm will try to automatically determine for each new data point, what class it belongs to. Finally, with the new data points added, the system can optionally adapt the class boundaries for better future predictions [Esling and Agon, 2012].

(27)

A Review on Time Series Data Mining

A common problem in training a classifier is overtraining or overfitting. This happens when the training data leads to an over specified and inefficient model that has poor performance on new data that enters the system, which leads to misleading predictions [Esling and Agon, 2012].

Up until this point, this definition of classification could be broadened to traditional data min-ing. However, given the nature of time series datasets, special treatment must be considered. Similarly to the clustering differences with traditional methods, the major challenge lies in the high dimensionality of data. Nevertheless, several research works were found towards integrating known approaches in time series classification problems, for example: [Geurts, 2001] introduces a two step classification with pattern and local property discovery followed by a combination of this information to build classification rules; [Nanopoulos et al., 2001] performs time series classification with a multi-layer perceptron neural network; [Povinelli et al., 2004] evaluates a Bayesian maximum likelihood classifier in three different real-world datasets; [Xi et al., 2006] and [Srisai and Ratanamahatana, 2009]explore one-nearest-neighbor with dynamic time warping distance towards generating extremely fast and accurate classifiers. These works main aim were towards dealing with computing speed and dimensionality reduction. Also along the lines of di-mensionality reduction, [Zhan et al., 2007] proposes extraction of patterns from time series to be used as input to classical machine learning algorithms.

2.3.2.2 Regression

As opposed to Classification which deals with categorical labels, Regression is a predictive task that models numeric prediction with continuous-valued functions. Regression can be used to pre-dict missing or unavailable quantitative data values rather than class labels [Han et al., 2011]. An example using a simple regression model is shown in figure2.2. In this example, linear regression is used to find a line that fits two variables, so that one attribute can be used to predict the other. As seen on the example, training data on an undisclosed species of trees, allows us to model the relation between tree height (in meters, x axis) and tree age (in years, y axis), making it easy and intuitive to predict the value of one of them if the other is given.

Tree Height vs Tree Age

(28)

A Review on Time Series Data Mining

There are several other regression methods more suitable to more complex problems with in-creased dimensionality, such as multiple linear regression, non-linear regression, regression trees, among others.

Multiple linear regression is an extension of linear regression that involves more than one pre-dictor variable. This allows a response variable to be modeled by a function of multiple prepre-dictor variables. For instance, in order to predict the price of a car, predictor variables may include mul-tiple variables such as horsepower,wheelbase, age, among others. For this situation to be solved in multiple linear regression, a linear correlation between these three variables is needed to model the price, opening up the possibility of a prediction [Han et al., 2011]. Non-linear regression is aimed at situation where this linear correlation is not possible, as such, data is modeled by a function which is a set of non-linear parameter dependencies. In some cases, a linearization is possible, transforming this type of problems into a simple linear regression [Han et al., 2011].

Regression trees are decision trees where this target value takes a continuous value. A decision tree is a tree structure, similar to a flow chart, where each node denotes a test on an attribute value, each branch represents an outcome of the test made and tree leaves represent the target value. The model is obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. Each of these simple prediction models are composed by an attribute value test and its result [Loh, 2011]. When the target value of a decision tree is within a finite set of values, it is called a classification tree, and is categorized as a classification method inside the supervised learning scope.

A very popular practical use of regression is the prediction of future business direction based on past performance. The output of a regression analyzer is usually a model that can depict future trends. Companies rely heavily on regression analysis to make future business or strategic strate-gies, for instance to predict customer demand for a new product [Gupta and Kohli, 2015]. Another notable work in time series regression in a trend detection problem, is a multidimensional regres-sion method for analysis of multidimenregres-sional time-series data proposed by [Chen et al., 2002]. In this paper, time series are mapped as cubes and an approach based on multi-dimensional linear regression is taken in order to find unusual changes of trend.

2.4

Applications

On pair with traditional data mining, TSDM techniques has been a hot topic among researchers which has led to a respectable level of maturity. Furthermore, this maturity level has enabled the usage of time series analysis over a wide range of real-life problems in various fields. Time series data are of interest because of its pervasiveness in various areas ranging from science, engineering, business, finance, economic, health care and government.

Biomedicine In the biomedical field, time series analysis has been used for several tasks related to the improvement of medical services. To cite a few, [Keogh et al., 2006] uses an anomaly de-tection algorithm to find the most unusual subsequences in a time series. This can be particularly

(29)

A Review on Time Series Data Mining

useful in the detection of abnormal entries on various types of medical times series datasets. For instance, in electrocardiograms, the algorithm was able to locate premature beats, which can sup-port medical diagnosis. In another medicine related work, [Altiparmak et al., 2006] with industry-sponsored clinical datasets, a clustering algorithm was used to identify a set of analytes (substances in blood) that are most strongly correlated. From this, it was possible to effectively model the state of normal health from analytes.

Commerce The great majority of research in this domain is on sales forecasting models, which usually include customer profiling and sales analysis. In [Sun and Luan, 2008], for the sake of forecasting the amount of sales and to improve inventory management, a commerce enterprise resorted to prediction methods. A remarkable precision on the model of un-house forecasting was empirically verified. [Hülsmann et al., 2011] presents various sales forecast methodologies and models for the automobile market. The best suited models were proved to be decisions trees, mainly due to their intrinsic explicability. The developed models are based on time series analysis and classic data mining algorithms, and the data is composed by German and American markets. The effects of economic data on the newly registered automobiles to be predicted are also studied.

Finance Another group of real world examples with a large representativeness is the economics and the financial sector. The availability of high frequency financial data has driven the growing interest in statistics and econometrics research. One example of an application is the forecasting of share prices. This information can be used for decision making on new investments. A great deal of research has been done towards this end, on several different stock markets and using several different methods. To cite a few, [Enke and Thawornwong, 2005,Wang and Wang, 2012,

Wang and Wang, 2015] are examples of successful neural networks implementations that are able to keep pace with stock market dynamics. Neural networks have also found a lot of success in dealing with other financial fields, such as stock profits, exchange rates and risk analysis [Wang and Wang, 2015]. Outlier detection methods are also very popular in the finance sector and have several applications, which can range from credit card frauds to money laundering detec-tion. In these problems, the main challenge is to identify transactions or other account activities that signal either of these situations. This area of research has been receiving increasingly more attention, to cite a few [Chan et al., 1999,Žliobait˙e et al., 2016,Nithya and Lavanya, 2015]

Transportation TSDM has been increasingly used towards the development of intelligent sys-tems able to predict and evaluate traffic jams and accidents. To cite a few, concerning traffic fore-casting, [Tchrakian et al., 2012] introduces a real-time predicting model that uses spectral charac-teristics on traffic flow time series. With this model and with previous predictions, in the same paper, the authors present another weighted prediction algorithm, combining the characteristics of time-series based predictions and spectral analysis. For a similar purpose, [Qi and Ishak, 2013] uses a stochastic approach to predict traffic speed during peak periods. Another application in the

(30)

A Review on Time Series Data Mining

transportation sector is the work presented in [Moreira-Matias et al., 2013], which consists in pre-dicting the taxi-passenger demand at taxi stands using historical data as a learning base, enabling a smarter taxi-driver mobility.

Other problems approached in 2.5 are also related to traffic modeling and present a deeper research on the subjects, since those have a direct connection to the main focus of the document and the input data nature.

2.5

Estimating Traffic Characteristics with Time Series

In order to be able to make a decision on the usage of the RB system, information on the road ahead of the our current position is needed. In this section, two approaches found in the bibliography are presented towards modeling a given route.

2.5.1 Optimal Speed Profiles

A possible approach to determine the optimal usage model for the EMS is determining a time-dependent optimal speed profile. By analyzing what would be the vehicle’s optimal speed in a certain point in time, this approach will incorporate information on road characteristics. Neverthe-less, the achieved profile would only be applicable to the route that expects the same environment, for instance, if the route included passing through areas in which the traffic conditions vary during rush hour, the speed profile would only be applicable to situations similar to the situation in which it was conceived.

With this in mind and focusing on electric vehicles, various research work has been performed towards eco-driving assistance systems. The purpose of these systems is to provide eco-driving information to the drivers, instructing towards an energy-efficient driving style. This is usually carried out by providing the optimal speed in a certain link of the current route to the driver. These systems often resort to optimal speed profiling algorithms, and the increasing popularity in these systems have motivated more research in this area. An usage example of speed profiling is presented in [Lin et al., 2014], where an optimal control problem is formulated and solved based on dynamic programming. Due to unexpected traffic situations and traffic density, the optimal speed trajectory is not tracked perfectly. The driver is informed of the current speed and optimal speed in a screen.

An important characteristic that is not considered in many research works in this area is the impact of the presence of intersection queues. For instance, [Wu et al., 2011] uses data on the acceleration, speed and environment factors as input and presents the optimal acceleration that represents the least fuel waste in a given trip. However, as this does not take into account intersec-tion queues, it can only be applied to freeflow situaintersec-tions.

With this additional constraint in mind, [Wu et al., 2015] proposes an analytical model that determines an optimal speed profile to minimize the consumed electricity in a given route. This model divides the whole control process into a sequence of control stages, and each control stage is formulated as a nonlinear optimization problem which involves spatial and temporal constraints

(31)

A Review on Time Series Data Mining

induced by the presence of vehicle queues. However, it is assumed that information on the inter-section queues is known in advance using vehicle-to-vehicle and vehicle-to-infrastructure commu-nication. [Mandava et al., 2009] and [Tielert et al., 2010] are an example of other similar works that use data coming from traffic-light-to-vehicle communication to calculate optimal speed and acceleration rates.

Regarding railway routes, energy consumption optimization is another field where optimal speed profiles thrive. [Grabocka et al., 2014] proposes a heuristic to decompose speed profiles into segments of trips, which, combined, compose a realistic driving behavior in railways. The optimal policy was simply derived from the concatenation of the most energy-efficient subsequences, as it can be roughly shown in figure2.3. Then, an optimal combination of those segmented and yet perturbed series is searched using Simulated Annealing. As opposed to the previously mentioned approaches on eco-driving, no queue information is used as input.

Figure 2.3: Combining segments to find the optimal speed profile [Grabocka et al., 2014]

On pair with [Grabocka et al., 2014], [Albrecht et al., 2010,Lin et al., 2014] are examples that the application of optimal speed profiling has significant results reported, this is mainly due to rigid and immutable routes characteristic of train trips, which enables a more precise profiling that are more easily tested.

2.5.2 Travel Time State Estimation

Another viable approach to gather the information needed for our problem is using travel time state estimation (TTSE). In this approach, the time a vehicle takes to go through a given distance is used to label a state to this route. By analyzing travel times, this approach is intrinsically taking into account intersection queues and correlating them to time of the day. Additionally, this approach can be divided into priori and sequential TTSE.

An example of a apriori solution is presented in [Moreira-Matias et al., 2016], where a model based on time-evolving origin-destination matrices is developed. The information in these matri-ces is generated and updated by mining continuous flows of high-speed origin-destination entries.

(32)

A Review on Time Series Data Mining

With the framework presented, relevant context-aware information is extracted, following, as close as possible, the stochastic dynamics of the human mobility behavior.

Among sequential travel time estimation, several works were identified.

In [Skabardonis and Geroliminis, 2005] and in [Skabardonis and Geroliminis, 2008], a model based on kinematic wave theory to model the spatial and temporal queuing at the traffic signals is applied to estimate traffic arrivals at an intersection. Long queues and spillovers are taken into account. In these two works, travel time is calculated by decomposing it into free flow travel time and delay. With this assumption the impact of yellow time in signalized links is omitted. [Liu and Ma, 2009] proposes a model that estimates travel time by tracing the virtual probe and determining its next maneuver based on estimated traffic states retrieved from equipment installed in each intersection. This installation is a major drawback in the scalability of this approach. In a similar link travel time estimation problem on buses, a tendency based model is proposed in [Chen et al., 2013]. This approach intends to apply corrections to an existing historical data-based model, since, in this model, a strong bias between trips was detected, which indicates a movement tendency between trips. As such, the bias used is based on the arrival at a station - stop - departure from station sequence pattern in bus routes.

[Herring et al., 2010] proposes a probabilistic modeling framework for estimating arterial traf-fic conditions from sparse probe data collected from taxis in San Francisco, CA. This probabilis-tic approach consists in modeling the evolution of traffic states as road segments in a Coupled Hidden Markov Model (CHMM). An expectation maximization algorithm is used to find the pa-rameters of the CHMM, the state transition matrix and conditional travel time probability dis-tributions. With this traffic flow model, one can estimate short-term evolution of travel times. [Ramezani and Geroliminis, 2012] is a similar work using a Markov chain towards the same end: measuring traffic progression and link correlation in an arterial route. After a division of the route in segments, calculations on the travel time of each segment are made. Following this, a com-parison between consecutive segments is made resulting in a set of points correlating the time the vehicle took to pass through each of the consecutive segments that are being compared.

The visualization in2.4enables an intuitive clusterization in groups 1 - 4 as separated by red dashed lines. This clusterization is based on a griding approach. With this information, initial and transition probabilities can also be inferred. The whole process of calculating the travel time of two consecutive segments, correlate the two segments via travel time, clustering into different states and generating the transition matrix is made iteratively for all the consecutive segments that divide the selected route.

(33)

A Review on Time Series Data Mining

Figure 2.4: Comparing the transition from segment 1 (x axis) to segment 2 (y axis) and posterior clustering in four groups [Ramezani and Geroliminis, 2012]

(34)
(35)

Chapter 3

Solution Proposal

As previously mentioned, the main goal of this project is to provide a framework that can be followed to optimize the energy retrieved during the RB functionality usage. To this end, out of the available dataset, a specific route for tests needs to be identified. This route’s traffic conditions will then be modeled as a classification problem. This classifier will provide traffic information to a decision heuristic that will be developed based on domain knowledge. This heuristic’s aim is to force the RB system into a greedier, higher load state when the traffic predictions indicate the presence of energy-rich braking moments, hence increasing the energy produced.

An overview of this process is illustrated in figure3.1. In this figure the three main phases of the process are presented, as well as their respective inputs and outputs. It is worth mentioning that the two disjunct datasets that the database is providing are described in more detail in section

3.4.

Figure 3.1: Development steps covered by the proposed solution with inputs and outputs of each phase. After selecting a route, a classifier able to model the traffic conditions on this route is elaborated. The predictions made by this classifier are used to predict braking moments and maxi-mize the amount of energy recovered. These gains are evaluated with simulations using a disjunct dataset.

(36)

Solution Proposal

The aims and specifications of each of these three steps are thoroughly explained in the fol-lowing sections and, in the end, the dataset used is presented. This includes a complete description of the data sources used and the features available.

3.1

Selecting the Test Route

The first step towards completing the main challenge of this project is to complete a selection of a road segment to apply and test the developed framework. As such, initially, a process towards selecting the most common routes from the dataset is carried out. The main aim of this step is to identify the most relevant route from the vehicles’ operation. A route is considered to be relevant and interesting to research if it presents variable traffic flow and includes zones where braking is necessary, such as slopes, intersections or roundabouts.

Initially, using time series data as input, trips are detected. A trip represents an instance of the vehicle operation, from the traction engine start until the vehicle stops. In practice, this is only an organization of the time series data points in groups, where each group represents a trip, in other words a path the vehicle took. This allows us to quickly identify zones where data is more abundant. Nevertheless, this representation is not sufficient to make a decision. Firstly, because of the subjectivity level associated to geographical data that has not gone under a map-matching algorithm and, secondly, because it will not include information on the traveling direction.

A grid clustering approach is then taken to discretize not only the trips detected but also the whole map. For practical and processing reasons, the grid will only include a portion of the geographic zone that is richer data-wise. By mapping each data point to a cell of the grid, a trip can be simplified to a sequence of cells that delimit its path. This representation can be further simplified by indexing the cells of the grid. With this indexing a trip can be represented as a sequence of cell identification numbers.

Entering the Sequence Pattern Mining domain, a Prefix Span algorithm is applied to the se-quences of identification numbers (representing the trips that were detected) to identify the most common subsequences. These results will represent the most common segments of roads that were traveled in the dataset. Since each sequence include the cells that were traveled consecutively, in-formation on traveling direction is now included. The final route decision will be based on the final results of the Prefix Span algorithm.

3.2

Modeling Traffic Characteristics

Towards increasing the amount of information the EMS has available regarding the road ahead and its traffic conditions, a prediction model is developed. This road modeling problem is converted to a typical multi-class classification problem.

Each row of the created dataset that will be submitted to the classifier will represent a spatial-temporal unit of a trip on the route selected. In other words, the data on the selected route will be

(37)

Solution Proposal

divided not only temporally (by time of the day and day of the week), but also spatially (by using the previously computed grid clustering results).

This way, the classes will reflect the traffic behavior in a given cell, at a given time of day of a given day of the week. The states used are Freeflow, Synchronized Flow, and Congested. An additional state Not Available is used to label situations where not enough information is available to make a classification. Different classification models and approaches are used towards reaching a robust classifier that fits the business needs the best.

The model reached can then be used to predict the traffic at each point of the route under study, given a set of predictors.

3.3

Optimizing the Energy Recovery Process

Concerning the final decision on the application of the RB system at a given moment, a decision heuristic is elaborated. Using data gathered from the previous steps as input, a final decision is made on which mode should the regenerative braking system use towards saving the system from over usage while maximizing the available recoverable energy.

There are two modes of operation, the normal mode, which is the one used currently by the working system, and the greedy mode, which forces a higher braking power to produce more energy. Nevertheless, this greedy mode can not be used constantly, since it can cause overheating when used for too long, triggering security mechanisms, or force some components too much, wearing them down. The ability to use this approach is introduced by the intelligence layer the EMS is missing and that is developed in this work.

The results of the traffic modeling will represent the RB system’s knowledge on how does the road develop and will prove useful in the management of the greedier energy recovery approach. An heuristic based on domain knowledge of the behavior of the RB system is elaborated and in charge of detecting a situation where the recovery process can be greedier and in charge of security constraints. When in a greedy state, the torque of the power generation motor used when braking is increased. This directly leads to a larger amount of energy produced.

The main challenge is to be able to increase the amount of energy produced by inducing the system into a greedy state when the road favors the production, without damaging or compromis-ing any system component. Since the EMS, as of now, is not insertcompromis-ing any greedy states, in order to estimate the gain given by the heuristic, a comparison with a random insertion of greedy states is made.

3.4

Dataset Analysis

3.4.1 Data Sources

As previously mentioned, the EMS that is installed in the vehicles has a module that is able to retrieve data on the operation of the vehicle. This information is gathered using Controller Area

(38)

Solution Proposal

Network (CAN) Bus. CAN Bus is a message-based communication network that connects com-ponents inside a vehicle, allowing micro controllers and devices to communicate with each other [Boterenbrood, 2000]. The module connected to this network has two versions that retrieve a different set of features with different periodicity.

In the first version, data is retrieved every 30 seconds and include only geographical coordi-nates and very few operation metrics. The data used from this module was retrieved from different vehicles during March of 2016. This set was named “Dataset A”. As for the second version, data is retrieved every third second and include more operation metrics. The data used from this module was retrieved from a single vehicle during May of 2016. This set was named “Dataset B”. Further information on what each dataset includes is presented in section3.4.2.

The module responsible for retrieving this data via CAN Bus is also responsible for the com-munication with a remote server.

3.4.2 Features

The information retrieved from the vehicles is communicated via HyperText Transfer Protocol (HTTP) to a remote Web Server. The data is then stored in a remote PostgreSQL database. All data is stored in a single relational table with the columns specified in table3.1.

Even if coming from different versions of the communications module, the data received is stored in the same table. In the first version, since less data is retrieved, some fields are left empty. The fields filled are: id, time, latitude, longitude, speedCAN, contract, totalDistance, totalFuel, fuelRate, fuelThrottle . When a message is received from the second version, all fields are filled.

The variables time and contract define each data point. As expected from time series data, the timevariable defines the continuous interval equally spaced between each data point. The contract variable differentiates each vehicle used, since different vehicles can have measurements with the same timestamp.

After some brief exploratory analysis, the error introduced in some variables render them useless. These variables are altitude and weight. Additionally, variables referring to the current trip (tripDistance and tripFuel) are too unreliable since they depend on the usage of the partial odometer, whose usage varies with the driver. Information on the distance traveled and quantity of fuel used can still be retrieved from the overall measures (totalDistance and totalFuel).

During the decision making process, special attention should be paid to the variables that reflect braking behavior. Firstly, the percentage of pedal pressed is given by brake. Secondly retarder transmits information of the homonym device. Retarder is a device used to augment the braking system, generally used only on heavy vehicles. These devices either slow vehicles or maintain a steady speed while traveling down a hill. The effectiveness is reduced with the reduction of the vehicle speed, as such they are not capable of bringing vehicles to a standstill point.

(39)

Solution Proposal

Name Type Summary

id integer An unique identifier of the row.

time timestamp w/ time zone Date and time of the measurements retrieved. longitude real Longitude coordinate of the vehicle’s

posi-tion.

latitude real Latitude coordinate of the vehicle’s position. altitude real Altitude coordinate of the vehicle’s position. speed real Speed of the vehicle retrieved from the

in-stalled GPS.

speedCAN real Speed of the vehicle retrieved from CAN Bus. This value is more reliable than the GPS value.

rpm real Rotations per minute of the vehicle’s traction engine.

brake real Percentage of the braking pedal pressed (0-1). acceleration real Percentage of the acceleration pedal pressed

(0-1).

retarder real Percentage of usage of the retarder (0-128). Retarder is a device used to augment braking systems and serve to slow vehicles or main-tain a steady speed. It is enabled by the driver when he sees fit.

contract integer Vehicle unique identifier

totalDistance integer Total odometer value. Equal to the total dis-tance traveled by the vehicle in Kilometers tripDistance integer Partial odometer value.

totalFuel integer Total fuel consumed during the vehicle’s life-time in Litres

tripFuel integer Fuel consumed since the partial odometer was reset in Litres

weight real Total weight exerted on the vehicle wheels in Kilograms

fuelRate real Traction engine fuel consumption rate in Litres per hour.

fuelThrottle real Injection pump percentage. Amount of fuel that is being fed to the traction engine (0-100). MGUTemperature real Temperature of the communication module. createdAt timestamp w/time zone Date and time of the creating of the row in the

database

updatedAt timestamp w/time zone Date and time of the last update of the row in the database

(40)

Solution Proposal

As such, the data available can be divided in two disjunct datasets. The first dataset includes data retrieved by the first version of the communication module. It includes information on several vehicles, but has a lower frequency. This dataset will be used to select and model a route. The sec-ond dataset will include more detailed data on a single vehicle. This data reflect braking moments of a real trip over the route selected. This grants a way to simulate a trip and estimate the gains of the application of the decision model that is developed in the final step of the framework.

(41)

Chapter 4

On the Selection of the Most Common

Route

For the current chapter, the process that led to the selection of route used is described in detail. This process can be divided into three main stages: trip detection, spatial discretization and finding the most common segments of the road. This process is briefly illustrated in figure4.1.

Figure 4.1: Phases and flux of the route selection process in more detail.

Initially, the dataset is cut and reorganized into trips. A trip represents an instance of the vehi-cle operation, from the traction engine start until the vehivehi-cle stops. These trips are then discretized using a grid clustering approach. By the end of this step, a trip can be represent as a sequence of numbers that represent each cell the trip passes through. A trajectory augmenting heuristic is applied, increasing the sequences’ faithfulness to the map. At the end, the most common route is selected using a sequential pattern mining algorithm and separated from the rest of the dataset.

(42)

On the Selection of the Most Common Route

4.1

Detecting Trips in Data

In an initial phase, it is important to segment the dataset into trips. This will allow a more concise and intuitive analysis on the operation of the vehicle geographically, since a vehicle, over the course of a day, may travel on different paths and directions.

4.1.1 Pre-processing

In order to achieve the best routes some pre-processing must be made. This is specially important since the data gathering method used is automated and loosely controlled. This characteristic leads to the sporadic generation of invalid or missing values. Additionally, communication issues may arise and loss of data can occur, creating incoherences or insufficient data for some inferences.

Furthermore, the high dimensionality of the dataset indicates that some variables will not represent any additional knowledge for our purpose. Since our current aim is to find the most traveled route segments, the majority of the vehicle’s operation variables could be discarded.

The features used in the route selection step are: time, longitude, latitude, rpm, totalDistance. Reference on these variables can be found in table3.1. These variables were chosen after some domain study on what determines if a vehicle is operating and where it is operating. Entries that missed a value for any of these features were removed.

From a total of 1303097 series, 1302986 fit these restrictions and were used in the trip detection process.

4.1.2 Adopted Heuristic

To detect trips out of the raw data, a need to identify what defines a trip arises. After pre-processing, one can assume that data contains all the needed information for the following steps. This information includes timestamp, GPS position, rotations per minute of the traction engine and total distance traveled in the vehicle lifetime.

After some domain study and verification through data analysis, a beginning of a trip would be best represented in the data as a spike growth in rotations per minute of the traction engine. Taking into account that when the vehicle is turned off the rpm value is set to 0, if this value increases, a beginning of a trip is marked.

As for the end of the trip, the same variable can be used to identify the end of the trip: when it decreases below inoperable values a trip is considered to be ended. The heavy duty vehicles that are being studied are not able to operate below 500 rpm. Nonetheless, this is not enough, given the nature of the data and possible losses due to communication issues, an additional restriction needs to be made regarding the amount of time that passed between two consecutive series: if more than 10 minutes passed than the current trip is ended and a new one is issued (as long as the rpm value is above operable values). Even though there is a chance that these 10 minutes reflect only communication failures and not vehicle stops, having a 10 minute gap in a trip without any data is not analyzable in the future steps, specially road modeling.

(43)

On the Selection of the Most Common Route

Following these data splits, additional restraints are made on the trips that were found. Firstly, a minimum number of series is stipulated. It is assumed that detected trips with less than 20 entries are not analyzable and are therefore discarded. Secondly, trips that travel less than 4 kilometers are discarded. For this restriction a comparison between the first and last totalDistance values of a trip is made. This constraint is specially useful to eliminate parking movements and stops for loading and unloading goods that can not be detected by the rpm restriction. These actions are very common during the operation of this type of vehicles, and can take a long amount of time, hence generating a large amount of series with little to no data.

Technically, the kilometer difference between two GPS coordinates could be calculated using the Harvestine Formula. This value could replace the use of the totalDistance variable. Since this information can be retrieved directly from the odometer in the vehicle via CAN Bus, and since the precision given is enough to distinguish operation on the road from stops and parking movements, the value of totalDistance is used, avoiding additional calculations that could slow down the processing.

Figure 4.2: Trips detected from the dataset used. On the left, a zoom of the north and center areas of Portugal are displayed. On the right, a zoom over Porto, the area with more data available, is shown.

In the figure4.2the trips found using this set of rules are shown on a map. Each line’s opacity is reduced so that zones with overlapping trips end up having a darker tone of blue. This map has a total of 3378 trips.

(44)

On the Selection of the Most Common Route

4.2

Map Discretization

Following trip detection, a discretization of the map is carried out. The main aim of this step is to represent the trip data in a more objective way, enabling grouping of similar trips. To achieve this grouping, a spatial clustering technique is used.

4.2.1 Spatial Clustering

With the growth of data mining, many useful tools for the geographical community started to appear. Spatial Clustering is one of these tools that target the analysis of geographical data. Sim-ilarly to any clustering task, it consists in grouping a set of objects into clusters, maximizing intra-cluster similarity and minimizing inter-cluster similarity, the only difference being the spe-cial concern with spatial data. Extracting information from large spatial datasets can prove to be a very difficult task. As such, it is common to pre-proccess spatial components of the dataset and apply traditional data mining techniques [Shekhar et al., 2000].

A grid based method is a spatial clustering method that fits our needs the best: it consists in dividing the space into a finite number of cells, forming a grid structure on which all operations for clustering are performed. Succinctly, it consists in the discretization of the space under study and posterior mapping of data points to cells for clustering purposes.

In order to avoid heavier and unnecessary computational load and to integrate the work in a more familiar environment, the area under study was limited to the Porto area shown in4.2.

For the problem in hands, a GeoJSON [Butler et al., 2008] containing the specification of a grid is generated. Every cell in the grid is quadrangular and different values for the width (grid step) of the cells were tested. Cells are generated horizontally within a longitude range and then vertically within a latitude range. These ranges are hard coded and are defined taking into account the visual representation already made in figure4.2. This creation order defines each cell’s unique id as suggested in the matrix of expression4.1.

Am,n=       a1,1 a1,2 · · · a1,n a2,1 a2,2 · · · a2,n .. . ... . .. ... am,1 am,2 · · · am,n       =       1 2 · · · n n+ 1 n+ 2 · · · 2n .. . ... . .. ... (m − 1)n + 1 (m − 1)n + 2 · · · m · n       (4.1)

During the creation of the grid, the formulas4.2 and4.3 are used to find the four pairs of latitude and longitude coordinates that delimit each cell. These formulas receive the initial point (φ1 latitude and λ1longitude in radian degrees), a distance d (equal to the grid step) and a

bear-ing as input and return a point. The resultbear-ing point has φ2 latitude and λ2 longitude in radian

(45)

On the Selection of the Most Common Route

clockwise from north, which means that in the grid creation, the values used areπ

2 (when iterating

in longitude, towards west) and π (when iterating in latitude, towards south).

φ2= arcsin(sin φ1· cos δ + cos φ · sin δ · cos θ ) (4.2)

λ2= λ1+ arctan(sin θ · sin γ · cos φ1, cos δ − sin φ1· sin φ2) (4.3)

where

• φ is latitude; • λ is longitude; • θ is the bearing;

• δ is the angular distance dR, being d the distance traveled and R the earth’s radius.

4.2.2 Trip Discretization

Following trip detection from our dataset and the discretization of the spatial domain, the next phase aims to associate the two outputs of these steps.

The association consists in a simple matching of each trips’ points to a cell already defined in the GeoJSON created in section4.2.1. After mapping each cell of the grid to an identification number (as defined in expression4.1), a trip can be represented as a sequence of cell numbers.

During the same trip, specially when using a large grid step, many consecutive data points may be included in the same cell. The repetition of cell ids is avoided. It is ensured that in a resulting sequence, a cell id is never next to the same cell id in a sequence.

Different cell sizes were used to explore the different possible outputs. Maps with the dis-cretized trips are presented in figure 4.3. The colors on each map follow a gradient with four levels: green, yellow, orange and red. The number of occurrences on a cell defines its color, being red the representative color for more occurrences. Intervals that determine a cell color are dy-namically calculated taking into account the total number of trips, so that these visualizations are automatically updated upon the insertion of new data. By coloring the maps this way, the direction of the vehicle is not taken into account, since a vehicle in a road can travel in up to two different directions. Nevertheless, a good idea of the main areas of operation is given.

4.2.3 Trajectory Augmenting

Even if the discretization method results in a possible sequence of cell ids, it may be incomplete. If a vehicle traveled fast enough to skip an entire cell, its cell id will not be included in that trip sequence. This situation has increased occurrences the smaller the grid step is. To avoid that these situations of low-sampling-rate affect the following route selection steps, pseudo cell ids are added to the sequence.

(46)

On the Selection of the Most Common Route

(a) Cell width = 1000 meters

(b) Cell width = 800 meters

(c) Cell width = 500 meters

Figure 4.3: Vehicle trips mapped to different spatial discretizations with different cell sizes. The colors follow a gradient from green to red according to the number of occurrences of a specific cell in all trips.

Referências

Documentos relacionados

The fourth generation of sinkholes is connected with the older Đulin ponor-Medvedica cave system and collects the water which appears deeper in the cave as permanent

A reconstruction of the past annual precipitation rates for the last 280 years has been obtained from speleothem annual growth rates, derived from the

A reconstruction of the past annual precipitation rates for the last 280 years has been obtained from speleothem annual growth rates, derived from the distance

By a negative variance component, we mean here a parameter that would be a variance should a hierarchical (also referred to as conditional) view be adopted, but merely is a parameter

Os objetivos específicos são divididos em cinco, sendo: estudar os conceitos de cultura da convergência, da conexão e do audiovisual e sua relação com o objeto de estudo;

A infestação da praga foi medida mediante a contagem de castanhas com orificio de saída do adulto, aberto pela larva no final do seu desenvolvimento, na parte distal da castanha,

O objetivo deste trabalho foi estimar a prevalência da AIE em equídeos de serviço de fazendas do município de Corumbá, empregando-se o teste oficial de IDGA, e avaliar a adoção