Predicting Passenger Connectivity in an Airline s Hub Airport. Aerospace Engineering

(1)

Predicting Passenger Connectivity in an Airline’s Hub Airport

Marta Isabel Vaz Guimarães Pinto de Melo

Thesis to obtain the Master of Science Degree in

Aerospace Engineering

Supervisors: Prof. Cláudia Alexandra Magalhães Soares Prof. Rodrigo Martins de Matos Ventura

Examination Committee

Chairperson: Prof. Paulo Jorge Coelho Ramalho Oliveira Supervisor: Prof. Cláudia Alexandra Magalhães Soares

Member of the Committee: Prof. Qiwei Han

January 2021

(2)

(3)

Acknowledgments

Firstly, I would like to express my gratitude to my Supervisors, Professor Cl ´audia Soares and Professor Rodrigo Ventura, for all the support, guidance and motivation throughout the entire work. Without their extensive knowledge and dedication to this project, this thesis would not have been possible.

I would also like to thank TAP Air Portugal, for providing the data, and Eng. Duarte Afonso, for sharing all the domain knowledge.

I would like to extend my gratitude to all my friends and family, especially my mom, who has always been there sharing my stress and happiness over the last five years.

Finally, my deepest thanks to Andr ´e Menezes for the never ending support and motivation during this work, and for being the greatest contribution for me to keep improving as a person.

(4)

(5)

Resumo

Projetar um escalonamento de voos que maximize os proveitos das companhias a éreas é um problema extremamente complexo. Um melhor planeamento est á diretamente relacionado com os lucros da companhia, uma vez que a procura é influenciada pela oferta da concorr ência, atrasos, ligaç ões perdidas e a fidelidade dos passageiros. Este trabalho foca-se no problema de escalonamento de voos de ligaç ão. Este tipo de planeamento tem por base que o tempo de escala no aeroporto est á ajustado ao estritamente necess ário para fazer a ligaç ão entre os voos e que o passageiro consegue fazer o cir- cuito necess ário entre desembarque/embarque. A soluç ão apresentada utiliza ferramentas de ci ência de dados para prever a conectividade de passageiros num aeroporto hubde uma companhia a érea, sem necessitar do tempo de conex ão de cada passageiro. Os modelos propostos consistem em modelos de apoio à decis ão que podem ser utilizados em momentos distintos no processo de planeamento de voos de conex ão, dadas as informaç ões dispon´ıveis em cada contexto. Primeiramente, é realizada uma an álise explorat ória dos dados. Em seguida, um modelo de aprendizagem autom ática, consistindo emgradient boosting machinese estrat égias de imbalanced data, é usado para fazer previs ões. Por

último, as t écnicas de explicabilidade do modelo fornecem uma explicaç ão post hoc dos resultados obtidos. Os modelos propostos superam obaselineinspirado no crit ério atualmente usado pela companhia, com uma área sob a curva deprecision-recall maior que 89%. Do ponto de vista operacional, a aplicaç ão destes modelos preditivos para tomar aç ões preventivas facilmente representa uma reduç ão de custos.

Palavras-chave:

voos de ligaç ão, planeamento de voos, operaç ões baseadas em dados, modelos de apoio à decis ão, aprendizagem autom ática

(6)

(7)

Abstract

Designing a flight scheduling that maximizes airline’s revenues is an extraordinarily complex problem. A better schedule planning is directly related to airline’s profitability since an airline’s demand is influenced by the schedule of competing airlines, delays, missed connections and passenger goodwill. This work focuses on the problem of connecting flights schedule planning. This type of schedule planning is based on the fact that the time at the airport is adjusted to what is strictly necessary to make the connection between flights and that the passenger is able to make the necessary circuit between landing/boarding.

The presented solution uses a data-driven approach based on data science tools for predicting passenger connectivity in an airline’s hub airport, without requiring the connection time of each passenger. The proposed models consist of decision support models that can be employed at different times in the process of planning connecting flights, given the information available in each context. Firstly, an exploratory data analysis is performed. Then, a machine learning model consisting of gradient boosting machines and imbalanced data strategies is used to make predictions. Lastly, model explainability techniques give a post hoc explanation of the obtained results. The proposed models outperform the baseline solutions inspired by the current criterion used by the airline, with an area under the precision-recall curve greater than 89%. From an operation’s perspective, applying these predictive models in order to take preventive actions easily represents a cost reduction.

Keywords:

connecting flights, airline schedule planning, data-driven operations, decision support models, machine learning.

(8)

(9)

List of Tables

2.1 Classification performance metrics based on the confusion matrix. . . 13

3.1 Arrival flights’ airlines. . . 18

3.2 Missing values of the arrival and departure dates. . . 20

3.3 ‘On/Off-Blocks Time’ and ‘Best Time Hub Control’. . . 24

4.1 Features considered for each problem. . . 39

5.1 Classifier’s parameters. . . 44

5.2 Time to fit a GMM with diagonal covariance and a given number of components. . . 47

5.3 Tuning the number of majority chunks for the Strategic model. . . 47

5.4 Strategic DSM results. . . 48

5.5 Tuning the number of majority chunks for the Pre-Tactical model. . . 52

5.6 Pre-Tactical DSM results. . . 52

5.7 Tuning the number of majority chunks for the Tactical model. . . 55

5.8 Tactical DSM results. . . 56

5.9 Tuning the number of majority chunks for the Post-Operations model. . . 59

5.10 Post-Operations DSM results. . . 59

5.11 Number of reactive and preventive actions. . . 61

5.12 Ratio between reactive and preventive costs for each model. . . 61

(12)

(13)

List of Figures

2.1 Decision tree. . . 6

2.2 Bagging. . . 7

2.3 Boosting. . . 8

2.4 Covariances types for GMM . . . 11

2.6 Confusion matrix for a binary classification problem. . . 13

2.7 Examples of ROC and PR curves. . . 15

3.2 Feature ‘Class from’ analysis. . . 19

3.3 Feature ‘Class to’ analysis . . . 20

3.4 Connecting passengers throughout months. . . 21

3.5 Analysis of the feature ‘SEF’ throughout months. . . 21

3.6 Binary features of the Pax dataset. . . 22

3.7 Missing data in the arrivals and departures timestamps. . . 23

3.8 Lisbon airport diagram. . . 24

3.9 Number of used buses used in the departures. . . 25

3.10 Data cleaning process . . . 26

3.11 Data integration process . . . 28

3.12 Connecting jorney . . . 30

3.13 Missing values. . . 32

4.1 Decision support models . . . 35

4.2 Structure of the proposed DSMs. . . 36

4.3 Schemetic diagram of the oversampling procedure. . . 40

5.1 Baseline performance of the Strategic model. . . 45

5.2 GMM selection using AIC and BIC scores for the Strategic model. . . 46

5.3 Hyperparameters of the Strategic classification model. . . 47

5.4 ROC and PR curves of the Strategic model. . . 48

5.5 SHAP summary plot of the Strategic model. . . 49

5.6 GMM selection using AIC and BIC scores for the Pre-Tactical model. . . 51

5.7 Hyperparameters of the Pre-Tactical classification model. . . 51

5.8 ROC and PR curves of the Pre-Tactical model. . . 52

(14)

5.9 SHAP summary plot of the Pre-Tactical model. . . 53

5.10 Baseline performance of the Tactical model. . . 54

5.11 GMM selection using AIC and BIC scores for the Tactical model. . . 55

5.12 Hyperparameters of the Tactical classification model. . . 55

5.13 ROC and PR curves of the Tactical model. . . 56

5.14 SHAP summary plot of the Tactical model. . . 57

5.15 Baseline performance. . . 57

5.16 GMM selection using AIC and BIC scores. . . 58

5.17 Hyperparameters of the Post-Operations classification model. . . 58

5.18 ROC and PR curves of the Post-Operations model. . . 59

5.19 SHAP summary plot of the Post-Operations model. . . 60

6.1 Graph of the airport structure. . . 64

6.2 Undirected graph. . . 64

(15)

Nomenclature

Greek symbols

µ Mean of a Gaussian distribution.

σ Standard deviation of a Gaussian distribution.

Roman symbols

c Number of majority class chunks.

k Number of folds ink-fold cross-validation.

N Gaussian distribution.

N Number of samples.

p(X =x) Probability of random variableXtaking valuex.

x Feature vector.

x Feature value.

ˆ

y Estimated target value.

y Target value.

(16)

(17)

Glossary

ACARS aircraft communications addressing and reporting system.

AIC Akaike information criterion.

APT airport.

AUC area under the curve.

BIC Bayesian information criterion.

CART classification and regression trees.

CSV comma-separated values.

DSM decision support model.

EDA exploratory data analysis.

EM expectation-maximization.

FN false negatives.

FP false positives.

FPR false positive rate.

GBM gradient boosting machine.

GMM Gaussian mixture model.

IATA International Air Transport Association.

ICAO International Civil Aviation Organization.

MAR missing at random.

MCAR missing completely at random.

MCT minimum connecting time.

(18)

ML machine learning.

MNAR missing not at random.

NN Non-Schengen to Non-Schengen.

NS Non-Schengen to Schengen.

OCC operations control center.

PR precision-recall.

ROC receiver operating characteristic.

SEF Servic¸o de Estrangeiros e Fronteiras.

SHAP Shapley additive explanations.

SMOTE synthetic minority oversampling technique.

SN Schengen to Non-Schengen.

SS Schengen to Schengen.

SSIM Standard Schedules Information Manual.

TN true negatives.

TP true positives.

TPR true positive rate.

(19)

Chapter 1

Introduction

1.1 Motivation

One of the most challenging problems for airlines is schedule planning. The airline schedule planning problem largely defines the market share that an airline captures, and hence it is a key factor for the airline’s profitability. Nowadays, an airline has to manage hundreds of flights per day; multiple aircrafts;

hundreds of airports; constraints regarding air traffic control, airport slots, gates, crew and passengers;

while dealing with complex issues such as pricing and competing airlines. Therefore, designing a flight schedule that maximizes airline’s revenues clearly is an extraordinarily complex problem [1].

TAP Air Portugal’s operating model is based on the hub and spoke philosophy, with its operation concentrated at Lisbon airport (the hub), where it allows passenger flows to connect between flights that land and take off at Humberto Delgado Airport. This is one of the most important European gateways to Brazil and Africa and the biggest European airport serving South America among the Star Alliance hubs. In 2019, Lisbon airport handled 31.2 million commercial passengers. From these, 17.1 million were TAP passengers [2].

The connecting flights schedule planning is based on the assumption that the time at the airport is adjusted to what is strictly necessary to make the connection between flights and that the passenger is able to make the necessary circuit from landing to boarding. Recently, TAP Air Portugal faced some difficulties in guaranteeing the connectivity of these same passengers, with a measurable percentage unable to reach the boarding gate in time to board their flight. A better planning of connecting flights is directly related to airline’s profitability since an airline’s demand is influenced by:

(1)Schedule of competing airlines.In terms of connecting flights, other hub airports, such as Madrid airport, represent direct competition. Thus, it is necessary to ensure that the airline offers the best time and price ratio as typically passengers prefer to purchase flights that allow them to make connections in less time and/or at a more affordable price. To achieve the shortest possible time interval between origin and destination points, it is necessary to estimate the minimum connecting time (MCT), i.e., the minimum time that ensures a successful transfer.

(2)Delays and missed connections.When a passenger misses a (connecting) flight, either because

(20)

of the flight delay or because the flight is canceled, the airline loses profitability. In certain circumstances, passengers are entitled to monetary compensation or even an overnight stay. In addition, a flight delay may mean a spread of delays in other processes.

(3)Passenger goodwill. A passenger who is satisfied with the services is more likely to buy flights from the same airline again and recommend it to others.

To plan operational schedules, airlines engage in a complex decision-making process. Firstly, due to the mutual dependencies between each airlines’ flight scheduling, the problem becomes highly nonlinear. Secondly, the number of possible solutions exponentially increases with the number of considered variables.

To help decision makers to optimize the flight schedule planning and to establish the MCT, decision intelligence can be employed. Decision intelligence uses data science to take advantage of the large amounts of available data to build decision support models (DSMs).

1.2 Objectives and Contributions

This thesis aims to develop DSMs that use historical information from TAP to help decision makers to optimize the connecting flights schedule planning.

This work proposes a formulation for the problem of predicting passenger connectivity in an airline’s hub airport, which has seldom been studied in the past. The task is framed as a classification problem where a machine learning (ML) model is used to predict whether a passenger will miss the connection or not. Furthermore, the proposed solution consists of DSMs that can be employed at different times in the process of the planning process, given the information available in each context. With these models it is possible to identify some key factors that impact passengers’ connection times.

1.3 Related Work

Some previous work have extensively studied optimization approaches for fleet assignment, aircraft maintenance routing and crew scheduling [1, 3].

The problem of defining a flight schedule that aims to promote a robust operating system has been studied by some researchers. Lan et al. [4] presented two new approaches to minimize passenger disruptions and achieve robust airline schedule plans. Wu et al. [5] developed a rapid solving method to large airline disruption problems caused by airport closure. Burke et al. [6] investigated simultaneous flight retiming and aircraft rerouting, subject to a fixed fleet assignment. A number of papers have addressed the flight scheduling and fleet assignment simultaneously, including [7–9]. Aloulou et al.

[10] addressed the challenging issue of improving schedules robustness without increasing the planned costs. Airlines use robust scheduling to mitigate the impact of unforeseeable disruptions on profits.

Atkinson et al. [11] examined how effectively practices such as flexibility to swap aircraft, flexibility to reassign gates, and scheduled aircraft downtime accomplish this goal. Dunbar et al. [12] introduced a

(21)

new algorithm to accurately calculate and minimize the cost of propagated delay, in a framework that integrates aircraft routing and crew pairing.

Airline frequency competition is partially responsible for the growing demand for airport resources.

Vaze and Barnhart [13] proposed a model for airline frequency competition under slot constraints. Yan and Chen [14] introduced the coordinated scheduling models for allied airlines. Jiang and Barnhart [15] developed a dynamic scheduling approach that aims to match capacity to demand given the many operational constraints that restrict possible assignments.

Flight schedules are highly sensitive to delays and witness these events on a very frequent basis.

Mueller and Chatterji [16] analysed departure and arrival data for ten major airports in the United States and characterized the delay distributions for traffic forecasting algorithms. Xu et al. [17] studied the airport-level causal relationship between delays, delay propagation, and delay causes using Bayesian networks. Wu and Law [18] explored the use of Bayesian networks to account for multiple connecting sources (aircraft, cabin crew, and pilots) and passenger connections. Guleria et al. [19] developed a multi-agent based method to predict the reactionary delays of flights, given the magnitude of primary delay that the flights witness at the beginning of the itinerary. Zhong et al. [20] proposed an application of Non-Negative Tensor Factorization for airport flight delay pattern recognition. To understand the mechanism of delay propagation from the perspective of multiple airports Xiao et al. [21] proposed a low-dimensional approximation of conditional mutual information for transfer entropy. Efthymiou et al.

[22] performed an empirical analysis that focuses on the impact of delays on customers’ experience and satisfaction. Some researchers use machine learning approaches to leverage the problem of predicting flight delays. Khanmohammadi et al. [23] introduced a new multilevel input layer artificial neural network and Belcastro et al. [24] implemented a parallel version of the Random Forest data-classification algorithm for predicting flight delays.

Besides flight delays, another important component in flight scheduling is the flight block time and its reliability. Tian et al. [25] proposed a methodology for evaluating the flight block time under different delay time windows.

Typically, passenger level data is not publicly available making it difficult to explore passenger-centric problems. Bratu and Barnhart [26] developed a passenger delay calculator to compute passenger delays and to establish relationships between passenger delays and cancellation rates, flight leg delay distributions, load factors, and flight schedule design. Later, the same authors proposed [27] airline recovery decision models that select flight leg departure times and cancellations that, like conventional models, minimize operating costs, but are extended to include the resulting delay and disruption costs experienced by passengers. Rosenow et al. [28] performed an evaluation of strategies to reduce the cost impacts of flight delays on total network costs. The inclusion of individual transfer passengers in the delay cost balance of an airline involves the uncertainty of the number of these passengers. Barnhart et al.

[29] developed a multinomial logit model for estimating historical passenger travel and extend a previously developed greedy reaccommodation heuristic for estimating the resulting passenger delays. Guo et al. [30] proposed the first approach to study passengers’ transfer journeys using flight and passenger datasets, and to provide decision support in real-time. It uses customized machine learning algorithms

(22)

to predict passengers’ connection times. This approach differs from the one proposed in this work, since it has access to information which allows to make a regression model instead of a classification one.

To the best of the authors knowledge, there is no previous work that develops DSMs for connecting flights schedule planning without requiring the connection time of each passenger.

1.4 Thesis Outline

Chapter 2 provides an introduction of the concepts used in this work, within the topics of supervised learning and classification algorithms. Chapter 3 presents the datasets used in this work and describes the data analysis and preparation processes. The proposed models are described in Chapter 4. All the results are shown and discussed in Chapter 5. Finally, Chapter 6 presents the main conclusions of this thesis as well as some future work ideas.

(23)

Chapter 2

Background

In this Chapter, some theoretical concepts used throughout this thesis are introduced. Section 2.1 provides some background on supervised learning and classification algorithms. In Section 2.2, the imbalanced data problem is described. Section 2.3 refers the different types of encoding techniques and lastly, in Section 2.4 describes a method to explain individual predictions.

2.1 Supervised Learning

In ML, supervised learning refers to models and algorithms that map from the input features to the corresponding output. The algorithm is given many samples where the prespecified target variable is known, so that the model may learn from this data. After training, the model will receive new inputs and determine the correspondent target based on the previous training data. Thus, a supervised learning model aims to generate a prediction of the correct target given unseen input data.

Supervised learning is further separated into two subcategories, classification and regression problems, based on the alphabet of the output variables. Classification predicts unordered discrete class labels while regression consists of predicting continuous output variables. Classification algorithms will be the focus of this work.

2.1.1 General Approach to Classification

The learning process is divided in two phases: training and testing.

During training a classification model is fit to a previously established set of data classes. This classification algorithm builds the classifier by analysing a set of data made up of samples from the dataset under analysis and their corresponding class labels. The individual samples making up the training set can also be referred to as examples, instances or data points.

In the second step, during test, the model is used to predict class labels for a untouched set of data.

In this phase, if the training set was used to measure the classifier’s performance, the estimate would be too optimistic. This happens because the classifier overfits the data, i.e., in the training phase the model learns the anomalies and details of the training data that do not represent the overall data and,

(24)

therefore, impacts negatively the performance of the model on new data. To overcome the overfitting problem, a test set is used. This set of data is independent of the training samples, meaning that it is not used to build the classifier [31].

2.1.2 Classification Algorithms

There are many different supervised learning methods used for classification, such as neural networks, nearest neighbor classifiers, support vectors machines and decision trees. Among these techniques, decision tree based algorithms are one of the most commonly used methods in classification problems due to the easy interpretability achieved by the hierarchical placement of decisions. In general, the learning and classification steps of decision trees are fast and the classifiers have good performance.

Decision Tree

As the name suggests, this technique uses a tree-like model of decisions to predict the target value. Just like a conventional tree, a decision tree is composed of internal decision nodes, branches and terminal nodes (or leaf nodes). Each decision node denotes a test on an attribute with discrete results, so each branch represents an outcome of the test. Depending on the outcome value, one of the branches is taken. This process is done by traveling along the tree from the root and recursively partitioning the data until reaching a leaf node, which contains the output value. Thus, for a given data point, the target value is found by going through the tree and making decisions based on the feature values [32].

Root Node

Decision Node

Leaf Node

Leaf Node Decision

Node

Leaf Node

Leaf Node Branch/Sub-Tree

Figure 2.1: Decision tree.

In a classification tree (a decision tree used for classification) to split the nodes at the most informative features, one must use an impurity measure. A split is said to be pure if after the split all the instances following a specific branch belong to the same class. In this case, there is no need to split any further and a leaf node is added with the class.

There are several functions used to measure the impurity of the splits, such as ID3 [33], its extension C4.5 [34], and the classification and regression trees (CART) algorithm [35]. The CART algorithm will be the focus since it provides a foundation for important algorithms like boosted decision trees.

The decision trees produced by CART are strictly binary, i.e., contain exactly two branches for each decision node. This algorithm uses the Gini impurity as splitting condition, which can be understood as

(25)

a criterion to minimize the probability of misclassification and is given by,

I= 1−

m

X

i=1

p(i)², (2.1)

wherep(i)is the proportion of the samples that belong to classi, andmis the number of classes [36].

The attribute having the lower Gini impurity value is chosen as the root node. This process is repeated for another leaf node until all leaves are pure or all features have been used.

Ensemble learning is a ML technique that combines several classifiers (e.g., decision trees) into a meta-classifier in order to improve the generalization performance and robustness over each individual classifier alone.

Bagging

Bootstrap aggregation [37], also called bagging, is an ensemble algorithm which fits different instances of the base classifier, each on random subsets of the original training set, and then combines their individual prediction using a majority voting process. Majority voting means that the class label selected has been predicted by the majority of the classifier’s instances.

Majority Voting Prediction

Figure 2.2: Bagging.

By using different sets of training data and introducing randomization into the algorithm, the model becomes less likely to generate errors. Therefore, the performance with the test data is improved and helps to avoid the overfitting problem of the base estimators.

Random Forest

Random Forest is a bagging-based algorithm that creates a collection of decision trees, i.e., a forest.

However, instead of sampling random subsets of the original training set, only a random subset of features is selected to build each decision tree. Sampling over features ensures that the different trees do not fit the exact same information and, therefore, reduces the correlation between the different predictions.

(26)

Thus, random forest combines the bagging technique and the concept of feature subspace selection to improve the model’s performance.

Boosting

As it was mentioned before, the bagging algorithm can be an effective method to reduce the variance of a model. However, bagging is not a good algorithm to be fitted in models that are too simple to capture the trend in the data, meaning that is ineffective in reducing the model bias.

Boosting, originally proposed by Schapire [38], is an ensemble technique which combines weak models that are no longer fitted independently from each other. The main idea behind boosting is to sequentially fit models by minimizing the errors from the previous models, that is, to let the simple base classifiers learn from misclassified training samples and, consequently, improve the performance of the combined estimator.

Prediction

Figure 2.3: Boosting.

Gradient Boosting

Gradient boosting, also known as gradient boosting machine (GBM) [39], is a boosting algorithm in which the main purpose is to minimize a loss function by adding sequential weak learners using the gradient descent technique. Gradient descent is an iterative optimization algorithm for finding a local minimum of a differentiable function.

Gradient boosting is a stage-wise additive model that creates learners during the learning process.

At each particular iteration, a weak learner is fitted and its predictions are compared with the correct expected outcome. The difference between these values represents the error of the model, and can be used to compute the gradient of the loss function. Then, the gradient value is used to understand the direction in which the model parameters need to be changed in order to minimize the error in the next training iteration.

In gradient boosting, the sample distribution is not modified because the weak learns train on the pseudo-residuals (the remaining residual errors of the ensemble so far and the actual output). There-

(27)

fore, the algorithm does not optimizes the model parameters directly but the boosted model predictions instead. Moreover, the gradients are added to the training process by fitting the remaining weak models to these values.

XGBoost

XGBoost [40] stands for extreme gradient boosting and implements machine learning algorithms under the gradient boosting framework. It is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.

Since XGBoost derives from GBM, there are many similarities between the algorithms and their tuning parameters. Both of them are ensemble methods based on the CART principle using the gradient descent architecture. However, XGBoost improves upon the base GBM framework through system optimization and algorithmic enhancements, characterized by regularization and sparsity awareness.

In the system optimization it is worth highlighting the parallelization, tree pruning and the hardware optimization. XGBoost approaches the process of sequential tree building using parallelized and distributed implementation. This extension improves algorithmic performance by making the learning process faster which enables quicker model exploration. Regarding the tree pruning, XGBoost uses a depht-first approach which is achieved by definind a new paramenter, maximum tree depth for base learners, instead of the stopping criterion for tree splitting. This approach improves computational performance significantly. To solve the hardware optimization problem, this algorithm introduces two key concepts. It uses cache awareness by allocating internal buffers in each thread (where the gradient statistics can be stored) and the out-of-core computing, which optimizes the available disk space and maximizes its usage when handling big datasets that do not fit into memory.

XGBoost provides regularization parameters that help to reduce model complexity and to prevent overfitting. The first, gamma, is used to set the minimum reduction in loss required to make a further split on a leaf node of the tree. When this parameter is specified, the algorithm will grow the tree to the maximum depth defined but then prune the tree to find and remove splits that do not meet the specified gamma. Thus, this parameter controls the complexity of a given tree. The other parameters, alpha and lambda represent L1 and L2 regularization, respectively. L1 regularization adds a penalty which can yield sparse solutions. On the other hand, L2 regularization constrains the coefficient norm and keeps all the variables small.

XGBoost includes a sparsity-aware split finding algorithm that naturally handles different types of sparsity patterns in the data (e.g., features with missing values) more efficiently.

2.2 Imbalanced Data

A very common challenge faced when trying to perform classification is the class imbalance problem. A dataset is said to be imbalanced if one class (minority class) is relatively rare as compared to the other class (majority class). As a consequence, the classifier can be extremely biased towards the majority class.

(28)

In order to overcome the class imbalance problem, several approaches have been proposed. These solutions can be divided into different levels: algorithm level (also referred to as internal) and data level (or external) [41, 42].

At the algorithm level, solutions are based on modifying the classifier learning procedure itself. The aim is to adapt existing classifier learning algorithms to bias the learning toward the minority class [43, 44]. To do so, it is required an in-dept understanding of both the corresponding classifier and the application domain.

At the data level, the aim is to rebalance the class distribution by resampling the data space. Sam- pling methods have become a widely adopted approach for dealing and improving imbalanced classification performance. It consists of changing the training set in such a way as to create a more balanced class distribution. This is achieved by removing samples from the majority class (undersampling) and/or adding more examples from the minority class (oversampling). In the first case, the major drawback is that potentially useful information will be discarded, which can damage the induction process. Under- sampling techniques will not be used in this work, so the focus remains in the oversampling ones.

2.2.1 Oversampling

Random oversampling

It aims at balancing class distribution through randomly replicating minority class instances. Random oversampling may increase overfitting, since it generates copies of the original data points [45].

Synthetic minority oversampling technique (SMOTE)

SMOTE [46] is a widely used sampling algorithm which forms new synthetic minority class samples to oversample this class. For every minority example, new samples are generated by interpolating between its k (typically set to 5) nearest neighbors of the same class.

As a drawback, in the SMOTE algorithm, the problem of overfitting is largely attributed to the way in which it creates synthetic samples. Precisely, this algorithm generates new samples without consid- eration to neighboring examples, thereby generalizing the minority area without regard to the majority class. Thus, this strategy leads to an overlap of the classes and generation of noisy points.

Due to these limitations, another data level strategy which generates more representative samples and uses intrinsic properties of the data was considered.

Gaussian mixture model (GMM)

The GMM is a probabilistic model, commonly used for clustering and density estimation, that assumes that all the data points are generated from a mixture of Gaussian distributions. This assumption is not a limitation, because a GMM is known to approximate any probability distribution [47].

Generating samples As shown in [48, Sec.9.2], the formulation of Gaussian mixtures can be written in terms of discrete latent variables, i.e., variables that are inferred from observed data and each value

(29)

characterizes a given cluster. Usually, these latent variables are denoted asz. For each cluster, a Gaus- sian distribution is defined. From the distribution’s parameters, it is possible to compute the probability of a random variablexgiven the value of the latent variable: p(x|z).

The probability distribution of the latent variables can be inferred from the dataset given the proportion of data points belonging to each cluster. Thus, it is possible to computep(z).

In order to generate random samples distributed according to the GMM, the ancestral sampling technique is used. Firstly, one has to sample from the marginal distribution p(z) which generates a value for the latent variable, denotedˆzand then sample from the conditional distributionp(x|ˆz)to finally generate a value forx. This process is repeated the same amount of times as the number of samples needed.

Parameters A Gaussian distribution is fully specified by its mean vector and covariance matrix. In two dimensions, the parameters are defined as follows:

N ∼







 µ1

µ2



,





σ²₁ σ12

σ21 σ²₂







 (2.2)

The mean vector controls the Gaussian’s location in space, where µ₁ and µ₂ define the position along each dimension. On the other hand, the covariance matrix determines the direction and length of the axes that define its density contours. The terms along the diagonal, σ²₁ andσ²₂, represent the spread along each dimension, while the reaming terms,σ12 andσ21, define the correlation structure of the distribution. The last terms have the same value, meaning that the covariance matrix is symmetric.

Figure 2.4: Demonstration of covariances types for GMM. Given a dataset with three different classes, the goal is to cluster the classes using the GMM and compare the results obtained using different types of covariance matrices. This example was generated using the code presented in [49].

As illustraded in Figure 2.4, the covariance matrices of the GMM can be of different types: i) spherical – each component has its own single variance which means that the contours have spherical shape; ii) diagonal – each component has its own diagonal covariance matrix meaning that its axes are oriented along the coordinate axes and the contours have ellipsoid shape; iii) tied – all components share the same general covariance matrix or, in other words, the components have the same shape; iv) full – each component has its own general covariance matrix meaning that the components may independently adopt any position and shape.

(30)

Number of components Figure 2.5 gives an illustration of the importance of choosing an appropriate number of mixture’s components to generate meaningful samples and avoid overfitting. In the context of the “S” shaped dataset used in the example, Figure 2.5(a) demonstrates that using a GMM with three components underfits the data and, consequently, the generated samples pooorly represent the original data points. On the other hand, Figure 2.5(c) is a good example of overfitting. The model learns the detail in the data points and the generated samples, Figure 2.5(f), reflect this effect. Thus, it is necessary to carefully establish the number of components of the mixture so as not to compromise the performance of the model.

(a) 3 components. (b) 10 components. (c) 30 components.

(d) Generated samples with 3 components.

(e) Generated samples with 10 components.

(f) Generated samples with 30 components.

Figure 2.5: Examples of GMM and the influence of the number of components to fit the data and generate new samples. The GMMs used in each one of three scenarios have full covariance matrices.

Model selection criteria Probabilistic model selection, also referred to as information criteria, provides an analytical technique for scoring and choosing among candidate models. There are several model selection schemes in the statistics literature, such as the Akaike information criterion (AIC) [50] and Bayesian information criterion (BIC) [51]. These measures attempt to score: i) the model performance, that is, how good a model has performed on the training set; ii) model complexity, evaluated using the number of parameters in model.

The AIC and BIC are defined as:

AIC=−2 logL+ 2K, (2.3)

BIC=−2 logL+ 2KlogN, (2.4)

whereLis the value of the likelihood andKis the number of estimated parameters andNis the number of samples. In both AIC and BIC, the lowest the value, the best the model is.

When comparing both techniques, the AIC statistic penalizes complex models less, meaning that it may put more emphasis on model performance on the training dataset, and selects more complex

(31)

models. On the other hand, BIC penalizes more the model for its complexity, which means that more complex models have worse score and simple ones are selected.

2.2.2 Evaluation Metrics

The evaluation criteria is a key factor in assessing the classification performance of the model. A common method for determining the performance of a classifier is through the use of the confusion matrix (Figure 2.6).

True Negatives

(TN)

False Positives

(FP)

False Negatives

(FN)

True Positives

(TP) Predicted Label

True Label

Negative Positive

NegativePositive

Figure 2.6: Confusion matrix for a binary classification problem.

In a confusion matrix, true negatives (TN) is the number of negative instances incorrectly classified as negative, false negatives (FN) is the number of positive instances correctly classified as negative, false positives (FP) is the number of negative instances incorrectly classified as positive and true positives (TP) is the number of instances correctly classified as positive.

The most common evaluation metrics derive from the confusion matrix. Typically, the most often used metric is accuracy:

accuracy= T P +T N

T P+F P +T N+F N (2.5)

In the framework of imbalanced datasets, the evaluation of the classifiers’ performance must take into account the class distribution. For this reason, accuracy may produce a biased illusion on imbalanced data and, consequently, it is not a good metric for measuring the performance of classifiers in these cases. Alternatively, other metrics can be computed from the confusion matrix. Some of these metrics are summarized in Table 2.1.

Table 2.1: Classification performance metrics based on the confusion matrix.

Metric Formula

precision _{T P+F P}^{T P}

recall, sensitivity, true positive rate (TPR) _{T P}^{T P}_{+F N}

false positive rate (FPR) _{F P+T N}^{F P}

specificity _{F P+T N}^{T N} = 1−F P R

geometric mean (G-mean) √

sensitivity·specif icity

F1score 2 precision·recall

precision+recall

(32)

From the Table 2.1, it is possible to understand that precision quantifies the number of positive class predictions that actually belong to the positive class, while recall measures how often a positive class instance in the dataset was predicted as a positive class instance by the classifier. In other words, and in the context of this problem, precision measures what proportion of predicted missed connections was actually correct and recall measures what proportion of real missed connections was correctly identified. The G-mean measures the balance of the classification performance over the negative and positive classes. A low G-mean value is an indicator of a poor classifier’s performance. F1score is the harmonic mean of precision and recall.

Instead of simply predicting a sample as positive or negative, there are classifiers, called scoring classifiers, which give a numeric score for an instance to be classified in the positive or negative class.

Therefore, instances with a higher score are more likely to be classified as positive. The classifications are made by applying a threshold to a score. The choice of this threshold impacts the trade-off of the predictions [52].

A commonly used measure for evaluating the predictive performance of scoring classifiers is the receiver operating characteristic (ROC) curve. A ROC curve is a graphical evaluation metric which does not depend on a specific threshold.

In this graph, the x-axis represents the FPR and the y-axis the TPR. To build the plot, the instances are ordered according to the decreasing score value of being positive and then the threshold is varied from the highest score (most restrictive) to the lowest one (least restrictive). For each threshold value, there is one possible point in the ROC space, based on the obtained values of FPR and TPR for that threshold. These points can be interpolated to approximate the curve [53].

In ROC space, a good performance should be as close to the upper left corner as possible (see Figure 2.7(a)). This point, FPR=1 and TPR=1, corresponds to the perfect classification. The lower left corner (i.e., the origin of the graph) corresponds to a classifier always predicting the negative class whereas the upper right corner to predicting the positive class. The diagonal which connects this two points indicates the random performance. Points below this diagonal indicate a performance worse than random, hence all the points in the ROC curve should lie above this line.

The area under the curve (AUC) is a measure used to evaluate the overall performance of score classifiers. AUC_ROC can be interpreted as the probability that the model will rank a random positive sample more highly than a randomly negative chosen one. It ranges from ranges in value from 0 to 1.

The AUC_ROC of the random performance is 0.5, so it is expected that for any useful classifier this value is higher than 0.5.

Although widely used to evaluate classifiers under presence of imbalanced data, some researchers argue that ROC curve may be deceptive with respect to conclusions about the reliability of classification performance [54]. Precision-recall (PR) curves, on the other hand, provide an accurate prediction of the classification performance.

The PR curves (Figure 2.7(b)) can be obtained in a similar way to what is done for the ROC curves, but using recall in the x-axis and precision in y-axis. In PR space, good classifiers should be as close as possible to the upper right corner, since this is the point which represents the best precision and recall

(33)

True Positive Rate

False Positive Rate

0 1

1

Random performance

(a) ROC curve

Recall

Precision

0 1

1 Random performance

(b) PR curve

Figure 2.7: Examples of ROC and PR curves.

trade-off. The random performance is defined as the ratio of the number of positive class samples over all the samples.

As in the ROC space, it is also possible to compute the area under the PR curve, AUC_PR. However, in PR space this value does not have a probabilistic interpretation as in ROC space. The AUC_PRvalue of the random classifier is expected to be close to the ratio of positive samples in the test set.

2.3 Encoders

Many statistical and machine learning algorithms require all variables to be numeric. This means that if the dataset contains categorical data, one must encode it to numeric values before fitting and evaluating a model. This process is called categorical encoding.

There are many different approaches for handling categorical variables. Two of the most widely used techniques are ordinal encoding and one-hot encoding.

In ordinal encoding, each category is assigned an integer value from 0 to N-1 (N is the number of categories for the feature). This results in a single column of integers per feature. For example, if a dataset feature was “colour” and the values were “blue”, “green” and “red”, the encoding values would be 0, 1 and 2, respectively. In this scenario, the colors names do not have an order rank, but when the encoding is performed, a learning algorithm would consider the relationship between colors such as red being larger than green, and green larger than blue. Although this assumption is incorrect, the algorithm could still produce useful results. However, those results would not be optimal. This type of encoding it is really only appropriate to features with a known order between the categories.

To overcome this problem, one-hot encoding is commonly used. The idea behind this approach is to create multidimensional features based on the number of unique values in the categorical feature.

Given the same “colour” feature from the previous example, binary values can be used to indicate the particular colour of a sample, i.e. a red sample can be encoded as red=1, green=0, blue=0. Thus, each sample in the dataset is replaced with a vector and, in this example, one column becomes two.

This technique becomes very difficult to handle for high cardinality categorical variables, since it

(34)

generates too many features.

An alternative and more informative encoding method is target encoding, which transforms the categorical variables into quasi-continuous numerical data. The general idea of this technique is to replace each categorical value with a blend of posterior probability of the target given particular categorical value,p(Y|X =x_i), and the prior probability of the target over all the training data,p(Y). The blending is controlled by a regularization parameter that depends on the sample size:

S_i=p(Y|X =x_i)·λ(N_i) +p(Y)·(1−λ(N_i)) (2.6) whereNiis the size of the sample{X|X =xi} andλis a function of the sample size. The larger the sample size, the more the estimate is weighted towards the target givenX = x_i. On the other hand, the smaller the sample size, the more the estimate is weighted towards the prior probability of the target [55].

2.4 Shapley Additive Explanations

Limited by the interpretability of the ML model, in some cases it is not possible to fully understand the complex and nonlinear relationships between the features, as well as the impact of their variations on the model’s output. To address this problem, the Shapley additive explanations (SHAP) framework is used.

SHAP, proposed by [56], is a method used to explain the prediction of a certain instance by evaluating the contribution and importance of each feature to the results. This is achieved by computing the Shapley values, a concept from cooperative game theory introduced by Shapley [57]. Cooperative game theory assumes that groups of players are the key elements of decision-making, leading to a need in cooperative behavior. The Shapley value solution distributes, in each cooperative game, the total gains among the players.

In the ML context, the game can be formulated as the prediction of each instance. Thus, the total gains are considered to be the prediction value for a certain instance and the game players are the model features of that instance. The collaborative game can be seen as all of the model features cooperating with each other to form a prediction value and, consequently, the Shapley value of a feature measures how much it contributes, either positively or negatively, to each prediction.

(35)

Chapter 3

Data Analysis and Preparation

In this chapter, the different used datesets are prepared and analysed. Firstly, in Section 3.1 the data used in this work is introduced. Secondly, the data analysis is presented in Section 3.2 and lastly, Section 3.3 describes the data prepossessing steps.

3.1 Data Preparation

Historical data for all of 2019 and the first two moths of 2020 and information about the passport control bottleneck was provided by TAP so that it was possible to develop this thesis. The data consists of three different sets of data: i) Servic¸o de Estrangeiros e Fronteiras (SEF), ii) Hub and iii) Pax datasets.

Hub dataset includes all the information about the departure and arrival flights such as the scheduled arrival/departure time, gate number, etc. The Pax (short for passenger) dataset contains all the passenger’s travel information, e.g., their travel class and connection status. Lastly, SEF (border control service) dataset associates each pair of arrival/departure airports with a binary variable that indicates if the passenger has to pass through the passport control or not.

All these datasets are stored in independent files in comma-separated values (CSV) format.

Both the Hub and Pax datasets represent big volumes of data (millions of samples), therefore it is important to analyse them separately before combining all of them into a final dataset. The SEF dataset itself does not provide much information, so a detailed analysis is not necessary. However, when associated with the Pax dataset, SEF data may generate very interesting results. For this reason, before proceeding with further analysis, a new column was added to the Pax data to include this information.

As previously mentioned, the Pax dataset contains the passenger’s travel information, so each row represents a passenger’s connection and includes the International Air Transport Association (IATA) code of the arrival and departure airports. With these codes one must go through the SEF dataset until a match of codes is found and simply store the binary value into the new Pax’s column. This process is repeated for the entire Pax dataset.

(36)

3.2 Exploratory Data Analysis

Exploratory data analysis (EDA) refers to a set of techniques originally developed by Tukey [58]. The EDA is about analysing datasets to detect and summarize their main characteristics, patterns and trends, detecting data noise, checking assumptions and selecting data models [59]. This concept is strongly associated with the use of visual methods and graphical representations of data [60].

Due to the number of variables in both datasets and the large amount of different exploratory analysis that can be done, this Section only presents the most relevant information to the overall understanding of the data and the decisions made throughout this work.

3.2.1 Pax Dataset

The Pax dataset consists of 5034919 samples (rows) and 17 features (columns).

Arrival Flights

The feature ‘From’, which indicates the International Civil Aviation Organization (ICAO) airport code of the arrival flights, has high cardinality: 128 distinct values. Within these values, the most common airport code is “OPO” (Porto, Portugal) with 369426 samples, i.e a frequency of 7.3%.

In addition to the ICAO airport code, there is another fundamental feature regarding the arrival flights:

the flight number. The flight number is a numerical designation of a flight. It is a code consisting of two character airline designator and a 1 to 4 digit number. The airline designator is the two character code assigned by IATA. Designators are used to identify an airline for all commercial purposes. In the case of two-character designators, they consist of two alpha characters, one alpha with one numeric, or one numeric with one alpha [61]. In fact, the term “flight number” refers to the numeric part of a flight code, but even within the airline and airport industry it is colloquially used as “flight designator”, which is the official term defined in Standard Schedules Information Manual (SSIM) published by IATA.

The flight designator feature for the arrival flights is called ‘TP from’ and has high cardinality: 642 distinct values.

As shown in Table 3.1, there are flights designators from other airlines other than TAP. These samples whose arrival flight belongs to other airlines are included in the dataset since they are considered to be connecting passengers (with departure flight TAP) as well.

Table 3.1: Arrival flights’ airlines.

# Observations

Airlines 59

TAP 4915963

TAP (%) 97.637

non-TAP 118956

non-TAP (%) 3.363

Regarding the non-TAP airlines, Figure 3.1 presents the absolute and relative number of observations in the dataset for each airline.

(37)

1XPEHURIREVHUYDWLRQVDEVROXWHYDOXH

6

$'(./+

8$/;

$&7.

/*

61 45 '7,%

$<

$

$)$7

%$

./

8;

,$7$DLUOLQHGHVLJQDWRU

1XPEHURIREVHUYDWLRQVRIQRQ7$3DLUOLQHV

Figure 3.1: Airlines with more than 1000 observations in the Pax dataset.

As shown in Figure 3.2, there are 16 different types of classes. However, some of them correspond to variations of the classes economic and business. For instance, the two most common classes are the same, but one of them contains a space character, causing the samples to be associated with different classes.

This feature, ‘Class from’, has 527800 (10.5%) missing values which is considered to be a significant number.

1XPEHURIREVHUYDWLRQVDEVROXWHYDOXH ^H (FRQRP\'LVFSULYDWH

(FRQRP\'LVFSULYDWH (FRQRP\'LVFSULYDWH72 (FRQRP\'LVFFRPPRQ

$OORWVSULYDWH

*URXSVSULYDWH (FRQRP\'LVFFRPPRQ

%XVLQHVV'LVFSULYDWH (FRQRP\FRPPRQ (FRQRP\5HGHPFRPPRQ

%XVLQHVV'LVFFRPPRQ 5WUDYHOOHU5

%XVLQHVV5HGHPFRPPRQ

%XVLQHVVFRPPRQ (FRQRP\,'6HUY17UDYFRPPRQ

%XVLQHVV,'$'6HUYLFHFRPPRQ

&ODVV

1XPEHURIREVHUYDWLRQVSHUFODVV

Figure 3.2: Feature ‘Class from’ analysis.

Departure Flights

The feature ‘To’, which is equivalent to the feature ‘From’ but with destination information, has 110 distinct values. The most common value is “OPO” (Porto, Portugal) with a frequency of 8.6%, which represents 434284 samples.

Contrary to what happens with the arrival flights, in the departures only TAP flights are recorded in the dataset. There are 344 distinct flights designators and the feature containing these values is the

(38)

feature ‘TP to’.

1XPEHURIREVHUYDWLRQVDEVROXWHYDOXH ^H (FRQRP\'LVFSULYDWH

(FRQRP\'LVFSULYDWH (FRQRP\'LVFSULYDWH72 (FRQRP\'LVFFRPPRQ

$OORWVSULYDWH (FRQRP\'LVFFRPPRQ

*URXSVSULYDWH

%XVLQHVV'LVFSULYDWH (FRQRP\FRPPRQ (FRQRP\5HGHPFRPPRQ

%XVLQHVV'LVFFRPPRQ

%XVLQHVV5HGHPFRPPRQ 5WUDYHOOHU5

%XVLQHVVFRPPRQ (FRQRP\,'6HUY17UDYFRPPRQ

%XVLQHVV,'$'6HUYLFHFRPPRQ

&ODVV

1XPEHURIREVHUYDWLRQVSHUFODVV

Figure 3.3: Feature ‘Class to’ analysis

Similarly to the arrival flights, there is a feature with information regarding the class in which the passengers travel, ‘Class to’ (Figure 3.3). The overall distribution of the different classes is the same as in ‘Class from’, however the number of missing values is zero.

Arrival/Departure Date

There are 427 different dates for the arrival flights and 425 for the departures. The difference is due to the fact that there are samples where the departure flight is on January 1st, 2019 and the respective arrival date is December 31st, 2018. The other extra date in the arrivals is ‘31-12-9999’, which corresponds to missing values.

Figure 3.4 shows the number of connecting passengers throughout the months for arrival and departure flights. Since each row of the dataset contains the arrival and departure flights date, and in most cases they occur on the same day, it would be expected that these features had similar trend. However, as can shown in Figure 3.4, this does not happen. In fact, this graph allows to understand the behaviour of the arrivals missing values (i.e., values represented by ‘31-12-9999’) throughout the months.

As can be seen in the graph, specially between April and October, there is a considerable number of missing values. Table 3.2 summarizes the number of missing data in both features.

Table 3.2: Missing values of the arrival and departure dates.

# Observations Frequency

Arrivals 468093 9.3%

Departures 0 0

(39)

'DWH

1XPEHURISDVVHQJHUV

1XPEHURIFRQQHFWLQJSDVVHQJHUVWKURXJKRXWWKHPRQWKV

$UULYDOV 'HSDUWXUHV

Figure 3.4: Connecting passengers throughout months.

SEF

As already stated, when associated with the Pax dataset, SEF data allows for very interesting analyses such as the graph in the Figure 3.5.

'HSDUWXUHGDWH

1XPEHURISDVVHQJHUV

1XPEHURIFRQQHFWLQJSDVVHQJHUVSDVVLQJWKURXJK6()WKURXJKRXWWKHPRQWKV 6() 6()

Figure 3.5: Analysis of the feature ‘SEF’ throughout months.

SEF equal to 1 means that the passenger has to pass through the passport control, and SEF equal to zero indicates otherwise. This graph can be useful to infer the density of connecting passengers at the airport and the influx of passengers in the SEF bottleneck throughout the months.

Binary Features

Figure 3.6 shows the remaining passenger’s information. The values are presented in percentage and, for the sake of space, NA stands for missing data.

‘Is Group’ indicates if the passenger is travelling within a group (represented by 1) or not (represented by 0).

The feature ‘Age’ has two possible values: adult or child. In Figure 3.6 these values are referred to

(40)

as A or C, respectively.

From the 8 presented features, ‘Sex’ is the only one with missing data. In this feature, M indicates male passengers and F represents female ones. As can be seen, this two classes are balanced, contrary to what happens in the other features.

‘Check Bags’ refers to the baggage of which the carrier takes sole custody and for which the carrier has issued a baggage check. It takes a value of 1 if the baggage was checked and 0 otherwise.

1$

*URXS,V

)UHTXHQF\

$ & 1$

$JH

0 ) 1$

6H[

1$

&KHFN

%DJV

1$

&RQQHFWLRQ 6WDWXV

1$

2YHUQLJKW

1$

%RDUGLQJ3D[

1$

6()

Figure 3.6: Binary features of the Pax dataset.

The feature which indicates whether a passenger missed the connection or not is the ‘Connection Status’. The graph indicates that 19.4% of the connections were missed and 80.6% were successful.

Whenever a passenger misses the connecting flight and the airline has to pay for the stay, the

‘Overnight’ value is equal to 1. As can be seen in the graph, these cases are rare events. Note that an

‘Overnight’ value of 1 implies ‘Connection Status’ equal to 0, but the opposite does not apply.

‘Pax Boarding’ feature indicates if the passenger boarded (represented by 1) or not. It differs from the ‘Connection Status’ in the sense that if a passenger has missed the connection (‘Connection Status’

and ‘Pax Boarding’ both equal to 0) then there is another sample in the dataset associated with this passenger. In this second connection, assuming that the passenger does not misses the connecting flight, ‘Pax Boarding’ is 1 but the ‘Connection Status’ remains 0 because the original connection was missed.

3.2.2 Hub Dataset

The Hub dataset contains 108 features and 345035 samples. From these samples, 174030 (50.4%) are related with the arrivals and 171005 (49.6%) with the departures.

Note that Terminal 2 is only for low cost departures and all arrivals go to Terminal 1, that is, Terminal 2 is not used in this analysis.

Predicting Passenger Connectivity in an Airline s Hub Airport. Aerospace Engineering