Data Analysis and Preparation

3.2 Exploratory Data Analysis

3.2.1 Pax Dataset

The Pax dataset consists of 5034919 samples (rows) and 17 features (columns).

Arrival Flights

The feature ‘From’, which indicates the International Civil Aviation Organization (ICAO) airport code of the arrival flights, has high cardinality: 128 distinct values. Within these values, the most common airport code is “OPO” (Porto, Portugal) with 369426 samples, i.e a frequency of 7.3%.

In addition to the ICAO airport code, there is another fundamental feature regarding the arrival flights:

the flight number. The flight number is a numerical designation of a flight. It is a code consisting of two character airline designator and a 1 to 4 digit number. The airline designator is the two character code assigned by IATA. Designators are used to identify an airline for all commercial purposes. In the case of two-character designators, they consist of two alpha characters, one alpha with one numeric, or one numeric with one alpha [61]. In fact, the term “flight number” refers to the numeric part of a flight code, but even within the airline and airport industry it is colloquially used as “flight designator”, which is the official term defined in Standard Schedules Information Manual (SSIM) published by IATA.

The flight designator feature for the arrival flights is called ‘TP from’ and has high cardinality: 642 distinct values.

As shown in Table 3.1, there are flights designators from other airlines other than TAP. These samples whose arrival flight belongs to other airlines are included in the dataset since they are considered to be connecting passengers (with departure flight TAP) as well.

Table 3.1: Arrival flights’ airlines.

# Observations

Airlines 59

TAP 4915963

TAP (%) 97.637

non-TAP 118956

non-TAP (%) 3.363

Regarding the non-TAP airlines, Figure 3.1 presents the absolute and relative number of observa-tions in the dataset for each airline.

Figure 3.1: Airlines with more than 1000 observations in the Pax dataset.

As shown in Figure 3.2, there are 16 different types of classes. However, some of them correspond to variations of the classes economic and business. For instance, the two most common classes are the same, but one of them contains a space character, causing the samples to be associated with different classes.

This feature, ‘Class from’, has 527800 (10.5%) missing values which is considered to be a significant number.

Figure 3.2: Feature ‘Class from’ analysis.

Departure Flights

The feature ‘To’, which is equivalent to the feature ‘From’ but with destination information, has 110 distinct values. The most common value is “OPO” (Porto, Portugal) with a frequency of 8.6%, which represents 434284 samples.

Contrary to what happens with the arrival flights, in the departures only TAP flights are recorded in the dataset. There are 344 distinct flights designators and the feature containing these values is the

feature ‘TP to’.

Figure 3.3: Feature ‘Class to’ analysis

Similarly to the arrival flights, there is a feature with information regarding the class in which the passengers travel, ‘Class to’ (Figure 3.3). The overall distribution of the different classes is the same as in ‘Class from’, however the number of missing values is zero.

Arrival/Departure Date

There are 427 different dates for the arrival flights and 425 for the departures. The difference is due to the fact that there are samples where the departure flight is on January 1st, 2019 and the respective arrival date is December 31st, 2018. The other extra date in the arrivals is ‘31-12-9999’, which corresponds to missing values.

Figure 3.4 shows the number of connecting passengers throughout the months for arrival and depar-ture flights. Since each row of the dataset contains the arrival and depardepar-ture flights date, and in most cases they occur on the same day, it would be expected that these features had similar trend. However, as can shown in Figure 3.4, this does not happen. In fact, this graph allows to understand the behaviour of the arrivals missing values (i.e., values represented by ‘31-12-9999’) throughout the months.

As can be seen in the graph, specially between April and October, there is a considerable number of missing values. Table 3.2 summarizes the number of missing data in both features.

Table 3.2: Missing values of the arrival and departure dates.

# Observations Frequency

Arrivals 468093 9.3%

Departures 0 0

'DWH

1XPEHURISDVVHQJHUV

1XPEHURIFRQQHFWLQJSDVVHQJHUVWKURXJKRXWWKHPRQWKV

$UULYDOV 'HSDUWXUHV

Figure 3.4: Connecting passengers throughout months.

SEF

As already stated, when associated with the Pax dataset, SEF data allows for very interesting analyses such as the graph in the Figure 3.5.

'HSDUWXUHGDWH

1XPEHURISDVVHQJHUV

1XPEHURIFRQQHFWLQJSDVVHQJHUVSDVVLQJWKURXJK6()WKURXJKRXWWKHPRQWKV 6() 6()

Figure 3.5: Analysis of the feature ‘SEF’ throughout months.

SEF equal to 1 means that the passenger has to pass through the passport control, and SEF equal to zero indicates otherwise. This graph can be useful to infer the density of connecting passengers at the airport and the influx of passengers in the SEF bottleneck throughout the months.

Binary Features

Figure 3.6 shows the remaining passenger’s information. The values are presented in percentage and, for the sake of space, NA stands for missing data.

‘Is Group’ indicates if the passenger is travelling within a group (represented by 1) or not (represented by 0).

The feature ‘Age’ has two possible values: adult or child. In Figure 3.6 these values are referred to

as A or C, respectively.

From the 8 presented features, ‘Sex’ is the only one with missing data. In this feature, M indicates male passengers and F represents female ones. As can be seen, this two classes are balanced, contrary to what happens in the other features.

‘Check Bags’ refers to the baggage of which the carrier takes sole custody and for which the carrier has issued a baggage check. It takes a value of 1 if the baggage was checked and 0 otherwise.

1$

Figure 3.6: Binary features of the Pax dataset.

The feature which indicates whether a passenger missed the connection or not is the ‘Connection Status’. The graph indicates that 19.4% of the connections were missed and 80.6% were successful.

Whenever a passenger misses the connecting flight and the airline has to pay for the stay, the

‘Overnight’ value is equal to 1. As can be seen in the graph, these cases are rare events. Note that an

‘Overnight’ value of 1 implies ‘Connection Status’ equal to 0, but the opposite does not apply.

‘Pax Boarding’ feature indicates if the passenger boarded (represented by 1) or not. It differs from the ‘Connection Status’ in the sense that if a passenger has missed the connection (‘Connection Status’

and ‘Pax Boarding’ both equal to 0) then there is another sample in the dataset associated with this passenger. In this second connection, assuming that the passenger does not misses the connecting flight, ‘Pax Boarding’ is 1 but the ‘Connection Status’ remains 0 because the original connection was missed.

No documento Predicting Passenger Connectivity in an Airline s Hub Airport. Aerospace Engineering (páginas 36-40)