Data Analysis and Preparation
3.3 Data Preprocessing
3.3.3 Feature Engineering
Feature engineering is the process of creating and transforming features using the knowledge of the dataset acquired in the data analysis phase. Typically, it increases the predictive power of the learning
Passenger ID ... Data Schd Arr TP from Data Schd Dep TP to ...
123456 ... 01/09/2019 TP 0573 01/09/2019 TP 0087 ...
... ... ... ... ... ... ... ... ... ... ... ...
Pax Info Arr Info Dep Info
TP 0573 01/09/2019 A
TP 0087 01/09/2019 D
... IATA Code ... Flight Number Flight Date Movement Type ...
... TP ... 0573 01/09/2019 A ...
... ... ... ... ... ... ...
... TP ... 0087 01/09/2019 D ...
Figure 3.11: Schematic representation of the data integration process.
algorithm by creating features that capture additional information that is not presented in an amenable form in the original feature set .
Some of the most relevant new features included/transformed were:
Scheduled Connection Time. The schedule connection time is the time interval between the sched-uled on-block time (arrival ‘Schedule Date Time’), which is the time that an aircraft is schedsched-uled to arrive at its parking position, and the scheduled off-block time (departure ‘Schedule Date Time’), i.e. the time that an aircraft is scheduled to depart from its parking position. This connection time is the one used by the airlines for scheduling purposes.
Actual Connection Time. The actual connection time refers to the time interval (measured by the OCC) between the actual ‘On-Blocks Time’ and ‘Off-Blocks Time’, i.e., the actual arrival and departure times (measured by the OCC).
Perceived Connection Time. This variable is the scheduled off-block time (departure ‘Schedule Date Time’) minus the actual ‘On-Blocks Time’ (measured by the OCC). It represents the time a passenger thinks he has to make the connection. Thus, may be an indicator of the stress level of the connecting passenger, which influences the speed with which a person moves along the airport .
Arrival/Departure Delay. These features are defined as the arrival/departure ‘Schedule Date Time’
minus the ‘On/Off-Blocks Time’. Negative values indicate that an arrival/departure flight happened ahead of schedule.
Boarding Delta. This feature is the difference between the timestamps ‘Boarding End’ and ‘Boarding Start’ presented in the original set of features. It represents the boarding time for the departure flight.
Label. This feature represents the connection type based on the ‘Connection Status’ and boarding values, ‘Pax Boarding’ . It was created applying the hierarchical index procedure already described and illustrated in the Figure 3.11. For example, if there is only one line associated with a pair ‘Passenger ID’
and ‘TP from’, i.e. there is only one record for the connection in question, and both the connection and boarding status indicate that the passenger was able to do the transfer successfully, this connection is considered to be type A. This feature is essential to select the samples to be analysed or discarded. It is possible to distinguish the following labels:
• A: the passenger does not miss the connection;
• B/b: the passenger misses the connection and is relocated on another flight. The connection where the passenger successfully embarked is classified as ‘B’ and the other as ‘b’;
• O/o: same as type B/b, but in these cases the airline has to pay for the passenger’s overnight. The capital letter indicates the successful connection;
• N: due to lack of records, it is considered that the passenger did not appear;
• X: cases which appear to present errors since the information is not coherent with other cases.
For example, in some B/b and O/o cases, the passenger is relocated in more than one flight, the departures associated with a timestamp later than the successful departure are not considered useful information. This label also includes cases where clearly one of the rows is repeated, but with some missing values.
Rerouted. Indicates if the passenger was rerouted and the respective number of allocation. For exam-ple, if a passenger misses a connection and is allocated into another flight but for some reason misses that flight too and, consequently, is again rerouted to another connecting flight, then the ‘Rerouted’ val-ues are 0, 1 and 2 respectively. With this feature it is possible to select, for example, all the original connections, i.e., ‘Rerouted=0’.
Traffic Network. Indicates if the passenger is travelling from and to airports within the Schengen area.
Thus, there are four types of connections (traffic networks): Schengen to Schengen (SS), Schengen to Non-Schengen (SN), Non-Schengen to Schengen (NS) and Non-Schengen to Non-Schengen (NN).
This feature may have an effect on the time needed to go from the arriving point to the departure one.
For example, passengers in a SS connection are expected to traverse the airport in less time than the NS connecting passengers, since the first ones do not need to pass through the X-ray control nor the passport control, while the last ones have to pass through both of the airport bottlenecks (see Figure 3.12). To create this feature, it was necessary to elaborate a list with all airports in the Schengen area.
Each airport was identified with the respective IATA code. Then, for each data point, a match was sought
in the Schengen airports list to classify the arrival and departure flights as Schengen or Non-Schengen and, consequently, classify the traffic network.
Jet Bridge Jet Bridge
Way to Non- Schengen Departure
Way to Schengen Departure Schengen
Figure 3.12: Schematic diagram of the connecting jorney.
Class From/To. As shown in Figures 3.2 and 3.3, there are 16 different classes and some of them correspond to variations of the same one. These variations were then combined into the same class, resulting in a total of 5 different classes: Business, Economic, R1, Groups and Allots. These feature can be useful in the sense that may characterize a given passenger profile.