Learning to Rank with Random Forest: A Case Study in Hostel Reservations

(1)

F

ACULDADE DE

E

NGENHARIA DA

U

NIVERSIDADE DO

P

ORTO

Learning to Rank with Random Forest:

a case study in hostel reservations

Carolina Macedo Moreira

D

ISSERTATION

Mestrado Integrado em Engenharia Informática e Computação Supervisor: Carlos Manuel Milheiro de Oliveira Pinto Soares

Company Supervisor: Isabel Portugal

(2)

(3)

Learning to Rank with Random Forest: a case study in

hostel reservations

Carolina Macedo Moreira

Mestrado Integrado em Engenharia Informática e Computação

(4)

(5)

Abstract

Learning to rank is the application of supervised machine learning in the construction of ranking models for information retrieval systems. Hostelworld Services Portugal uses this method to im-prove the ranking of properties on their listing page, to imim-prove their profits, as well as to boost their costumers satisfaction.

In the online hospitality industry, user search is one of the key factors. Search filters allow the users to easily see which properties they’re interested in. Therefore, the improvement of these filters is of maximum importance.

We developed a random forest based approach, with several variations, one focused exclusively on clicks, another on bookings, and another combining the two. We compared this approach to a simple baseline using the static ranking and to the current method, achieving positive results. Among the different variants that were tested, there are no significant differences.

(6)

(7)

Acknowledgements

First and foremost, I want to acknowledge my advisor, Professor Carlos Soares, as well as my company advisor, Isabel Portugal. They both had to deal with my lack of knowledge, mental breakdowns, lack of motivation and overall doom and gloom. Without either of them, I’m quite sure this thesis wouldn’t even exist. Isabel fought tooth and nail for me to be able to have the internship with Hostelworld, and my professor took on this thesis against his better judgment, even though he already had quite the workload, and for that I am eternally grateful to them both.

Furthermore, I’d like to thank everyone in the Porto office of Hostelworld, for always making me feel welcome and part of the team. In the same vein, I’d like to acknowledge the team at the small restaurant I always got my lunch from, Berry, for providing me with nutritious meals that allowed me to develop this project.

And as a parting note, I’d like to thank everyone who, in one way or another, helped me thorough this course, which allowed me to even make it this far in the first place. Without all your help during these years, none of this would have been possible. Thank you.

(8)

(9)

“Everything will be okay.”

(10)

(11)

2.2 Pointwise Approaches . . . 8 2.2.1 Definition of Classification . . . 8 2.2.2 Random Forest . . . 8 2.3 Related Work . . . 8 2.3.1 Type of Problem . . . 8 2.3.2 Algorithm Used . . . 9 2.3.3 Approach . . . 9 2.3.4 Data Format . . . 9 3 Case Study 11 3.1 Current Approach . . . 11 3.2 Project Methodology . . . 14 3.3 Data . . . 14 3.4 Features . . . 15 3.4.1 Feature Correlation . . . 20 4 Results 23 4.1 Approach . . . 23 4.2 Performance Estimation . . . 23 4.3 Research Questions . . . 23 4.3.1 Clicks . . . 24 4.3.2 Bookings . . . 25

4.3.3 Bookings & Clicks . . . 25

5 Conclusions 29 5.1 Future Work . . . 30

(12)

CONTENTS

6 Appendix 35

(13)

List of Figures

1.1 Homepage of hostelworld.com . . . 2

3.1 The static page for Amsterdam. From [2] . . . 12

3.2 The dynamic page for Amsterdam. From [2] . . . 13

3.3 CRISP-DM, Reproduced from[23] . . . 15

3.4 Number of queries, property clicks, and bookings over a period of two weeks of the six highest booked cities in Italy . . . 16

3.5 Number of queries, property clicks, and bookings over a period of two weeks from six medially booked cities in Italy . . . 17

3.6 Number of queries, property clicks, and bookings over a period of two weeks from eight meagerly booked cities in Italy . . . 18

3.7 Feature correlation chart . . . 21

4.1 Comparison ofclickpredictionswithcurrent baseline . . . 24

4.2 Comparison ofbookingpredictionswithcurrent baseline . . . 25

4.3 Comparison ofclickpredictionswithcurrent baseline . . . 26

(14)

LIST OF FIGURES

(15)

List of Tables

(16)

(17)

Chapter 1

Introduction

Hostelworld is an online travel agency, located at hostelworld.com. Their focus is on hostels, which can be described as "lower-priced, sociable accommodation where guests can rent a bed, usually a bunk bed, in a dormitory and share a bathroom, lounge and sometimes a kitchen." [6] The website allows customers to easily browse many hostels in one single platform, and directly book their accommodation there, without having to communicate with the hostels individually.

Upon reaching the homepage of hostelworld.com, an user is prompted to select a city, a re-gion, or a country by typing its name into the search bar on the top of the screen, and selecting accompanying dates for the proposed stay at a hostel, as well as the number of guests. There is also the choice to directly click on the "Unmissable Cities" section of the homepage, which will direct an user to the listing page of those cities.1.1

On the listing page for a city, an user can find a list of hostels relevant to their search parame-ters. These parameters, besides the information given at the beginning of the search, are filters the user can apply, such as price range, room type, among others. Chapter 3 will go in depth on these filters and their impact on the user search.

1.1 Problem & Motivation

The problem lies with improving the listing page’s ranking function. Currently, Hostelworld com-bines a simple approach of using the previous week’s sales, as well as the work of human analysts for their most prominent cities, to decide the ranking the properties should have in the following week. To improve it is to make it easier for an user to find their ideal property at the top of the listing, depending on the filters applied. If customers can find their ideal property at the top of the listing, there’s a higher change that they’ll book that property, and it might make them more likely to use the platform again in the future. For Hostelworld, this would translate into higher profits for the company, as well as happier customers with greater satisfaction levels.

There are two target values considered in this project, bookings and clicks. Although bookings are the more important one, since it correlates directly with Hostelworld’s profit, clicks are also

(18)

Introduction

Figure 1.1: Homepage of hostelworld.com 2

(19)

Introduction

important, since they reflect a more attainable level of engagement with the platform that wouldn’t be measurable only with bookings, which require a lot more commitment on the part of the user.

1.2 Goals & Contributions

The main objective of this project was to test a Random Forest approach with Hostelworld’s listing page ranking. In order to do this, four research questions were proposed.

1. Hypothesis 1: the proposed approach is better than the baselines to predict click rankings; 2. Hypothesis 2: the proposed approach is better than the baseline to predict bookings; 3. Hypothesis 3: the ability of the proposed approach to predict the bookings improves when

the ties are broken with the predicted ranking of clicks;

4. Hypothesis 4: in case of ties in the rank predicted by the random forest model, the current approach is better for breaking them, since the method is more sophisticated;

1.3 Dissertation Structure

Chapter 2 presents the Background & Related Work. Chapter 3 described the Hostelworld case study used in this project, while Chapter 4 discusses the results obtained. Finally, Chapter 5 presents the conclusions.

(20)

Introduction

(21)

Chapter 2

Background & Related Work

Learning to Rank is a field that draws on both Machine Learning and Information Retrieval. [20] It could be described as using machine learning algorithms to improve ranking functions in infor-mation retrieval.

In this chapter, we’ll explore the history of learning to rank, formalize the problems we’ll face, and take an overview of the field of learning to rank through an evaluation matrix.

2.1 Learning to Rank

Norbert Fuhr [13] first proposed the idea of "machine-learned ranking" in 1992, where he de-scribed different problems that the area would face in its future, such as parameter estimation, query formulation and the representation of documents and queries.

He also considered that the scope of the area would have to change. At the time, Information Retrieval was an area mostly reserved to its use in libraries or academic institutes, by trained professionals. Nowadays, as Fuhr predicted, it has evolved to be used by regular people, with the widespread use of the Internet and search engines, and it has become an integral part of our daily lives and careers.

In 2002, Alta Vista started to develop one of the first commercial search engine ranking sys-tems, using a gradient boosting-trained ranking function. This technology was later acquired by Overture, followed by Yahoo. [4]

2.1.1 Problem Formalization

Reproduced from [10], the formalization of a ranking problem can be defined as follows: Definiton An information retrieval model is a quadruple [D, Q,F,R(qi, dj)] where

(1)D is a set composed of logical views (or representations) for the documents in the collec-tion.

(2)Q is a set composed of logical views (or representations) for the user information needs. Such representations are calledqueries.

(22)

Background & Related Work

(4) R(qi, dj) is a ranking function which associates a real number with a query qi E Q and a document representation dj ED. Such ranking defines an ordering among the documents with regard to the query qi.

2.1.2 Types of Approaches: Pointwise, Pairwise & Listwise

There are three main approaches to learning to rank problems: Pointwise, Pairwise and Listwise approaches [20].

The Pointwise approach is based on applying already existing machine learning algorithms to ranking problems. This is done by assuming that the prediction goal is the exact degree of relevance of each document. As such, models that consider each document individually are trained.Under this umbrella fall several categories of algorithms: regression algorithms, classifica-tion algorithms, and ordinal regression algorithms. The difference between the three approaches lies with the target variable. In regression algorithms, this varible is assigned real-valued rele-vance scores. In classification algorithms, this variable can be inserted into various categories, and in ordinal regression algorithms, the target variable belongs to ordered categories. [20]

mpoint : X − > Y QxD− > Y (2.1)

r(qi, dj) = ranking(mpoint((qi, dj)), mpoint((qi, d1)), ..., mpoint((qi, dj)) (2.2)

The Pairwise approach, also known as preference learning, compares two possible documents for a given query, giving one a relative score of importance against the other. Here, the learning to rank problem is interpreted as a classification problem on these pairs. The goal is to classify all the document pairs correctly, which would indicate a perfect ranking for the given documents. This form of classification is different from the one in the Pointwise approach, since instead of focusing on a single document, it uses pairs of documents. Although a classification approach is still used, the required evaluation is not on the quality of each individual pairwise comparison, but on the ranking as a whole. [20]

mpair: QxDxD− > {−1, 0, 1} (2.3)

The Listwise approach takes the entire aggregate of documents associated with a certain query, and then outputs either their individual scores of relevance, or the ranking for all the documents. Comparing it with the Pointwise and Pairwise approaches, the advantage of the Listwise approach is that the approach itself is naturally engineered to reflect a ranking model, therefore being ap-propriate for such a problem. There is no need to take into consideration the pitfalls that exist in other approaches (such as the possible similarity between results of the Pointwise approach or the depth of individual pairings considerations of the Pairwise approach) since the approach itself is already akin to a ranking. [20]

(23)

mlist : QxDn− > partition(1, ..., n) (2.4)

2.1.3 Evaluation Methodology and Metrics

In this project, two evaluation metrics were used: NDCG and MAP. NDCG is based on the real value of the document in position a, as seen below:

rela= r(qi, dj) : rΘ(qi, dj) = a (2.5)

NDCG, standing for Normalized Discounted Cumulative Gain, is the most popular one. Dis-counted Cumulative Gain is a way to measure how good a ranking is. It is based on the fact that the higher a document is ranked, the more relevant it should be. Therefore, documents which are relevant but which are low on the ranking should be penalized. That would be Cumulative Gain. CGp= ∑

p

a=1rela The Discounted part came from the fact that the relevant documents which are

lower on the ranking scale are penalized in a logarithmic manner. DCGp= 2

rela₋₁

log2(a+1) The

Normal-ized part comes from taking the DCG and dividing it by the IDCG, that is, the Ideal Discounted Cumulative Gain. This is what the DCG would be if the ranking was perfect. Being a logarithmic measure, it frequently produces results that might seem inflated, but it comes down to the fact that, due to its nature, any improvements are reflected with very high percentile margins. [7]

nDCGp= DCGp IDCGp (2.6) IDCGp= |RELp|

∑

a=1 2rela_{− 1} log2(a + 1) (2.7) Another metric is MAP (Mean Average Precision). Precision is the fraction of relevant in-stances among the retrieved inin-stances. MAP is a simpler metric, which takes into account the mean average precision score of each query, where Q is the number of queries:

MAP=∑

Q

q=1AveP(q)

Q (2.8)

2.1.4 Performance Estimation Methodologies

Performance estimation methodologies are what’s used to predict how well a model will run in different circumstances, or whether it’s just functional in the experiments run. It allows prediction of performance in real world scenarios. In the related work section2.3, the most used approach were various types of cross-validation. However, for our project, this approach wasn’t very apt, so we chose to use a sliding windown methodology. Sliding window works by choosing, in our case, one week of data, using it to train the model, and then testing it on the following week, repeating this process as the weeks go by. As such, the first week would be used for training, and then that model would be used to test the second week. Then, the second week would be used for training,

(24)

and that trained model would be used to test the third week. This would progress until no more weeks were available in the data, or enough data had been gathered.

2.2 Pointwise Approaches

Since the main algorithm used in this project was a Pointwise algorithm, a more in-depth explana-tion of Pointwise approaches is required.

2.2.1 Definition of Classification

According to [11], classification can be defined as follows:

"Classification is the most common task in machine learning. A classifier is a mapping ˆc: X− > C, where C = {C1,C2, ...,Ck} is a finite and usually small set of class labels. We will

sometimes also use Cito indicate the set of examples of that class. We use the ’hat’ to indicate that

ˆc(x)is an estimate of the true but unknown function c(x). Examples for a classifier take the form (x, c(x)), where x ∈ X is an instance and c(x) is the true class of the instance. Learning a classifier involves constructing the function ˆc such that it matches c as closely as possible (and not just on the training set, but ideally on the entire instance space X ).

2.2.2 Random Forest

Random forests are what’s known as an ensemble method, that is, a method that uses multiple models to produce a better result. In this case, the algorithm in question are decision trees. A decision tree is sort of a flowchart in which objects are separated according to certain characteristic, being placed in different nodes until it reaches a probable outcome. It exists in both classification and regression formats. For this project, the classification method was used. [14]

2.3 Related Work

In order to explore the work that’d already been done in the area, we analysed 17 papers and identified several dimentions to characterise them, which are the Type of Problem, its Approach, the Algorithm Used, the Data Format, and the Evaluation Methodology and Metrics. A more detailed characterisation of each paper is presented in the annexes.6.1.

2.3.1 Type of Problem

There are 3 possible values for this category. Classification, Multiple Classification and Regres-sion. Classification can basically be described as choosing between two possible values. An object either is a value or isn’t. Multiple Classification is when there are more values to choose from. An object can fall into any one of multiple categories. Regression problems are when there is a line of possible outcomes, and an object can fall into any point of it. 58.8% are regression problems, 29.4% are classification ones, and the remaining 11.8% are multiple classification problems.

(25)

2.3.2 Algorithm Used

Most papers are experimenting with new algorithms, so results are very varied. However, algo-rithms can be grouped in various categories. SVM algoalgo-rithms and boosted trees are more used, while simpler algorithms trail behind.

2.3.3 Approach

There are a lot more listwise and pairwise approaches than pointwise. This showcases that this approach is not being very investigated, which made it interesting to develop a project focusing on a pointwise algorithm. A possible reason for this is that pointwise algorithms are often very similar to ones used in machine learning tasks not relating to learning to rank, and so there might be a bias against using them on learning to rank tasks.

2.3.4 Data Format

The data format was evenly divided between three categories: Relevance Label, Relevance Score, and Pairwise Comparisons.

Relevance Label considers whether a document is relevant to the given query. A Relevance Score signifies its score of relevance, which can then be used in comparison with other documents. Pairwise Comparisons are pairs of documents, indicative of which document is the more relevant between them.

(26)

(27)

Chapter 3

Case Study

Hostelworld is a company that describes itself as having the purpose to "inspire passionate trav-ellers to see the world, meet new people and come back with extraordinary stories to tell".[1] Its mission is "to create a platform that connects our hostel partners and hostel guests with the most customised tech solutions, and offers the best choice of hostels and travel experiences around the world"[1], and its ambition is "to reaffirm our position as the world’s leading online booking platform for hostel travellers".[1]

Per their own words, "Hostelworld Group is the leading hostel-focussed online booking plat-form, sparking social experiences for young and independent travellers." They have over 16,500 hostel properties globally, as well as 20,000 other forms of accommodation. [1]

3.1 Current Approach

There is a static ranking generated every week for each city. It ranks cities according to human analysts regarding the previous week’s bookings. This is the rank that appears when a user selects a city from the homepage.

For the first baseline, this ranking was used as a baseline to assess if the approach proposed is providing useful results, and merely cross-referenced with the queries in question. For this baseline, six cities in Italy were used over a period of two weeks. For a more in-depth explanation of why these six cities were chosen, please read section3.3.

The results of this first baseline, with both the MAP and the NDCG metrics, for each of the target variables, are as follows:

The dynamic ranking is the ranking currently in use by the website. It takes the static ranking, and applies the user filters, as well as the availability on the dates indicated by the user.

MAP NDCG Clicks 0.1694 0.4090 Bookings 0.0308 0.3493

(28)

Case Study

Figure 3.1: The static page for Amsterdam. From [2] 12

(29)

Case Study

(30)

Case Study

MAP NDCG Clicks 0.3909 0.5064 Bookings 0.0991 0.2773

This current Hostelworld approach was tested, to provide a second baseline to which to com-pare the results achieved through a Random Forest approach, as well as to comcom-pare with the first, static, baseline.

the results of the second baseline, with both the MAP and NDCG metrics, for each of the target variables, are as follows:

3.2 Project Methodology

This project uses the CRISP-DM method. CRISP-DM stands for "Cross-Industry Standard Pro-cess for Data Mining" and it breaks the Data Mining proPro-cess into six major phases.[23] Phases aren’t necessarily done one after the other, and there can be overlap and backtracking.3.3

3.3 Data

The data used consists of search records from six cities of different sizes, in Italy, during a period of three months. Italy was chosen because it is a very popular country, with approximately 5% of the website’s bookings. Although this may seem like a small percentage, considering there are 179 countries available in Hostelworld’s website [1], it is still a statistically important figure, reflecting its popularity with travellers. There were, at the time of the sample, which has a length of two weeks, 180 available cities in Italy. From this sample, six cities were selected (two small, two medium and two large ones). Cities with over two hundred bookings during the two weeks were considered big cities. From these, Rome was picked due to being the most popular city in Italy, with 1,169,790 queries made during this period, as well as 2326 bookings. The second big city chosen was Milan, with 576,570 queries and 1,290 bookings. We also considered Venice, with 547,635 queries and 1,279 bookings, which is very similar to Milan. There was an effort to select similarly queried and booked cities, so Venice was proposed to be the second big city, replacing Rome, but it was decided to leave it be, since Rome, being the capital, and with a significant margin compared to the other cities, had a statistical importance of its own.

Cities with more than fifty but less than one hundred bookings were considered medium cities. There were also six cities in this sample. In terms of queries, they were all approximately evenly spaced, so the two highest booked ones were picked, Catania, with 86 bookings, and Genoa, with 94.

Cities with less than twenty but more than ten bookings were considered small cities. There were eight cities in this sample. Of these, Cagliari was an outlier in terms of queries, with 7,181 queries, as well as Matera and Como, with 258 and 245 queries respectively. From these, Perugia

(31)

Case Study

Figure 3.3: CRISP-DM, Reproduced from[23]

and Taormina were chosen, since they have similar values across query numbers, property clicks, and bookings.

Therefore, the six cities that were used for this project were Rome, Milan, Catania, Genoa, Perugia and Taormina.

The period of three months was chosen because of seasonality. In the travel business, sea-sonality is one of the core factors that defines sales. Over a period of three months, it is possible to obtain an overview of that season, and draw conclusions that wouldn’t be possible with ei-ther a longer or a shorter period of time. For this reason, the months of September, October and November were chosen, in order to provide a sense of the seasonality of the autumnal season of 2018.

3.4 Features

In the development of this project, 51 features were used in training the random forest algorithms. Number of nights: depending on the number of nights a customer chooses to spend, different properties might be more appropriate. There might be some properties that are better suited to shorter stays in comparison with longer ones.

Number of guests: guests who are traveling solo, in pairs, or in groups might have very dif-ferent needs and demands when it comes to the property they will choose. A guest traveling solo might prefer to stay at a property that’s well known for it’s good dorm rooms, while a guest travel-ing with a significant other might prefer a private room, and a big family might prefer somethtravel-ing akin to a suite.

(32)

Case Study

rome _milan

venice florence naples bologna 0 0.2 0.4 0.6 0.8 1 1.2 ·106 Queries rome _milan

venice florence naples bologna 0 2 4 6 8 ·104 Clicks rome _milan

venice florence naples bologna 500 1,000 1,500 2,000 2,500 Bookings

Figure 3.4: Number of queries, property clicks, and bookings over a period of two weeks of the six highest booked cities in Italy

(33)

Case Study

palermo cinque

terre

sorrento catania genoa

turin 1 1.5 2 2.5 3 ·104 Queries palermo cinque terre

turin 1,500 2,000 2,500 3,000 3,500 Clicks palermo cinque terre

turin 70

80 90

Bookings

Figure 3.5: Number of queries, property clicks, and bookings over a period of two weeks from six medially booked cities in Italy

(34)

Case Study

cagliari trieste pompei perugia_{taorminasyracuse} matera como 0

2,000 4,000 6,000

Queries

cagliari trieste pompei perugia_{taorminasyracuse} matera como 100 200 300 400 500 600 700 Clicks

cagliari trieste pompei perugia_{taorminasyracuse} matera como 12

14 16 18

Bookings

Figure 3.6: Number of queries, property clicks, and bookings over a period of two weeks from eight meagerly booked cities in Italy

(35)

Case Study

Featured property: when one arrives at the listing page for a certain city, some properties will appear on the top of the screen, before all others. These are featured properties, the owners of which have paid to have their property take on a more prominent spot in the listings. This is a binary feature of whether a property is featured or not.

Districts: a filter that allows users to only display properties in the selected district. This a binary feature of whether this filter was used or not.

Facilities: a filter that allows users to only display properties with certain features, for in-stance, properties that have air conditioning or free breakfast. This is a binary feature of whether this filter was used or not, without individual criteria separation.

Payment type: a filter that allows users to only display properties that allow a certain payment type, for instance, showing only properties that allow using a credit card to pay. This is a binary feature of whether this filter was used or not.

Price range: a filter that allows users to only display properties that are within a certain price range. This is a binary feature of whether this filter was used or not.

Property type: a filter that allows users to only display properties of a certain type, for instace, hostels or hotels. This is a binary feature of whether this filter was used or not.

Rating range: a filter that allows users to only display properties that are within a certain rating range. This is a binary feature of whether this filter was used or not.

Room type: a filter that allows user to only display properties that have available rooms of a certain type, for instance, female dorm rooms or three person private rooms. This is a binary feature of whether this filter was used or not.

Distance: a sort function that will forcefully sort the properties on the listing page by their distance to the center of the city. This is a binary feature of whether this sort function was used or not.

Name: a sort function that will forcefully sort the properties on the listing page by their alphabetical order. This is a binary feature of whether this sort function was used or not.

Price: a sort function that will forcefully sort the properties on the listing page by their price. This is a binary feature of whether this sort function was used or not.

Rating: a sort function that will forcefully sort the properties on the listing page by their rating. This is a binary feature of whether this sort function was used or not.

Weekdays: one-hot encoding of the day of the week of the arrival date. A guest arriving on a Monday might have different priorities in properties compared with a guest arriving on a Friday, especially if this is taken into consideration along with the number of nights feature.

Website language: one-hot encoding of the website language chosen by the user, from the selection of brazilian portuguese, chinese, czech, danish, dutch, english, finnish, french, german, italian, japanese, korean, norwegian, polish, portuguese, russian, spanish, swedish and turkish. If a property is frequently visited or booked by website users of a certain language, it might be good to show them those properties first.

Dynamic fab: binary feature that presents whether the user is looking at the static page, with-out applying any filter, or at the dynamic page, which is already a selection of all the properties.

(36)

Case Study

Property chain: binary feature that showcases whether a property is part of a chain. Properties that are part of a chain have certain advantages and disadvantages that might be worth considering. Number of images: the number of images that a property listing has. A property that has at least a certain number of images might be considered more interesting and reliable by the customers.

Number of reviews: the number of reviews a property has. Review average: the average review score of the property.

Type of property: one-hot encoding detailing the type of property, whether a hostel, a hotel, a guesthouse, an apartment or a campsite.

Clicked: binary feature detailing whether a property was clicked for the query in question. One of the target variables.

Booked: binary feature detailing whether a property was booked for the query in question. One of the target variables.

3.4.1 Feature Correlation

Upon doing a correlation chart, using the Pearson standard correlation coefficient [5] , as can be seen on3.7, there aren’t really any features that heavily correlate with any others.

There is a 29% correlation between clicked properties and whether they’re featured, and a 26% correlation between booked properties and whether they’re featured. Besides that, there’s a 30% correlation as well between the number of images and whether a property is featured. There is also 49% correlation between a property being clicked and it being booked. Overall, there aren’t any really heavy correlations in our features.

(37)

Case Study

(38)

Case Study

(39)

Chapter 4

Results

4.1 Approach

In order to apply a random forest algorithm to a learning to rank problem, some changes had to be made to the data before applying the algorithm. We used the out-of-the-box Random Forest Classifier from the scikit-learn library for python [3]. The number of trees for each ensemble was set to 100, and the maximum depth for each tree was set as two.

The main difficulty in using a classification algorithm for ranking is that the predictions consist in a class from a finite set of unordered alternatives. In our case study, the prediction is whether or not a particular property would have been clicked, or booked, given a particular query. It didn’t present an organized ranking of the properties for each query, in other words, the predictions cannot be directly used to generate a ranking of the properties for each query. However, for each prediction made with a RF there is an associated probability of belonging to each of the classes. Given that we are interested in the positive class (i.e. user has clicked on property or booked it), we build a ranking by ordering the predictions according to decreasing order of belonging to the positive class. Since we are addressing the problems of whether the user has clicked or has booked separately, in practise we compute two rankings for each query.

4.2 Performance Estimation

We evaluated each predicted ranking with both the NDCG and MAP metrics, according to the sliding window methodology. We repeated this process twelve times, using up all of the weeks in the three months’ data available. We also analysed the run times of the approaches. The values obtained were around 1 minute, which indicates that execution time is not an issue.

4.3 Research Questions

As explained earlier, the main objective of this project was to test a Random Forest approach with Hostelworld’s listing page ranking. In order to do this, four research questions were proposed.

(40)

Results 0 2 4 6 8 10 12 0.4 0.5 0.6 Weeks MAP 0 2 4 6 8 10 12 0 0.5 1 Weeks NDCG

Figure 4.1: Comparison ofclickpredictionswithcurrent baseline

1. Hypothesis 1: the proposed approach is better than the baselines to predict click rankings; 2. Hypothesis 2: the proposed approach is better than the baseline to predict bookings; 3. Hypothesis 3: the ability of the proposed approach to predict the bookings improves when

the ties are broken with the predicted ranking of clicks;

Each hypothesis was tested by comparing our approach with the default ordering as well as with the current baseline approach being used as a tie breaker when properties had the same probability.

4.3.1 Clicks

The first hypothesis consisted in ignoring the bookings, and focusing entirely on the clicks, trying to optimize how many property pages would be clicked.

On the left chart in Figure4.1, showcasing the MAP evaluation, the blue line, which is almost completely overlapping the red line, represents the random forest approach using the default or-dering as a tie-breaker. The red line, matching almost perfectly with this last one, represents the random forest approach using the current approach as a tie-breaker. The brown line, at the bottom, represents the results obtained with the current approach. On the right chart in4.1, showcasing the NDCG evaluation. Overall, there was an average MAP score of 0.5687 for our method using the default tie-breaking method and 0.5685 with the baseline tie-breaker baseline score is 0.3688, which represents an overall improvement of 54%. For the NDCG, our method obtained the scores 0.9838, and 0.9839 with the default and baseline tie breakers. The baseline obtained a score of 0.3392. This represents an overall improvement around 272%.

These results show that the RF approach is much better than the approaches that are currently in use. The improvements are so large that there is no need for statistical analysis of the differences.

(41)

Results 0 2 4 6 8 10 12 0.1 0.2 Weeks MAP 0 2 4 6 8 10 12 0.2 0.4 0.6 0.8 Weeks NDCG

Figure 4.2: Comparison ofbookingpredictionswithcurrent baseline

4.3.2 Bookings

The second hypothesis consisted in focusing entirely on the bookings, trying to optimize how many properties would be booked.

In the left chart in4.2, showcasing the MAP evaluation, the blue line represents the random forest approach using the default ordering as a tie-breaker. The red line represents the random for-est approach using the current approach as a tie-breaker. The brown line, at the bottom, represents the results obtained from the current approach.

In the right chart in4.2, showcasing the NDCG evaluation, the blue line represents the random forest approach using the default ordering as a tie-breaker. The red line represents the random forest approach using the current approach as a tie-breaker. The brown line represents the results obtained from the current approach.

Overall, there was an average MAP score of 0.2186 for the default approach and of 0.2163 for the baseline tie-breaker approach, compared with a baseline score of 0.1091, an overall improve-ment of 102% for the default approach and 99% for the tie-breaker baseline approach. For the NDCG, there was a default score of 0.6045, and a baseline tie-breaker score of 0.6326, compared with a baseline score of 0.1489, an overall improvement of 345% for the default approach and 367% for the tie-breaker baseline one.

4.3.3 Bookings & Clicks

For our third hypothesis, we used the bookings as a base order, and used the scores obtained from the model to predict the clicks as a tie-breaking strategy to get the final ranking. The predicted ranking was evaluated against both rankings, based on clicks and on bookings.

4.3.3.1 Clicks

In the left chart in4.3, showcasing the MAP evaluation, the blue line represents the random forest approach using the default ordering as a tie-breaker. The red line represents the random forest

(42)

ap-Results 0 2 4 6 8 10 12 0.4 0.5 0.6 Weeks MAP 0 2 4 6 8 10 12 0 0.5 1 Weeks NDCG

Figure 4.3: Comparison ofclickpredictionswithcurrent baseline

proach using the current approach as a tie-breaker. The brown line represents the results obtained from the current approach.

Overall, there was an average MAP score of 0.5585 for the default approach and of 0.5533 for the baseline tie-breaker approach, compared with a baseline score of 0.3688, an overall im-provement of 52% for the default approach and 50% for the tie-breaker baseline approach. For the NDCG, there was a default score of 0.9757, and a baseline tie-breaker score of 0.9781, compared with a baseline score of 0.3392, an overall improvement of 269% for the default approach and 266% for the tie-breaker baseline one.

0 2 4 6 8 10 12 0.1 0.2 0.3 Weeks MAP 0 2 4 6 8 10 12 0.2 0.4 0.6 0.8 Weeks NDCG

Figure 4.4: Comparison ofbookingpredictionswithcurrent baseline

(43)

Results

4.3.3.2 Bookings

In the left chart in4.4, showcasing the MAP evaluation, the blue line represents the random forest approach using the default ordering as a tie-breaker. The red line represents the random forest approach using the current approach as a tie-breaker. The brown line, at the bottom, represents the results obtained from the current approach.

Overall, there was an average MAP score of 0.2185 for the default approach and of 0.2176 for the baseline tie-breaker approach, compared with a baseline score of 0.1091, an overall improve-ment of 102% for the default approach and 101% for the tie-breaker baseline approach. For the NDCG, there was a default score of 0.6190, and a baseline tie-breaker score of 0.6217, compared with a baseline score of 0.1489, an overall improvement of 357% for the default approach and 354% for the tie-breaker baseline one.

(44)

Results

(45)

Chapter 5

Conclusions

Hostelworld (http://hostelworld.com) is an online travel agency focused on hostels, which can be described as a low cost, shared accommodation. Users find hostels by querying the site (select a city, a region, or a country, desired dates for the proposed stay at a hostel, number of guests, etc) and receives a ranking of properties, in decreasing order of expected interest. The main goal of this project was to test a Random Forest approach for ranking properties in response to user queries. Two target values were considered, bookings and clicks. Although bookings are the most important one, since it correlates directly with profit, clicks are also important, since they reflect a more attainable level of engagement with the platform that would not be measurable only with bookings.

In order to do this, four research questions were addressed:

1. Hypothesis 1: the proposed approach is better than the baselines to predict click rankings;

2. Hypothesis 2: the proposed approach is better than the baseline to predict bookings;

3. Hypothesis 3: the ability of the proposed approach to predict the bookings improves when the ties are broken with the predicted ranking of clicks;

Overall, all alternatives tested worked very well. No matter which one is chosen, or under which tie-breaker conditions, they showed great improvements over the current approach.

For the clicks, the best hypothesis was the first one, which makes sense, since the third hy-pothesis wasn’t being trained with clicks in mind. In the first hyhy-pothesis, both the default and the baseline approaches performed equally well.

For the bookings, both hypothesis two and three work similarly well, with only very minute differences that also don’t isolate a single hypothesis or method.

(46)

Conclusions

5.1 Future Work

Different approaches should be tried. Two other types of algorithms were considered for this project. Due to time restraints, it wasn’t possible to implement them at this time, but it would be the logical next step. The first one are recommender systems. These algorithms would suggest hostel to an user based on either similar hostels to the type that they have previously booked, or clicked on, or by similar users to them, with similar age ranges, nationalities, etc.

The second type of algorithm are learning to rank algorithms. These are more complex ma-chine learning algorithms, but that historically shown better results. [20]

(47)

References

[1] Annual report 2018. URL: http://www.hostelworldgroup.com/~/ media/Files/H/Hostelworld-v2/reports-and-presentations/

interactive-annual-report-2018.pdf.

[2] Hostels worldwide - online hostel bookings, ratings and reviews. URL: https://www. hostelworld.com/.

[3] learn: machine learning in python - scikit-learn 0.16.1 documentation. URL: https:// scikit-learn.org/.

[4] Us7197497b2 - method and apparatus for machine learning a document relevance function. URL:https://patents.google.com/patent/US7197497.

[5] Vii. note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58(347-352):240–242, 1895. doi:10.1098/rspl.1895.0041.

[6] Hostel, Dec 2018. URL:https://en.wikipedia.org/wiki/Hostel.

[7] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. Proceedings of the 22nd international conference on Machine learning - ICML 05, 2005. doi:10.1145/1102351.1102363.

[8] Chris J. C. Burges, Krysta M. Svore, Qiang Wu, and Jianfeng Gao. Ranking, boosting, and model adaptation, October 2008. URL: https://www.microsoft.com/en-us/ research/publication/ranking-boosting-and-model-adaptation/.

[9] Xiao Chang, Qinghua Zheng, and Peng Lin. Cost-sensitive supported vector learning to rank imbalanced data set. Emerging Intelligent Computing Technology and Applications. With Aspects of Artificial Intelligence Lecture Notes in Computer Science, page 305–314, 2009. doi:10.1007/978-3-642-04020-7_33.

[10] Ribeiro Berthier de Araujo Neto. and Ricardo Baeza-Yates. Modern information retrieval. Pearson Higher Education, 2010.

[11] Peter A. Flach. Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, 2017.

[12] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efficient boosting algo-rithm for combining preferences. J. Mach. Learn. Res., 4:933–969, December 2003. URL: http://dl.acm.org/citation.cfm?id=945365.964285.

[13] N. Fuhr. Probabilistic models in information retrieval. The Computer Journal, 35(3):243–255, Jan 1992. doi:10.1093/comjnl/35.3.243.

(48)

REFERENCES

[14] Tin Kam Ho. Random decision forests. Proceedings of 3rd International Conference on Document Analysis and Recognition. doi:10.1109/icdar.1995.598994.

[15] Osman Ali Sadek Ibrahim and Dario Landa-Silva. Es-rank. Proceedings of the Symposium on Applied Computing - SAC 17, 2017. doi:10.1145/3019612.3019696.

[16] Rong Jin, Hamed Valizadegan, and Hang Li. Ranking refinement and its application to information retrieval. Proceeding of the 17th international conference on World Wide Web -WWW 08, 2008. doi:10.1145/1367497.1367552.

[17] Thorsten Joachims. Optimizing search engines using clickthrough data. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining -KDD 02, 2002. doi:10.1145/775047.775067.

[18] Jen-Wei Kuo, Pu-Jen Cheng, and Hsin-Min Wang. Learning to rank from bayesian decision inference. Proceeding of the 18th ACM conference on Information and knowledge manage-ment - CIKM 09, 2009.doi:10.1145/1645953.1646058.

[19] Ping Li, Chris J.C. Burges, and Qiang Wu. Learning to rank using classification and gradient boosting, January 2008. Advances in Neural Information Processing Sys-tems 20. URL:https://www.microsoft.com/en-us/research/publication/ learning-to-rank-using-classification-and-gradient-boosting/.

[20] Tie-Yan Liu. Learning to rank for information retrieval. Foundations and Trends in Infor-R

mation Retrieval, 3(3):225–331, 2007. doi:10.1561/1500000016.

[21] Tapio Pahikkala, Evgeni Tsivtsivadze, Antti Airola, Jorma Boberg, and Tapio Salakoski. Learning to rank with pairwise regularized least-squares. SIGIR 2007 Workshop on Learning to Rank for Information Retrieval, 01 2007.

[22] L. Rigutini, T. Papini, M. Maggini, and F. Scarselli. Sortnet: Learning to rank by a neural preference function. IEEE Transactions on Neural Networks, 22(9):1368–1380, 2011.doi: 10.1109/tnn.2011.2160875.

[23] C Shearer. The crisp-dm model: the new blueprint for data mining. J Data Warehouse, 5:13–22, 01 2000.

[24] Ming-Feng Tsai, Tie-Yan Liu, Tao Qin, Hsin-Hsi Chen, and Wei-Ying Ma. Frank. Proceed-ings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR 07, 2007. doi:10.1145/1277741.1277808.

[25] Hamed Valizadegan, Rong Jin, Ruofei Zhang, and Jianchang Mao. Learning to rank by op-timizing ndcg measure. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1883–1891. Curran Associates, Inc., 2009. URL: http://papers.nips.cc/paper/ 3758-learning-to-rank-by-optimizing-ndcg-measure.pdf.

[26] Maksims N. Volkovs and Richard S. Zemel. Boltzrank. Proceedings of the 26th Annual In-ternational Conference on Machine Learning - ICML 09, 2009.doi:10.1145/1553374. 1553513.

[27] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizing average precision. Proceedings of the 30th annual international ACM

(49)

REFERENCES

SIGIR conference on Research and development in information retrieval - SIGIR 07, 2007. doi:10.1145/1277741.1277790.

[28] Zhaohui Zheng, Keke Chen, Gordon Sun, and Hongyuan Zha. A regression framework for learning ranking functions using relative relevance judgments. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR 07, 2007. doi:10.1145/1277741.1277792.

[29] Zhaohui Zheng, Hongyuan Zha, Tong Zhang, Olivier Chapelle, Keke Chen, and Gordon Sun. A general boosting method and its application to learning ranking functions for web search. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1697– 1704. Curran Associates, Inc., 2008. URL: http://papers.nips.cc/paper/

3305-a-general-boosting-method-and-its-application-to-learning-ranking-functions-for-web-search. pdf.

(50)

REFERENCES

(51)

Chapter 6

(52)

Appendix T itle T ype of Pr oblem Algorithm A ppr oach Data F ormat Ev aluation Method Metrics Optimizing Search Engines using Clickthrough Data[ 17 ] Classification Ranking SVM P airwise P airwise Comparisons Lea v e-one-out cross-v alidation K endall’ s T Learning to Rank Using Gradient Descent[ 7 ] Learning RankNet Gradient Descent P airwise P airwise Comparisons Repeated random sub-sampling v alidation NDCG McRank: Learning to Rank Using Multiple Classification and Gradient Boosting[ 19 ] Multiple Classification Gradient Boosted T ree Pointwise Rele v ance Label k-fold cross-v alidation NDCG ES-Rank: Ev olution Strate gy Learning to Rank Approach[ 15 ] Learning ES-Rank Listwise Rele v ance Label k-fold cross-v alidation MAP/NDCG FRank: A Ranking Method with Fidelity Loss[ 24 ] Learning FRank P airwise P airwise Comparisons k-fold cross-v alidation P@n/MAP Cost-Sensiti v e Supported V ector Learning to Rank Imbalanced Data Set[ 9 ] Multiple Classification Ranking SVM P airwise P airwise Comparisons cross-v alidation NDCG Ranking, Boosting, and Model Adaptation[ 8 ] Learning LambdaSMAR T P airwise R ele v ance Label cross-v alidation NDCG An Ef ficient Boosting Algorithm for Combining Preferences[ 12 ] Classification RankBoost P airwise P airwise Comparisons k-fold cross-v alidation AP/PR O T A Re gression Frame w ork for Learning Ranking Functions Using Relati v e Rele v ance Judgments[ 28 ] Learning Gradient Boosted T ree P airwise P airwise Comparisons k-fold cross-v alidation P@n A General Boosting Method and its Application to Learning Ranking Functions for W eb Search [ 29 ] Learning QBRank P airwise Rele v ance Label cross-v alidation P@n/DCG Learning to Rank with P airwise Re gularized Least-Squares[ 21 ] Learning RankRLS P airwise R ele v ance Score k-fold cross-v alidation NDCG/P@n A Support V ector Method for Optimizing A v erage Precision[ 27 ] Classification Ranking SVM Listwise P airwise Comparisons Repeated random sub-sampling v alidation MAP Ranking Refinement and Its Application to Information Retrie v al[ 16 ] Classification MRR P airwise R ele v ance Score Repeated random sub-sampling v alidation NDCG SortNet: Learning T o Rank by a Neural Preference Function[ 22 ] Classification SortNet P airwise P airwise Comparisons Repeated random sub-sampling v alidation NDCG/P@n BoltzRank: Learning to Maximize Expected Ranking Gain[ 26 ] Learning BoltzRank Listwise R ele v ance Score cross-v alidation MAP/NDCG Learning to Rank from Bayesian Decision Inference[ 18 ] Learning BayesRank Listwise R ele v ance Score k-fold cross-v alidation MAP/NDCG Learning to Rank by Optimizing NDCG Measure[ 25 ] Learning NDCGBoost Listwise R ele v ance Label k-fold cross-v alidation NDCG T able 6.1: Matrix of Ev aluation 36

Learning to Rank with Random Forest: A Case Study in Hostel Reservations

F

E

U

P

Learning to Rank with Random Forest:

a case study in hostel reservations

Carolina Macedo Moreira

D

Learning to Rank with Random Forest: a case study in

hostel reservations

Carolina Macedo Moreira

Mestrado Integrado em Engenharia Informática e Computação

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Problem & Motivation

1.2

Goals & Contributions

1.3

Dissertation Structure

Chapter 2

Background & Related Work

2.1

Learning to Rank

∑

2.2

Pointwise Approaches

2.3

Related Work

Chapter 3

Case Study

3.1

Current Approach

3.2

Project Methodology

3.3

Data

3.4

Features

Chapter 4

Results

4.1

Approach

4.2

Performance Estimation

4.3

Research Questions

Chapter 5

Conclusions

5.1

Future Work

References

Chapter 6