Evaluation of spatial data’s impact in mid-term room rent price through application of spatial econometrics and machine learning. Case study: Lisbon

(1)

EVALUATION OF SPATIAL DATA’S IMPACT IN MID-TERM

ROOM RENT PRICE THROUGH APPLICATION OF SPATIAL

ECONOMETRICS AND MACHINE LEARNING

Case Study: Lisbon

(2)

EVALUATION OF SPATIAL DATA’S IMPACT IN MID-TERM

ROOM RENT PRICE THROUGH APPLICATION OF SPATIAL

ECONOMETRICS AND MACHINE LEARNING

Case Study: Lisbon

Dissertation supervised by:

Professor Doutor Roberto André Pereira Henriques, PhD Instituto Superior de Estatística e Gestão de Informação,

Universidade NOVA de Lisboa Lisbon, Portugal

Joel Dinis Baptista Ferreira da Silva, PhD Instituto Superior de Estatística e Gestão de Informação,

Universidade NOVA de Lisboa Lisbon, Portugal

Professor Doutor Carlos Granell-Canut, PhD Institute of New Imaging Technologies,

Universitat Jaume I (UJI) Castellón de la Plana, Spain

(3)

DECLARATION OF ORIGINALITY

I declare that the work described in this document is my own and not from

someone else. All the assistance I have received from other people is duly

acknowledged and all the sources (published or not published) are

referenced.

This work has not been previously evaluated or submitted to NOVA

Information Management School or elsewhere.

Lisbon, 28.02.2020

Mihail Petkov

(4)

A

CKNOWLEDGMENTS

An idea about making sense of economic phenomena through the lenses of geographical information have always been a research topic of interest for me. Though research and the results’ aftermath are oftentimes unpredictable, the people that guided me and supported me throughout this process were fortunately not.

I want to first and foremost thank my main-supervisor Professor Dr. Roberto Henriques that was open to hearing my ideas and helping me construct them in something practical, to Professor Dr. Joel Silva who was always there to suggest improvements and guide me through new directions and doors when all I was seeing were dead ends, for Professor Dr. Carlos Granell for believing in me, and last, but not least, to Professor Dr. Marco Painho for keeping me level-headed to push and progress until I have reached the finish line.

Additionally, my greatest gratitude goes to all colleagues from the Geospatial Technologies Master as a whole, with special shout-outs to my friend and flat-mate Itza, to Damien, Carlos, Braundt, Vicente and Pablo. Last but not least, I would like to thank all my closest family and friends in Macedonia. I am extremely grateful for all the support, patience and encouragement provided during the journey.

(5)

EVALUATION OF SPATIAL DATA’S IMPACT IN MID-TERM

ROOM RENT PRICE THROUGH APPLICATION OF SPATIAL

ECONOMETRICS AND MACHINE LEARNING

Case Study: Lisbon

ABSTRACT

Household preferences is a topic whose relevance can be found to dominate the applied economics, but whereas urban economies view cities as production centers, this thesis aims to give importance to the role of consumption. Provision to PoIs might give explanation to what individuals value as an important asset for improvement of their quality of life in a chosen city. As such, understanding short-term rentals and real estate prices have induced various research to seek proof of impacting factors, but analysis of mid-term rent has faced the challenge of being an overlooked category. This thesis consists of an integrated three-steps approach to analyze spatial data’s impact over the mid-term room rent, choosing Lisbon as its case study. The proposed methodology constitutes use of traditional spatial econometric models and SVR, encompassing a large set of proxies for amenities that might be recognized to hold a possible impact over rent prices. The analytical frameworks’ first step is to create a suitable HPM model that captures the data well, so significant variables can be detected and analyzed as a discrete dataset. The second step applies subsets of the dataset in the creation of SVR models, in hopes of identifying the SVs influencing price variances. Finally, SOM clusters are chosen to address whether more natural order of data division exists. Results confirm the impact of proximity to various categories of amenities, but the enrichment of models with the proposed proxies of spatial data failed to corroborate attainment of model with a higher accuracy.

(Nüst et al., 2018) provides a self-assessment of the reproducibility of research, and according to the criteria given, this dissertation is evaluated as: 0, 2, 1, 2, 2 (input data, preprocessing, methods, computational environment, results).

(6)

KEYWORDS

Predictive Modeling

Amenities

Support Vector Regression Spatial Econometrics Hedonic Price Modelling Points of Interest

(7)

ACRONYMS

AIC – Akaike Information Retrieval ANN – Artificial Neural Network

API – Application Programming Interface

BGRI - Base Geográfica de Referenciação de Informação BMU – Best-Matching Unit

CCDR – Comissão de Coordenação e Desenvolvimento Regional CML – Câmara Municipal de Lisboa

CPI – Consumer Price Index DoF – Degrees of Freedom DTM – Digital Terrain Model

GIS – Geographic Information Systems HPM – Hedonic Price Modelling MAE – Mean Absolute Error MD – Minimum Distance MLP – Multi-Layer Perceptron

MMH – Maximum Margin Hyperplane OSM – OpenStreetMap

POI – Point of Interest

RMSE – Root Mean Square Error SD – Standard Deviation

SDEM – Spatial Durbin Error Model SLX – Spatially Lagged X

SOM – Self-Organizing Map

SQL – Standardized Query Language SVM – Support Vector Machine

(8)

INDEX OF THE TEXT

ACKNOWLEDGMENTS ... IV ABSTRACT ... V KEYWORDS ... VI ACRONYMS ... VII INDEX OF TABLES ... X INDEX OF FIGURES ... XI 1. INTRODUCTION ... 1

1.1 THE MARKET AND ITS FLUCTUATIONS ... 1

1.2 FACTORS DRIVING A MARKET –PREDICTIVE MODELLING ... 2

1.3 RESEARCH OBJECTIVES ... 3

1.4 DISSERTATION STRUCTURE ... 3

2. THEORETICAL FRAMEWORK AND RELATED WORKS ... 4

2.1 THEORETICAL BACKGROUND ... 4

2.1.1 Artificial Intelligence and Machine Learning ... 4

2.1.2 Spatial Econometrics ... 4

2.1.3 Support Vector Regression ... 6

2.1.4 Self-Organizing Map and GeoSOM ... 8

2.2 RELATED WORK ... 9

2.2.1 Research through Application of Spatial Econometrics ... 9

2.2.2 Research through Applications of SVR ...11

2.2.3 Research through Usage of Other Algorithms ...11

3. DATA AND METHODOLOGY ...12

3.1 GEOGRAPHICAL CONTEXT ...12

3.2 PROPOSED ARCHITECTURE ...13

3.3 HARDWARE AND SOFTWARE ...13

3.3.1 PostgreSQL ...13

3.3.2 R ...14

3.3.3 Python ...14

3.4 DATA DESCRIPTION AND RESOURCES ...15

3.4.1 Room Rental Sources ...15

3.4.2 Zomato API Restaurant Data Collection ...15

3.4.3 Google Places API Data Collection ...16

3.4.4 OpenStreetMap Data ...16

3.4.5 Census Data ...18

3.4.6 Ancillary Data ...18

3.4.7 Other Variables Importance – Access Limitations...19

3.5 METHODOLOGY ...19

3.5.1 Variables Conversion ...20

3.5.2 Spatial Data Enrichment ...21

3.5.3 Data Preprocessing ...24

(9)

4. RESULTS AND DISCUSSION ...28

4.1 SPATIAL DEPENDENCE ...28

4.2 SPATIAL LAGGED X ...29

4.3 SPATIAL DURBIN ERROR MODEL ...33

4.4 SUPPORT VECTOR REGRESSION ...38

4.5 NATURAL DATA CLUSTERS USING SOM AND GEOSOM ...43

4.6 OVERVIEW OF GIVEN ANALYSIS ...46

5. CONCLUSIONS ...47

6. BIBLIOGRAPHIC REFERENCES ...49

(10)

INDEX

OF

TABLES

TABLE 3.1–VARIABLES TYPES AND NAMES ...15

TABLE 3.2-CENSUS DATA –SECTIONS DIVISIONS ...18

TABLE 3.3-ANCILLARY DATA –SOURCES AND VARIABLES ...18

TABLE 3.4–DESCRIPTIVE STATISTICS OF DEPENDENT VARIABLE ...24

TABLE 3.5–COUNTS OF SIGNIFICANT HOT-SPOT ANALYSIS CLUSTERS ...25

TABLE 3.6–HOT-SPOTS AND COLD-SPOTS OF PARISHES ...26

TABLE 3.7–COUNT OF AGGREGATED DWELLINGS IN BLOCKS ...27

TABLE 4.1- LOCAL MORAN’S I ...28

TABLE 4.2–MODELS IMPROVEMENT ...28

TABLE 4.3–SIGNIFICANCE PERCENTAGE ...29

TABLE 4.4–SIGNIFICANT VARIABLES –MODEL USING ALL VARIABLES ...29

TABLE 4.5-COUNT OF SIGNIFICANT VARIABLES IN SLX ...30

TABLE 4.6–MODEL SUBSETS STATISTICS ...31

TABLE 4.7–TOP 20SIGNIFICANT VARIABLES –SEMMODEL WITH ALL VARIABLES ...33

TABLE 4.8–PERFORMANCE EVALUATION IN SDEM ...34

TABLE 4.9–MAE AND RSQUARED –PERFORMANCE EVALUATION ...34

TABLE 4.10–STATISTICS OF MODEL PREDICTIONS ...36

TABLE 4.11–HYPERPARAMETER OPTIMIZATION ...39

TABLE 4.12–BEST PERFORMING MODELS ...39

TABLE 4.13–MAE ON BEST RBFKERNEL MODELS ...39

TABLE 4.14–VARIABLES CONTRIBUTIONS;MODEL M3 AND M4 ...40

(11)

INDEX

OF

FIGURES

FIGURE 2.1-TYPES OF WEIGHTED NEIGHBOURHOODS ... 6

FIGURE 2.2-TUBE WITH RADIUS Ε (SCHÖLKOPF,2002) ... 7

FIGURE 2.3-SELF-ORGANIZING MAP - I/O SPACE (HENRIQUES,BAÇÃO,&LOBO,2009) ... 9

FIGURE 3.1-STUDY AREA DIVISIONS:PARISHES VERSUS LEVEL 4CENSUS BLOCKS ...12

FIGURE 3.2–PROPOSED ARCHITECTURE ...13

FIGURE 3.3-TESSELLATION OF STUDY AREA FOR MAXIMIZATION OF POIS COVERAGE ...16

FIGURE 3.4-METHODOLOGY ...19

FIGURE 3.5–TOP 20WORD FREQUENCY IN ROOM DESCRIPTIONS ...20

FIGURE 3.6-COUNTS OF FEATURES PER CATEGORY A ...21

FIGURE 3.7-COUNTS OF FEATURES PER CATEGORY B ...21

FIGURE 3.8-ORIGIN (ROOM),DESTINATION (POI) AND EDGE NETWORK DBSCHEMA ...22

FIGURE 3.9-ROAD NETWORK ACCESSIBILITY FROM A SAMPLE ROOM ...23

FIGURE 3.10-EXAMPLE:MINIMUM DISTANCE TO A MALL (IN METERS) ...24

FIGURE 3.11-ANSELIN LOCAL MORAN’I ...25

FIGURE 4.1-SIGNIFICANT VARIABLES FROM MODEL USING ALL VARIABLES ...31

FIGURE 4.2-COUNT OF BEDROOMS IN APARTMENT WHEREIN THE ROOM RESIDES ...32

FIGURE 4.3-SIDE BY SIDE –ORIGINAL PRICE VERSUS SDEMPREDICTION ...35

FIGURE 4.4–RESIDUALS IN SDEM ...35

FIGURE 4.5-RESIDUALS DISTRIBUTION ...36

FIGURE 4.6-PRICE VS.RESIDUALS SCATTERPLOT ...36

FIGURE 4.7-BOXPLOTS OF 4HIGHLY SIGNIFICANT PROXIES -SDEM ...37

FIGURE 4.8–ELECTRONIC STORES PER PARISH ...38

FIGURE 4.9–SVMREGRESSION –RESIDUAL ERRORS ABOVE SDTHRESHOLD...41

FIGURE 4.10–MODEL COMPARISON IN RESIDUAL ERROR ...41

FIGURE 4.11–PRICE ERRORS THROUGH BRAKES ...42

FIGURE 4.12–RESIDUAL ERROR COMPARISON ...42

FIGURE 4.13–SOMCLUSTERS –MODEL M4 ...43

FIGURE 4.14–GEOSOM–M3BINARY VARIABLES ...44

FIGURE 4.15–GEOSOM–MODEL M3 ...44

(12)

1. I

NTRODUCTION

1.1 The Market and its Fluctuations

Apartment and room rent price changes are frequent and with an ever-increasing inflation of a monetary currency, terms such as “price hike” describe the phenomena of abruptly increasing prices that are often experienced in many places, be it countries’ capitals, metropolitan cities or countries. As the financial capacity of the citizen oftentimes lags behind market supply’s offer, rent control seems like a logical step. This is quite evident from places like San Francisco, among others, that have strict rent control to avoid exceeding the maximum allowed annual rent increase allowed. Hence, rents in these cities are limited from above by ceilings imposed by a government or a local administration. (Larsson, Rong, & Thomson, 2014).

However, rent regulation does not bring only advantages to the market, as unexpected side effects are oftentimes experienced. Although San Francisco has thorough rent regulation, it still is one of the most unaffordable cities in The United States. Evidence suggests that with a demand shock and a relatively low annual rent increase cap (60% of regional CPI rate), there is an increasing number of holders choosing to withdraw housing units from the rental market, even at cost of leaving them vacant (Asquith, 2019). Looking at The European market, Germany has also placed modest rental brakes on rent increase. Contrary to the policymakers plans, the law has at its best, no impact in the short run, and at its worst, it accelerates the price rent increase much closer to the imposed rental brakes (Kholodilin, Mense, & Michelsen, 2017).

The aforementioned markets were never so old, comprehensive and rigid as laws imposed in Portugal that were in force until recently. These laws had allowed for lifelong contracts with only minimal, inflation-linked price increase to be handed down through multiple generations. This law had been discontinued in 2012, when the government had presented a new law amending the New Urban Lease Act Law 6/2006 to ensure equally the rights and obligations of landlords and tenants. This legislation had set the five-year period of transition from the old lease contracts to a new regime of free rent, and introduced measures to broaden the conditions under which renegotiation of open-ended residential leases can take place, measures to phase out rent control mechanisms, and limits the possibility of transmitting the contract to first degree relatives (Branco & Alves, 2015). Moreover, an introduction of a golden visa scheme was introduced in 2012 that have encouraged investors to also buy Portuguese property, and with this, the real estate market had also endured a 27 percent rise within a 4-year period in the timeframe of 2014-2018 (Colectivo Marxista Lisboa, 2019). Last, but not least, the increase of touristification creates changes to dwellings, resulting in undergoing processes of renovation in both historical and non-historical areas of the city (President & Marques, 2018). Bearing this in mind, the regeneration of the inner city then favors the movement of affluent, middle-class residents into these areas, triggering gentrification (Gravari-Barbas & Guinand, 2017). Citizens have not taken these trends lightly, as the process of gentrification in neighborhoods like Alfama and

(13)

have made Lisbon’s old town rent market exceed the financial capacities of the average Lisbon’s worker (C.A. Brebbia & J.J. Sendra, 2017). As a response to these changes there has been a wave of anti-tourism marches that has slowly spread across multiple cities in Europe including Lisbon and Barcelona (Hughes, 2018).

1.2 Factors Driving a Market – Predictive Modelling

The analysis of rent is focused mainly on more tangible variables that pertain to a household. These can be very few, and include an apartment’s square meter area, street, and the year the apartment and building has been constructed, or otherwise, include as many variables as are publicly available to be assessed with. Various research works bring forward price prediction models that use data from Airbnb (Kakar et al., 2016), Twitter (Bing, Chan, & Ou, 2014) and other local websites like Sofang (Li, Ye, Lee, Gong, & Qin, 2017). However, literature does often not consider important spatial modules like proximity of buildings with important cultural heritage, distance of a point of interest (public transport stops, closest university, green areas library or co-working space), or location being rated as undesirable, oftentimes as a consequence of spatial factors.

It is of high importance to create a clear distinction between a real estate market, apartment rent market and the market of renting a room. Research made through applying correlations and Granger causality tests on the available data of Lisbon had given results that the housing prices are a good indicator that has causal implications over apartment’s rent. Nevertheless, when re-validated using National Data, neither of the variables has been proven significant to refer to it as solid proof of being the cause of the other (Pereira, 2017). Another long-term temporal study of the Singaporean real estate market using co-integration analysis had also failed to establish a relationship between the two on nationally estimated indexes (Baltagi & Li, 2015). Therefore, a study covering the real estate market does not necessarily offer proof to be important in the price of the rent of the same apartment in question. Subsequently, room rent is a third different term that denotes an entity whose specifications of the outside of the room might become of secondary interest if they are to be shared with strangers, or landlords (and their family). Hence, research that has offered great findings of the real estate market, rent of an apartment, or even short-term rents does not directly compare to outcomes of this case study which is evaluating the impact of the mid-term rent prices which correspond to a monthly duration of service they are intended for.

Today, a lot of data is available to assess and use as a contributing factor in building a model, but oftentimes the standardization of this information lacks uniformity and proper integration. Moreover, a study that captured data from 4 years ago might have had different findings as the market endures a constant shift, both by internalized (economy, laws, wage, inflation) and external factors (the area wherein it’s located might increase in desirability as new spatial proxies appear in the vicinities).

The prediction of price represents fitting a continuous range of numeric variables through other explanatory data. Hence, it is a regression problem and it requires a linear

(14)

or non-linear model using exogenous variables at disposal to yield an output able to explain the target variable of interest.

Many intelligent algorithms have been developed in recent years that make use of Machine Learning (ML) to solve regression problems. Some of them have a black-box design that restrains the user from fully understanding the model built as is the case with Multi-Layer-Perceptron (MLP). Others, however produce and outline information importance, as is the case of Support Vector Regression. Other, more traditional approaches include Hedonic Price Modeling (HPM), although various slight modifications of the algorithm exist to account for the spatial dependence of the data. However, mid-term rental as a research topic has been a highly under-represented case, and even then, spatial variables, usually play a secondary part in the analysis. Subsequently, the idea behind the thesis was to incorporate methodologically constructed spatial data proxies in predictive models and analyze the possible impact they might have through tuning their configuration mechanisms and analyzing the outputs.

1.3 Research Objectives

The thesis aims to evaluate the influence and significance of spatial variables and corroborate proof of whether these data provide aid in the construction of reliable prediction models of mid-term room rental prices in the specified study area of Lisbon. The methodological approach applied is to be based on spatial econometrics and machine learning.

To achieve the aim given, specific objectives were defined, as follows:

• Analyze available points of interests within the study area and divide them into generalized categories from which knowledge can be extracted

• Review and apply suitable spatial econometric models of HPM to evaluate the significance of the variables provided

•_{Design and implementation of SVR for creation of a model that best fits the} data. Evaluation and review of the accuracy metrics with, or without the new proposed variables

• Application of the Self-Organizing Map (SOM) algorithm through the use of GeoSOM software to check for correspondence of more natural clusters of the listings

1.4 Dissertation Structure

The dissertation is organized in six chapters. Chapter 1 gives introductory remarks, aims and objectives. Chapter 2 expands on the background and related works. The 3rd Chapter describes the contextual geographical location of the study area chosen, the dataset used, as well as the methodology undertaken to give spatial context to variables created from spatial features. Then, Chapter 4 comprises results and discussions thereafter. Finally, Chapter 5 summarizes the research work by portraying conclusions.

(15)

2. T

HEORETICAL

F

RAMEWORK AND

R

ELATED

W

ORKS

This chapter makes a brief introduction of the theoretical concept and defines the frequently used terminologies in the dissertation

2.1 Theoretical background

2.1.1 Artificial Intelligence and Machine Learning

Artificial Intelligence mimics the intelligence of a human brain, and is applied to, and

subsequently, demonstrated by machines. The field studies intelligent agents. Hence, it utilizes the detected information through computational techniques to understand patterns and to learn how to solve given problems. Artificial Intelligence can be relevant to any intellectual task, and most prominent ones include medical diagnosis, self-driving cars, and image and speech recognition.

Machine Learning is a subdomain of AI wherein machines learn to perform a task

without explicitly being given instruction, but by identifying existing patterns and gaining knowledge about continuous decision-making. Various algorithms have come to be developed throughout the years, including SVR, MLP and RF. Today, machine learning is having widespread application as it brings a large value to the ever-growing heaps of data being generated in a technology driven world. Its high computational abilities has allowed it to protrude in various sectors including finance, banking, insurance and internet fraud detection amongst others.

2.1.2 Spatial Econometrics

Spatial Econometrics is the field that interweaves econometrics and spatial analysis.

Instead of understanding data observations as independent observations, new models are put into place that also account for the rule of geography wherein closer things are more similar (Tobler, 2004), and give meaning to neighbourhood effects and spatial spillovers.

Hedonic Price Theory suggests that a property can be viewed as an aggregation of many

individual components or attributes. Looking from the consumer perspective, people are primarily purchasing goods that are embodying many attributes that maximize their underlying utility functions (Rosen, 1974). Hence, this technique had grown to be present in giving value to various commodities that include, but are not limited to residential amenities (Bourassa, Cantoni, & Hoesli, 2007), benefits of environmental improvements (Harrison & Rubinfeld, 1978; Kong, Yin, & Nakagoshi, 2007), and wildlife recreation resources (Messonier & Luzar, 1990).

The Hedonic Price Model in its simpler nature describes regression wherein the target

attribute is a price of an amenity. For example, one can expect that attributes giving value to the price of a house should include land size, age of house, types of rooms inside and their count.

(16)

Ordinary Least Squares (OLS) is a simple HPM regression that takes the form of

equation (1):

𝒚𝒚𝒊𝒊= 𝛂𝛂 + 𝜷𝜷𝟏𝟏𝒙𝒙𝟏𝟏+ 𝜷𝜷𝟐𝟐𝒙𝒙𝟐𝟐+ 𝜺𝜺𝒊𝒊 (𝟏𝟏)

Where yi would be the ith observation of the dependent variable, x is a vector of all

explanatory variables, and α is an intercept that holds the value of y when all

explanatory variables are equal to 0. 𝛽𝛽is the vector of the corresponding coefficients of the estimates predictors and 𝜺𝜺is the random error.

When dealing with given data, the OLS model has certain assumptions, whose violation would lead to an inappropriate model. One of the assumptions specifies that finding spatial autocorrelation in the residuals of the predictions shows that there is a spatial pattern in the data under analysis and more appropriate models exist that can account for it.

Spatial Lagged X (SLX) portrays a spatial regression model that extends to include

explanatory variables observed on neighboring cross-sectional units terms. In order to improve the model, a spatial exogenous autoregressive lag is introduced (Elhorst & Halleck Vega, 2017). The equation (2) would then take the form of:

𝒚𝒚𝒊𝒊𝒊𝒊 = 𝐖𝐖𝒙𝒙𝒊𝒊𝒊𝒊+ 𝐗𝐗𝒊𝒊𝒊𝒊𝜷𝜷 + 𝜺𝜺𝒊𝒊𝒊𝒊 (𝟐𝟐)

Then, 𝐗𝐗𝒊𝒊𝒊𝒊 represents the autoregressive coefficient that produces the global spatial spillover effect. If this comes to be equal to zero, the model simplifies to a conventional linear regression model (Lesage, 2008). Nevertheless, this approach has its drawbacks because assumptions are made that all the spatial dependence among residuals is caused by variables one can measure.

Spatial Durbin Model is an extension of SAR that includes the average of the variables

from the neighboring houses, but it extends to allow for global diffusion of shocks to the model disturbances. This means that the spillover can be felt over multiple regions, but the decay parameter λ controls that higher-order neighbors (indirect ones) to receive less impact than the direct ones (Lesage, 2014). Additionally the Error addition within this model creates the SDEM model that accounts for spatial dependence among the error terms and redistributes this error among the sample. The equation in vector form takes the equation (3) (LeSage & Pace, 2009):

𝒚𝒚𝒊𝒊𝒊𝒊= 𝐃𝐃𝒊𝒊𝒊𝒊ծ + 𝐖𝐖𝒚𝒚𝒊𝒊𝒊𝒊𝝆𝝆 + 𝐗𝐗𝒊𝒊𝒊𝒊𝜷𝜷 + 𝑾𝑾𝑾𝑾𝑾𝑾𝑾𝑾 (𝟑𝟑)

𝑾𝑾 = 𝝀𝝀𝑾𝑾𝑾𝑾 + 𝝐𝝐

Dummy Variables are binary variables that equal ‘1’ if the target variable is described

to either, contain something (e.g. terrace with seaside view), or it belongs somewhere (e.g. an apartment belongs to the parish of Alfama). ‘0’ then describes the opposite case. Each of the variables can have a positive or a negative relationship with the price. In an ideal case, minimum distance to an amenity makes a price rise. Nevertheless, the market can behave unpredictably or interact with other variables to yield a negative relationship. For example, a rent of room in an apartment where the ratio of number of

(17)

bedrooms divided by number of bathrooms is high might yield a negative effect. Additionally, having a minimum distance of clubs below 20 meters might dissuade potential tenants away from a property.

Direct effect: Change of a particular entity of a listing (e.g. renovation of apartment

from 1979 to 2019) changes the price target variable.

Indirect effect: Measures the impact on the price of a target variable from changing an

exogenous variable in another data point.

Weight Matrix - A matrix W is a matrix of order n x n where n is the number of regions.

Non-zero elements in row i and column j in the matrix then represent the element ni is

a neighbor of element nj. Rook and Queen each refer to two common ways to calculate

statistics for focal cells, and these are known as Moore’s and Neumann’s neighborhoods. These spatial weights are calculated, such that each element in a matrix represents a spatial weight between two neighbors. The spatial weights Wij are non-zero

when i and j are neighbors, or zero otherwise. The difference between the two types of neighborhood is that in Rook contiguity, common vertices do not make two polygons - neighbors. The use of queen neighborhood was the chosen approach in order to reflect real-life phenomena where if two neighborhoods meet at a common corner, they would still cause influence between one-another.

Figure 2.1 - Types of Weighted Neighbourhoods

2.1.3 Support Vector Regression

SVM originated from an algorithm implementation from 1995 by Cortes and Vapnick (Cortes & Vapnik, 1995). The algorithm had been first introduced in order to solve pattern recognition problems. SVM is described as an algorithm that makes use of a hyperplane constructed in a higher dimensional input space that separates the data in this high dimensional space optimally (Suykens & Vandewalle, 1999). Nevertheless, what began as a classification problem, rose to application of solving function estimation problems. These findings proceeded to additional inclusion of optional parameters like epsilon’s loss function and introduction of different types of kernel functions in which data can be more easily divided. This way, there could be different solutions produced with different complexities and different thresholds of errors in order to build the most cost-efficient model that could be deemed a good enough fit for the purpose it had been built for.

(18)

There are infinite possible ways to construct a hyperplane, and in classification, the best hyperplane can be determined as the one that holds the largest distance between hyperplanes and the support vectors. Subsequently, detection of SVs and Lagrange Multipliers demonstrate the records important for building the model (Witten, Pal, & Fourth, 2017).

Epsilon within SVR

With given training data {(x1, y1) …. (xe, ye)} where X ⊂ X × R, where X denotes the

space of the input patterns, SVR is looking for a function f(x) that would have at most ε deviation from the actual targets for all training data, whilst maintaining the largest flatness (Smola & Schölkopf, 2004). A linear fit of a regression function is presented in Figure 2.2:

Figure 2.2 - Tube with radius ε (Schölkopf, 2002)

If all data points fit within the tube presented, the fitted model then has an error 0. In this way, a model can be built with an infinitely high value ε. This would lead to no penalty given to any data point and an error of 0, but this would be a meaningless accuracy measure.

In a linear case the SVR function takes the form of Equation (4)

𝑋𝑋 = 𝑏𝑏 + ∑i α a(i) ⋅ a (𝟒𝟒)

The dot notion can be substituted by various kernel functions if a non-linear problem is presented. Most common kernels used are:

• Linear kernel Xi – Equation (4)

𝐾𝐾(𝑥𝑥𝑖𝑖, 𝑥𝑥) = 𝑥𝑥𝑇𝑇𝑥𝑥𝑖𝑖 (𝟓𝟓) •_{Polynomial kernel – Equation (5)}

(19)

• Radial Basis Function kernel – Equation (6)

𝐾𝐾(𝑥𝑥𝑖𝑖, 𝑥𝑥) = exp �−𝛾𝛾�|𝑥𝑥𝑖𝑖− 𝑥𝑥|�2� , 𝑤𝑤ℎ𝑒𝑒𝑒𝑒𝑒𝑒 𝛾𝛾 =_2𝜎𝜎1₂ (𝟕𝟕)

Only the support vectors are important for the model – the deletion of all other rows bears a coefficient 0 and hence, does not change the outcome of the predictive model. Another variable whose value plays importance in building the model is the regularization parameter C that denotes the limit of the absolute value of the coefficients αi, which gives the limit of how much it is needed to avoid misclassifying a training

example. The limit cases then become the following:

I. the larger C becomes, the closer the function can fit the data in the hyperplane and smaller-margin hyperplane will be needed

II. an ε of 0 creates a least-absolute-error regression with constraint C III. A large value of ε just outputs the flattest tube that encloses all data

Gamma (𝛾𝛾) is a specific parameter for radial basis kernel that defines how far an influence of a single training example reaches. A large gamma would correspond to more support vectors having influence on the hyperplane.

2.1.4 Self-Organizing Map and GeoSOM

The Self-Organizing Maps are a clustering algorithm introduced in the 1980s (Kohonen,

1982). The main idea is to map high-dimensional data, into dimensions from which the human eye can understand and extract patterns. The units by themselves can be loosely or closely connected to each other, and the deciding clusters are decided by the user. SOM algorithm can have different distance metrics, but most often makes use of the

Euclidean distance as a measure of closeness between two arbitrary units. Once a SOM

has been trained with a given pattern, all the units move towards the best matching unit (BMU). At the end, patterns that are similar in the input space should also carry this behaviour in the output space.

GeoSOM is an adaptation of SOM to account for the specificity of spatial data. The

search of the BMU takes part in two cycles. Instead of searching for a BMU throughout the whole dataset, it tries to find a neighbouring one that is limited in the search radius with the parameter k. The output space in SOM usually takes the form of a 2-dimensional space (Kohonen, 2001) as the easiest one to visualize.

(20)

Figure 2.3 - Self-Organizing Map - I/O space (Henriques, Bação, & Lobo, 2009)

There are a number of important parameters that need to be tuned when training a dataset with SOM. These include learning rate, radius, a number of iterations, and they need to be chosen accordingly to minimize a model’s topological error.

2.2 Related Work

Throughout this subsection, references and links to studies regarding price prediction are mentioned to gain a general overview of what research has been applied and the correspondent findings follow. For a clearer structure, the related works are divided into three sections, one denoting Spatial Econometrics’ related studies, a second for SVR related studies, and a third for giving briefing of other algorithms’ usage.

2.2.1 Research through Application of Spatial Econometrics

Research shows indisputable evidence of the impact of the cultural heritage on the real estate prices in the city. A case study of the city of Lisbon finds proof that a protected zone can produce a negative impact on the house value but it also produced findings that when accounting for the heterogeneity of the areas under question, the effects had seemingly disappeared (Franco, Macdonald, Franco, & Macdonald, 2016).

Another study has revealed that overall, historic amenities generally give rise to dwelling prices in order of 4.2%, but once the radius broadened, a high concentration of these historical amenities started to yield a slightly negative effect of 0.1% (Franco & Macdonald, 2016).

When analyzed, these papers bring about conclusions that living in a proximity of an amenity with some degree of cultural importance gives exposure and identity of an apartment being in the center of a local cluster (a desirable residential area). Living in close proximity to many of these objects on the other hand had shown a correlation to

(21)

it being a highly touristic area, which in turn is valued as a very desirable dwelling for one’s day-to-day life.

Other research has also found tax exemptions to drive positive spillovers to nearby properties (van Duijn and Rouwendal 2012, Ahlfeldt, Holman, and Wendland 2012, Coulson and Lahr 2005).

What’s more, using a HPM on real estate prices have found Lisbon’s ‘Metro Station Proxy’ variable coefficient to give between 3.49% and 5.18% rise in prices with rated positive accessibility to a single metro line, or between 4.62% and 6.17% for accessibility to two metro lines and even larger for the rail accessibility (Martínez & Viegas, 2009).

Additionally research have also used additive hedonic regression models to account for the heterogeneity in the rent market (Brunauer, Lang, Wechselberger, & Bienert, 2010) for real estate prices in Vienna. This had led to conclusions about discoveries of submarkets which have as a significant factor a ‘Belonging’ of listings in the specific city’s districts.

Other sources (Moro, Mayor, & Lyons, 2013) include a study over the city of Dublin as a case study where heritage sites within the city were categorized as factors using a limited number of categories. This study had given a small but significant positive score to certain types of categories (historic buildings and memorials) as a spillover effect to apartments in the vicinity, although other categories (archeological sites) had produced a negative one. The study had been made through producing dummy variables as flags of whether a certain type of building was in a radius of a chosen amenity. Variables pertaining to metros and bus stop as means of public transport had also been used and it had been noted that living within 100 meters of a metro station resulted in a positive correlation with a price increase of an apartment (Franco et al., 2016).

(Poort, 2015) have moreover argued that the presence of cultural heritage attracts highly educated households using their case study of the Netherlands. This factor had also caused a spillover to industries in the country investing and preferring to reside where the elite households of the country reside (Marlet and Woerkens, 2005).

Last, but not least, other study findings include cultural heritage sites itself carrying a multiplier effect to other amenities including restaurants and shops (van Duijn, 2013). This gives an idea of linear dependencies between exogenous variables could clutter or bias an analysis and make for an overtly complex model.

These studies, for the most part, deconstruct many spatial variables segregated and analyzed separately (as objective of understanding single variable’s importance) rather than fitting one model using many. The majority of them use Euclidean distance to calculate proximity, wherein a PoI can reside anywhere within a proposed radius threshold of an amenity in question.

(22)

2.2.2 Research through Applications of SVR

Finding using the SVR include (Chen, 2010) that discuss that stratification of Shanghai’s market into more homogeneous subsets provides considerable benefits as contrast to taking the aggregate of the whole Shanghai’s rent market. The approach used a hedonic appraisal model as a preprocessing step of choosing the right variables before building the SVR. Another paper has built a real estate forecasting model based on particle swarm optimization (PSO) before applying the SVR (Wang, Wen, Zhang, & Wang, 2014). Optimization through HPM seems to be a recurring theme when using SVR and SVM. In this way, the significant variables need to be detected first before building the model so the time complexity also lowers substantially, whilst also making sure the model’s accuracy does not suffer from the multicolinearity problem.

2.2.3 Research through Usage of Other Algorithms

A recent study of real-estate price evaluation was done through the usage of both NN and HPM as a way of comparing the two algorithms. Nevertheless, this research was concentrated on producing minimum MAE accuracy model for predictions and does not give further info about the singular influence of unique variables in building the model (Safronov, 2017). This is because of the “black box” nature of ANN in which multiple neurons are exchanging information and are updating their weights, but to the outer world not a lot of substantial information is given through which the observer can get many conclusions. Another study was done using ANN on the island of Cabo Verde using 1092 data points with a yearly temporal scale between 2009 and 2014. This study also came about as a comparison between using the algorithm of ANN and Random Forest and draws conclusions of MLP producing higher errors than the latter algorithm (Ester & Martins, 2016). The most important variables there were found with the model fit, and among them, the location of the apartment and the square meter of the apartment seemed to cause both highest significant and positive influence, with the closeness of public institutions and an existing balcony in the apartment computed as ones with least importance. ML studies often have a lot of research concentrated solely on finding the best model with the least error (Cristina and Teixeira 2009, McCluskey et al. 2013, Limsombunchai et al. 2004) with using NN and MLP’s as ML standards that manage to build models that capture the data’s behavior the best, producing the smallest error. One can also note usage and comparison of Random Forest in real estate predictions in Ljubljana (Kilibarda, 2018), where findings suggest in the study area under evaluation, it performed significantly better than other ML algorithms. These are just a few small references of the vast usage of many different algorithms within the ML field that could be used to assess this type of data, although it must be noted that every algorithm varies in the output information and hence, the type of analysis it allows for.

(23)

3. D

ATA AND

M

ETHODOLOGY

This chapter explains in detail the undertaken steps to produce this dissertation. First, a geographical context is given in Section 3.1 wherein general information about the encompassing study area is introduced. In Section 3.2 a brief overview of the proposed architecture. The chapter continues with Section 3.3 thereupon giving a synopsis of the software and algorithms that are imperative for producing new spatially consistent data. Section 3.4 then describes the data sources and variables choices. The final, Section 3.5 portrays how the different undertaken parts of the methodology come together to encapsulate the analysis.

3.1 Geographical Context

Lisbon is the capital and largest city of Portugal, with a bounding box describing a latitude range between -9.2379 and -9.0863 and a longitude range of 38.6800 and 38.7967. The administrative area of the city covers approximately 100.05km2_{, and the}

city’s last census had recorded 505,526 citizens. A clear distinction is made between Lisbon’s administrative area and the urban area that extends far beyond these limits. The positive elevation goes from 0 at its minimum up to 227 meters of altitude.

The city has a circular physical configuration covering approximately 12 kilometers both east to west and north to south. Official divisions divide it into 24 parishes, or in 3623 subsections at the lowest mapping level. Maps of the parishes’ polygons and the level 4 subsections are shown in Figure 3.1 below.

(24)

3.2 Proposed Architecture

The purpose of the methodology is to try to incorporate many different sources, according to the availability of the data and the categories of data points they offer. The policy of transparency of the EU gives space to use open data freely which helps in the process of gathering an abundance of data that could be inter-checked for reliability between the different sources and gives the choice between using sources with the biggest quantity, biggest quality or integrations. Nevertheless, this also brings about weight and complexity for computations needed for replicating the analysis. For easier processes, all data is imported to a PostgreSQL database, from where it could be easily exported and read in both QGIS and RStudio. This integration of environments provides a robust bridge for easier data conversion, manipulation and additional parallelization of computational processes. The proposed architecture is found in Figure 3.2.

The Figure shows little detail of the R analysis, although the packages used are to be mentioned in the following section.

Figure 3.2 – Proposed Architecture

3.3 Hardware and Software

Concerning hardware, the research used an Intel Core i7-7700HQ with 16 GB of RAM capacity, under a Windows 10 operating system.

Regarding software, the majority of the undertaken analysis was done within PostgreSQL and R. Additionally ancillary software was used for visualization including QGIS, ArcGIS Pro, as well as minor checks and flat files of the dataset for versioning history as excel files.

3.3.1 PostgreSQL

This is a free and open-source relational database management system that extends the SQL language where users can safely store and scale complicated data workloads (The PostgreSQL Global Development Group., 2014). When dealing with spatial-data, the

(25)

database capabilities can be accompanied by additional add-ons. pgAdmin4 version is used, alongside the following extensions:

• PostGIS: Allows for geographic location queries to be run in SQL (PostgreSQL

Global Development Group, 2020).

• pgRouting: Enables a road network analysis to be performed; It is used to

calculate the shortest path to places through pgr_drivingDistance function (OSGeo Foundation, 2018)

3.3.2 R

R is a system for statistical computation and graphics. It provides, among other things, a programming language, high-level graphics, interfaces to other languages and debugging facilities (R Core Team, 2019). The language is able to be executed on many operating systems free of charge. The RStudio, subsequently, represents an interface for the execution of R code. Packages that make the analysis possible are listed as follows: • _{sf: Package that allows for reading, writing, manipulating and visualizing}

spatial data and geometries, to and from R (Edzer Pebesma, 2019)

• RPostgreSQL: Database interface and ‘PostgreSQL’ driver for R (Conway et

al., 2019). With this package direct connection to the PostgreSQL database is made, and tables and queries of the database can be directly imported and saved as variables within the RStudio environment

• dplyr: A tool used for easier manipulation of data frames (Wickham, 2019)

• magrittr: Operator that forwards a results of an expression to the next function

that comes after (Stefan Milton Bache and Hadley Wickham, 2014)

• caret: A comprehensive framework for machine learning models (Kuhn, 2020)

• e1071: A package used for intuitive application and analysis of the SVR

algorithm, one among many probabilistic clustering and regression models (Meyer, 2019)

• spatialreg: A collection of estimation functions for spatial cross-sectional

models (Roger Bivand 2019). Library used for applying HPM on the data • spdep: A collection of functions to create spatial weights matrix objects from

polygon ‘contiguities’ by distance and tessellations (Roger Bivand 2019). This package is used for applying functions in the HPM spatial regression analysis.

3.3.3 Python

Python is an interpreted, high-level, general-purpose programming language. The language can be used for any computational purposes, although for the means of the dissertation, it was used for finding specific-purpose API Wrappers. The API Wrapper then provides an interface between an API and a programming language, so all provided API methods for data collection can be made through the desired programming language’s function calls and wrappings.

(26)

3.4 Data Description and Resources

In the following section, all data used for this study is to be briefly described.

3.4.1 Room Rental Sources

Listed below are the two data sources used. The data collected was used only for the means of this research, and no personal data was collected that could expose an individual’s identity.

Uniplaces is an innovative service, and a global brand primarily used for mid-term

accommodation, which thus puts the average target of accommodation between 6 months and 1 year of contract binding, targeted at students. Nevertheless, the service is not necessarily discriminative towards anybody that can potentially apply for booking accommodation at one of the listed rooms and apartments (Uniplaces, 2019). There are room listing for many cities and information taken for the case study contains the variables shown in Table 3.1

Variable Type Variables

Numerical Number of bedrooms, Number of Bathrooms, Price of Room, Room Square Meters, Apartment Square Meters, Minimum Stay

Binary Overnight guests, Full-time employees (Allowed), Pets, Smoking, Domestic Students, Type of Bed, Preferred Gender, Private Bathroom, Landlord Gender, Outdoor Area, Elevator

Text Attribute Description of Room/Аpartment, Review

Categorical Bills Inclusion, Cleaning Frequency

Table 3.1 – Variables Types and Names

Most of the binary variables can have an N/A value depicting missing information (the exceptions being Type of Bed, Type of Booking and Private Bathroom). Moreover, Room Square Meters and Apartment Square Meters are not mandatory variables to be input by the host.

BQUARTO (BQUARTO, 2010) is the second web source that offers similar services

as Uniplaces, but its only target market is the country of Portugal. The variables at disposal portray a similar pool, however there are no reviews, and the items listed as available within the room are far scarcer to be mentioned. The two datasets are merged and the variables unified to get a total of 2149 unique rooms.

3.4.2 Zomato API Restaurant Data Collection

Zomato is a food delivery start-up founded by Deepinder Goyal in 2018. It contains

information, menus, reviews and delivery options from many restaurants throughout the world (Zomato, 2019). Zomato’s API allows for easy retrieval of restaurants within a city, and it offers multitude of variables including restaurant types, ratings and price range.

This website includes an API that allows 1000 free requests a day, and the city of Lisbon bears the id of “82”. Conversely, once a call has been made, the server returns only 20 most important PoIs from which all related information can be extracted. To be able to

(27)

extract all PoIs, a common approach would be to search within the radius of a limited distance. To start, an arbitrary radius of 1km is chosen as a starting point, with results sorted by distance. If the requests yield less than 20 restaurants, then this grid size covers all existing restaurants within the proposed radius of that sample point. If not, the area is further divided within a smaller level grid to collect new data. This behaviour can be repeated, as diving deeper into smaller tessellating square grids covers smaller areas until all records are extracted.

A library that provides ready wrapper APIs for Zomato is “pyzomato”, and only a key from Zomato which has not exceeded the daily allowance of requests is needed (MIT, 2016). The data collected from Zomato counts 8629 unique restaurants.

3.4.3 Google Places API Data Collection

Google Places API has a similar way of functioning, where a single API request would

target points of interest near a coordinate, ordered by importance. There is a “next page” redirect, which returns the following page with 20 new results. Still, the pages have a maximum of 3 redirects. This behaviour needs a denser tessellated grids, as it also contains a denser count of amenities within a chosen radius since PoIs can be obtained from any commercial business of interest, as well as buildings as objects (Google-Cloud, 2019). Google Places provides a division of 97 types of PoIs, and the outputs contain exact coordinates of the records, and are thus suitable for further analysis.

Figure 3.3 - Tessellation of Study Area for Maximization of PoIs coverage

3.4.4 OpenStreetMap Data

OpenStreetMap (OSM) is a project that aims for a free geographic database of the

(28)

gathering of information from people that want to contribute towards this collection. The statistics have recently risen to count 5.7 million users registered on its platform (OpenStreetMap, 2020). A common middle ground of users that are producing content stands at 20%. This, however, describes a common statistic for applications that rely on users bringing in its value (Haklay & Weber, 2008).

OSM PoIs are downloaded through a German geofabrik server that stores OSM data

about various points of interests divided by categories (Geofabrik, 2019). Partial usage of categories from the PoIs is available, and the choices are explained further in Section 3.5.2.

OSM Road Network – The creation of routes from an origin to a destination is

commonly referred to as the shortest path problem. Each node can have many edges, through which one or multiple other nodes can be reached. This describes the graph theory which lies behind the construction, visualizations and computations within road networks of a city. The OSM data for the proposed network is created through an import tool called osm2pgrouting. The tool allows for the creation of a network where two tables are present that carry the weight of nodes and the vertices through which cars, pedestrians and/or bikes can subsequently traverse from an origin to a destination at a certain cost (given in meters). The configuration of the network lies in a configuration file (mapconfig.xml). Highways are removed, as the proposed places of interest should be reachable to commuting pedestrians, rather than vehicles.

OSM Data Quality Concerns - Although Google and Zomato keep a constant flow of

updates of their database to remain a competitive force in their respective field on the web, OSM’s data quality portrays a unique issue with crowd-sourced contributions wherein no quality control procedure to guarantee the safeness of its use is present (Haklay & Weber, 2008).

One common quality of data is its completeness, but the specificity of spatial data makes this an unfeasible task since things can change vastly within temporal scale events and spatial point features, could turn to ruins through a natural disaster. What’s more, the data completeness of OSM varies from country to country, and undertaken studies of OSM coverage can lead to different findings according to the features under analysis (e.g. road network, point of interest, administrative boundaries) (Girres & Touya, 2010; Hochmair, Juhász, & Cvetojevic, 2018; Neis & Zipf, 2012). Nevertheless, the studies also point out that the major cities coverage is vastly better than the rural areas. The latest findings suggest that today, upwards of 83% completeness of the real-life data is present within a 95% confidence interval. (Barrington-Leigh & Millard-Ball, 2017).

Another type of accuracy of interest is positional accuracy. As a user-generated data platform, OSM relies on GPS devices and skilled contributors to carry this weight, which in turn might produce data with accuracy errors in areas wherein high buildings reside or where a device’s signal is prone to flutter.

Because of these reasons, when it comes to spatial features, if a PoIs categories’ quantity was either comparable or higher from other sources (Google Places, CML), the OSM PoIs records from the category in question were dismissed.

(29)

3.4.5 Census Data

The Census Data from 2011 is a realization from the National Statistics Institute from Portugal, on the day of 21 of March 2011. The study counts 10 562 178 individuals and gives information of aggregate statistics for both individuals and buildings within the proposed census block level of analysis (National Office for Statistics, 2011). The levels of the analysis bears the 4th_{level, as shown in Table 3.2}

Level Scale Count

1 Municipality 1

2 Parish 53

3 Section 1054

4 Subsection 3623

Table 3.2 - Census Data – Sections Divisions

Level 4 was chosen for the purposes of this study, as it represents the smallest aggregation of census block statistics. It can be noted that the parishes count differs from the one shown in Figure 3.1, as new parish divisions of Lisbon were presented in 2012, a year after the last Portuguese census at the time of this writing.

3.4.6 Ancillary Data

Ancillary data from Open Data sources further includes these selected variables, as shown in Table 3.3

Data Source Variable Type

CML

Parks and Gardens Polygons

Public Transport (Bus, Train, Metro, Lifts)

Points Parks and Gardens

Hotels

Health Centres, Pharmacies Sport Facilities

Education (1º, 2 º, 3 º, Vocational) Fitness Centres

Religious and Cultural Heritage Art Galleries Statues National Monuments Cemeteries Theatres Museums Commercial Centres CCDR PM2.5 Particles

Open Data Lisbon DTM - Altitude Raster

Table 3.3 - Ancillary Data – Sources and Variables

All of the collected data from Zomato, Google Places, OSM PoIs, Road Network and its configuration file, as well as Open Data shapefiles with an accompanying timestamp

(30)

3.4.7 Other Variables Importance – Access Limitations

Some studies show a very clear negative coefficient index for apartment prices wherein crime rates were higher (Ceccato & Wilhelmsson, 2016). Others present findings that noise negatively influencing a buyers’ decision to purchase a property. There is also some evidence of urban traffic noise having adverse economic consequences (ERF, 2004), and this comes from noise pollution proxy which is described as Noise Depreciation Sensitivity Index. In this way, hedonic estimates with continuous noise data were analyzed on the city of Zurich and its flight regime to estimate the effect of aircraft noise on rental rates. Following the study, it is mentioned that under the Swiss protection law, landlords are able to receive compensation for the decrease in price as a consequence of noise. The paper follows to present findings that the apartments that experienced an increase of more than three decibels of noise within 3 years, showed a decrease of 0.5% within its rental price for each decibel. However, semi-parametric analysis revealed that the relationship between aircraft noise and rental rates did not satisfy functional forms throughout the entire noise distribution, and consequently, the noise was seen as redundant unless the noise levels were at least medium (Boes & Nüesch, 2011). Unfortunately, the data mentioned in this section might have a possible significance, but was not available at the desired levels of the case study.

3.5 Methodology

This section describes data cleaning and integration when multiple sources contain spatial features providing the same category of PoIs. Preprocessing steps needed to be undertaken for the creation of a uniform dataset. The Methodology can be seen in Figure 3.4

(31)

3.5.1 Variables Conversion

Variable conversion is the process wherein the variables are converted into meaningful data to be fed to the algorithms. For this, all non-numerical variables need to undergo conversion. According to the types shown in Table 3.3:

• All numerical variables take the same form in the unified dataset

• Each mandatory binary variable takes the form of 1 if the room contains the amenity or 0 the lack thereof

• All non-mandatory binary variables that contain a large amount of N/A’s mean this variable contains a third value of N/A which cannot be quantified as a 0, since it signifies a lack of information, not a lack of existence. In these cases, instead of putting arbitrary third value 2 (or any number that bears no significance), creation of three dummy variables (e.g. outdoor_areaF, outdoor_areaT, outdoor_areaNA) is used. This is one of many ways to impute missing data values, although there is proof that this type of imputation could lead to bias of coefficients in regression analysis (Allison, 2002). Because of this, when variables are noted to have many N/A values (>30%) the variable is deleted. This is where both the size of the room and size of the apartment are removed since above 84% of the data entries are missing

• Categorical variables follow the creation of N amount of dummy variables equal to the count of their unique categories

•_{Text mining is performed on the short description text, using unigram n-gram} bagging. This, in turn produces a specific word binary variable (e.g. a variable named “cosy” would contain 1 in each data record that contains this word in the text description, or 0 otherwise. Connecting words (in Portuguese and English), or description that would be true for all records are not taken into consideration (e.g. “Lisbon”, “and”, “de”)

Figure 3.5 – Top 20 Word Frequency in Room Descriptions

(32)

dummy variables were, in turn, not included. The most frequent words are shown in Figure 3.5 above.

3.5.2 Spatial Data Enrichment

Data from many of the sources overlaps. Therefore, decisions for unification of these data into one source were undertaken.

The data from Zomato holds the largest quantity of restaurants, and contains information pertaining to reviews and the costliness of the restaurants. Hence, it was decided to use it as the only source for food-related PoIs. Other PoI categories from a single source (CML) were merged depending on whether the differentiation between them failed to appear as a response in similar studies of this type (e.g. private and public schools of first, second degree and pre-cycle are merged). Medical practitioners were subsequently merged with independent doctors. Nevertheless, hospitals and medical centres are left as a separate category.

If two used sources are found to hold a substantial amount of different unique data,

ArcGIS Pro Analysis – “Generate Near Features” function is used for detection of all

features between the two layers that are within 2 meters one other. These were presumed to represent the same data point and were deleted from the second source before merging.

The categories of amenities chosen are listed in Figure 3.6 and Figure 3.7 below.

Figure 3.6 - Counts of Features per Category A

(33)

Fifty-five new variables thus exist; 54 of them are shown in the Figures above, and the last one denotes elevation extracted from a DTM (Lisboa - Digital Terrain Model, 2010).

The spatial data enrichment then makes use of the function pgr_drivingDistance() for detection of reachable amenities from a record’s listing.

pgr_drivingDistance('SELECT id, source_osm, target_osm, cost_s, FROM ways', starting_vertices, distance, directed := false)

To be able to execute this command, a listing needs its node origin point, and a destination point. In the case of the rooms, computations generate the closest node to each room as well as the closest nodes to each of the destinations. Figure 3.8 below shows sample tables with generated closest nodes’ field using the ways table.

Figure 3.8 - Origin (Room), Destination (PoI) and Edge Network DB Schema

For less costly computation, an additional intermediary table catchment_area is created, which describes all potentially reachable nodes from all origin nodes of the rooms (given a desired meters upper threshold).

WITH nodes AS (

SELECT array_agg(node_id) AS nodes from apt_room.node_id)

SELECT from_v as start_node, node as end_node, agg_cost as cost from nodes, pgr_drivingdistance(

'SELECT gid as id, source as source, target as target, cost_s as cost, directed FROM public.ways'::text, nodes, 2000, false)

This code, in turn, produced a table that holds all the reachable nodes within 2km. A visualization of the reachable area from a sample point is shown in Figure 3.9 below.

(34)

Figure 3.9 - Road Network Accessibility from a Sample Room

The proxies for each category of PoI then constitute of:

• Minimum distance (in meters) from the room to the closest amenity of a predefined type

• Count per category of PoIs within a distance threshold. Five distinct thresholds are proposed as follows: 2km, 1.5km, 1km, 0.5km, and 0.25km.

Two variables that were specific in their segregation process were the ‘Parks’, many of which have multiple closest nodes as being 0 meters away from its entrance, and the ‘Restaurants’ that have a very high quantity and density throughout the whole study area. Therefore, a separate table for parks was constructed that holds all nodes less than 10 meters away from the park of interest, as possible destination nodes from an origin point. Unique proxies are constructed for the restaurants according to their expensiveness (1-4 stars), and additionally, according to their average review rate (starting from 0 as no-reviews, 1 as bad, up to 5 as excellent rate).

Figure 3.10 below, presents an example with an interpolated surface of the minimum distance needed to walk from a sample room to the first registered statue in the dataset.

(35)

Figure 3.10 - Example: Minimum Distance to a Mall (in meters)

3.5.3 Data Preprocessing

The goal of this section is to provide an overview of the descriptive data statistics, as well as some preprocessing decisions of the integrated dataset and the reasoning behind them.

Outlier detection – A recommended approach when dealing with datasets is to remove

outliers. These can represent erroneous data and in this specific case, false entries can be part of overpriced offers if a room’s vacancy is far in the future, or additional negotiations might be present between customer-client, which cannot be accounted for. The statistics for the endogenous variable are shown in Table 3.4 below:

Statistic Value Mean 473 Standard Error 3.72 Median 420 Mode 400 Standard Deviation 173 Range 900 Minimum 100 Maximum 1000 Count 2149

Table 3.4 – Descriptive Statistics of Dependent Variable

The interquartile range finds all values from 870 above as outliers and no outliers below the minimum value, which amounts to a total of 105 outliers outside of the 1.5

(36)

and scenario depicts a real-life Lisbon rent market at a specific temporal instance in September 2019. Hence, it is decided against the removal of these outliers.

The Cluster and Outlier Analysis was run in order to identify clusters and spatial

outliers. False Discovery Rate Correction (FDR) parameter is checked, to reduce the critical p-values to account for multiple testing and spatial dependence. It is noticeable that there are many clusters adjoining high-priced rooms with low-priced rooms nearby, as well as low-priced rooms with many high-priced rooms nearby. The high-low price clusters tend to exhibit a pattern from central to the northern borders of the study area, whereas the low pricings with many high-priced rooms tend to have a generalized localization near the coast.

Cluster Type Count

High-High 185 High-Low 64 Low-High 86 Low-Low 232 Non-Significant 542 Total 1109

Table 3.5 – Counts of Significant Hot-Spot Analysis Clusters

Figure 3.11 - Anselin Local Moran’ I

When it comes to numbers of rooms available per Parish, most of the available rooms are found in parishes where the significant clusters dominate. As such, Avenidas Novas, Areeiro, Alvalade, Santa Maria Maior and Misericórdia hold the largest quantities. Additionally, a high count of bedrooms predominates the central parishes of Lisbon, as

(37)

the bedrooms tend to decrease rapidly towards the north borders and decrease marginally towards the southwest and southeast borders.

Following this approach, a few binary variables were put under inquiry and results of a Hot Spot Analysis are subsequently shown in Table 3.6.

Variable Parish – Hotspot Parish – Coldspot

Male Landlord Lumiar, Santa Clara, Benfica, Alvalade, Beato, Penha de França, Belém

Santo António, Arroios

Female Landlord Carnide, Alvalade, Lumiar, Santa

Clara, Parque das Nações São Domingos de Benfica, Areeiro

No Landlord Avenidas Novas, Campolide,

Arroios, Santo António Alvalade, Lumiar, Benfica

Has Outdoor Area Ajuda, Marvila, Olivais, Estrela,

Campo de Ourique Avenidas Novas

No Outdoor Area Alcántara, Santa Maria Maior, Marvila, Misericórdia, São Vicente, Areeiro

Lumiar, Alvalade, Olivais, São

Domingos de Benfica

Outdoor Area N/A Campolide, Carnide, Lumiar,

Alvalade, Avenidas Novas Olivais, Marvila, Belém

Table 3.6 – Hot-Spots and Cold-Spots of Parishes

These hotspots can indirectly invoke knowledge about the pattern – and it can be seen that most rooms that are rented in an apartment with a host appear in all parishes but the central ones. This signifies the area in question is likely to be comprised of student-dominated apartments and it includes the parishes of Alvalade, Areeiro, Avenidas Novas and Santo Antônio. ‘Male Landlord’ hot-spots are found both in the Parishes in the north and on the river coast, but ‘Female Landlords’ do not appear as hotspots on any coastal parish. An existing ‘Outdoor Area’ hot-spots appear frequently in Ajuda and Alcántara, but hot-spots of apartment without it appear frequently in all parishes alongside the coast (in which Ajuda and Alcántara also belong). Rooms belonging in apartment with ‘Non-specified Outdoor Area’ also show a few hotspots in the most central parishes.

Identification of Correlated Predictors and Linear Dependencies - The dataset

contains many variables, and a pairwise check is done to remove variables that might bias the regression. All predictors that are correlated with above 85% are thus removed.

Data Normalization – The dataset largely contains data in very different metrics

(meters, quantities and binary variables) – the min-max range is applied so there exists a common scale that would not lead to any distorting differences in the range of values.

Data Aggregation – As the spatial application of hedonic price modelling through the

‘spatialreg’ R package required polygons neighbours for calculating the lagged

influence of the exogenous variables, different dataset is created for applying HPM. A spatial join is performed using the layer of the Census Blocks at the 4th_{level with the}

layer of rooms and its features. For each polygon, the averages statistics of the listings belonging in the mapping unit is calculated. The resulting shapefile then contains 987 polygons that denote only census blocks for which there are listings present.