Predspot: predicting crime hotspots with machine learning

(1)

Systems and Computing - PPgSC/UFRN

Predspot: Predicting Crime Hotspots

with Machine Learning

Adelson Araújo jr

Natal-RN September 2019

(2)

Predspot: Predicting Crime Hotspots

with Machine Learning

Master dissertation presented to the Pro-gram of Postgraduate Studies in Systems and Computing (PPgSC) of the Federal University of Rio Grande do Norte as a requirement to the M.Sc. degree.

Supervisor

Prof. Dr. Nélio Alessandro Azevedo Cacho

Federal University of Rio Grande do Norte – UFRN

Natal-RN

(3)

Araújo jr, Adelson.

Predspot: predicting crime hotspots with machine learning / Adelson Dias de Araújo Júnior. - 2019.

80f.: il.

Dissertação (Mestrado) - Universidade Federal do Rio Grande do Norte, Centro de Ciências Exatas e da Terra, Programa de Pós-Graduação em Sistemas e Computação. Natal, 2019.

Orientador: Nélio Alessandro Azevedo Cacho. Coorientador: Leonardo Bezerra.

1. Computação Dissertação. 2. Policiamento preditivo -Dissertação. 3. Manchas criminais - -Dissertação. 4. Aprendizado de máquina - Dissertação. I. Cacho, Nélio Alessandro Azevedo. II. Bezerra, Leonardo. III. Título.

RN/UF/CCET CDU 004

Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Ronaldo Xavier de Arruda - CCET

(4)

(5)

I thank God.

I thank the support of my extraordinary and close family, Flávia Bezerril, Adelson Araújo, Fernando Bezerril, Graça Bezerril, Paula Sette, Fernando Bezerril Neto, Matheus Coutinho, Fernanda Bezerril, Marina Bezerril, Maria Fernanda Bezerril, Mariana Bezerril and Alexandre Serafim.

I thank my closest academic educators, Nélio Cacho, Leonardo Bezerra, Renzo Tor-recuso and many others. Also those that financially contributed with my research, the SmartMetropolis project and Google Latin America.

I thank my closest friends Giovani Tasso, Pedro Araújo, José Lucas Ribeiro, Mickael Figueredo, João Marcos do Valle, Beatriz Vieira, Júlio Freire, Ottony Chamberlaine and Lúcio Soares.

(6)

there is a sort of foreknowledge of the coming series of events.

(7)

with Machine Learning

Author: Adelson Dias de Araújo Júnior Supervisor: Prof. Dr. Nélio Alessandro Azevedo Cacho

Abstract

Smart cities are increasingly adopting data infrastructure and analysis to improve the decision-making process for public safety issues. Although traditional hotspot policing methods have shown benefits in reducing crime, previous studies suggest that the adop-tion of predictive techniques can produce more accurate estimates for future crime con-centration. In previous work, we proposed a framework to generate near-future hotspots using spatiotemporal features. In this work, we redesign the framework to support (i) the widely used crime mapping method kernel density estimation (KDE); (ii) geographic fea-ture extraction with data from OpenStreetMap; (iii) feafea-ture selection, and; (iv) gradient boosting regression. Furthermore, we have provided an open-source implementation of the framework to support efficient hotspot prediction for police departments that can-not afford proprietary solutions. To evaluate the framework, we consider data from two cities, namely Natal (Brazil) and Boston (US), comprising twelve crime scenarios. We take as baseline the common police prediction methodology also employed in Natal. Results indicate that our predictive approach estimates hotspots 1.6-3.1 times better than the baseline, depending on the crime mapping method and machine learning algorithm used. From a feature importance analysis, we found that features from trend and seasonality were the most essential components to achieve better predictions.

(8)

com Aprendizado de Máquina

Autor: Adelson Dias de Araújo Júnior Orientador: Prof. Dr. Nélio Alessandro Azevedo Cacho

Resumo

As cidades inteligentes estão adotando cada vez mais infraestrutura e análise de da-dos para melhorar o processo de tomada de decisões em questões de segurança pública. Embora os métodos tradicionais de policiamento de hotspot tenham se mostrado eficazes na redução do crime, estudos anteriores sugerem que a adoção de técnicas preditivas pode produzir estimativas mais precisas para a concentração espacial de crimes de um futuro próximo. Em nossas pesquisas anteriores, propusemos uma metodologia para gerar hotspots do futuro usando variáveis espaço-temporais. Neste trabalho, redesenhamos a es-trutura do framework para suportar (i) o método de mapeamento de crimes amplamente utilizado - estimativa de densidade de kernel (KDE); (ii) extração de características ge-ográficas com dados do OpenStreetMap; (iii) seleção de atributos e; (iv) regressão com o algoritmo Gradient Boosting. Além disso, fornecemos uma implementação de código aberto da estrutura para suportar a predição eficiente de hotspots. Para avaliar nossa abordagem, consideramos dados de duas cidades, Natal (Brasil) e Boston (EUA), com-preendendo doze divisões de tipo de crime. Tomamos como método base de comparação uma metodologia comumente utilizada e também empregada em Natal. Os resultados indicam que nossa abordagem preditiva estima hotspots em média 1,6 a 3,1 vezes mel-hor que a abordagem tradicional, dependendo do método de mapeamento do crime e do algoritmo de aprendizado de máquina usado. A partir de uma análise de importância de atributos, descobrimos que tendência e sazonalidade eram os componentes mais essenciais para obter melhores previsões.

Palavras-chave: policiamento preditivo, previsão de hotspot, aprendizado de máquina, previsão de crimes.

(9)

1 An illustration of KGrid. The events are clustered by using the K-Means algorithm. Each cluster has external edges, which form a convex polygon. These polygons are the topological separation of the city into subregions or cells of a grid. By aggregating the count values in each cell, one can

map hotspots. . . p. 22

2 An illustration of Kernel Density Estimation and its parameters. Each bold point (grid cell) represents an arbitrary place in which a kernel function applies a density estimation around a bandwidth. For a set of events/points, this procedure returns an array of KDE values indexed by

the cells identifier. . . p. 23

3 KDE results for different bandwidth values. The clear difference in res-olution is observed between 0.1 and 0.01 miles (bottom). We observe

underfitting in the top left and underfitting in the bottom right situations. p. 24

4 Seasonal and trend components of a time series. In blue, the original series show a varying behaviour which can be further explained by a trend (in black) and a seasonality (in orange). The trend follows the moving average of the series and the seasonality represents the cyclical

aspect of the original series in monthly oscillation. . . p. 25

5 Results of the survey with 54 police sergeants about the rating of different landmarks and demographic aspects for determining hotspots. At the top, the rating for property (CVP), in the middle for lethal or violent

(CVLI) and for drug-related crimes (TRED) at the bottom. . . p. 27

6 Geographic feature layers from Natal generated using KDE of residential streets (left) and schools (right). Note that residential streets are denser and concentrated in the north of the city, but still widespread in other places. Schools are more concentrated in the center of the city, following

(10)

data sources include a crime database for model training and connection to the OpenStreetMap API to load PoI data. Also, the city’s shapefile is

important for filtering data entered within its borders. . . p. 35

8 Between data loading and model building, the feature ingest process is responsible for assembling the independent variables and the variable to be predicted. This starts with the spatiotemporal aggregations of crimes, through crime mapping method and time series manipulation, and PoI data spatial aggregation. The grid is a supporting element in this step and will be used as the places where criminal incidence will be predicted. Time series Cij extracted from grid cells indicate a set of values from a

grid cell i in period j, and PoI features Gk

i represent a the density of a k

PoI category located in a grid cell i. . . p. 37

9 The extraction of the independent variables (features) of the time series in Predspot is conducted through a trend and seasonality decomposition (through STL) and series differentiation. Each decomposition will gener-ate a new series, and the variables will be composed of k lags from each

of these series. . . p. 38

10 An artificial example of temporal features for two places and three time

intervals. . . p. 38

11 In the machine learning modeling step, the best qualified features are selected and feed into the algorithms. By adjusting various algorithms, we can evaluate the models to use the one that has the best predictive

performance. . . p. 39

12 In the prediction pipeline, one must tailor the model selection steps to return predictions using previously trained models. This means no longer loading the entire dataset, just the crimes from the last period. Then, extracting temporal features, filtering previously selected features, and

requesting the trained model for predictions. . . p. 41

13 The web service comprises managing the prediction pipeline to attend online requests. This requires the use of a file system, which we call Volume, for caching the prediction results and an ETL process controller

(11)

geneous crime scenarios compared to Boston. . . p. 51

15 Monthly sampled time series of the twelve crime scenarios. . . p. 52

16 Time series decomposition for Residential-Burglary (in Boston) daytime and nighttime scenarios. The second row is composed of trend series, which are a smoothed version of the original. Third row are the seasonal patterns, clearly distinct from day and night. Fourth row represent the

differentiated component. . . p. 53

17 Spatial representations of the two crime mapping methods for Residential-Burglary (in Boston) daytime and nighttime crime scenarios. In the first

row, KGrid aggregation, and in the second row KDE. . . p. 53

18 PoI data from Natal (left) and Boston (right) extracted from

Open-StreetMap. . . p. 54

19 A representation of the geographic features taken from three PoI cate-gories, hospitals (first row), residential streets (second row) and touristic

places (third row) of Natal (left) and Boston (right). . . p. 55

20 Cross-validation MSE results for KGrid-based models for Natal (in red)

and Boston (in blue) crime scenarios. . . p. 59

21 Cross-validation MSE results for KDE-based models for Natal (in red)

and Boston (in blue) crime scenarios. . . p. 60

22 P RRM SE results of five trials evaluating trained models for each crime

scenario. In the left side, the two crime mapping methods are compared and in the right, machine learning algorithms. Note that KDE outper-forms KGrid in all crime scenarios, but more sharply in crime scenarios that have fewer data points, such as CVLI. On the other side, one can

note that GB models have higher percentiles, but with much more variance. p. 61

23 Feature importance of KGrid-RF models. . . p. 66

24 Feature importance of KGrid-GB models. . . p. 67

25 Feature importance of KDE-RF models. . . p. 68

(12)

1 KDE parameters and machine learning hyperparameters considered in

the grid search tuning. . . p. 56

2 Selected KDE parameters for the crime mapping methods applied in each

crime scenario. . . p. 57

3 Selected KDE parameters for geographic feature extraction applied for

PoI data aggregation. . . p. 58

4 Average and standard deviation of P RRM SEfor each predictive approach,

considering five trials of the twelve crime scenarios. . . p. 62

5 A pairwise statistical comparison of the four predictive approaches,

con-sidering the results from post-hoc analysis. . . p. 63

6 Wall time spent on model selection phase for each dataset. . . p. 63

7 Selected hyperparameters of the machine learning models for the Natal

crime scenarios. . . p. 79

8 Selected hyperparameters of the machine learning models for the Boston

(13)

1 Introduction p. 13

1.1 Problem Statement . . . p. 14

1.2 Objectives . . . p. 15

1.3 Contributions . . . p. 16

2 Background p. 18

2.1 Predictive Hotspot Policing . . . p. 18

2.2 Spatiotemporal Modelling . . . p. 21

2.2.1 Crime Mapping Methods . . . p. 21

KGrid . . . p. 21

Kernel Density Estimation (KDE) . . . p. 22

2.2.2 Time Series Decompositions . . . p. 23

2.2.3 Geographic Features . . . p. 26

2.3 Machine Learning . . . p. 28

2.3.1 Prediction Task . . . p. 30

2.3.2 Feature Selection . . . p. 31

3 The Predspot Framework p. 33

3.1 Model Selection . . . p. 34

3.1.1 Dataset Preparation . . . p. 34

3.1.2 Feature Ingest . . . p. 36

(14)

3.2.1 Prediction Pipeline . . . p. 41 3.2.2 Web Service . . . p. 42 3.3 Implementation . . . p. 43 3.3.1 Predspot python-package . . . p. 43 3.3.2 Predspot service . . . p. 44 4 Evaluation p. 46 4.1 Datasets . . . p. 47

4.2 Experiment Methods and Metrics . . . p. 47

4.3 Results . . . p. 50

4.3.1 Exploratory Data Analysis . . . p. 50

4.3.2 Parameter Selection . . . p. 56

4.3.3 Model Performance . . . p. 58

4.3.4 Feature Importance Analysis . . . p. 64

5 Concluding Remarks p. 70

5.1 Future work . . . p. 72

References p. 74

(15)

1 Introduction

Cities are the habitat of the increasing majority of human people, and the management of their resources has become a complex task. With the growing impact of Information and Communication Technologies (ICT), coupled with the fact that information has become as valuable as energy (BATTY, 2013), city governance is undergoing a technological paradigm shift to the so-called smart cities. This demand, justified by the current belief that ICT can be a great facilitator of city management (ANGELIDOU, 2014; LEE et al., 2008), expresses the need for the constitution of more sustainable city models. We agree with Caragliu, Bo and Nijkamp (2011) who define a smart city as one where investments in social and human capital, as well as in traditional and modern infrastructure (ICT), lead to sustainable economic growth, high quality of life and the management of city’s resources through participatory governance.

However, improvements in quality of life are unlikely to be effective if they disregard crime incidence levels. According to a recent report of BBC (2018), among the 50 most dangerous cities in the world (accounting by homicides per capita), only eight are not in Latin America. Brazilian cities stand out in this ranking, with 17 cities. The city that hosts this study, Natal, Brazil, holds the fourth position with slightly more than 100 homicides per 100,000 inhabitants, which is a rate fifteen times higher than the global average rate measured by the United Nations Office on Crime and Drugs (UNODC, 2013). Certainly, these regions claim for changes in the way police resources are allocated.

Using Geographical Information Systems (GIS) to support the patrol vehicle dispatch has become routine in many law enforcement agencies in the world, in a so-called hotspot policing manner (SHERMAN; GARTIN; BUERGER, 1989). Based on the empirical evidence that many crimes concentrate in few places (WEISBURD, 2015), an increasing body of research has explored the effects of sending patrol units to spots of high crime incidence. Braga, Papachristos and Hureau (2014) have found strong evidence that it does help reduce crimes and Weisburd et al. (2006) suggested other benefits aside crime reduction. Traditional hotspot policing literature focused on the historical aggregation of crime data

(16)

until the first prediction efforts appeared in America in the late 1990s (GORR; HARRIES, 2003).

Nowadays, smart cities are gradually adopting predictive data analysis to enhance decision-making in public safety. Related works have reported many examples of such quantitative techniques to support patrol planning as "predictive policing" applications (PERRY, 2013; MOSES; CHAN, 2018). Machine learning is one of the techniques that has gained momentum in such context due to the accuracy of its estimation and the flexibility to explore patterns from a range of data, such as geographic (LIN; YEN; YU, 2018), de-mographic (BOGOMOLOV et al., 2014) and social media (GERBER, 2014). However, crime mapping methods (ECK et al., 2005) and spatiotemporal modeling may be crucial to make efficient predictive models (ECK et al., 2005).

1.1 Problem Statement

Previous studies have shown that statistical models can make estimates that exceed in terms of accuracy the human capacity to predict crime incidence over space and time (MOHLER et al., 2015). Nevertheless, Moses and Chan (2018) suggest a considerable body of research lacks accountability and transparency, and that these models should not be implemented without an adequate description of the processing steps involved. Indeed, we have found no research that at the same time (i) evaluates its efficiency against a traditional hotspot policing approach implemented by the police and (ii) provides a clear breakdown of the processing steps involved to implement such a predictive system.

A considerable body of research (LIN; YEN; YU, 2018; MALIK et al., 2014) developed steps, guidelines and frameworks to implement a crime hotspot prediction model, with a variety of standards. In a previous work (ARAUJO et al., 2018), we designed a framework to model machine learning with time series autoregressive features based on a crime mapping method named KGrid (BORGES et al., 2017;ZIEHR, 2017). Our ambitious purpose was to create a standard of the processing steps involved in modelling machine learning models to predict hotspots, but this first effort lacked in some points. First, despite having tested several machine learning algorithms, there was still room for improvement in the results found, perhaps because we considered a separate model for each place. Second, we did not consider the kernel density estimation (KDE), extensively recommended by related works (ECK et al., 2005; CHAINEY, 2013; HART; ZANDBERGEN, 2014), in our experiments.

(17)

(2018) the lack of open-source tools to leverage efficient crime hotspot prediction. Large police departments may not feel this much as they have the budget to purchase commercial solutions that meet their needs. However, small police departments, which often have more worrying demands for violence, may not be able to provide more efficient tools. If they want to build a prediction system, it can cost even more than buying one and they can take much time to build. We argue that an open-source programming interface can ease the implementation of web service to be deployed in a low budget police department.

1.2 Objectives

The purpose of this work is to improve our previously proposed prediction framework through alternative crime mapping and feature engineering approaches, and provide an open-source implementation that police analysts can use to deploy more effective predictive policing.

Our first specific objective is to improve the efficacy of our previously proposed frame-work. To do so, we consider alternative crime mapping and prediction algorithms, respec-tively kernel density estimation (KDE) and gradient boosting regression (GB). We eval-uate our expanded framework on two datasets, from Natal (Brazil) and Boston (US). We compare our results with the traditional approach used by criminal analysts to generate hotspots. Natal’s police department uses a data aggregation in the time window of the pre-vious month to build their patrol plans, and we noted that this is a common baseline prac-tice in related works (BROWN; OXFORD, 2001; COHEN; GORR; OLLIGSCHLAEGER, 2007). By improving our framework, we also intend to describe better the process of producing a crime hotspot predictive pipeline, from the model training to its online deployment.

We also investigate challenges concerning the tools available and the methodology involved. The procedures of translating criminal events into attributes that spatially de-scribe the concentration of crime and adjusting a predictive algorithm may involve many different tools. Often, proprietary solutions as PredPol and HunchLab are robust, but they mainly serve cities with higher purchasing power. Also, their tools are focused on being easy to use and sometimes they can be rigid to configure particular procedures, such as crime mapping and prediction algorithms. The democratization of such technologies emerges as a real necessity since many of the poorer and more dangerous cities cannot afford such solutions. Therefore, our second objective is to design an open-source software and details its procedures to estimate future hotspots.

(18)

1.3 Contributions

In this study, we present a set of relevant contributions to predictive policing litera-ture. First, we present qualitative improvements towards a transparent process necessary to model machine learning for future hotspot crime estimation. We connect a broad set of techniques and organize them into purpose-related processing steps for adjusting effi-cient models and deploy them with a web service. As a second contribution, we provide an open-source python-package implementation of our Predspot framework, to ensure re-producibility of our approach and code reuse. The structure of the Predspot framework enables the use of different crime mapping approaches and machine learning algorithms. Thus, we argue that future work may use our standard framework to improve model performance even more.

A third contribution is that we show empirical evidence that the predictive we modeled estimate better than the traditional baseline implemented in Natal police. Depending on the crime mapping method and machine learning algorithm chosen, predictive approaches can average 1.6 to 3.1 times better than baseline. Fourth, we find that KDE is a robust technique to predict crime incidence when fewer samples are available. In the smallest samples we analyzed (lethal crimes in Natal), the predictive approaches with KDE were much better than baseline compared to the cases we had more samples (property crimes in Natal). Conversely, KGrid approaches have estimated better when more samples are available.

Finally, a fifth contribution is related to the importance of analyzing the spatiotem-poral features used. We modelled features based on trend and seasonality components of time series and also geographic features from OpenStreetMap data. In our feature impor-tance analysis, we find that trend and seasonality lags were the features that contributed most in the adjusted models. From our knowledge, no previous work has investigated such temporal patterns alongside machine learning algorithms for crime hotspot modelling.

(19)

This study is divided as follows. In Chapter 2, we present the necessary theoretical framework to follow the proposals made in this work, from the hotspot policing discus-sions and critique, through the spatiotemporal modeling methods to finally the machine learning background necessary for this work. In Chapter 3, we introduce our framework presenting the adaptations we made to the previous version, dividing the purpose-related steps involved and present the open-source implementation we made. In Chapter 4, we show the datasets used in our experiments, explaining the evaluation process conducted and present the results. In Chapter 5, we conclude our study with a resumption of our objectives, methods and experimental results, and provide further recommendations for future work.

(20)

2 Background

A broad and complex spectrum of concepts underlies a predictive policing solution. In this chapter, we review some theoretical aspects, starting from traditional hotspot policing literature and the application of predictive algorithms to estimate crime hotspots. Specifically, Weisburd’s law of crime concentration in places is used as a premise for the implementation of spatially focused policing strategies, and we explore related discussions in criminology and hotspot prediction to implement predictive policing.

Proper modeling of crime variables can impact prediction model success and efforts to translate crime events into independent variables (features) are necessary. This spa-tiotemporal modeling process, also called feature engineering, starts with a crime mapping method and can be assisted temporally with time series decomposition. In crime hotspot prediction literature, it is also common to use ancillary variables to help describe crime spatial patterns, such as demographic (BOGOMOLOV et al., 2014), geographic (LIN; YEN; YU, 2018) and from social media data (GERBER, 2014). Particularly for geographic vari-ables, we present some strategies to use data from points-of-interest (PoI) in the city as an alternative to help in the predictions.

Last, an increasing body of research (VOMFELL; HÄRDLE; LESSMANN, 2018; PERRY, 2013;BOGOMOLOV et al., 2014) suggest that machine learning algorithms fit very well with predictive policing. Still, the many facets of machine learning, such as the algorithms and prediction tasks, make it necessary to discuss its application more thoroughly in the context of spatiotemporal crime analysis.

2.1 Predictive Hotspot Policing

Criminologists point to different strategies for reducing crime, disorder and fear ( WEIS-BURD; ECK, 2004). Among the methodologies addressed, there is strong evidence of the effectiveness in patrolling micro-regions of crime concentration (WEISBURD; ECK, 2004;

(21)

PERRY, 2013;BRAGA, 2001). Perhaps this can be justified by the fact that policing strate-gies have considered Weisburd’s law of crime concentration in places (WEISBURD, 2015), which states that "for a defined measure of crime at a specific microgeographic unit, the concentration of crime will fall within a narrow bandwidth of percentages for a defined cumulative proportion of crime". For example, experiments suggest that criminal occur-rences are concentrated in around 10% of places in the cities (ANDRESEN; LINNING, 2012; ANDRESEN; WEISBURD, 2018), with variations for each crime type and study region. A 2006 US national survey (KOCHEL, 2011) reported that 90% of large policing departments have considered this crime concentration pattern to draw the so-called hotspot policing operations (SHERMAN; GARTIN; BUERGER, 1989). Researchers are continually reviewing experiments on hotspot policing efficiency (BRAGA; PAPACHRISTOS; HUREAU, 2014), sug-gesting that 20 out of 25 of the experiments have observed benefits on reducing crimes, social disorder and an overall improvement in the perception of community safety.

Hotspot policing has received substantial interest and criticism, such as the claim that crime displacement is a straight consequence of the former (REPPETTO, 1976). Con-versely, according to Weisburd et al. (2006), the inevitable crime displacement idea has been questioned because displacement is rarely total, and most of the times irrelevant. Moreover, the author argued that focused patrolling leads to the diffusion of "other bene-fits not related to crime" and also to the criminal’s discouragement. Another criticism on hotspot policing is that most hotspots are related to poverty and race issues, hence increas-ing inequality and even creatincreas-ing an environment of lowered police legitimacy (KOCHEL, 2011). In addition, Rosenbaum (2006) suggested that most police activity in hotspots is enforcement-oriented and that aggressive strategies can increase negative contact with citizens, mostly where perceptions of crime tend to be worse (GAU; BRUNSON, 2010). Al-though reporting some short-term adverse effects, and pointing guidelines on minimizing the latter, Kochel and Weisburd (2017) presented experimental results showing that there is no long-term harm to communities’ public opinion when supported by continuous polic-ing. However, the impossibility of storing every crime in databases is still a problem to be solved, and victims of such data gathering limitation will often continue to be ignored by law enforcement (MOSES; CHAN, 2018).

In contrast to the criticism discussed above, police scholars have argued that hotspot policing is a model for police innovation (WEISBURD; BRAGA, 2006). Further, in the pursuit of innovation in hotspot policing, predictive algorithms have been used to support preciser estimators (MOHLER et al., 2015). According to Gorr and Harries (2003), the role of crime forecasting had stopped being considered infeasible at the beginning of the 2000s, after

(22)

a major success of crime mapping systems, when the US National Institute of Justice (NIJ) awarded five grants for studies to extended accuracy of short-term forecasts. The aim was to estimate precisely spatial crime concentration, as the first step to practical intervention. After a decade, the term predictive policing was coined and become a trend, reflecting the role of "the application of analytical techniques to identify promising targets for police intervention and prevent or solve crimes", according to Perry (2013). Stated in another perspective, Ratcliffe (2015) suggests that predictive policing involves "the use of historical data to create a spatiotemporal forecast of areas of criminality or crime hot spots that will be the basis for police resource allocation decisions with the expectation that having officers at the proposed places and time will deter or detect criminal activity". Still, predictive policing may require practical policing planning, in contrast to the role of spatial crime forecast per se, as discussed by Gorr and Harries (2003).

Indeed, few studies have assessed the effect of predictive hotspots against traditional

crime mapping with GIS to reduce crime incidence (HUNT; SAUNDERS; HOLLYWOOD,

2014), and Moses and Chan (2018) suggested that there are two ways of evaluating a predictive policing solution. The first is by reporting the "drops in particular categories of crime in particular jurisdictions employing its software" and the second by measuring the "predictive accuracy of particular tools". One may argue that the former is more prone to the predictive policing definition and the second to spatial crime forecasting analysis. Some studies have evaluated both, e.g. Hunt, Saunders and Hollywood (2014) have shown a null effect on applying predictive hotspot policing, suggesting an important role for patrol program implementation failures and low statistical power of the tests conducted. On the other hand, an experiment in the Los Angeles Police Department (MOHLER et al., 2015) showed promising results on both incidence decrease in particular categories of crimes and predictive performance. In their study, models predicted 1.4 to 2.2 times better than trained crime analysts (accuracy evaluation), leading to 7.4% crime reduction, compared with 3.5% of the treatment effect (impact evaluation). Their treatment approach, or baseline, was to let criminal analysts manually indicate a risk value for the delimited regions to their knowledge.

In consultation with the Natal Police Department, we note that they use a strategy for estimating hotspots using historical data. They apply a spatial aggregation using data from the previous month to plan the next patrol. Such a methodology assumes that the immediate past may indicate a good measure of the future, and has already been used in related studies (BROWN; OXFORD, 2001; COHEN; GORR; OLLIGSCHLAEGER, 2007).

(23)

2.2 Spatiotemporal Modelling

To deliver accurate predictions, a hotspot prediction analysis must be based on solid spatiotemporal modeling. Previous studies (ECK et al., 2005;CHAINEY; TOMPSON; UHLIG, 2008) have reported different methods of aggregating crime spatially, using crime mapping methods. Here, we review two of such methods, namely KGrid and KDE. We then discuss strategies for representing temporal patterns using time series decomposition methods and how the results of such transformation can be useful to map crimes in a rich set of spatiotemporal variables. To complement them, ancillary geographic variables can be derived from OpenStreetMap data of the city, and we discuss an alternative strategy to do so.

2.2.1 Crime Mapping Methods

The literature of criminal spatial analysis commonly refers to the procedures of di-viding the city into subregions and aggregating criminal events as crime mapping. This aggregation creates the relationship between a subregion and a crime incidence level. In a prediction task, this is the first step to generate the feature set, composed by crime incidence levels to each place in each time interval of available data.

Describing several techniques, Eck et al. (2005) discusses their qualitative pros and cons. Some of these methods divide the city from the distribution of criminal events (e.g. forming spatial ellipses) and others from geometries regularly spaced (e.g. rectangles or points). The forms of aggregation of crimes may be divided into (1) counting crimes within the area bounded by each subregion, and (2) calculating the weighted sum of the crimes by their distances to the centre of each subregion. In this work, we compare two crime mapping methods, KGrid and KDE, when translating crime events into features to fit machine learning models.

KGrid The techniques of dividing crime into spatial ellipses assume that crime groups spatially into geographic units over a time window. One of the ways to build these groups of places is by using clustering. Recently, Borges et al. (2017) proposed a division of the city based on the construction of convex polygons drawn from the spatial grouping of criminal events. This technique, called KGrid, consists in applying the K-Means algorithm in the crimes location attribute to define a grid. In Figure 1, we present an illustration of crime mapping in KGrid. The spatial aggregation of crimes is made by counting events

(24)

Event Grid cell

Figure 1: An illustration of KGrid. The events are clustered by using the K-Means algo-rithm. Each cluster has external edges, which form a convex polygon. These polygons are the topological separation of the city into subregions or cells of a grid. By aggregating the count values in each cell, one can map hotspots.

within the polygons, also called grid cells. The author also mentioned that this method has the advantage of considering the topology of criminal incidence to project the regions and that the resolution can be adjusted to meet different analytic demands. The parameter K controls this resolution and can cause effects on the performance of the algorithms, as verified by Araujo et al. (2018).

Kernel Density Estimation (KDE) Among crime mapping methods, there is

evi-dence that KDE is the most appropriate technique for hotspot mapping (CHAINEY; TOMP-SON; UHLIG, 2008). The practical reasons for that are the visual effect that simulates meaningful heatmaps, and the inherent spatial correlation considered on data aggrega-tion (HART; ZANDBERGEN, 2014). Illustrated by Figure 2, it consists of creating a grid of points regularly spaced, and applying a kernel function to return a density estimation for each point. Equation 2.1 formally defines it for bidimensional analysis, where h is the bandwidth parameter of the kernel function K applied, and dx,y(i) is the distance between

all the incident i and the centre of the grid point described by its coordinates x, y. To analyze it temporally, one can (i) add the third dimension to the formula, or (ii) generate a density estimation for each time window, forming a time series.

f(x, y_{| h) =} 1 nh2 n X i=1 K(dx,y(i) h ) (2.1)

(25)

Kernel function

(Gaussian)

Bandwidth Grid cell

Event

Figure 2: An illustration of Kernel Density Estimation and its parameters. Each bold point (grid cell) represents an arbitrary place in which a kernel function applies a density estimation around a bandwidth. For a set of events/points, this procedure returns an array of KDE values indexed by the cells identifier.

As discussed, KDE presents user-defined parameters to be configured. Among them are the kernel function, its bandwidth and the grid resolution. Previous studies have in-vestigated the effect of these parameters on crime hotspot mapping precision (CHAINEY, 2013;HART; ZANDBERGEN, 2014), suggesting that kernel and bandwidth are the most rel-evant factors to be analyzed. Figure 3 illustrates that selecting an appropriate bandwidth has severe implications for crime incidence representation when using KDE. We can see that when bandwidth is equal to 1 mile, it results in an underfitting situation and the bandwidth of 0.01 mile gives an overfitting distribution.

A simple way to select bandwidth is to use Silverman’s rule of thumb (SILVERMAN, 2018). However, it is only applicable to estimations using the Gaussian kernel. By ana-lyzing both bandwidth and kernel, Hart and Zandbergen (2014) showed that Gaussian kernels are not the optimal choices for mapping crime occurrences, suggesting linear ker-nels instead. To retrieve a multiple parameter combinations that maximizes likelihood estimation, Mohler et al. (2011) suggests running a grid search with cross-validation.

2.2.2 Time Series Decompositions

Temporal patterns have been explored by the crime prediction literature since re-searchers started to forecast crime one period ahead, and Gorr and Harries (2003) date it from the 1990s. Roughly speaking, when humans want to predict something, they look for

(26)

B = 1 mile B = 0.5 mile

B = 0.1 mile B = 0.01 mile

Figure 3: KDE results for different bandwidth values. The clear difference in resolution is observed between 0.1 and 0.01 miles (bottom). We observe underfitting in the top left and underfitting in the bottom right situations.

different items in the past and estimate what they think will happen next. The forecasting methods we discuss here were based on a similar intuition, using past observations (lags) of time series to estimate a value one or more steps ahead. In this sense, autoregressive (AR) models were extensively used, and Brown and Oxford (2001) suggested them as suitable methods for baseline comparison. In a more comprehensive formulation, autoregressive integrated moving average (ARIMA) models were proposed to deal with non-stationary series (BOX et al., 2015). Besides extracting p AR lags, ARIMA works with q moving av-erage components (smoothed versions of the original series) and d series differentiations (see Eq. 2.2) that are much more prone to be stationary. After decomposing the series into these components, Box et al. (2015) suggest to apply parameter estimation using non-linear methods.

Yd(t) = Y (t)− Y (t − 1) (2.2)

Seasonal-trend decomposition by loess (STL) (CLEVELAND et al., 1990) was also used in forecasting models for crime prediction (BORGES et al., 2017;MALIK et al., 2014). By rep-resenting series in an additive configuration of trend, seasonality and residuals (Equation 2.3), this decomposition method can reveal other temporal patterns. We depict seasonal

(27)

and trend components of an illustrative time series in Figure 4.

Y(v) = T (v) + S(v) + R(v) (2.3)

Jun

2014 Jul Aug Sep Oct Nov Dec

Time

100 120 140 160 180

Crimes

Original series

Seasonality

Trend

Figure 4: Seasonal and trend components of a time series. In blue, the original series show a varying behaviour which can be further explained by a trend (in black) and a seasonality (in orange). The trend follows the moving average of the series and the seasonality represents the cyclical aspect of the original series in monthly oscillation.

Further, temporal modelling that uses previous observations requires an adequate selection of the number of lags to be considered. In the case of ARIMA components, we did not find an optimal methodology behind the selection of its lags, only practical guidelines considering autocorrelation functions, but the seminal study of Box et al. (2015) suggested to prioritize parsimony (less complex models) to avoid overfitting. It is reasonable to think that lag selection should depend on the time series sample frequency. STL theorists recommend the number of lags of the seasonal component to be related with time series sample frequency, e.g. to take 12 lags in monthly series (CLEVELAND et al., 1990). The assumption underlying it is that the 13th month may be very correlated to the 1st, and the features introduced will not present seasonal contributions proportionally with the complexity of adding one more feature. Still, with fewer features, one may keep sufficient information, but we do not know what specific lags contribute more to each time series. A particular crime type would benefit from the first three lags and the other benefit most from the last. To solve this, we will discuss later in this chapter feature selection techniques

(28)

based on machine learning.

2.2.3 Geographic Features

After applying crime mapping and extracting spatiotemporal features from crime points, other secondary factors can help to explain the urban place in which the subre-gions were derived. Related studies have proposed joining many exogenous information to help in crime prediction tasks, such as social media traffic (GERBER, 2014) demogra-phy aspects (BOGOMOLOV et al., 2014) and geographic location of PoI data (LIN; YEN; YU, 2018). We argue that these strategies are difficult to reproduce in an arbitrary crime prediction application since the data availability can be a problem. Among the three types of information mentioned, we find that the latter can be acquired using the volun-teered geographic information systems such as OpenStreetMap. Thus, our secondary set of features, namely geographic features, will be explored under the PoI data available on OpenStreetMap to ensure more reproducibility potential across other cities.

To identify relevant PoI categories that help describe hotspots, we conducted an opin-ion survey in the Natal’s police department (SESED/RN) with police cops that work in patrols. A total of 54 interviewed cops were asked to assign a value for the relative impor-tance (an integer between 1 and 5) of different PoI categories that may spatially explain crime incidence of three different types: property (CVP in the nomenclature of Natal’s Police Department), violent or lethal (CVLI) and drug-related crimes (TRED). We in-structed them to assign 5 to the items that most contribute to (attracting or repulsing) crime incidence and 1 to items that do not influence in their opinion. The results are shown in Figure 5.

(29)

0 1 2 3 4 Street lighting level

Gangs location Public squares and sports playground Neighborhood population Schools Residential streets Neighborhood's per capita income Primary streets Touristic places Bars and restaurants Night clubs Hospitals or clinics Police departments Places of worship Apartment concentration Gated community

CVP

0 1 2 3 4 Gangs location Street lighting level Public squares and sports playground Neighborhood population Neighborhood's per capita income Bars and restaurants Touristic places Primary streets Schools Residential streets Night clubs Hospitals or clinics Police departments Places of worship Apartment concentration Gated community

CVLI

0 1 2 3 4 5

Importance

Gangs location Schools Public squares and sports playground Street lighting level Touristic places Neighborhood population Bars and restaurants Neighborhood's per capita income Night clubs Residential streets Primary streets Police departments Hospitals or clinics Gated community Apartment concentration Places of worship

TRED

Figure 5: Results of the survey with 54 police sergeants about the rating of different landmarks and demographic aspects for determining hotspots. At the top, the rating for property (CVP), in the middle for lethal or violent (CVLI) and for drug-related crimes (TRED) at the bottom.

(30)

They believe "Gangs location", "Street lighting level" and "Public squares" are cru-cial aspects for the three crime categories. Particularly for TRED crimes, "Schools" and "Touristic places" arise as essential features for them. Demographic elements, such as the population of the neighborhood and its per capita income, are other aspects in which they regard relevance when describing dangerous places. Another interesting fact is that residential streets are highlighted for CVP crimes, as found in a related work (DAVIES; JOHNSON, 2015). From our perspective, the results of the questionnaire do not necessarily reflect the aspects that determine which places are dangerous or not but give us a direc-tion to select PoI categories in the broad number of OpenStreetMap features. Also, we must say that the purpose of the survey conducted was not to compare the cops’ opinion with algorithm results, but to consult cops’ opinion regarding geographic risk factors and then model features considering data availability.

Related works have considered geographic features to model crime incidence using PoI data aggregation. Caplan and Kennedy (2011) suggested that the density of some facilities in city blocks, considering a bandwidth, can represent their spatial concentration. They also indicated that in violent crimes, the distance between the closest facilities may be another correlated spatial pattern. Wang, Brown and Gerber (2012) have used both mentioned methods, counting PoI within city blocks and taking the distance from the city block center to the closest PoI, generating spatial information regarding each layer of PoI. Differently, Lin, Yen and Yu (2018) used the counting strategy but also considering weighting neighbor city blocks, to increase the spatial autocorrelation in their aggregation. We argue that these approaches can be adapted to consider in a single variable both density and distance decay by using KDE (example in Figure 6), as we will discuss later. Also, we will present in the next chapter our approach to select the subset of PoI to be included in the predictions of each crime type, considering the particular correlations that each facility can correspondingly present.

2.3 Machine Learning

In the previous section, we explored the methods behind translating crime events and PoI locations into independent variables (features) for describing crime incidence in space and time. We mentioned that the crime mapping method is the spatial aggregation approach and using a temporal sample frequency, one can generate time series for each grid cell. Further, to extract a more diverse set of temporal patterns (such as trend and seasonality), we described methods of time series decompositions, which we argue

(31)

Figure 6: Geographic feature layers from Natal generated using KDE of residential streets (left) and schools (right). Note that residential streets are denser and concentrated in the north of the city, but still widespread in other places. Schools are more concentrated in the center of the city, following to the south, but with some concentration in the north.

being useful also as feature extraction methods. To complement features with external information, we suggested using OpenStreetMap data and extract geographic features based on PoI density, instead of the current practice of related studies. Nonetheless, the prediction methodology was not discussed yet.

To forecast crime incidence levels periods ahead using spatiotemporal features, super-vised machine learning methods have been aroused as efficient tools in many recent related studies (LIN; YEN; YU, 2018;VOMFELL; HÄRDLE; LESSMANN, 2018; ZIEHR, 2017;ARAUJO et al., 2017; BORGES et al., 2017). In such class of heuristic algorithms, there is a set of them specifically designed to learn relationships between a group of inputs or features X and outputs or target variable y. This set of algorithms are called supervised because they iteratively adjust internal weights to minimize the error between the predictions and the actual value. This process is called training and involves adjusting internal parameters using features extracted from the dataset. After training a model using an algorithm, one can use it to predict values for new data.

In crime prediction studies, researchers have used many supervised machine learning algorithms, such as Support Vector Machines (YU et al., 2011), Random Forest (VOMFELL; HÄRDLE; LESSMANN, 2018), Multilayer Perceptron (ARAUJO et al., 2018), including based on Deep Neural Networks (LIN; YEN; YU, 2018), and several others. To the best of our

(32)

knowledge, there is not a consensus on the best algorithm for crime prediction tasks. In this work, we do not intend to search for the best algorithm among those mentioned, but we aim to evaluate how different the performances can be with different algorithms in different crime mapping approaches. In our experiments, we will choose to consider Ran-dom Forest and Gradient Boosting for two reasons. First, because they were empirically suggested as efficient algorithms in crime prediction studies, respectively by Borges et al. (2017), Vomfell, Härdle and Lessmann (2018). Second, because both have similarities among each other, such as being ensemble algorithms based on Decision Tree, i.e. they are constituted by a finite set of Decision Trees combined to provide a better prediction. The assumption behind ensemble algorithms is that a group of weak learners forms a stronger one (BREIMAN, 2001).

Still, each of these ensemble algorithms has its proper way to combine learners, namely "bagging" in Random Forest and "boosting" in Gradient Boosting. Bagging is when a model randomly chooses subsets of data, with replacement, to give training samples for Decision Trees, fits them and then retrieve the average of their predictions. In addition to this process, the so-called bootstrap aggregation, the Random Forest algorithm trains each of its trees with different features, randomly selected for each one. On the other hand, boosting is when a model incrementally adds a new learner, updating weights (using gradient descent in the case of Gradient Boosting) of the samples in which there were more mispredictions (VOMFELL; HÄRDLE; LESSMANN, 2018)/.

Besides the algorithm choice, there are other concerns when applying machine learning for modelling hotspot predictions. First, supervised machine learning is often distinguished between classification and regression tasks, and in crime analysis, this can change the output considerably, as we will discuss. Second, we present feature selection methods also based on machine learning to filter lag components of each temporal decomposition, as well as filtering PoI data layers, that are most important for a more accurate estimation in each type of crime.

2.3.1 Prediction Task

A prediction task is defined accordingly with the target variable, which can be a class (hotspot or coldspot) or ordinal values (crime incidence levels). Previous work im-plemented both, classifiers (BOGOMOLOV et al., 2014; ARAUJO et al., 2018; LIN; YEN; YU, 2018) and regressors (MALIK et al., 2014; BORGES et al., 2017; ARAUJO et al., 2017) to es-timate dangerous places in the future. From our perspective, the latter strategy is more

(33)

appropriate, since classifying a place as a hotspot or not, or even as "low", "medium" and "high" dangerousness, must follow an aggregation on an ordinal hierarchy derived from crime incidence values. Thus, aggregating this value into classes would hide inherent variance present in the data.

Another concern related to such aggregation is that depending on the particular quan-titative definition of a hotspot (e.g. more incidences than the average of four last observa-tions), class unbalance may be a problem (ARAUJO et al., 2018). One can define a hotspot threshold in which few samples are delimited, and it may generate too much of a class. Also, much of the samples variance is lost in this discretization. On the other hand, mod-elling regressors in a highly variant samples requires further inspection on outlier filtering to prevent the model from biased predictions. For instance, if samples are concentrated in lower values, the model will prefer to predict less values to get better overall perfor-mance. We argue that crime mapping method parameter selection, described in Section 2.2 is crucial to obtain parsimony target variables and consequently, more efficient mod-els. Thus, the choice of the prediction task involves a trade-off analysis, between biased samples with regards to the choice of a threshold and variant samples when considering raw crime incidence levels.

2.3.2 Feature Selection

As we discussed, the selection of temporal lags and PoI layers leads to a leaner repre-sentation of spatiotemporal and geographic patterns. For instance, the trend component extracted from time series of crimes would have more correlated information with the firsts lags, and the seasonal component with the lasts. Also, hospitals density may be relevant to predict burglary crimes, but not to violent crimes. When a model uses many variables to predict a target, the algorithm fitting process becomes more complex, since the model will have much more parameters than inputs, which will result in lack of model stability and overfitting (VERLEYSEN; FRANÇOIS, 2005). Sometimes adding variables can even disturb algorithm performance, because the algorithm will try to fit variables that can even be noise to the predicted variable. Therefore, applying feature selection is an essential step to ensure all the models will use all variables available to predict.

The feature selection task can be performed by machine learning algorithms in three different ways, using wrapper, embedded or filter methods. Wrapper methods combine a search strategy with a predictor to select the best subset of features, training a machine learning algorithm for a subset of features randomly take, producing a set of models.

(34)

The model with the best performance represents the best set of features to be selected. Embedded methods differ from the wrapper ones because they analyze the model structure instead of the performance. They consider the weights assigned by the predictor for each feature as a measure of importance, excluding the least important ones. Finally, filter methods consider feature importance by using a measure for correlation (e.g. χ2) with the target variable, and also calculate feature-to-feature correlation to avoid redundancy.

In this work, we followed the suggestion of Kniberg and Nokto (2018), that systemati-cally evaluated several feature selection algorithms, providing useful guidelines. Although not having a single algorithm performing choice for both runtime and predictive per-formance, they suggested that a Decision Tree modelled as an embedded method had reasonable results in the two aspects. The idea of such embedded is to calculate feature importance applying for each feature random permutations and measuring the average performance drop when fitting the Decision Tree. The features with the lower loss are assumed as not important because it is not influencing the predictions as the others do.

(35)

3 The Predspot Framework

Given that data-driven predictive analysis varies according to the developers’ exper-tise, one can find different methodologies and frameworks to predict crime hotspots. For instance, Malik et al. (2014) divided the stages of its processing into (1) geospatial divi-sion in subregions, (2) generation of time series, (3) prediction and (4) visualization of results. Similarly, Lin, Yen and Yu (2018) proposed to (1) create a grid, (2) intersect the grid in the city map, (3 to 6) extract data and grid features, (7) train a machine learning algorithm and (8) assess the latter. The similarities across methodologies motivated us to pursue a more generalized approach, namely Predspot, which we discuss in this chapter.

In previous work, we proposed a framework detailing the steps of spatiotemporal modelling and machine learning for crime hotspot prediction (ARAUJO et al., 2018). Our purpose was to improve the tasks’ transparency and the parameter selection involved. In this chapter, we introduce a redesign of this framework to include a more generic ap-proach for applying efficient crime mapping, and a more detailed feature ingest procedure. We agree with Domingos (2012) that suggests the success of a machine learning solution is on feature engineering endeavors. The framework is divided into two phases, namely model selection and prediction service, analogously to the training and prediction steps of machine learning algorithms. Each phase has its steps to achieve the final goal. This divi-sion has the purpose of differentiating the model adjustment and its usage in operational policing software.

Furthermore, in this chapter, we explain how our methodology was implemented. We detail our python-package software to support model selection operations and a web service interface to illustrate how the prediction service can be managed. These elements shall guide the software routines involving deploying the Predspot framework in a police department.

(36)

3.1 Model Selection

As we discuss in Section 2.3, supervised machine learning algorithms need a training step to adjust internal parameters and to predict based on the patterns found in the train-ing data. In this section, we overview how to prepare the dataset, apply a crime mapptrain-ing method, extract features from time series and fit a model considering hyperparameter tuning. This workflow is comprised of three steps, namely "dataset preparation", "feature ingest" and "machine learning modelling". These three steps compose the so-called model selection phase, which purpose is to train, evaluate and save an efficient model that shall be used operationally. We describe each step providing explanations regarding the neces-sary inputs and parameters to be configured, as well as the corresponding workflow and how to evaluate the model selected. It is worthy mentioning that the evaluation of the model selection phase is limited to assess the predictions of the model adjusted. Thus, it is measured in terms of error or accuracy ratios and not directly concerned with the practical impact of policing operations.

3.1.1 Dataset Preparation

From a systemic point of view, the inputs of the model selection phase are a vectorized file of city map (e.g. shapefile), a sufficiently large database of crime records provided by the police department, and auxiliary data sources. To apply the procedures, the crime records must have been registered at least with latitude, longitude, timestamp and crime type. Also, in the Predspot framework, we propose using OpenStreetMap as the auxiliary data source. It provides data of Points-of-Interest (PoI) of many cities in an open-source manner, thus increasing the reproducibility potential of our approach. The data can be extracted through the OverPass API.1

The first processing step of model selection is "dataset preparation", illustrated in Figure 7. It concerns (1) loading the data from Crimes DB and external sources, (2) applying spatial filters using the City Shape and (3) separating crime data into Crime Scenarios, according to a crime types division. The spatial filtering process consists of applying a spatial join operation taking georeferenced data of crimes and PoI that are within the city boundaries. It is also important to drop duplicate records, and look if there are "default" location values which events without proper registration are improperly assigned to. Since crime data can be mostly acquired through human interactions, spatial

(37)

bias can arise (KOCHEL; WEISBURD, 2017). We argue that previously exploring the dataset to clean invalid records is crucial to avoid harming the model with invalid data.

SHAPE FILE CRIMES DB PoI DATA CITY SHAPE FILTERS CRIMES CRIMESCRIME SCENARIO OpenStreetMap API

Figure 7: Model selection begins with loading and preparing datasets. Required data sources include a crime database for model training and connection to the OpenStreetMap API to load PoI data. Also, the city’s shapefile is important for filtering data entered within its borders.

In addition, to manage the separation of crime scenarios, it is important to follow the division made by the local authorities presented in the data. For example, if the police department is concerned with burglary crimes and the data has many types of burglaries, the aggregation of all types of burglary must be made carefully and aligned with the police department’s opinion. Otherwise, we suggest using the default division of the data rigorously. Supported by empirical evidence (ANDRESEN; LINNING, 2012), we argue that it is not appropriate to aggregate crime types into a broader category. The sum of spatial contributions of different sources of crime, such as residential burglary and drug-related offenses, can generate areas in the middle of these two types of events where no crime happens at all. Also, the Natal’s police department suggested dividing the data into daytime and nighttime scenarios for each crime type, since different patterns can arise. In the following steps of Predspot, we use each of these crime scenarios (crime type and day period) as separate datasets.

Regarding PoI data extraction, one can easily download data from OpenStreetMap querying from the Overpass API or using the web-based tool Overpass Turbo.2 _The

data is categorized into map features: streets, traffic signs, and intersections belonging to the "highway" category; hospitals, schools, restaurants and other facilities belonging to "amenity"; other interesting categories are "leisure", "tourism" and "nature".3 _Still,

choosing PoI categories may not be a trivial task, and we argue that it can be made by analyzing geographic risk factors with the help of policing experts, as presented in Section

2_{https://overpass-turbo.eu/}

(38)

2.2.3 or by analyzing related studies. For example, Davies and Johnson (2015) argues that the street network has shown relevance in his crime prediction study. The selection of the most relevant PoI layers is discussed later in this chapter as a feature selection problem.

3.1.2 Feature Ingest

Before the "feature ingest" step begins, it is necessary to choose the spatial and temporal units of predictions. First, the choice of the spatial unit of analysis can be made according to police department patrolling policies, or according to current practices of crime mapping. If the police department wants predictions by neighborhood, the grid is made up of neighborhoods in the city. If there is no such restriction and spatial resolution is a priority, artificial grids from a crime mapping method may be a suitable alternative. For example, KDE works with a grid of points, and KGrid with a set of convex hull polygons. Artificial grids have the advantage of configuring spatial resolution using a parameter. For example with K in KGrid or the space between grid points in KDE. As a drawback of high-resolution grids, it generates sparser time series more challenging to predict (MALIK et al., 2014; BORGES et al., 2017). Therefore, one cannot merely increase the grid resolution without further inspecting the results. On the other hand, the second choice is to select the temporal sample frequency. Similarly to grid cells, too small time intervals of aggregation may also result in sparse time series. Even that police may prefer hourly sampled predictions, we agree with Malik et al. (2014) that suggested weekly, or monthly aggregates are more appropriate, depending on the sample size.

With the spatial and temporal units defined, the "feature ingest" step’s workflow con-sists of the following procedure (illustrated in Figure 8). The first data item necessary is the set of crime events of a given crime scenario (described as tuples of latitude, longitude and timestamp). As it has spatiotemporal attributes, we start the ingest by aggregat-ing crimes spatially, accordaggregat-ingly with the Grid disposition derived from the Mappaggregat-ing Method, and then temporally by sampling time series for each place and time interval, completing the so-called Spatiotemporal Aggreg.. This result in Time Series Cij,

indexed by the grid cell i and the time interval j.

Then, we use these time series to start the Temporal Feature Extraction, illus-trated in Figure 9. In Section 2.2.2 we described two time series decomposition methods, namely ARIMA and STL. Although each method has its way of setting parameters to make predictions, we take advantage of the derived components as the basis for the feature

(39)

DATASET CRIME SCENARIO DATASET CRIME SCENARIO FEATURE SET X y CRIME SCENARIO TIME SERIES Cij CITY SHAPE SPATIO-TEMPORAL AGGREG. TEMPORAL FEATURE EXTRACTION PoI DATA GRID SPATIAL AGGREG. PoI DENSITY Ghi MAPPING METHOD JOIN

Figure 8: Between data loading and model building, the feature ingest process is responsi-ble for assembling the independent variaresponsi-bles and the variaresponsi-ble to be predicted. This starts with the spatiotemporal aggregations of crimes, through crime mapping method and time series manipulation, and PoI data spatial aggregation. The grid is a supporting element in this step and will be used as the places where criminal incidence will be predicted. Time series Cij extracted from grid cells indicate a set of values from a grid cell i in period j,

and PoI features Gk

i represent a the density of a k PoI category located in a grid cell i.

extraction process to feed supervised machine learning algorithms. Then, we consider as learning target the time series of original values one period ahead. Assuming that each region i of the Grid has particular trend, seasonality and differentiation aspects, we apply STL and Diff. operations for C1j, C2j, ..., Cnj (being n the size of the Grid) separately.

Finally, we take k lags accordingly with the temporal sample frequency to represent past observations of each component as Tk

ij, Sijk, Dijk. The objective is to represent the

corre-sponding crime incidence level of time j (target) with independent variables (features) expressed in past observations of trend, seasonality and differentiation in times j _{− 1,} j_{− 2, ..., j − k.}

Further, we propose to complement such temporal representation of crime incidence, using PoI data to help to describe each place i of the Grid geographically in terms of the facilities nearby. To do so, it is necessary to apply a Spatial Aggreg. operation. In Section 2.2.3, we discussed that related studies have considered counting PoI items within grid cells or the distance between the closes PoI (WANG; BROWN; GERBER, 2012; LIN; YEN; YU, 2018). We argued that KDE might include these both aspects by weighting items with a kernel that considers distance decay. Therefore, we propose using KDE as the Spatial Aggreg. method. The objective of such aggregation is to produce PoI density Ghi values for each region i and each PoI category h extracted.

(40)

STL TREND T_ij SEAS. S_ij DIFF. _DIFF. D_ij Tk_ij Sk_ij Dk_ij k LAGS TEMPORAL FEATURE EXTRACTION TIME SERIES Cij

Figure 9: The extraction of the independent variables (features) of the time series in Predspot is conducted through a trend and seasonality decomposition (through STL) and series differentiation. Each decomposition will generate a new series, and the variables will be composed of k lags from each of these series.

Ultimately, the "feature ingest" step ends by joining T , S, D and G to produce the feature matrix X, and the corresponding crime incidence levels target y that can be used to fit a machine learning model. In Figure 10, we illustrate an artificial example of how the temporal features can be disposed. Besides, we also consider timestamp attributes, such as the corresponding year and month, and the geographic features to compose the final Feature Set.

53 48 55 32 31 34 14 44 43 46 28 27 29 14 43 46 45 27 29 30 14 46 45 47 29 30 26 14 5 7 -4 1 3 1 14 7 -4 -3 3 1 -2 14 -4 -3 2 1 -2 -1 14 3 6 -2 2 4 -1 14 6 -2 -4 4 -1 5 14 T1 T2 T3 S1 S2 S3 D1 D2 1 1 1 2 2 2 3 0 1 2 0 1 2 0 -2 -4 1 -1 5 3 14 D3 13 13 13 2 2 2 14 G1 22 22 22 5 5 5 14 G2 6 6 6 16 16 16 14 G3 2 2 2 11 11 1 14 G4 15 15 15 9 9 9 14 G5 Place, Time y

Figure 10: An artificial example of temporal features for two places and three time inter-vals.

(41)

3.1.3 Machine Learning Modelling

The input required to start the machine learning modelling step is the feature set. As we discussed in Section 2.3.3, feature dimensionality can influence the performance of the predictive algorithm, introducing complexity in adjusting the model weights and even adding noise to the data. Thus, to filter the features of each crime scenario, feature selection is necessary before adjusting the models. We propose it to be done as part of the "machine learning modeling" step because we use machine learning-based strategies for feature selection, as suggested by Kniberg and Nokto (2018).

With noisy features properly filtered, different supervised learning algorithms can be adjusted to predict crime hotspots, following the flow illustrated in Figure 11. There is not a clear consensus on the best machine learning algorithm for predicting crime hotspots. As in the previous version of our framework, we do not propose the use of a single learning algorithm, but experimenting with several of them to select the one with the best score for each crime scenario. In this sense, our framework takes a model-agnostic approach to use the one that better fits, according to appropriate assessment. Considering that a substantial volume of machine learning algorithms have recently emerged, we argue that the search for the best algorithm can displace the focus of the modelling approach proposed in this study.4

SUPERVISED LEARNING ALGORITHM TUNING SUPERVISED LEARNING ALGORITHM TUNING SUPERVISED LEARNING ALGORITHM TUNING FEATURE SELECTION ALGORITHM EVALUATION FEATURE SET X y GOLDEN_MODEL

Figure 11: In the machine learning modeling step, the best qualified features are selected and feed into the algorithms. By adjusting various algorithms, we can evaluate the models to use the one that has the best predictive performance.

Still, we consider as relevant using a Tuning strategy to optimize machine learning hyperparameters. Previous studies have shown the improvement of algorithm performance when applying it (BERGSTRA; BENGIO, 2012). Olson et al. (2017) suggested hyperparam-eter tuning led up to 50% improvements in CV score. In the Predspot, as we are training 4_{Automated machine learning (}_{FEURER et al.}_{, 2015) has shown to be a promising model-agnostic} approach which we will explore in future work.