Fundação Getulio Vargas
Escola de Matemática Aplicada
Brenda Quesada Prallon
Functional Classification of Bitcoin Wallets
Rio de Janeiro 2020
Brenda Quesada Prallon
Functional Classification of Bitcoin Wallets
Dissertação submetida à Escola de Matemática Aplicada como requisito parcial para a obtenção do grau de Mestre em Modelagem Matemática.
Área de Concentração: Ciência de Dados Orientador: Yuri Fahham Saporito
Rio de Janeiro 2020
Dados Internacionais de Catalogação na Publicação (CIP) Ficha catalográfica elaborada pelo Sistema de Bibliotecas/FGV
Prallon, Brenda Quesada
Functional classification of bitcoin wallets / Brenda Quesada Prallon. – 2020.
112 f.
Dissertação (mestrado) -Fundação Getulio Vargas, Escola de Matemática Aplicada.
Orientador: Yuri Fahham Saporito. Inclui bibliografia.
1. Bitcoin – Modelos matemáticos. 2. Transferência eletrônica de fundos. 3. Modelagem de dados. 4. Investimentos – Análise I. Saporito, Yuri Fahham. II. Fundação Getulio Vargas. Escola de Matemática Aplicada. III. Título. CDD – 006.31
•^
i3K' ifwf^ ,?'
^^J ¥
BRENDA QUESADA PRALLON
"FUNCTIONAL CLASSIFICATION OF BFTCOIN WALLETS".
DtSSERTA^AO apresentado(a) ao Curso de MESTRADO EM MODELAGEM MATEMATICA do(a) ESCOLA DE
MATEMATICA APLICADA para obten^ao do grau de Mestre(a) em MODELAGEM MATEMATICA.
Data da defesa: 16/04/2020
ASSINATURA DOS MEMBROS DA BANCA EXAMINADORA
Presidente da Comissao Examinadora: Prof°/a YURI FAHHAM SAPORITO
^ /^-J ' \,f-;?; -^cc^/vT/o YUWAHHAM SAPOi^TO Orientador .^"y '• •• <^, ^J-\. Ik^. . •^+~^~-~. ,; ^ Y'' i ^y^'^j^) rJ '71^' M^-; , s^^^^^? ; ^^/ RODRK30 DOSSANTO^TARGINO / Membro ki ".:-"1 •.-- '* r^' ^->^ f-. -L~r^.. ^•\ ^.^ ^^'^\ [~^J ARTHUR BRAG/^N^A Membro
Em cumprimentoaoDECRETOn0 46.970 de 13/03/20-PoderExecufivo do Estado do FUo de Janeiro, DOE n°047-Aem 13/03/20, Art 4ae Portaria MEC n° 343 de 17/03/20, DOU n° S3 de 18/03/20, que disp5e sobre a suspensao temporaria das atividacfes academicas presenciais e a utilizac3o de recursos tecnol6gicos (em corrfonnidade a legisla^o vigente), face ao COVID-19, as apresenta?oes das defesas de Tese e Disserta^ao, de forma excepcional, serao realizadas de forma remota, inctui-se nessa modalidade membros da banca e discente.
Cesar Leopoldo Camacho Manco Antonio de Araujo Frertas Junior
Diretor Pro-Reitor de Ensino, Pesquisa e Pos-Gradua^ao FGV
Instrupio Normativa n» 01/19^ de 09/07/3.9 - Pro-Rertoria FGV
Em caso de partiapacao de Membro(s) da Banca Examinadora de forma nao-presencial*/ o Presidente da Comissao Examinadora assinara o documento como representante legal, delegado por esta I.N.
Acknowledgements
Firstly, I would like to thank my family for all the hardship they have endured in order to support all of my choices that led me here. I owe them everything.
Secondly, Gabriel, whose unconditional love helped me overcome my doubts and fears, and makes me want to be my best self. The fact that he is great at debating my ideas is just another perk.
I would like to express my utmost gratitude to my advisor, Yuri Saporito, who not only is a brilliant and dedicated teacher, but also a great person. His mentorship forever changed my perspective on research, and made this journey especially fun.
To the friends made through this quest, specially Bernardo, Igor, Marcelo, Pedro and João Miguel, it has been a pleasure learning with and from you, and I hope to continue in the future. “There are some things you can’t share without ending up liking each other...” and completing a master program is one of them. To my lifelong friends, thank you all so very much for being so understanding of my absences, and for being the best cheering team anyone could ask for.
I would like to thank the committee members, Rodrigo Targino and Arthur Bragança, for the valuable comments that have significantly improved this work. I further thank the teachers and staff of EMAp for all the lessons and attention.
I also thank Professors Wenceslao Manteiga and Manuel Febrero, who so kindly welcomed me and advised me during my visit at USC. Their guidance was essential from the very beginning and allowed me to get in touch with topics at the border of Functional Data Analysis.
Finally, this project would not have been possible without the financial support of FGV-EESP and Ripple. I would like to thank them deeply, and also CAPES for the continuous funding.
Abstract
This work proposes a classification model for predicting the main activity of bitcoin wallets based on their balances. Since the balances are a function of time, we apply functional data analysis methods; more specifically, the features of the proposed models are the functional principal components. The estimation of functional principal components is explained in detail. Classifying bitcoin wallets is a relevant problem for two main reasons: to understand how the bitcoin market works, and to identify accounts used for illicit activities. Although other bitcoin classifiers have been proposed, they focus primarily on network analysis rather than curve behavior. Results show improvement when combining functional features with scalar features, and similar accuracy for the models using those features separately, which points to the functional model being a good alternative when domain-specific knowledge is not available.
Contents
1 Introduction 7
1.1 Bitcoin . . . 8
2 Literature Review 11 2.1 Classification with Functional Data . . . 11
2.1.1 General Overview . . . 11
2.1.2 Classifiers from the Multivariate Framework with Functional Covariates 12 2.2 Classification of Bitcoin Data . . . 14
3 Functional Data Analysis 15 3.1 Smoothing Functional Data . . . 15
3.1.1 Smoothing Functional Data with a Roughness Penalty . . . 17
3.1.2 B-splines . . . 19
3.1.3 Functional Principal Component Analysis . . . 19
3.1.4 Estimating the functional principal components . . . 24
3.1.5 Poisson Process . . . 27 4 Data 29 4.1 Labelling Wallets . . . 29 4.2 Wallets Balances . . . 29 4.3 Wallets Treatment . . . 30 4.3.1 A Sampling Issue . . . 30 4.3.2 Estimating curves . . . 31 4.4 Additional Curves . . . 39
4.4.1 Derivatives . . . 40
4.4.2 Poisson Rate Curves . . . 42
5 Classification Models 45 5.1 Vector Model . . . 45
5.2 Functional Data Classification . . . 46
6 Results 48 6.1 Min. 10 observations . . . 51 6.1.1 Random Forest . . . 51 6.1.2 Other Algorithms . . . 53 6.2 Min. 20 observations . . . 54 6.2.1 Random Forest . . . 54 6.2.2 Other Algorithms . . . 55 6.3 FPCA plots . . . 56 7 Conclusion 63 8 References 64 A Appendix 68 A.1 B-Splines . . . 68
A.2 The GCV Method . . . 70
A.3 Plots of wallets of other categories . . . 72
A.3.1 Original curves . . . 72
A.3.3 Smooth curves . . . 82 A.3.4 Smooth Derivatives . . . 92 A.3.5 Poisson Rates . . . 102
1
Introduction
This dissertation tackles the question of building a model to classify the main activity of a bitcoin wallet. Bitcoin was the first decentralized cryptocurrency to be created, and it is also the most popular. Anonymity is a central characteristic of the Bitcoin protocol, and one of the reasons for its success. This implies, nevertheless, that the market composition of this cryptocurrency is not obvious: the purposes for which bitcoins are spent are obscure. It is known, however, that illicit services account for a relevant portion of that market1 - which follows from anonymity certainly being an attractive for law-breakers2.
Thus, identifying the main activity of a bitcoin wallet is a relevant issue, since it both aids law-enforcement and sheds some light on the bitcoin market organization.
We would like to propose a solution to this issue considering only information of the wallets’ balances. More precisely, we use the account movements, which are the credits and debts made throughout a period of time. This is essentially a task of classification by analysing the behaviour of curves. Although there have been multiple works with the same ultimate goal - to classify bitcoin wallets by activity -, the classifiers in the literature focus on arbitrary features, that come from field specific knowledge - also known as the process of ‘feature engineering’, and network analysis. Our approach differs because we want use the fact that the account movements are a function of time, or, more specifically, of the wallet’s life span3. That is, we aim to find patterns in the shapes of the movements’
curves, and to classify the wallets based on them. This is done by employing Functional Data Analysis (FDA) tools.
The arbitrary features normally contain implicit information of some aspects of curve behaviour; for example, the number of credits is a measure for frequency, and the total amount of credit is a measure of level. FDA, however, accomplishes summarizing this information in a more automatic way, meaning that the methods do not require as much domain-specific knowledge. This is particularly useful when the process of feature engineering is complicated.
There are many specific methods for classifying functional data. In this dissertation, the focus is on adapting classifiers from the multivariate framework. This is accomplished by projecting the curves of account movements onto a functional basis, and then using the coefficients as functional features. The proposed basis is the eigenbasis of the Karhunen-Loève expansion, the functional principal components basis. Besides being widely explored
1
46% of bitcoin transactions, as estimated by Foley et al. (2019). 2
The Silk Road darknet market, closed by the FBI in 2013 and used mainly for commercializing illegal drugs, moved approximately fifteen million dollars annually in transactions (Christin, 2013).
3
Here understood as the time elapsed between the first credit, at t = 0, up to a pre-defined limit (chosen empirically), t = T .
in the literature, this system maximizes the variance of the observed curves, analogously to classical multivariate principal components analysis; this means that, typically, few components are necessary to explain most of the differences of the functions.
Regarding the activities for which bitcoins are used, here, six categories are proposed based on the work of Tomé (2017): exchange, darknet, gambling, mixer services, payment systems, and mining. Mining is the activity through which blocks are validated (more on Section 1.1), and mixer services are a way of providing more privacy to the user by mixing different addresses through transactions and making it harder to trace them to real-world entities.
Although the application shown here is for bitcoin data, the methods described are general and may be useful for other types of functional data, such as regular bank accounts. Even if anonymity is not a major issue in this context, analysing curve behaviour might be helpful in detecting scams and money laundry; besides illegal activities, FDA can aid in predicting default risk and identifying profiles for bank customers.
The text is structured as follows: Section 1.1 presents a brief introduction on Bitcoin; Section 2.2 discusses some of the literature on both classification with functional data and classification of bitcoin wallets; Section 3 details the FDA methodology involved, as how to represent functions in finite dimension and how to estimate functional principal components. Section 4 describes all the procedures of gathering and treating the data, as well as characteristics of the data. Only plots of exchange wallets are shown, to ensure the fluidity of the text. Section 5 details the models tested, their features and algorithms. Results are presented and discussed in Section 6, and Section 7 draws conclusions and proposes future work. Finally, the Appendix has some additional information on functional methods, as well as plots for the other categories.
1.1
Bitcoin
Bitcoin was created in 2008 with the publication of the article Bitcoin: A Peer-to-Peer
Electronic Cash System, written by one or more individuals under the pseudonym of
Satoshi Nakamoto (whose identity is still unknown). The implementation of the bitcoin software took place in January 2009, as an open-source code. The most important characteristics of Bitcoin - or any other cryptocurrency, for that matter - are anonymity and decentralization, meaning that there is no identification requirement and no intermediary to perform transactions (as in other digital payment networks, such as Paypal).
The blockchain, in very general terms, works as a block of ledgers, public and accessible to everyone, where all intended transactions are registered. To prevent false transactions of being added to the blocks, digital signatures come into play: all the users involved in a transaction must validate them by providing one. It works as following: each user has a secret (or private) key that is combined with the transaction identification in order to generate a signature that will appear on the block. There is a function to verify this signature, which combines it with the transaction identification and public key and checks if the signature was indeed created by the secret key associated to the public key. This system also prevents the signatures from being forged, because they depend on the transaction itself, yielding a different signature for every operation.
Furthermore, there is still the issue of double spending. To check if a user is spending more than they have in their account, it is necessary to know the full history of transactions up to that point.
The blockchain is not controlled by one entity; instead, all users keep a copy of it. The intended transactions are broadcasted to the network, so that every user records them in their own copies. But there is still a necessity to ensure that all blocks match. The original Bitcoin paper proposes that the block with more computational work put into it is the one that should be trusted.
The way to check the computational work put into a block is called proof-of-work. For the network to accept a block, it must have a proof-of-work, which is a number that when fed to a cryptographic hash function, yields a target number. The fact that these functions cannot be inverted - it is computationally unfeasible to get the input of the function given the output 4 - means that the only way to obtain the target number is
through trial and error. The target number is changed periodically, so that a new block is added to the chain every ten minutes, approximately.
For the blocks to remain in order, they also carry the hash of the previous block. Anyone can create a block, by registering the broadcast transactions and finding a proof of work; this is why it is a decentralized system. The reward for that process, called mining, is in bitcoins - that were created instead of exchanged by regular currency.
The consensus tells to trust the longer chain (which has more computational work). This consensus makes it almost impossible to create fraudulent blocks; if a miner inserts invalid transactions in a block and finds a proof of work, they will have to keep finding the proof of work for every following block in order for the network to trust the incorrect chain. Since finding a proof of work is essentially a lottery where more computation
4
Figure 1: Bitcoin market price (USD)
capacity buys you more tickets, this is highly unlikely, unless the miner had more than 50% of the world’s computational power devoted to bitcoin mining.
It is important to stress that once someone exchanges regular currency for bitcoins, the protocol works independently: bitcoins are not anchored in any regular currency. Furthermore, the amount of bitcoin increases because of the mining rewards. The reward is, however, decreased by half every 210 blocks. This does not mean that miners’ incentives will be ultimately zero, because users can still include a transaction fee as a way to encourage them to quickly append their transaction to the next block. After all, each block is limited to 2400 transactions.
Even though Bitcoin-users are not identified, their anonymity is not complete. Since the transactions are public, it is possible to link multiple wallets’ addresses. To prevent that, mixer services - in which the origin and destination of bitcoins is disassociated - have become popular. Other cryptocurrencies have been developed attempting to tackle that issue, such as monero and zcash.
For a individual to buy bitcoins, with no mining aspirations, exchanges are usually the best option. They work analogously to regular exchanges or brokers.
Finally, the high price volatility needs to be pointed out (see Figure 1.1). This factor is responsible for much criticism referring to Bitcoin as a bubble phenomenon.
The wallets considered in this work were active at some time between 2011 and the beginning of 2017. Although this is not the time frame of higher volatility, the price does vary, roughly, between 1 and 1000 USD.
2
Literature Review
2.1
Classification with Functional Data
2.1.1 General Overview
The literature on classification with functional data is quite extensive. Many methods from the multivariate framework were modified to fit the functional context, while others were applied on functional covariates such as the coefficients from a basis expansion. The latter is also called “the filtering approach” in Baíllo et al. (2011), and the main difficulty associated to it is the choice of an appropriated basis. The filtering approach is in fact the focus of our work and will be discussed in Section 2.1.2.
Restricting the analysis to the supervised case, Cuevas et al (2007) propose the classification of functional data via projection-based depth notions, and use the k-NN classifer as a reference method; the k-NN algorithm can be modified to incorporate functional data through the metric employed to measure the distance between the elements - typically, the L2 metric is chosen, but virtually any other functional metric could be
used. López-Pintado and Romo (2007) present inference tools for functional data, again using depth notions, and study their asymptotic behavior. Also on this trend, Sguera et al. (2014) introduce a new functional depth - the kernelized functional spatial depth - for studying functional samples that require analysis at a local level. They test their method by classifying data in which the differences between groups are not excessively marked or contain outliers, and obtain consistently better results than the default k-NN method and other functional depths.
James and Hastie (2001) introduce the functional linear discriminant analysis (LDA) method, an extension of the multivariate LDA, particularly useful for irregularly sampled data. On the matter of curves measured through different intervals, Delaigle and Hall (2013) suggest a non-parametric approach to extending curves outside of their originally observed intervals; Rice and Wu (2001) propose a mixed effects model, where each individual curve is represented as the sum of a population mean function, a random function, and white noise, with the functions being estimated non-parametrically with splines.
Besides linear discriminant analysis, other models successfully adapted to the functional setting are generalized additive models (Febrero-Bande and González-Manteiga, 2013) and support vector machines (Rossi and Villa, 2006). Ferraty and Vieu (2006) detail non-parametric models for classification, where the Bayes Rule is used to determine the
group of a new observed curve, using kernel estimators and different functional metrics or semi-metrics to estimate the posterior probabilities.
2.1.2 Classifiers from the Multivariate Framework with Functional Covariates In this dissertation, traditional classification algorithms, such as the random forest, gradient boosting, support vector machine and logistic regression, are used with functional covariates. In general, any multivariate classification methodology can be applied to functional data. One way to do so is to represent the infinite dimensional function in finite dimension through a basis expansion (see more details in Section 3). The coefficients from that expansion are used as features in the algorithm. The basic assumption behind this method is that the basis accounts for all the information in the function. It is also useful to choose this classification method if the problem at hand depends on the algorithm - that is, certain algorithms are known to perform better than others for specific contexts.
Although there are specific methods for reducing dimensionality with the purpose of functional classification, such as as the “Functional Adaptive Classification (FAC)” proposed by Tian and James (2013), the most common way to reduce dimensionality is through the truncated Karhunen–Loève expansion (see Equation (17)), which yields an empirical basis. The resulting coefficients are the functional principal components (FPCs). Hall, Porkitt and Presnell (2001) argue that the FPCs capture the greatest part of the curves, in a L2 sense, since they are maximizing variance. They apply this
transformation to radar signals curves, and estimate the coefficients density to compute the posterior probabilities in order to classify the signals in eight different groups. Their model performs significantly better than the multivariate counterpart of canonical variates analysis, independently of the method chosen to estimate the FPCs density (nonparametrically, with kernels, or assuming a Gaussian distribution).
On the following paragraphs, we will pay attention to the supervised case; however, FPCs are also useful for non-supervised classification problems: Illian et al.(2009) aim to analyse microbial communities based on fingerprinting techniques, and find that FPCA combined with hierarchical cluster analysis does provide a better approach to understanding the microbial communities.
In Müller et al.(2005) it is shown that, when using the principal components basis (or any orthogonal basis) to represent the functions and truncating the expansion, the functional logistic model becomes the equivalent of a multiple logistic regression with FPCs (or other coefficients). Escabias and Valderrama (2004) compare two different approaches for a functional logistic model. The first one consists of smoothing the curves
on an arbitrary basis, and then performing regular PCA on the matrix defined by the multiplication of the coefficients matrix by the inner product matrix. The second one consists on estimating the functional principal components by the method described in Section 3.1.4. The resulting principal components are inputs for the multiple logistic regression. In Escabias and Valderrama (2005), the first method is applied with B-splines to predict risk of drought across different areas in Canada.
A functional logistic model is also used to classify gene expression data in Leng and Müller (2006). They use the method presented by Rice and Wu (2001) for comparison, and find that their own method achieves lower error rates with fewer basis functions. On the same subject, Song et al. (2008) develop a model to classify gene expression that consists of applying multivariate classification algorithms, such as k-NN, LDA, QDA (quadratic discriminant analysis) and SVM, to FPCs estimated from the data. They first smooth the data with a pre-defined basis and then compute the principal components. The pre-defined basis is chosen as a cubic B-spline basis, since it is flexible and computationally efficient. The number of B-spline components and principal components are chosen through leave-one-out cross-validation; results point that QDA, followed by SVM, achieves the best result.
Folowing the same idea, Li et al. (2013) attempt to classify hyperspectral images by treating their pixels as curves instead of high-dimensional discrete vectors. This way, it is possible to take more advantage of the high spectral resolution characteristic of hyperspectral images. As in Song et al. (2008), the curves were first smoothed using cubic B-splines, with the difference that, instead of choosing the number of basis, they use a smoothing parameter for penalizing the second derivative; functional principal components were computed in the same fashion, and the number of FPCs was chosen by five-fold cross-validation. We use the same procedures in our work. The method was compared to other methods, including standard SVM and SVM applied to regular principal components, on three popular hyperspectral data sets. Results show that their approach performs consistently better. Functional principal components as features for SVMs were also employed in Lee (2005).
Furthering the topic of functional principal components, Hall and Hosseni-Nasab (2006) derive some asymptotic bounds for the truncated estimators of eigenfunctions and eigenvalues. Yao, Müller and Wang (2005) propose a nonparametric method to perform FPCA with sparse logitudinal data. Finally, since the FPCA is very sensitive to outliers, some tools to identify and remove those based on notions of functional data depth have been discussed in López-Pintado and Romo (2007) and Cuevas, Febrero and Fraiman (2007).
2.2
Classification of Bitcoin Data
Classification of bitcoin data usually focuses on identifying illicit activities. Also, most of the work uses some type of aggregation of wallets addresses, linking them to one common entity. Furthermore, the literature tends to utilize features from network relations, instead of the functional behaviour of account movements.
Meiklejohn et al. (2013) propose a method to cluster bitcoin addresses to user-level; that is, multiple addresses are assigned to each user (also called entity). They use a small number of transactions labeled through their own empirical interactions with various entities, and then identify major institutions and their interactions. They contribute to the better understanding of the bitcoin economy organization by showing how users behave through transactions.
Foley and Karlsen (2019) study the bitcoin market of illicit activities. They find that illegal users of bitcoin tend to transact more, in transactions involving fewer addresses, and they tend to hold smaller amounts of the cryptocurrency. To make this analysis, they first cluster wallets addresses to user-level using the approach from Meiklejohn et al.(2013). Then, two models are developed: the first one used information on the addresses network that is, which addresses communicated with each other -to cluster legal/illegal activities. The second method is detection-controlled estimation, which deals with the non-randomness of the samples, due to the nature of detection problems. The covariates used account for transaction frequency, USD value, and wallets lifespan, but there are also other variables with information of the network, such as how many users are involved in the transactions, and information of external shocks. Although the paper is focused on explaining the bitcoin illegal market, the proposed methods could be extended for predicting illegal activities, with the adaptation of some covariates.
Graph neighborhood features were proposed by Jourdan et al.(2018) to classify users with five different labels, taken from Wallet Explorer: exchange, gambling, mining, service and darknet (the categories are arbitrarily constructed). They build a total of 315 features, but obtain great accuracy results using only 15. They show that graph features improve the accuracy significantly over features considering only the entities addresses. Logistic regression and gradient boosting were employed, and results show 92% accuracy with a F1 score of 91% for the GBM, dominating state-of-the-art results
from the literature. Darknet is perfectly classified, and other classes, apart from mining, also score above 94% in terms of accuracy and 88% in terms of F-measure. Hu et al. also propose graph features to identify money laundering activities occurring across the Bitcoin network. Their binary model classify transactions, and achieves up to 92% in accuracy and 95% in F-measure.
3
Functional Data Analysis
Every observed data in the world is discrete. What makes discrete data functional, simply put, is the assumption that there is a function x giving rise to the observed data. This underlying function should be, usually, smooth (otherwise, there is not much to gain by treating the data as functional instead of multivariate). Examples of functional data include height varying on age, temperature varying throughout the year, or financial data such as traded volume throughout the day.
There are a few key aspects of functional data that highlight the inadequacy of traditional multivariate methods in handling it. Firstly, for finely discretized curves, the well-known “curse of dimensionality” arises; by projecting the curves on a truncated basis, Functional Data Analysis (FDA) handles this problem naturally. Secondly, because of the smooth structure of the curves, it is very likely that the variables are strongly correlated; this becomes an issue for the multivariate linear model, because of ill-conditioned matrices. At last, treating the data as functional allows for much more flexibility on sampling points; at first, there is no need for the curves to be sampled at the same points, or even to have the same number of sampling points. This representation is not direct with vectors, since there is no well-defined rule as to where place missing data.
In this work, the account movements are assumed to be a function of a wallet’s lifetime, therefore requiring the FDA methodology. This Section provides a brief theoretical background to the topic, focusing on the methods employed in practice. The next subsections are based on chapters 3 to 8 of Ramsay and Silverman (2005); for a more detailed and friendly introduction, see chapters 1 and 2 of the book.
3.1
Smoothing Functional Data
As stated above, all observed data are discrete, even if they arise from a functional process. Usually, the observations also carry some degree of noise. Thus, the first step in FDA is finding a suitable representation for the data, which will depend on the functional space being used. The functional space, on the other hand, should be chosen according to the problem in question. The most used functional spaces are the metric, the Banach, and the Hilbert spaces. Hilbert spaces, although more restrict, are often chosen since they have inner products; hence, functional bases with notions of orthogonality. The most common way of representing a function is through a linear combination of basis functions:
x(t) = ∞ X k=1 ckφk(t), (1) x(t) ≈ K X k=1 ckφk(t). (2)
Equation (2) is the finite-dimensional representation with vectors of Equation (1), which is possibly a infinite-dimensional function; that is, the function is projected on the Euclidean space RK. In this expansion, both the basis and the parameter K should
be chosen according to the characteristics of the data. Common basis choices are, for example, the Fourier, for periodic data, and B-splines for polynomial data. Coefficients ck, k = 1, . . . , K are the finite number of parameters at hand, hence, this is considered
a parametric representation, and they can be estimated by least squares. Let, for j = 1, . . . , n,
• yj be the discrete observations;
• tj be the discrete time points where yj is observed;
• xj = x(tj);
• τ be a close interval of R where argument tj takes values;
• yj = xj + ǫj, ǫj ∼ iid N (0, σ2).
Then, we want to compute
argmin c n X j=1 [yj− K X k=1 ckφk(tj)]2, (3)
or, in matrix terms5:
argmin
c
(y − Φc) T(y − Φc), (4)
where Φ is an n × K matrix containing the values φk(tj), that is, each column is a basis
component evaluated at times t1, . . . , tn. This procedure, which results in a smoother
curve, is known as smoothing.
5
3.1.1 Smoothing Functional Data with a Roughness Penalty
In addition, we may want a roughness penalty to make the fitting even smoother. This roughness penalty is naturally defined by the norm of the function’s m-th derivative:
P ENm(x) =
Z
τ
[ Dmx(s)] 2ds, (5)
where Dm is the m-th derivative operator.
The second derivative, which is a measure for curvature, is usually chosen. Although this measure can be generalized to any linear differential operator, we will stick to the above definition for the purpose of this work.
The minimization problem then becomes the penalized residual sum of squares (PENSSE):
P EN SSEm,λ(x|y) = (y − x(t)) T(y − x(t)) + λ × P ENm(x), (6)
where λ is called the smoothing parameter. In this minimization problem, there is an explicit trade-off between smoothness and data fit. It is interesting to notice what happens when λ approaches zero and infinity, when we penalize the second derivative: as λ → 0, the curve becomes more and more variable, and x approaches an interpolant to the data, with x(tj) = yj, for all j. As λ → ∞, the criterion puts more emphasis on the smoothness
of x; to minimize it, P EN2(x) must go to zero, which means that the function’s second
derivative must be zero, therefore implying that x is affine. In this case, P ENSSE reduces to ||y − x(t)||2, and we are back to linear regression.
By representing x(t) as a basis expansion, we get: argmin
c
The penalization can also be written in terms of a basis expansion: P ENm(x) = Z τ [Dm(x(s))]2 ds = Z τ Dm(cTφ(s))2ds = Z τ cTDm(φ(s))Dm(φT (s))c ds = cT Z τ Dm(φ(s))Dm(φT(s)) ds c = cTRc, R K×K = Z τ Dm(φ(s))Dm(φT(s))ds . (8)
At last, substituting (8) into (7), we get an optimization problem that can be solved analytically:
argmin
c
(y − Φc) T(y − Φc) + λ × cTRc. (9)
By taking the derivative in respect to c, we obtain the first order condition
−2ΦTy+ ΦTΦc+ λRc = 0, (10)
which gives the expression for the estimated coefficient ˆc:
ˆc = (ΦTΦ+ λR)−1ΦTy. (11)
The estimated data, ˆy, is given by ˆ
y= Φˆc
= Φ(ΦTΦ+ λR)−1ΦTy
= Sy, S = Φ(ΦTΦ+ λR)−1ΦT is the smoothing matrix.
(12)
It is interesting to see the role of the roughness penalty λR: besides avoiding overfitting the data, it also serves a computational purpose: matrix Φ may have highly correlated columns if the sampling points are too close, and ΦTΦbecomes ill-conditioned,
which can make its inversion unreliable. Furthermore, if we have more basis components than sampling points, then ΦTΦ is certainly not invertible. The roughness penalty
provides a solution to those issues by adding a well-behaved matrix. However, if λR has values that are too high, it can instead overwhelm ΦTΦ; R itself may not have full
an error message or inaccurate results.
3.1.2 B-splines
The B-splines basis is one of the most common choices to represent non-periodical functional data. They are piece-wise polynomials defined by their order and knots (points in the domain where the polynomials pieces meet). More details regarding the theoretical description and computation of B-splines can be found in Section A.1 of the Appendix. One of the main challenges of fitting curves using the B-splines system is choosing where to place the knots. In general, more knots are placed over regions where the function shows high curvature. However, there is a very interesting theorem, to be applied later, worth mentioning:
Theorem 3.1 (de Boor, 2002). If x is a function with second derivative
R D2x(s) ds<∞ and distinct sampling points tj, j = 1, . . . , n, the curve x that
minimizes PENSSE (6) is a cubic spline with knots at data points tj.
The above theorem deals with the issue of where to place the knots, and adapts naturally to uneven sampling points, being smoother over regions with fewer data. Notice that it also implies that the smoothing parameter λ has to be different than 0, otherwise, because we use every sampling point as a knot and order 4 B-splines, the number of basis components would be two more than the number of sampling points (see A.1), and we would encounter the earlier discussed difficulty of inverting singular matrices with more basis components than sampling points.
3.1.3 Functional Principal Component Analysis
Principal Component Analysis is a widely used technique for exploring the modes of variation - that is, the direction that maximizes variance - in multivariate data, and also for reducing its dimensionality in high-dimensional cases. In its functional version, this tool requires a slightly different approach, although the concept remains the same: transforming a set of observations of possibly correlated variables into a set of values of orthogonal variables, the principal components. This next subsection was mostly based on Chapter 8 of Ramsey & Silverman (2005) and on Cartea, Jaimungal & Penalva (2015). The following theorem is a generalization of the Spectral Theorem for functions. First, some definitions:
• K is a continous, symmetrical non-negative kernel if K is a continuous function, K : τ × τ → R, satisfying: K(t, s) = K(s, t) and n X i=1 n X i=1 K(ti, tj)cicj ≥ 0, (13)
for all sequences of points t1, . . . tn,∈ τ and c1, . . . , cn∈ R, with n ∈ N.
• The linear Hilbert-Schmidt operator associated to K is defined as (Kf )(t) =
Z
τ
K(s, t)f (s)ds. (14)
Theorem 3.2(Mercer’s Theorem). Let K be a continuous, symmetrical and non-negative
definite kernel. Then, there is a orthonormal basis {ξi(t), i = 1, . . .} of L2[τ ] consisting
of the eigenfunctions of K, such that the eigenvalues ρi, i= 1, . . . are non-negative. The
eigenfunctions are continuous and K can be written as:
K(s, t) =
∞
X
j=1
ρjξj(s)ξj(t), (15)
where the convergence is absolute and uniform in L2[τ ].
From Mercer’s Theorem, it follows that (Kξi)(t) = Z τ K(s, t)ξi(s)ds = Z τ ∞ X j=1 ρjξj(s)ξj(t) ! ξi(s)ds = Z τ X j6=i ρjξj(s)ξi(s)ξj(t) + X j=i ρjξj2(s)ξj(t) ! ds =X j6=i ρjξj(t) Z τ ξj(s)ξi(s)ds | {z } =0 + X j=i ρjξj(t) Z τ ξj2(s)ds | {z } =1 = ρiξi(t). (16)
Equation (16) is the eigenequation of the covariance operator K. Also, notice that the eigenfunctions are not invariant with respect to scalar multiplication. In Equation (15), we can see that, by changing their signs, for example, the value of the covariance function does not change.
This next theorem ensures that we can represent functional data with a principal components basis, as long as the function agrees to some hypotheses.
Theorem 3.3(Karhunen–Loève Theorem). For a stochastic process X(t) square-integrable R
τ|X(t)|
2dt <∞, let µ(t) = E[X(t)] and K(s, t) = Cov(X(s) − µ(s), X(t) − µ(t)). This
covariance function is a continuous, symmetrical, and non-negative definite kernel, so we can use Mercer’s Theorem to obtain the eigenfunction representation:
X(t) = µ(t) +
∞
X
j=1
Zjξj(t), (17)
where the convergence is uniform and in L2, and
Zj = Z τ (X(s) − µ(s))ξj(s) ds, (18) with E[Zj] = 0, E[ZjZi] = 0 (i 6= j) and V ar[Zj] = ρj. (19)
Proof. The representations of X(t) in Equations (17) and (18) follows directly from the
fact that {ξi(t), i = 1, . . .} is an orthonormal basis:
X(t) − µ(t) = ∞ X k=1 ckξk(t), h(X − µ), ξki = ∞ X j=1 cjhξj, ξki = ck. Furthermore, E[Zj] = E Z τ (X(s) − µ(s))ξj(s)ds = Z τ E[X(s) − µ(s)] | {z } =0 ξj(s)ds = 0
E[ZjZi] = E Z τ Z τ (X(s) − µ(s))(X(t) − µ(t)) ξj(s)ξi(t) ds dt = Z τ Z τ E[(X(s) − µ(s))(X(t) − µ(t))] ξj(s)ξi(t) ds dt = Z τ Z τ K(s, t) ξj(s)ξi(t) ds dt = Z τ ξi(t) Z τ K(s, t) ξj(s) ds dt = Z τ ξi(t)(Kξj)(t) dt (by Equation 14) = Z τ ξi(t)ρjξj(t) dt (by Equation 16) = ρj Z τ ξi(t)ξj(t) dt
= ρjδij, where δij is the Kronecker delta.
Finally, to prove convergence, let XK(t) = µ(t) +
PK j=1Zjξj(t): E|X(t) − XK(t)|2= E ∞ X j=1 Zjξj(t) − K X j=1 Zjξj(t) !2 = E ∞ X j>K Zjξj(t) !2 = E " 2 ∞ X i,j>K ZiZjξj(t)ξi(t) !# + E " ∞ X j>K Zj2ξj2(t) !# = 2 ∞ X i,j>K E[ZiZj] | {z } =0 ξi(t)ξj(t) + ∞ X j>K E[Zj2] | {z } ρj ξj2(t) = ∞ X j>K ρjξj2(t) = ∞ X j=1 ρjξj2(t) − K X j=1 ρjξj2(t) = K(t, t) − K X j=1 ρjξj2(t),
which goes to zero when K → ∞, by Mercer’s Theorem (with s = t).
From the above theorems, it is possible to conclude that eigenfunctions ξi(t)
maximize the variance of the centered process X(t) − µ(t) projected on them, given that they are uncorrelated. Let f be any L2[τ ] function, ||f || = 1. Since ξ
a basis, f can be expressed as: f(t) = ∞ X i=1 aiξi(t), ai the coefficients. (20)
The projection of X(t) − µ(t) on f is given by the inner product: Z τ (X(t) − µ(t))f (t) dt = Z τ (X(t) − µ(t)) ∞ X i=1 aiξi(t) dt, (21)
and together with the Karhunen-Loève expansion in (17), Z τ (X(t) − µ(t))f (t) dt = Z τ ∞ X j=1 Zjξj(t) ∞ X i=1 aiξi(t) dt = Z τ 2X i6=j Zjaiξj(t)ξi(t) + ∞ X j=1 ajZjξj2(t) ! dt = 2X i6=j Zjai Z τ ξj(t)ξi(t) dt | {z } =0 + ∞ X j=1 ajZj Z τ ξj2(t) dt | {z } =1 = ∞ X j=1 ajZj. (22)
Using the fact that the Zj are uncorrelated and that V ar[Zj] = ρj,
V ar Z τ (X(t) − µ(t))f (t) dt = V ar " ∞ X j=1 ajZj # = ∞ X j=1 a2jV ar[Zj] = ∞ X j=1 a2jρj. (23)
Therefore, maximizing variance translates to argmax aj,j=1,... ∞ X j=1 a2jρj, Subject to ||f|| = 1 =⇒ ∞ X j=1 |aj|2 = 1. (24)
Without loss of generality, we can assume that the eigenvalues ρj are in decreasing order.
The solution to (24) is then giving maximum weight for the largest eigenvalue, that is, a1 = 1, aj6=1 = 0. We conclude that the function f depicting the largest mode of variation
in the data is indeed ξ1. To see how this works for the L-th largest mode of variation, we
hξj,j<L, fi = 0 =⇒ Z τ ξj,j<L(t) ∞ X i=1 aiξi(t) dt = ∞ X i=1 ai Z τ ξj,j<L(t)ξi(t) dt = ∞ X i=1 aiδij = 0 =⇒ aj = 0. (25)
Since this is true for all j < L, using the above result on Equation (22) yields: Z τ (X(t) − µ(t))f (t) dt = ∞ X j=L ajZj =⇒ V ar Z τ (X(t) − µ(t))f (t) dt = ∞ X j=L a2jρj. (26)
Equivalently to (24), the solution to the maximization problem is aL = 1, aj6=L = 0,
that is, the function representing the L-th largest mode of variation uncorrelated with the previous ones is ξL.
3.1.4 Estimating the functional principal components
Although one could, if the functions are measured over a fine grid, use the discretization approach to find the FPCs, it would require the computation of the sample variance-covariance matrix. We can avoid that by using the primary tool given by FDA: smoothing the data with pre-defined basis functions.
First of all, we must center the functions. Supposing there are N different functions, each function xi has a truncated basis expansion, with φ being any pre-available basis,
such as B-splines or Fourier, for example: xi(t) =
K
X
k=1
aikφk(t)
The mean function can then be estimated simply by: ˆ µ(t) = 1 N N X i=1 K X k=1 aikφk(t). (27)
And now, for the sake of keeping the notation simple, let us consider xi(t) as
actually the centered function xi(t)− ˆµ(t). Each centered function xiitself has a truncated
basis expansion: xi(t) = K X k=1 cikφk(t). (28)
We can then evaluate the function on sampling points in order to estimate xi using
matrix notation:
x= Cφ, C= (cik) is a N × K matrix. (29)
Notice that this representation of the coefficients in matrix C assumes that the curves were sampled at the same points.
The sample variance-covariance function is then represented by:
ˆ
K(s, t) = φ(s)
TCTCφ(t)
(N − 1) . (30)
On a side note, some books divide the variance-covariance function by N instead of (N − 1). It makes no virtual difference for the analysis, but since we are estimating the mean function µ(t), the unbiased denominator was chosen.
Let us define matrix W as the K × K symmetric matrix containing the inner products of basis components φ:
wij = hφi, φji. (31)
Depending of the basis choice, W might be readily available, or it might be necessary to integrate numerically.
Now, the eigenfunctions can also be represented as an expansion of the arbitrary basis φ: ξ(t) = K X k=1 bkφk(t) = φ(t)Tb, in matrix notation. (32)
This yields Z τ ˆ K(s, t)ξ(t) dt = 1 N − 1 Z τ φ(s)TCTCφ(t)φ(t)Tbdt = 1 N − 1φ(s) T CTCWb. (33)
Hence, the eigenequation (16) can be expressed as: 1
N − 1φ(s)
TCTCWb= ρφ(s)Tb, (34)
and since it must hold for all s, this reduces to: 1
N − 1C
TCWb= ρb. (35)
However, the orthonormality required from the ξi implies that ∀i ≥ 1, ||ξi||2 = 1.
The translation of this restriction in terms of the basis expansion representation for ξ, φ(t)Tb, is ||φ(t)Tb||2 = 1 =⇒ Z τ bTφ(t)φ(t)Tbdt= bT Z τ φ(t)φ(t)T dt b = bTWb= 1 (36)
Orthonormality also imposes that, for i 6= j, hξi, ξji = 0. Again, in terms of basis
expansions: hφ(t)Tb i, φ(t)Tbji = 0 =⇒ Z τ biTφ(t)φ(t)Tbjdt= biT Z τ φ(t)φ(t)T dt bj = biTWbj = 0. (37)
These restrictions shown in Equations (36) and (37) are not trivially satisfied, since solving the eigenproblem in Equation (35) gives orthonormality for b instead. However, the restrictions can be accounted for by defining u = W1
(35) by W1
2 on the left side in order to build the equivalent eigenproblem:
1 N − 1W 1 2CTCW 1 2u= ρu. (38)
Equation (38) can be easily solved for u using any linear algebra package (such as function prcomp in R), which returns eigenvectors u such that uTu= 1 and hu
i, uji = 0,
and because W is symmetrical, by the definition of u, it is easy to see that the conditions on b are satisfied: uTu= 1 =⇒ bTW12W 1 2b= bTWb= 1, hui, uji = 0 =⇒ uiTuj = 0 =⇒ biTW 1 2W 1 2b j = biTWbj = 0.
Furthermore, coefficients of matrix C can be estimated by (11).
One interesting case that deserves mentioning is when W is the identity matrix (which occurs, for example, when the Fourier basis is chosen as the pre-defined basis). When this happens, the eigenanalysis reduces to performing standard multivariate PCA on the coefficient matrix C, and normalizing by N − 1.
At last, it should be highlighted that performing FPCA in its classical version requires curves sampled at the same points. Although there are a few techniques to align the functions when there might be discrepancies, they are not in the scope of this work.
3.1.5 Poisson Process
The homogeneous Poisson process Nt, t≥0 is a stochastic process counting how many events
occur up to time t, following the hypotheses: • N0 = 0.
• Increments Nt2 − Nt1, . . . , Ntn − Ntn−1 are independent, ∀ t1 < t2 < . . . < tn.
• Nt ∼ P oisson(µt), that is, P(Nt = k) = (µt
k)
k! e
−µt,∀ k ∈ {0, 1, . . . , }, where µ is
the Poisson rate, an intensity parameter. Defining the arrival times as
Tn= inf{t; Nt = n}, T0 = 0, (39)
and the interarrival times as
then
(Xn)n∈N are independent and identically distributed as exp(µ)
=⇒ p(xn|µ) = µ exp [−µxn] .
(41) The inhomogeneous Poisson process relaxes the hypothesis of rate µ being constant over time, that is, µ(t) is instead a function of time:
p(xn|µ(t)) = µ(t) exp − Z tn tn−1 µ(s)ds . (42)
Considering the sample of arrival times up to the N-th arrival, t1, . . . tN, the likelihood
function and log-likelihood function are given by L(t1, . . . , tN|µ) = N Y n=1 µ(tn) exp − Z tn tn−1 µ(s)ds =⇒ log[L(t1, . . . , tN|µ)] = N X n=1 log µ(tn) − N X n=1 Z tn tn−1 µ(s)ds = N X n=1 log µ(tn) − Z tN t0 µ(s)ds. (43)
Following the methods presented in the previous section, we can write µ(t) as a basis expansion. However, this function has a new constraint: µ(t) ≥ 0. To account for that, we take the exponential:
µ(t) = exp[cTφ(t)]. (44)
We may also penalize this fit with the penalty defined in Equation (5). Usually, since the simplified hypothesis is that µ is constant, the first derivative is penalized. By writing the penalty in matrix form, as in Equation(8), the following problem must be numerically solved: argmax c log[L(t1, . . . , tN|µ)] − λ P ENm(log[µ]) = argmax c N X n=1 cTφ(tn) − Z tN t0 exp[cTφ(s)]ds − λ cTRc. (45)
4
Data
The data used in this project includes both information on the wallet’s class and their balances over time. This allows for the development of a supervised classification model in which the balance values are used to predict the group of a certain wallet, that is, a model which predicts the main activity of a given wallet based on its account movements. Thus, the data was collected in essentially two stages: classifying the wallets by their major purpose and fetching their balances over time. The observations were later treated in order to build smoother functions.
4.1
Labelling Wallets
The stage of labelling wallets was done in Tomé (2017). The author scraped the following websites to obtain the user identity: blockchain.info and walletexplorer.com. The user here, though, is not an actual individual, but a company/website/entity; therefore, scraping provided information linking wallets addresses with entities (for example, address
1PkJRQaKStcmCehCJUjxYiQejFDm3w4yV belongs to 999Dice.com). She then observed
some of the entities activities and gave them one of these labels: exchange/investments, darknet, games/gambling, mixer services, payment systems, and mining pool/cloud mining. This work focuses on linking addresses to these categories rather than entities. The wallets gathered were active at some time between April 2011 and April 2017.
Hence, the original data-set had the descriptions shown in Table 1.
Category Addresses Entities
Darknet Marketplace 106,284 8
Exchange 282,685 64
Gambling 46,094 20
Mining Pool/Cloud Mining 56,899 3
Mixer Service 101,914 1
Payment System 275,763 10
Total 869,639 106
Table 1: Original Number of Addresses and Entities, by Category
4.2
Wallets Balances
The wallets balances through time were also collected at blockchain.info, using their API services. Since the easier scraping of this information was through their charts, due to time constraints, a decision was made to keep only the wallets with less than 500 account
movements, as that was the maximum number of observations the charts provided. Apart from Table 1, the remaining information presented account for that restriction. The full scraping would be much more time consuming, and not worth attempting for our purposes.
4.3
Wallets Treatment
As stated in the introduction, in this work, we use classifiers from the multivariate framework with functional features. The functional features are finite representations of the functions, that is, the coefficients of smoothing the functions onto a basis. We choose the eigenbasis representation derived from the Karhunen-Loève expansion, since typically few components are needed to explain the majority of variation.
In this Section, all the treatment necessary to pose a well-defined classification problem and build the functional features are described. We choose a time frame to observe the wallets, creating a clear rule of how much time is necessary to wait after the first account movement before classifying them. We also limited the sample to a minimum of observations, in order to properly capture curve behaviour. Measures of level, variation and frequency are created using the methods exposed in Section 3. Then, these measures are treated as curves and their functional principal components are computed.
4.3.1 A Sampling Issue
The classification of bitcoin wallets by their account movements depends on the underlying hypothesis that a wallet’s behaviour is invariant over time. That is, for wallets of the same group, if one is observed in a year and another on the next, they should still present the same characteristics of movements and amounts. We argue that this amount should be measured in dollars. Since the bitcoin price is very volatile, if the account movements were measured in bitcoins, they might impose a change in level throughout time. Again, for two wallets observed through different time frames, if the bitcoin price is higher in one period, they will probably be receiving smaller amounts of bitcoins for providing the same service.
This difference in level could possibly add a timestamp bias: the model may be classifying better the category with wallets observed in similar dates. If the samples of a certain category are concentrated around a period of time, say, they all begin in March 2015, the bitcoin price volatility would affect them equally. If instead the samples of another category are wallets whose start is somewhere in the range from 2011 to 2017, the amount of bitcoins will most likely vary greatly between them, even if they present
the same behaviour. Greater variability due to differences in the start date of the samples of a category could make defining a pattern harder, thus harming the classification of that category.
Indeed, the wallets are not observed uniformly over time: in Table 2, it is evident that the darknet and mining samples are much more concentrated than their counterparts. This is due to the sampling nature: for training darknet wallets to be identified and labeled, a police operation was most likely necessary - as was the case with the Silk Road website, shut down by the FBI. This poses another matter, that was not addressed in this work: services that put a lot of effort into anonymity, such as the darknet, will most likely have few entities discovered. This is in fact the case: Table 1 shows the imbalance. However, even with the values in dollars, there could still be an implicit timestamp in volatility: wallets in a time of high (or low) volatility could be grouped together. Furthermore, volatility introduces noise between one movement and another. To avoid these phenomena, instead of transforming each amount by the price of the day, the mean of the bitcoin price is taken over the period that the wallets are observed and then used to level the units. Therefore, the wallets balances are represented in dollars, instead of bitcoins. More specifically, the transformed balances are the balances in bitcoins multiplied by a fixed amount, which is the mean of the bitcoin price over a pre-defined period of time.
Category Min. 1st quantile Median 3rd quantile Max.
Exchange 2011-04-11 2014-11-26 2015-11-11 2016-06-20 2017-04-22
Mixer Service 2011-11-10 2014-12-29 2015-05-11 2015-11-08 2017-03-03
Payment System 2012-09-04 2015-04-27 2015-12-17 2016-04-13 2017-03-11
Mining Pool/Cloud Mining 2013-11-16 2015-11-06 2015-12-10 2016-01-06 2017-03-06
Gambling 2013-11-21 2014-07-18 2014-11-21 2015-05-13 2017-03-31
Darknet Marketplace 2013-12-16 2015-11-10 2015-12-13 2016-02-02 2016-07-31
Table 2: Distribution of Starting Dates by Category
4.3.2 Estimating curves
The original data of the wallets’ balances were very stiff; a lot of credits were immediately followed by debts of the same amount, which accounted for a very erratic behaviour. Here, we show some examples for exchange wallets in Figure 2; other categories can be viewed in A.3.
Figure 2: Original account balances for Exchange wallets, in dollars
The solution to this problem was to split the balances in order to form two different curves: one consisting of credits and another of debts. They were then approximately integrated by calculating the accumulated sum (Figure 3).
Figure 3: Accumulated sum of credits and debts for Exchange wallets, in dollars
However, even with a more regular behaviour, the curves were still step functions and required smoothing. Furthermore, to perform functional principal component analysis in its classical version, it is necessary that these functions are somewhat regularly measured, as pointed out in Section 3.1.4.
It is important to realize that it is possible to, regardless of the number of account movements, obtain an arbitrarily large number of observations through time for these curves - since it is known that, if no other movement was made, the account value in
bitcoins remains constant. Another remark is the fact that one cannot really know if a wallet’s life has ended, because there is nothing to prevent its use after a long time of inactivity. With that in mind, it was necessary to choose a time window to consider for prediction. The value was 2000 hours, which was slightly less than the third quantile of the duration of the wallets (considering only the wallets with more than one and less than 500 account movements), 2384.59 - here, the duration means the time elapsed between the first and last movements. For wallets with a duration smaller than this time frame, the last integrated values were extended as constants until the end of the interval. This interval was then normalized, so that time extends between [0, 1].
Figure 4: Histogram: wallets lengths in hours
Another necessity was to limit the sample for wallets with a minimal number of observations - otherwise there is not enough points to treat the data as functional. This was tested using thresholds of 10 and 20 movements. We would expect better results using the 20 movements threshold - since the wallets are being classified by their temporal structure, an increase in the number of account movements would make the structure more identifiable. The distribution of the number of observations in each wallet, considering only the first 2000 hours, is depicted in Figure 5.
Figure 5: Histogram: wallets number of observations of the first 2000 hours, by category
This selection does discard the majority of samples, as can be seen comparing Tables 3 and 4 to Table 1.
Category Addresses
Exchange 154,623
Mixer Service 5,583
Payment System 8,245
Gambling 2,930
Mining Pool/Cloud Mining 23,424
Darknet Marketplace 29,738
Total obs. 224,543
Table 3: Number of Addresses by Category - 10 obs.
Category Addresses
Exchange 105,035
Mixer Service 1,862
Payment System 5,085
Gambling 1,464
Mining Pool/Cloud Mining 11,433
Darknet Marketplace 1,474
Total obs. 126,353
Table 4: Number of Addresses by Category - 20 obs.
The step functions were then evaluated at 501 equally spaced points, varying between 0 and 1 (since the time interval was normalized), in addition to the original
time points, also scaled between 0 and 1. These new observations aid in structuring the curves to be smoothed; after all, as it was previously mentioned, the accumulated credits and debts are known at any arbitrary time of the wallets life span. Because of the great variability in scale of the accumulated credits and debts across different wallets, the logarithm was also taken. This variability is evident in Figures 6, 8 and 10, where there are a lot of outliers with great magnitude. By taking the log, we can see the structure of the distributions and how they vary when we restrict the sample.
Figure 6: Boxplot: accumulated credits and debts, by category
Figure 7: Boxplot: log of accumulated credits and debts, by category
Figure 8: Boxplot: accumulated credits and debts, by category; considering only wallets with 10 or more observations
Figure 9: Boxplot: log of accumulated credits and debts by category; considering only wallets with 10 or more observations
Figure 10: Boxplot: accumulated credits and debts, by category; considering only wallets with 20 or more observations
Figure 11: Boxplot: log of accumulated credits and debts, by category; considering only wallets with 20 or more observations
At last, smoothing was performed by fitting a cubic smoothing splines with penali-zation on the second derivative. Using Theorem 3.1, the knots were placed almost at every data point. It was necessary to remove points that were too close by (less than 3 minutes apart), because otherwise the nature of the data (large steps) would make the estimation of derivatives, particularly the second derivatives, too large; which, in turn, would make the values of matrix R (equation (8)) too high in magnitude, risking to incur in the problem described at the the end of Section 3.1.1. This does not, however, imply loss of information, because the frequency is accounted for in the estimated Poisson rate curves, as will be explained in Section 4.4.2. The smoothing parameter was chosen to be 0.001. Although we did experiment with the GCV criterion (see Section A.2 of the appendix) to choose λ, we found the resulting curves to be under-smoothed, even after attempting to further discount the degrees of freedom, as suggested by C. Gu (2002). We wanted a reduced level of noise in the data, and our explorations suggested λ = 0.001 to be a good choice. This method is subjective, but, as stated by Green & Silverman (1993), sometimes a subjective approach is the most useful one.
For the sake of simplicity, the curves mentioned from here on refer to the treated curves. Some examples for credit and debt curves can be seen in Figures 12 and 13.
Figure 13: Smoothed log accumulated sum of debts for Exchange wallets
4.4
Additional Curves
The derivatives and rate of arrivals are also modeled in order to consider the variation and frequency of the curves. The data treatment results in six estimated curves for each wallet; for both credits and debts, one curve represents the logarithm of the accumulated increments; another, their first derivatives; and at last, their frequencies.
4.4.1 Derivatives
In order to better account for the variations in the curves, their first derivatives were estimated. For a B-spline, the derivatives are calculated analytically after smoothing, but the details of such computations are not of particular interest to this paper. Since smoothness is a necessary hypothesis for the existence of derivatives, we must use the already smoothed curves rather than the step curves as our data. To estimate the curve’s derivative of order m, Ramsey & Silverman (2006) suggest that the smoothing process be done penalizing the derivative of order m + 2 (see equations (5) and (6)), to ensure that the derivative itself will be smooth. We have then used an order 5 B-spline system to re-smooth the curves applying P EN3(x) as the penalization criterion (5 is the minimum order
of a B-spline with an integrable third derivative). The derivatives were then evaluated at the same 501 equally spaced points. We chose λ = 0.0001 in the same fashion as the smoothing parameters for the curves themselves. The results are displayed in Figures 14 and 15.
Figure 14: Derivatives of the smoothed log of the accumulated sum of credits for Exchange wallets
Figure 15: Derivatives of the smoothed log of the accumulated sum of debts for Exchange wallets
4.4.2 Poisson Rate Curves
The curves can also be modelled by a Point Process, in which the credits and debts are viewed as arrivals. We simplify the hypothesis so that it becomes a Poisson Process. The assumption made for the functional model, however, is that the rate is not constant over time, and is a function of it instead. We use the method described in Section 3.1.5, with a smoothing parameter of 0.1 and penalization on the first derivative, for each credit and debt curve - recalling that the observations are considered for the first 2000 hours. Results can be seen in Figures 16 and 17. The resulting rate curves are less smooth than the other
types of curves, which is natural given that the first derivative is used as penalization.
5
Classification Models
The prediction of the wallets’ classes is a supervised classification problem, since the labels are known. Four different algorithms were used to compare models: Multinomial Logistic Regression, Gradient Boosting, Support Vector Machine and Random Forest. Details of each algorithm can be found in Friedman and Tibshirani (2001).
There are essentially two types of features to build models in this case: scalar and functional. The scalar features are chosen according to the problem at hand, and have no higher wisdom behind their engineering than the domain knowledge of the person responsible for developing them. They are present in the vector model described below. Functional features, on the other hand, are less arbitrary in the sense that they are representations of the function being analysed. This is a result of the basis regression minimization problem expressed in (7). The domain specific choices, in this case, are the basis system, and the number of basis/smoothing parameter.
5.1
Vector Model
The simplest model attempted is a vector model with scalar features built from credits and debts transactions. There is no functional treatment in this approach. The features were based in Tomé (2017):
• credit count/ debt count • credit sum/ debt sum
• credit minimum/ debt minimum • credit maximum/ debt maximum • credit median/ debt median
• credit first quantile/ debt first quantile • credit third quantile/ debt third quantile
• difference of third and first quantile values for credits/ difference of third and first quantile values for debts
The features regarding the difference of quantiles were not used with the multinomial logit model in order to avoid theoretical multicollinearity.
Additionally, constant Poisson rates (number of credits divided by interval time/ number of debts divided by interval time) were added in order to compare the gains of a functional Poisson rate curve.
5.2
Functional Data Classification
We choose to adapt the multivariate algorithms to the functional context for two reasons: first, we want to be able to compare and combine models using features accounting for different curves and the scalar approach; also, we wanted to evaluate the performance of random forests for bitcoin data using functional features, in comparison to the approach of Tomé (2017), which uses scalar features.
As previously stated, the functional features are representation of the functions at hand; more specifically, they are the coefficients of the basis expansion of each curve. For this work, we choose the principal component basis, as it usually allows for the representation of the majority of the function’s variance with few basis functions, thus preventing overfitting. In this case, the coefficients are the functional principal components. We estimate them for the training set as described in Section 3.1.4, for each curve. This requires the data to be written on a previously defined basis: we again use the B-splines to re-smooth the data, with penalization on the second derivative and knots at every data point. Different re-smoothing parameters are tested in terms of classification performance; more details are in Section 6.
The validation data is also re-smoothed in the same fashion. However, to obtain the first principal components, we need to find the coefficients of each curve corresponding to the expansion on the eigenbasis estimated on the training set. Let Ξ be the n × K matrix containing the values ξ(tj), that is, each column is an eigenfunction evaluated at
time tj. By equation (32), we have
Ξ = ΦB, (46)
where B is the K × K matrix whose columns are the coefficients of each eigenfunction. If Z is the vector of principal components, we can estimate Z as in equation (11):
ˆ
with ycen being the centered curve y − ˆµ, ˆµ estimated as in equation (27) for every tj
using the curves in the training set; and R∗ being the penalization matrix of Ξ:
R∗ = Z τ Dm(ξ(s))Dm(ξT(s))ds = Z τ Dm(bTφ(s))Dm(φT(s)b)ds = bT Z τ Dm(φ(s))Dm(φT(s))ds b = bTRb. (48)
Therefore, the inputs for the algorithms are a combination of scalar features and the functional principal components of the curves, their derivatives, and their Poisson rates.