Secure network monitoring using programmable data planes

(1)

UNIVERSIDADE DE LISBOA FACULDADE DE CI ˆENCIAS DEPARTAMENTO DE INFORM ´ATICA

SECURE NETWORK MONITORING USING

PROGRAMMABLE DATA PLANES

F´abio Miguel Canto Pereira

Mestrado em Engenharia Inform´atica

Especializac¸˜ao em Arquitetura, Sistemas e Redes de Computadores

Dissertac¸˜ao orientada por:

Prof. Doutor Nuno Fuentecilla Maia Ferreira Neves

e co-orientada por: Prof. Doutor Fernando Manuel Valente Ramos

(2)

(3)

Acknowledgments

Throughout this year there were many people who helped me overcome the obstacles I faced. First of all, I would like to thank my family for giving me the opportunity to get where I am and for supporting me while I wrote this thesis and over all my life, in general. I would also like to thank my friends and colleagues with whom I shared great mo-ments over the year. They made my work days more pleasant and our break times re-freshing. They were always there to motivate me and to discuss ideas when I got stuck.

I am also grateful to Faculdade de Ciˆencias, specially its Informatics Department and the LaSIGE research group, for providing me all the conditions to perform my work.

A special acknowledgement for my advisors, Professor Nuno Fuentecilla Maia Fer-reira Neves and Professor Fernando Manuel Valente Ramos, for giving me the opportunity to join this project, for their guidance, and availability over the year.

Funding

This work was partially supported by the European Commission through project FP7 SEGRID (607109) and project H2020 SUPERCLOUD (643964), and by national funds of Fundação para a Ciência e a Tecnologia (FCT) through project UID/CEC/00408/2013 (LaSIGE).

(4)

(5)

(6)

(7)

Resumo

A monitorização é uma ferramenta fundamental na gestão das redes de computado-res ao oferecer uma visão sobre o seu comportamento ao longo do tempo. Diferentes técnicas de monitorização têm sido aplicadas na prática, das quais se destacam duas: as baseadas em amostras e as baseadas em sketches. Enquanto as técnicas baseadas em amostras processam apenas um subconjunto do tráfego total (uma amostra), as técnicas baseadas em sketches processam todo o tráfego, procurando obter maior precisão nos seus resultados. Para poderem processar todo o tráfego e ainda assim serem escaláveis, os al-goritmos baseados em sketches comprimem a informação monitorizada em estruturas de dados que têm comportamento semelhante ao das hash tables. Apesar da inevitável perda de informação resultante das colisões que ocorrem tipicamente quando se usam estas es-truturas de dados, os algoritmos baseados em sketches apresentam ainda assim resultados bastante precisos, uma vez que todo o tráfego contribui para a computação das variáveis estat´ısticas monitorizadas.

A informação fornecida pelos algoritmos de monitorização é essencial para a correta operação da rede. No entanto, se o algoritmo de monitorização puder ser corrompido, os seus resultados deixarão de ser confiáveis, tornando a monitorização inútil. No pior caso, o administrador de sistemas não deteta que o algoritmo de monitorização foi com-prometido e acaba por tomar decisões inadequadas, baseadas em informação incorreta. Este problema demonstra a utilidade de algoritmos de monitorização seguros. No en-tanto, não temos conhecimento de nenhuma proposta que vise a segurança dos algoritmos de monitorização. De facto, a generalidade dos algoritmos de monitorização ignora as questões de segurança de forma a minimizar os seus tempos de execução e a memória uti-lizada, o que se justifica pelas altas velocidades a que os pacotes têm de ser processados e transmitidos nas redes atuais.

O objetivo desta tese consiste no desenho, implementação e avaliação de um algo-ritmo de monitorização seguro e escalável. A base da nossa solução é o Count-Min, um algoritmo baseado em sketches que permite estimar a frequência de itens observados num dado stream de dados. Genericamente, o Count-Min utiliza uma matriz com duas dimensões, definidas antes do arranque do algoritmo (número de linhas e número de co-lunas), para armazenar os dados monitorizados. Além disso, necessita de uma função de

(8)

dispersão diferente por cada linha da matriz, responsável por mapear os itens processa-dos pelo algoritmo numa coluna da matriz. Cada função de dispersão fica associada a uma linha da matriz e cada item vai ser processado por todas elas, sendo responsável pelo incremento de um contador em cada linha da matriz.

Para identificar poss´ıveis vulnerabilidades de segurança na versão original do Count-Min, assumimos um adversário que poderá estar localizado em qualquer ponto da rede, mas que não tem acesso ao dispositivo em que o algoritmo está instalado. Verificámos que, para diferentes capacidades do adversário (escutar a rede apenas, eliminar, modificar ou gerar pacotes), a maioria das vulnerabilidades identificadas na especificação original do Count-Min poderiam ser resolvidas com a utilização de funções de dispersão crip-tográficas (ao invés de funções de dispersão pouco complexas, como as sugeridas pelos autores do Count-Min) e de um mecanismo para impedir que os contadores excedam a sua capacidade máxima.

Os algoritmos baseados em sketches foram desenhados para monitorizar uma dada métrica durante um per´ıodo finito de tempo, após o qual a sua estrutura de dados começa a ficar demasiado preenchida e o número de colisões aumenta. Por essa razão, no fim desse per´ıodo a estrutura de dados deverá ser reiniciada. No entanto, no contexto da monitorização de redes de computadores é necessário que o algoritmo de monitorização esteja continuamente a executar a sua função, sem momentos de pausa. Nesse sentido, além de adicionar segurança à versão original do algoritmo, desenvolvemos um meca-nismo que permite utilizar algoritmos baseados em sketches, como o Count-Min, no con-texto da monitorização de redes. Para tal, no final de cada per´ıodo de monitorização, de-finido pelo administrador de sistemas, a estrutura de dados usada é reiniciada em tempo de execução.

Os switches e routers atuais não têm, no entanto, a capacidade de executar estas técnicas avançadas de monitorização (isto é, os sketches). Felizmente, nos últimos anos surgiram switches programáveis – existindo já alguns em produção – que criam final-mente a possibilidade de adicionar ao plano de dados de uma rede estas funcionalidades. Desta forma, o algoritmo de monitorização que propomos foi implementado em P4, uma linguagem recente que permite a programação dos dispositivos de encaminhamento re-programáveis. A utilização do P4 permitiu-nos programar diretamente no plano de dados, dando-nos inclusivamente a possibilidade de alterar valores mantidos pelo algoritmo de monitorização sem ter de parar a sua execução.

Decidimos usar o MD5 (Message-Digest Algorithm 5) para gerar as funções de dis-persão criptográficas, por este ter uma complexidade temporal menor comparativamente a outras funções criptográficas e porque ainda é considerado seguro se usado em conjunto com uma chave de 128 bits. Esta chave é um número aleatório, gerado no arranque do algoritmo de monitorização e guardado na memória do switch programável, podendo ser acedida internamente pelo próprio código do algoritmo ou externamente através de uma

(9)

interface oferecida pelo dispositivo. Uma vez que a segurança das funções de dispersão vai depender desta chave, é fundamental impedir que o adversário a descubra. Por essa razão, e porque os algoritmos baseados em sketches necessitam de reiniciar a sua estru-tura de dados periodicamente, como já referido, desenvolvemos uma solução que não só altera a chave que está a ser utilizada por uma nova, como também reinicia a estrutura de dados do algoritmo, logo após a ter serializado e copiado para um ficheiro. Esta cópia é necessária pois sempre que há um pedido ao algoritmo de monitorização para estimar a frequência de determinado item, todas as estruturas de dados têm de ser consultadas, incluindo as armazenadas no ficheiro, o que é feito de forma transparente pelo nosso algoritmo.

Durante a implementação da nossa solução, tivemos de ultrapassar algumas dificul-dades decorrentes não só das peculiaridificul-dades da linguagem P4 como da própria interface entre o código P4 e o software utilizado para emular um dispositivo de encaminhamento. Entre as principais dificuldades que o P4 nos colocou e que resulta das peculiaridades de um switch, nomeadamente a necessidade de processar pacotes a altas taxas de trans-missão, está o facto de este não permitir definir ciclos, o que nos era necessário para repetir as ações para cada linha da matriz. Acabámos por resolver a situação com sucesso de uma forma não convencional. A interface oferecida pelo dispositivo de encaminha-mento virtual (software switch) também nos colocou algumas dificuldades, entre as quais o facto de só permitir que as funções de dispersão devolvam um resultado com no máximo 64 bits. Uma vez que a execução do MD5 devolve 128 bits, para o seu resultado poder ser utilizado tivemos de modificar o software do dispositivo de encaminhamento de forma a garantir a interoperabilidade com o programa P4 desenvolvido.

A avaliação que executámos focou-se no desempenho e funcionalidade, comparando a nossa solução segura com o Count-Min original (que também implementámos em P4) e com um algoritmo base que apenas encaminha o tráfego sem fazer qualquer tipo de monitorização. Ao n´ıvel da latência, observámos que a monitorização através de um algoritmo baseado no Count-Min induz um atraso no processamento efetuado pelo dispo-sitivo de encaminhamento de cerca de 0,7 milissegundos por pacote (com uma matriz de 20 linhas). O atraso adicional inserido pela nossa versão segura foi, em média, de menos de 0,2 milissegundos. Avaliámos também a taxa de transferência que o dispositivo de en-caminhamento consegue atingir quando corre a nossa solução, tendo observado que esta se mantém sempre muito próxima da taxa de transferência obtida pela versão original do Count-Min. Comparámos ainda o erro das estimativas dadas pelo algoritmo com o erro máximo teórico apresentado na especificação do algoritmo original para uma dada pro-babilidade. Não observámos diferenças relativamente ao erro entre a versão original do Count-Mine a segura. Assim, pudemos concluir que a utilização de uma versão segura do Count-Min não introduz penalizações relevantes no desempenho e na funcionalidade do algoritmo de monitorização, apesar das garantias de segurança oferecidas.

(10)

Palavras-chave: Monitorização, Segurança, Redes de Computadores, Sketches, Planos de Dados Programáveis.

(11)

(12)

(13)

Abstract

Monitoring is a fundamental activity in network management as it provides knowl-edge about the behavior of a network. Different monitoring methodologies have been employed in practice, with sample-based and sketch-based approaches standing out be-cause of their manageable memory requirements. The accuracy provided by traditional sampling-based monitoring approaches, such as NetFlow, is increasingly being consid-ered insufficient to meet the requirements of today’s networks. By summarizing all traf-fic for specitraf-fic statistics of interest, sketch-based alternatives have been shown to achieve higher levels of accuracy for the same cost. Existing switches, however, lack the necessary capability to perform the sort of processing required by this approach. The emergence of programmable switches and the processing they enable in the data plane has recently led sketch-based solutions to be made possible in switching hardware.

One limitation of existing solutions is that they lack security. At the scale of the dat-acenter networks that power cloud computing, this limitation becomes a serious concern. For instance, there is evidence of security incidents perpetrated by malicious insiders in-side cloud infrastructures. By compromising the monitoring algorithm, such an attacker can render the monitoring process useless, leading to undesirable actions (such as routing sensitive traffic to disallowed locations).

The objective of this thesis is to propose a novel sketch-based monitoring algorithm that is secure. In particular, we propose the design and implementation of a secure and scalable version of the Count-Min algorithm [16, 17], which tracks the frequency of items through a data structure and a set of hash functions. As traditional switches do not have the capabilities to allow these advanced forms of monitoring, we leverage the recently proposed programmable switches. The algorithm was implemented in P4 [11], a pro-grammable language for propro-grammable switches, which are now able to process packets just as fast as the fastest fixed-function switches [12]. Our evaluation demonstrates that our secure solution entails a negligible performance penalty when compared with the original Count-Min algorithm, despite the security proprieties provided.

Keywords: Monitoring, Security, Computer Networks, Sketches, Programmable Data Planes

(14)

(15)

List of Figures

1.1 Work Plan . . . 4

2.1 Count-Min sketch data structure with width w = 4 and depth d = 3 . . . . 8

2.2 Count sketch data structure with width w = 4 and depth d = 3 . . . 10

2.3 Inserted items in the Bloom Filter array with b = 16 and k = 3 . . . 17

2.4 Test if the item i is in the set, with b = 16 and k = 3 . . . 18

2.5 Test if the item j is in the set, with b = 16 and k = 3 . . . 18

2.6 PCSA update operation with number of bitmaps w = 9 and depth d = 4 . . 20

2.7 P4 architecture (from [6]) . . . 32

3.1 Update operation in a sketch-based monitoring solution . . . 36

3.2 Estimate operation in a sketch-based monitoring solution . . . 36

3.3 Solution design . . . 37

3.4 Simulated bidimensional array and counter linear array comparison . . . 40

5.1 Network Topology . . . 49

5.2 Latency between the two hosts . . . 50

5.3 Switch Throughput . . . 51

5.4 Errors in estimations using different memory sizes when monitoring by source IP address with our solution . . . 52

5.5 Errors in estimations returned by the secure and the original versions of Count-Min when monitoring by source IP address. (a) using 1 KB of memory (b) using 4 KB of memory . . . 53

5.6 Errors in estimations using different memory sizes when monitoring by flow with our solution . . . 53

5.7 Errors in estimations returned by the secure and the original versions of Count-Min when monitoring by flow. (a) using 1 KB of memory (b) using 64 KB of memory . . . 54

(20)

(21)

List of Tables

2.1 Attacks against sketch-based algorithms . . . 26 2.2 Attacks against the Count-Min algorithm . . . 27

(22)

(23)

Chapter 1 Introduction

Monitoring is the activity of supervising something in order to ensure it is operating as expected. Monitoring of a computer network is the only way a network administrator has to know the state of the network, enabling quick responses to anomalies or to make the required configuration adjustments. Selecting the right metrics to monitor is an important decision to make, in order that the most useful information can be retrieved without wast-ing resources. Some of the most used metrics include: network availability (amount of time, in a specific time interval, during which the network infrastructure is operational), utilization (ratio of the bandwidth used by the traffic being sent/received over the overall capacity), packet loss rate (ratio of packets lost with respect to packets transmitted) and network latency (time a packet takes to get from one designated point to another).

Different approaches can be used for monitoring. Ideally, for complete accuracy, the monitoring task should store all transmitted packets for subsequent analysis. In practice, however, this technique would lead to storage and processing scalability issues. Fortu-nately, exact results are usually not necessary, and a high quality approximation is enough. This fact suggests the use of probabilistic algorithms, that use smaller amounts of memory and require less computation to achieve the desired goals.

Traditional Network Monitoring

To avoid the storage and processing of all packets, as would be required by naive mon-itoring, traffic data can be reduced by sampling, with only a subset of the traffic being captured. The frequency at which packets are collected is the sampling rate: the number of samples taken per unit of time.

A proprietary (Cisco) protocol, NetFlow [3], uses sampling since the introduction of Cisco 12000 [7] and has been considered a reference. Netflow is a protocol for moni-toring of network traffic flow data generated by switches that support it. A network flow can be defined as a unidirectional sequence of packets that share the following values: ingress interface, source and destination IP addresses, source and destination TCP ports,

(24)

Chapter 1. Introduction 2

IP protocol, and IP type of service.

A major problem of solutions based on sampling is the potential lack of accuracy achieved, as many packets are ignored. To be scalable, the sampling frequency of these solutions is kept at low levels, with sampling rates of 1:1000 (one packet in 1000) being common [8]. This reduces the accuracy to a level that precludes its use for many of the advanced monitoring capabilities required in today’s large scale networks that enable cloud computing.

Sketch-Based Algorithms

To maintain memory and processing at acceptable levels, sketch-based algorithms sum-marize the network data streams in the data plane (by employing hashing, counting, and filtering techniques). These solutions have been shown to offer an interesting trade-off between the accuracy achieved and the memory used, outpacing the alternative for vari-ous monitoring tasks. Existing switches, however, lack the necessary capability to enable this approach.

Sketches are data structures that use sub-linear space, meaning that the memory size used grows sub-linearly with the input data. Whenever the size of the memory used is smaller than the input, the accuracy loss is inevitable, leading to probabilistic re-sults. Sketch-based algorithms, however, still provide high-quality results approxima-tions, which very commonly is as useful as the exact results.

Programmable Networks

In this thesis we focus our attention in sketch-based algorithms. An initial problem that we thus face is on their practicality. Until recently, network switches and routers did not have the required capabilities for implementing sketches in practice. The emergence of programmable switches has given operators the opportunity to run complex processing in the data plane, radically changing the state of affairs. Recent proposals [37, 30] have shown the feasibility of sketch-based solutions in real hardware data planes.

In traditional networks, network devices have the control plane, used to populate the forwarding table, and the data plane, that entails the process of consulting the forward-ing table to decide the interfaces where packets should be transmitted, coupled together inside the same piece of equipment. Being hardware appliances, to achieve the required performance, this kind of networks tends to be static due to the little flexibility hardware provides for evolution.

On the other hand, software-defined networks (SDN) [28] decouples the data plane from the control plane, allowing flexible control of the network. The decoupling of the control plane makes logically centralized network control possible, which allows, among

(25)

other things, to observe the whole network from a single vantage point. The SDN concept has been recently extended to the data plane. Production-level programmable switches are now available (e.g., Barefoot Tofino), allowing programmability of the data plane itself – i.e., it is now possible to define precisely how packets should be processed in these switches using a high-level programming language (such as P4 [11]).

These software-defined networks can be monitored using traditional techniques, but also newer ones, such as those based on sketches. Some work has indeed already been done in order to adapt sketch-based algorithms to the SDN architecture. OpenSketch [39], for instance, is a software defined measurement architecture, with the data plane having a library of predefined sketches that can be combined in the control plane to create the required measurement algorithm. Hashpipe [37] is a very recent solution in P4, that takes advantage of programmable data planes.

1.1 Motivation

Many efficient sketch-based algorithms have been proposed over the years to face the requirements of real-time monitoring. With the growing network speeds, the proposed solutions had to be able to fulfill their job faster. Since their focus has been on this requirement, these solutions tend to neglect security in favor of optimal execution time and memory usage.

Indeed, if the monitoring algorithm itself is not secure, its results may be corrupted. In the worst-case scenario, the network administrator does not notice the results are cor-rupted, and takes improper actions. For instance, an attacker may persuade the monitoring system to route sensitive traffic to a location he or she controls. Unfortunately, there is evidence that the problem is real. A recent report mentions malicious insiders as one of the top threats in cloud computing [31], what is evidenced by the occurrence of instances of this problem in companies such as Google [2, 27]. The security limitation of current approaches is therefore already a serious concern.

There is already some initial work [33, 35] addressing the security of traditional SDN monitoring but, to the best of our knowledge, no attempt has hitherto been made to address the security of sketch-based algorithms. Our work starts filling that gap.

1.2 Goals

The objective of our work is to design, implement and evaluate a secure version of a sketch-based algorithm – Count-Min [16, 17] – that enables secure traffic monitoring, while still guaranteeing acceptable execution speeds and memory usage requirements.

The sketch should take advantage of the benefits SDN networks have to offer, namely the possibility to program the data planes. For this purpose, the sketch will be

(26)

imple-Chapter 1. Introduction 4

mented in P4 [11], a language that allows the programming of switches. The solution should allow a network administrator to monitor his network securely. By employing a sketch-based approach, monitoring outputs, despite not being exact, will be approxi-mations of higher-quality than those provided by algorithms based on packet sampling. Compared to the previous sketch-based alternatives, the monitoring results will be trust-worthy, increasing the network administrator confidence that he is taking the proper man-aging actions.

1.3 Contribution

The main contribution of this work is the design of a secure version of a sketch-based algorithm, the Count-Min, which should be able to perform, in a secure way, the moni-toring task it was designed for. Our solution addresses several technical challenges, many of which arise from the constraints imposed by real switches. These include the use of cryptographic hash functions (not supported in existing switches), avoiding loops (not di-rectly available as they would limit throughput), and techniques for secret key renewal. We prototyped our solution in P4 [11], a programming language for network switches.

In terms of performance, we measure the latency and throughput our solution achieves and compare it with two other algorithms, also implemented in P4: the original Count-Min algorithm and an algorithm that only forwards the traffic. We also calculate the errors in the estimations returned by our solution, while monitoring by source IP address and by flow. Our evaluation using the public-domain behavioral P4 switch model [1] demonstrates that securing the sketching algorithm introduces a negligible performance penalty. We also observe that flow-based monitoring requires the use of a larger data structure to achieve the same errors in estimations, when compared with monitoring based on source IP addresses only.

1.4 Planning

This section presents the proposed work plan. To help visualize it, a Gantt chart illustrat-ing the schedule is shown bellow.

(27)

In the Survey task, a survey of sketch-based algorithms was done alongside reading of other related work. The Security Considerations task included the investigation of potential security problems of sketches. In Design Solution phase, a secure version of a sketch was designed. The sketch was then implemented in the Implement Solution task, and evaluated in the Evaluate Solution task. The Document Work task is related to the writing of this document.

1.5 Structure of the document

This document is organized as follows:

• Chapter 2 - Related Work: This chapter presents a survey of sketch-based algo-rithms, security considerations about them, and some context about programmable switches.

• Chapter 3 - Design: In this chapter the design of the solution will be presented, including the way the memory data structures are reconfigured over time.

• Chapter 4 - Implementation: In this chapter the most important P4 implementation details are presented.

• Chapter 5 - Evaluation: This chapter presents the evaluation performed in terms of performance and functionality of the proposed solution.

• Chapter 6 - Conclusion: A conclusion about the work done is presented in this chapter.

(28)

(29)

Chapter 2 Related Work

In this chapter, we start with a survey on sketch-based monitoring algorithms. For each algorithm, we describe the data structure it requires, the methods it provides, and the ac-curacy that it is able to guarantee. Afterwards, we investigate these algorithms, giving special attention to the Count-Min, with respect to security. For different adversary capa-bilities we tried to identify the possible actions a malicious user could take. Finally, in the last section, we describe the emergent programmable switches. We give special attention to the P4 language, that is used to program these switches, and that we use to implement our solution. Besides identifying the language’s goals, structure and architecture, we also present recent proposals that make use of the P4 language for network monitoring.

2.1 Sketch-Based Monitoring

Network monitoring algorithms used today are mostly sample-based or sketch-based. These kind of algorithms, which aim to be memory and CPU efficient, are probabilis-tic because they are not able to guarantee that exact results are always provided. Instead, they aim to guarantee that high quality approximations are returned, which may be as useful as the exact results. Sample-based algorithms monitor only a subset of the traffic that arrives at that device. For that reason, many packets are not monitored, which leads to accuracy problems. On the other hand, sketch-based algorithms process every packet, performing a summarization (mainly by hashing and counting) for a specific statistic of interest. Importantly, the algorithms are designed with provable accuracy-memory trade-offs.

This section presents some of the most well-known sketch-based algorithms that can be used to monitor networks. The algorithms are categorized according to the problem they propose to solve. It is assumed a monitoring model where there is an external entity that periodically collects the sketch’s counters. Immediately after this, all counters are restarted and a new monitoring cycle begins.

(30)

Chapter 2. Related Work 8

Figure 2.1: Count-Min sketch data structure with width w = 4 and depth d = 3

2.1.1 Heavy Hitters

This section presents a set of sketches designed to identify flows that are larger (in number of packets or bytes) than a fraction of all flows seen during a time interval. Identifying heavy-hitters is important for several network applications, such as traffic engineering, anomaly detection and DDoS prevention.

Count-Min Sketch

The Count-Min sketch [16, 17] identifies the heavy hitters in a stream by solving the Count Trackingproblem, where the goal is to find the frequency of each item in a stream with a large number of items.

Data Structure The data structure used is a two-dimensional array of counters with width w and depth d, both fixed at the time of creation. These values, w and d, are chosen based on the desired accuracy of the estimates. The counters are initialized with zero.

In addition, d hash functions must be chosen uniformly at random from a pairwise-independent family. At update time, each of these functions maps the item onto the range {1, 2, . . . , w}.

Methods The sketch provides two methods: update(i,c), which updates the frequency of item i by c, and estimate(i), which gives the estimated frequency of i.

Update(i,c): When a new item i arrives, for each d row the corresponding hash function is applied to i in order to determine the position in that row of the target counter. Value c, which may be positive or negative, is then added to the target counter.

(31)

Estimate(i): In order to estimate the frequency of an item i, for each of the d rows the corresponding hash function is applied to i. This gives the position of the target counter in every row. After the d counters are found, the one with the smallest value is chosen. The value of that counter is then returned.

Accuracy Summarizing a stream normally results in the loss of some accuracy. To minimize this loss, the sketch dimensions should be as high as possible. This way, the probability of collision is lower and, as a consequence, the average accuracy of the esti-mates will be higher. Another factor contributing to the accuracy is the duration of the monitoring cycles. Whenever the counters are restarted, the accuracy of the sketch is perfect, starting to decrease after the occurrence of collisions.

If N is the sum of the values of all the counters in a row of a sketch of size w × d, the frequency of an item i returned by the algorithm is at most _w2 of N more than its true frequency, with a probability of at least 1 − 1₂d.

Example: For a query to have an error of at most 0.001 of N with a probability of at least 0.999, the sketch dimensions should be the following:

• 2 w = 0.001 ⇔ w = 2000 • 1 − 1 2 d = 0.999 ⇔ 1₂d = 0.001 ⇔ d = log(0.001)_log(0.5) ⇔ d ' 10

Note: If the resulting value is not an integer, it must be rounded up in order to preserve the guarantees.

Count Sketch

The Count Sketch [14, 15] can also be used to identify heavy hitters. While the Count-Min sketch can be used to count packets or bytes, the Count sketch can only be used to count packets, as there are only two possible update values: +1 and −1.

Data Structure The data structure is, like for the count-min sketch, a two dimensional array with w width and d depth. The dimensions of the array are going to have an effect on the accuracy achieved in the estimates. Each d row should be interpreted as an hash table with all its slots initialized to zero.

The count sketch also needs d hash functions to map objects onto {1, ...w} and another d hash functions to map those same objects to +1 or −1. Let i be the object to map. Then the hash functions are: h1...hd: i → {1...w} and s1...sd : i → {+1, −1}.

Methods For the operation of the algorithm, two methods are provided: the update method, called whenever a new item arrives, and the estimate method, that returns the estimated frequency of the queried item.

(32)

Figure 2.2: Count sketch data structure with width w = 4 and depth d = 3

Update(i): For a given item i, the algorithm applies for each row the corresponding hash function h to i. This operation gives the position in that row where the target counter is located for i. After that, a second hash function, s, is applied to i, which will produce the value +1 or −1. Finally, the result of the function s is added to the value in the target counter.

Estimate(i): Let j be an iterator over the rows, hja function that identifies a position

in row j and sja function that returns +1 or −1. For each row j, the product of hj(i) and

sj(i) is calculated. The median of these j products is the estimated frequency returned.

The median should be used instead of the mean because of the mean sensitivity to outliers. For example, if there is a counter with a value far from the others, which will probably happen due to collisions, the inaccuracy of the returned value would be higher if the mean was used instead of the median.

Accuracy Let F 2 be the sum of the squares of the frequencies of the items. For this version of the Count Sketch algorithm, the data structure used should have w = 1/ε₂2 and d = log(1_δ), in order to have an error of at most ε√F 2 with a probability of at least 1 − δ.

2.1.2 Frequency Moments

The problem of calculating the frequency moments was defined in [9] as described next. Consider a sequence of items S = (a1, a2, ..., am), where each ai is a number between

1 and n and mi denotes the number of occurrences of i in S. For each k ≥ 0, the kth

frequency is defined as:

Fk= n

X

i=1

mk_i (2.1)

There are several frequency moments that have different applications by providing useful statistics about the sequence. For example, F0 represents the number of distinct elements

(33)

in a stream and F1 is the number of elements of a stream. The sketch shown below is used

to estimate F2, which is the Gini’s index of homogeneity, an index that is required in the

calculation of the surprise index [25] of a sequence. The surprise index is a measure of the degree of surprise associated with the occurrence of an event. The larger the index, the more surprising the occurrence of the event is. An event is surprising if its probability is small compared with the probabilities of occurrence of other events. Considering Pr

the probability of the event that actually occurred, the surprise index is calculated with:

Surprise Index = F2 Pr

(2.2)

AMS Sketch

The AMS Sketch is useful to estimate, using a compact data structure, the value of F2of

the frequency vector containing the data stream. The second frequency moment (F 2) of a vector v is defined as the square of its Euclidean norm (also called L2 norm), which can

be represented as ||v||2₂.

The sketch was proposed originally in 1996 [9] but since then other authors have optimized it [15]. In this newer version of the sketch, the update time is reduced by O(_ε12), maintaining the same guarantees and requirements of space.

Data Structure The data structure used by this sketch is an array of width w = _ε12 and

depth d = log(1_δ). For each d rows, an hash function h maps the items to {1, 2, ...w}. A second hash function, g, is needed to map those same items to {+1, −1}. Function g must be fourwise independent [38]. All entries of the array are initialized to zero.

Methods The algorithm uses an update method to update the data structure whenever a new item arrives and an estimate method that returns the second frequency moment, F2,

of the vector containing the data stream.

Update(i,c): For each row j between 1 and d, hj(i) is computed to obtain the

po-sition of the target counter in that row. After that, the result of c × gj(i) is added to the

value in the target counter, positioned in row j and column hj(i). At the end of each j

iteration, the target counter value is: target counter = target counter + c × gj(i).

Estimate(i): Let D be the data structure where D[j, k] represents the entry in row j and column k. For each row j, computePw

k=1D[j, k]

2_{. The median of these sums is the}

(34)

Accuracy Recall that v is an imaginary vector containing the full data stream. The sketch guarantees that, given a certain sketch dimension, with a probability of at least 1 − δ the estimate returned by the algorithm is between (1 − ε)||v||2₂and (1 + ε)||v||2₂ or, in a simplified version, (1 −₂ε)||v||2and (1 +₂ε)||v||2.

2.1.3 Detection of Traffic Changes

The detection of traffic anomalies is crucial to identify failures and attacks in a network. However, to do this perfectly it is necessary to analyze each flow individually, which may be too expensive. For this reason, a solution that uses sketches is preferable due to its capability to summarize the traffic and still provide an accurate estimate when queried.

There are two main approaches to identify traffic anomalies: (1) looking for anomalies that match the behavior of some known anomaly and (2) compare the traffic with a model of normal behavior constructed based on past traffic history. The first approach has the same problem of blacklists: new anomalies/attacks are not detected because to identify them it is required to know their behavior in advance. The second approach does not require anything to be known a priori, allowing the detection of new anomalies. However, sometimes what is considered an anomaly by this solution may in fact be the normal behavior that is new.

K-ary Sketch

The k-ary sketch [29] was developed to identify traffic anomalies in an efficient, accurate, and scalable way. The k-ary operation follows the second approach described above. It looks for significant changes in the behavior of the traffic when compared to a model of a behavior considered normal, constructed based on past traffic history. The algorithm has three modules:

• Sketch module: Similar to other sketches, where the data stream is summarized in a sketch So;

• Forecasting module: Produces a forecast sketch Sf using some forecast model

based on the observed sketches So;

• Detection module: A forecast error sketch Se is calculated Se = So − Sf. Using

the Se sketch, the change detection module verifies if the forecast error is above a

defined threshold and if so, a potential anomaly is identified.

Data Structure The k-ary sketch uses a two dimensional array of counters of width w and depth d. Each row has an associated hash function that maps the items of the data stream onto {1, 2, ...w}. These hash function must be, like in the AMS Sketch, fourwise independent [38], in order to keep the sketch guarantees of accuracy.

(35)

Methods The algorithm provides 4 methods: the update method, to update the sketch, the estimate, to reconstruct an approximation of the stream of items with specific key, the estimateF2, used to estimate the second frequency moment of the vector containing the data stream, and combine to compute the linear combination of multiple sketches.

Update(S, i, c): Let S[j, k] represent the counter located in row j and column k of sketch S. For each row j between 1 and d, add to the counter in [j, hj(i)] the value c.

Estimate(S, i): For each row of a sketch S, the following is calculated, where sum(S) =Pw

k=1S[1, k] is the sum of all values in the sketch (computed only once):

S[j, hj(i)] − sum(S)/w

1 − 1/w (2.3)

The estimate returned for the given key i is the median among the results of all rows.

EstimateF2(S): For each row j of the sketch S, the following is calculated: w w − 1· w X k=1 (S[j, k])2− 1 w − 1· (sum(S)) 2 (2.4) The second frequency moment estimated is the median between the results of all rows.

Combine(c1, S1, ..., c`, S`): Let c1...c` be scalars and S1...S` be sketches. The

linearity of the sketch data structure allows the linear combination of multiple sketches:

Result Sketch =

`

X

k=1

ck· Sk (2.5)

Forecasting Module There are several models that can be used for forecasting and change detection. There are relatively simple smoothing models that work with the weights assigned to each previous sketch and models belonging to the family of AutoRe-gressive Integrated Moving Average (ARIMA) models [13], which identify the linear dependency of the future values on the past values.

Change Detection Module This is the module responsible for detecting the variations. Initially, the forecast error sketch Se(t) is constructed, based on the observed sketch and

the sketch that results of the application of the forecast module: Se = So− Sf.

For any given i key, it is possible to reconstruct its forecast error in Se(t) at any time

using the Estimate(Se(t),i) method. The detection alarm is raised if the estimated forecast

(36)

by the application, and the estimated second moment of frequency of the forecast error sketch Se(t):

TA = T · [EstimateF 2(Se(t))]

1

2 (2.6)

Considering the data stream as a series of (key, value) pairs. The algorithm can only indicate if, for a given key, the pairs with that key have considerable change. This process is irreversible, meaning it is required to know the keys to query to find the streams that changed more than the threshold.

One possible solution is to brute-force the keys. In this solution, all the keys of the stream in an interval t are recorded and then replayed after the Se(t) is constructed. The

problem is that this approach is not scalable with a large set of keys. A solution that reverses the k-ary sketch is presented in [34]. By modifying the algorithm’s update pro-cedure with a set of techniques, it allows to efficiently infer the keys of target flows from sketches.

Accuracy Like in the other sketches, the dimensions of the sketch, w and d, are crucial to the accuracy that can be achieved. Consider v_iest the estimated returned for the item i by the method estimate and Fest

2 the estimated frequency moment returned by the method

estimateF2of the algorithm.

Estimate method: For an item i, T ∈ (0, 1) and α ∈ [1, ∞), if |va| ≥ αT

√ F2, then Pr|vest_i | ≤ TpF2 ≤ 4 (w − 1)(α − 1)2_T2 d/2 (2.7)

EstimateF2 method: For any λ > 0,

Pr|F₂est− F2| > λF2 ≤ 8 (w − 1)λ2 d/2 (2.8)

2.1.4 Counting the Number of Distinct Flows

The problem of counting the number of distinct header patterns (flows) seen during a measurement interval is addressed by the algorithms presented in this section. An intru-sion detection system (IDS) looking for port scans, for example, can count, for each active source address, the distinct flows defined by destination port and IP address. If a source IP has more than a defined number of distinct flows opened during the measurement interval, it is probably performing a port scan.

(37)

Direct Bitmap

A Direct Bitmap [20] is a sketch-based algorithm that addresses the problem of counting the number of distinct flows among packets received on a link during a time period. This task may be specially difficult if the right algorithm is not used because nowadays network links work at very high speeds, allowing the execution of only a small number of accesses to memory per packet.

Data Structure The data structure is an array of bits, also called a bitmap, of size b, with all its bits set to zero at the beginning. It is also required an hash function h to map each flow to a bit of the bitmap. Considering N as the maximum number of flows and ε as the acceptable average relative error, then the size of the bitmap b is the result of

d N

ln(N ε2₊₁₎e.

Methods The algorithm has only two operations: update, called whenever an item comes in; and estimate, used at the end of the measurement interval to get the number of distinct items.

Update(i): Whenever an item (packet) i arrives, the hash function is applied to its header pattern (used to identify the flow). The hash function returns the position of the array where the bit associated to that flow is located. That bit is then set to 1, if it was not already set by a previous item belonging to the same flow (or to a flow that maps to the same bit).

Estimate(): Let z be the number of unset bits. The number of unique elements returned is given by the following equation, where ˆn is the estimated number of distinct elements (flows): ˆ n = b ln b z (2.9)

Accuracy The algorithm’s accuracy is not perfect since the bitmap size is smaller than the number of existing flows. Because of that, collisions will occur with a random proba-bility.

Let n be the real number of distinct elements and ρ the flow density, defined as the average number of flows that hashes to the same bit. In order to achieve the best possible accuracy, the value for ρ should be the one that maximizes the accuracy. The standard deviation of the ratio n_nˆ given by this algorithm is calculated with the following equation:

SD ˆn n ≈ √ eρ_{− ρ − 1} ρ√b (2.10)

(38)

Virtual Bitmap

The Virtual Bitmap [20] derives from the Direct Bitmap algorithm, described above. However, it uses less memory than the Direct Bitmap, by covering only a portion of the flow space. A Virtual Bitmap that covers the entire flow space is a Direct Bitmap.

Because of the limited memory space, this algorithm samples the flow space. This sampling factor must be chosen before the execution of the algorithm, based on the ex-pected number of flows. For a given memory size, the larger the number of flows, the smaller the flow space covered. For this reason, the sampling factor must be chosen care-fully because if the number of flows is too large, the virtual bitmap will have the same problems as an underdimensioned Direct Bitmap.

Data Structure Like in the Direct Bitmap, the Virtual Bitmap also uses an array of bits of size b and an hash function h to map the flows to specific positions of the array.

Consider a threshold on the number of distinct flows that are allowed before the raise of an alarm by the algorithm. In order to maximize the accuracy, at the threshold the value of the flow density ρ (number of flows / b) should be 1.593624. By minimizing the algorithm’s average error with equation 2.11, the algorithm’s authors concluded that this was the optimal value for ρ. For that reason, the sampling factor chosen should allow the value of ρ to be around 1.6 at the threshold. The value of b should be 1.54413865_ε2 , where ε

is the average relative error, for the best results.

Methods Similar to the Direct Bitmap, this algorithm also provides two methods: up-dateand estimate.

Update(i): An item i is hashed by h whenever it arrives. If the result of h is a position in the Virtual Bitmap, the bit in that position is set to 1. Otherwise, i is ignored and the Virtual Bitmap remains unchanged.

Estimate(): Let b be the size of the bitmap and s the flow space size. The number of distinct active flows ˆn is given by the equation:

ˆ n = s ln b z (2.11) Accuracy

Like the other sketch-based algorithms, the Virtual Bitmap does not provide perfect accu-racy. Consider ˆn the estimated number of unique elements, n the real number of unique elements and ρ the flow density. The standard deviation of the ration_nˆ is given by the next equation:

(39)

Chapter 2. Related Work 17 SD ˆn n / √ eρ_{− 1} ρ√b (2.12)

2.1.5 Count Traffic

The algorithms presented in this section are useful if one wants to count the number of distinct source addresses that send traffic to a set of destinations. For example, this prob-lem can be addressed with a combination of a bloom filter, to keep the set of destinations, and a PCSA sketch, to maintain the count of distinct sources.

Bloom Filter

The Bloom Filter [10] is a sketch-based algorithm used to test whether an item i is con-tained by a set s. Its main contribution is to allow this task to run in a space-efficient way.

Data Structure The data structure needed for this algorithm is an array of bits of size b. In addition, it requires k different hash functions, so that each one maps each element of the set to a position in the array. Let n be the maximum number of items of the set and p the false positive probability of the test that determines if an item is contained by a set. To minimize the probability of false positives, b should be set to −_{(ln 2)}n ln p2 and k to

b nln 2.

Methods The Bloom Filter provides an add method to insert an item into the set and a testmethod to determine whether an item is contained by the set.

Add(i): Whenever a new item i arrives, it is hashed by all the k hash functions. Each hash function returns a position in the array where the bit stored is then set to 1.

Figure 2.3 represents a Bloom Filter of size b = 16 and k = 3 to which three elements are inserted: a, b and c.

(40)

Test(i): To find out if an item i is in the set, the item is hashed by the k functions. If the bits stored in the returned positions are all set to 1, then the item is contained by the set with a probability of p. Since the algorithm does not support the removal of items, which would generate false negatives, every bit set to 1 at update time remains with that value. So, if at least one of those bits is unset, the item is definitely not in the set. The removal of an item iremovewould require the bits mapped by the k functions to be unset.

If at least one of the bits recently unset was shared with another item ikeep that is kept in

the set, the test operation for the item ikeepwould generate a false negative.

Figure 2.4 shows how the operation test is able to determine whether item i is con-tained by the set. The test failed as one of the hash functions (h2 in this case) is mapped

to a 0-bit, meaning that the item i has not been inserted in the set.

Figure 2.4: Test if the item i is in the set, with b = 16 and k = 3

Figure 2.5 illustrates a false positive. Item j is not in the set but it is mapped by all hash functions of the test to positions with 1-bits. For that reason, the test will consider that j is in the set, which is not true.

Figure 2.5: Test if the item j is in the set, with b = 16 and k = 3

Accuracy There are two different situations to consider regarding the result of the al-gorithm’s test method: (1) it considers that the item is not contained in the set or (2) it considers that the item is contained in the set. Since the algorithm does not allow false

(41)

negatives, as explained above, the algorithm enjoys perfect accuracy in the first situation, meaning that the item is definitely not in the set. In the second situation, however, there is the possibility that the item has not been inserted in the set, which is a false positive. The probability of the test method to return a false positive is the probability of all the k bits stored in the positions where i hashes be set to 1. So, the probability p of a false positive is given by the equation bellow:

p = 1 − 1 −1 b kn!k ≈ 1 − e−kn/bk (2.13)

Probabilistic Counting with Stochastic Averaging

The Probabilistic Counting with Stochastic Averaging (PCSA) algorithm [23] provides the estimated number of distinct items in a collection of data. In the computer networking area, it is often used to count the number of distinct values of a header field (e.g., the source address of the packet).

Data Structure This algorithm uses w bitmaps with d positions each. The bitmaps can also be seen as a two dimensional bitmap of width w and depth d. The value of w determines the accuracy that can be achieved. The value of d should be at least log₂(_wn)+4, where n is the expected number of distinct elements. An hash function h is needed to uniformly distribute the items over the bitmaps of length d, by mapping the items onto the range {0...2d− 1}. A function f to find the position of the least significant 1-bit in the binary representation of a value is also needed. Let bit(y, k) be a function that returns the position of the kth bit in the binary representation of y.

f (y) = (

mink≥0bit(y, k) 6= 0 if y > 0

d if y = 0 (2.14)

Methods The update(i) function is called when a new item i is detected, keeping the algorithm’s data structure updated. At the end of the measurement interval, the estimate() method is executed to estimate the number of distinct items that have been observed.

Update(i): Whenever an item i arrives, the algorithm starts by applying h to i. Let a be the remainder of h(i) divided by w and b the result of applying f to the result of the integer division of h(i) by w. The bit stored in bitmap[a, b] are then set to 1. The update operation is represented in Figure 2.6 and described in Algorithm 1, where mod represents the module operator and div represents the integer division operator.

(42)

Figure 2.6: PCSA update operation with number of bitmaps w = 9 and depth d = 4 Algorithm 1 PCSA Update

1: procedure UPDATE(i)

2: a = h(i) mod w

3: b = f (h(i) div w)

4: bitmap[a, b] = 1 5: end procedure

Estimate(): Let S and R be variables initialized to 0 and it be the iterator of a cycle made through w, starting with it = 0. For each iteration, if the bit in bitmap[it, R] has value 1 and R is smaller than d, the value of R is incremented and, after that, the value of S is set to S + R. The estimated number of distinct items returned is the integer part of

w 0.77351 × 2

S

w. This operation is described in Algorithm 2.

Accuracy The standard error ε of the value returned by the estimate method of this algorithm decreases as the number of bitmaps used increases and is approximated by the next equation.

ε ≈ 0.78√

w (2.15)

LogLog and Super-LogLog Sketches

The LogLog and the Super-LogLog sketches [18] are used to estimate the number of distinct items in a set, by employing only a small auxiliary memory space and a single pass over each item.

The Super-LogLog Sketch is an improved version of the basic LogLog Sketch. In the following paragraphs we describe both the basic LogLog algorithm and the techniques through which the improvements are achieved. When nothing is said, the Super-LogLog

(43)

Algorithm 2 PCSA Estimate

1: procedure ESTIMATE 2: S = 0

3: it = 0

4: while it < w do

5: R = 0

6: while bitmap[it, R] = 1 and R < d do

7: R = R + 1 8: S = S + R 9: end while 10: it = it + 1 11: end while 12: return trunc w 0.77351 × 2 S w 13: end procedure

works the same way as the basic LogLog algorithm.

Data Structure The data structure needed is an array of m memory units taking only dlog₂(log₂(Nmax))e bits each, where Nmaxis the maximum number of distinct elements

expected. All positions of the array are initialized to zero. The value of m determines the accuracy of the algorithm, as shown by equation 2.18.

Like in the PCSA sketch, an hash function h is needed to transform the input items into binary strings of size H. The value of H, corresponding to the length of the hashed items, must satisfy H ≥ log₂m + dlog₂(Nmax

m ) + 3e. A second function, f , is needed to

find the rank of the first 1-bit, counting from left, in a sequence of bits. Thus, f (1...) = 1, f (001...) = 3, f (0k) = k + 1, etc.

In the Super-LogLog algorithm, it is possible to reduce the size of each m memory unit to dlog₂dlog₂(Nmax

m ) + 3ee bits.

Methods The algorithm provides a method to add a new item to the data structure and another one to estimate the number of distinct elements added.

Update(i): Whenever an item i arrives this method is called and i is immediately hashed by h. Consider k = log₂(m). The value of the first k bits of h(i) is an index j to a position in the array. The position j in the array contains a value, say M (j), that is then set to the maximum between its previous value and the output of function f applied to the binary representation of h(i) without its first k bits.

Estimate(): This method is used to get the output from the algorithm. Consider αm ∼ α∞−2π

2_+log2₂

48m , where α∞= 0.39701. In a practical implementation with m ≥ 64,

(44)

basic LogLog algorithm, corresponding to the estimated number of distinct items added to the data structure, is given by the next equation:

E = αm· m · 2

1 m

P

jM(j) (2.16)

In the Super-LogLog algorithm, only a portion of the array is used to calculate the number of distinct items. This portion corresponds to the m0 = θ0m smallest values

stored in the array. The constant θ0 is a real number between 0 and 1, producing

near-optimal results when its value is 0.7. The value returned by the Super-LogLog is given by the equation below, whereP∗

M (j) indicates the sum of the values in the selected positions of the array.

E = αm0· m0· 2 1 m0 P∗ M(j) (2.17) Accuracy The standard error measures, in proportion to the real number of distinct items, the deviation that is expected in the estimated result. An approximation of this value, using the basic LogLog algorithm, is given by the next equation, where σ represents the standard error.

σ ≈ 1.30√

m (2.18)

Using the improvements of the Super-LogLog algorithm, the accuracy increases. The standard error σ is now given by:

σ ≈ 1.05

m (2.19)

As the quantity _m1 P

jM(j) is closely approximated by a Gaussian, the estimate

re-turned by the algorithms are within σ, 2σ and 3σ of the exact number of distinct items with a probability of 65%, 95% and 99%, respectively.

HyperLogLog

The HyperLogLog [22] is an improvement over the LogLog and the Super-LogLog Sketches. The algorithm was developed to estimate the distinct number of elements of a set, while being more memory efficient than its predecessors.

Data Structure The data structure needed is the same as its previous versions: an array of m buckets with dlog2(log2(Nmax))e bits each. The m memory units are all initialized

with zero. There was a “raw” version of the algorithm that instead of initializing the m memory units with zero, initializes them with −∞. However, for small cardinalities, the algorithm was very inaccurate, since the value 0 was always assumed whenever one of the memory units was not modified.

(45)

The hash function needed, h, should map the input items into hashed values whose bits are assumed to be independent and each one to have 0.5 probability of occurring. The second function needed, f , should also be present so it is possible to find the leftmost 1-bit in a binary string (one plus the length of the initial run of 0’s).

Methods The algorithm provides the same two methods as the LogLog Sketch: Up-date(i) and the Estimate().

Update(i): Whenever an item i arrives this method is called and the h function im-mediately hashes i. Let k = log2(m) be the number of bits of the hashed value that

determines an index j to a position in the array. The value of j is obtained by adding 1 to the value of the first k bits of h(i). Considering M (j) the value contained by the array in its position j, the value of M (j) in this stage is set to the maximum between its previous value and f (w), where w is the binary representation of h(i) without its first k bits. Algorithm 3 HyperLogLog

1: procedure UPDATE(i)

2: x = h(i)

3: j = 1 + hx1x2...xki2

4: w = xk+1xk+2...

5: M [j] = max(M [j], f (w)) 6: end procedure

Estimate(): This method returns the estimated number of distinct elements of the data set. In the “raw” version of the HyperLogLog algorithm, the value calculated with the equation bellow, where αm ∼ 0.72134 as m → +∞, is returned directly.

E = αm· m2· m X j=1 2−M (j) !−1 (2.20) In addition to initializing the memory units with zero instead of −∞, some other improvements were made to the “raw” version of the algorithm. These improvements regard the algorithm’s estimate operation, being applied over the value of E, calculated as described above.

Consider E∗ the improved estimate. To calculate its value the following rules are applied, in this order:

1. If E ≤ 5₂m, let V be the number of memory units with value 0. If V 6= 0, E∗ = m log(m_V), otherwise E∗ = E;

(46)

3. If E > ₃₀1 · 232_{, then E}∗

= −232log(1 − ₂E32).

After that, the value of E∗ is found and can be returned, as it represents a good esti-mation for the number of distinct elements in a set.

Accuracy The standard error to be expected from the estimated values of the Hyper-LogLog are numerically close to 1.03896√

m .

The estimates returned by the algorithm are also approximately Gaussian and, for that reason, these values are expected to be within σ , 2σ, and 3σ of the exact count of distinct elements with respectively 65%, 95%, and 99% probability.

Multistage Filters

Multistage Filters [19, 21] is the name of an algorithm used to identify large flows, defined for sending, individually, more bytes than a defined threshold.

Data Structure The data structures used are a “flow memory”, which is an array of flow IDs designed to contain the flows that sent more packets than a threshold T , and d arrays (the stages) of counters, each one with a different and independent hash function hj associated.

Methods The sketch provides two methods: the update method that is called to every packet that arrives and an estimate method, that returns the flows that probably sent more than a threshold of packets.

Update(i): Whenever a packet i arrives, the d hash functions compute the flow ID of i. These calculations can be made in parallel. The result of each hash function is a counter, that is then incremented by the size of i. After that, if all counters that are mapped by the functions are above the threshold T , the flow ID of i is finally inserted in the flow memory. This way, the effect of collisions is decreased, attenuating the probability of false positives, as only the flow IDs that maps to counters with values above T at all stages are inserted in the flow memory.

Estimate(): In this algorithm, this operation is very simple as the IDs of the flows that are estimated to be “large” is the content of the flow memory array.

Accuracy This algorithm guarantees that all flows that sent more bytes than the thresh-old are in the flow memory, as there are no false negatives, only false positives.

Consider the following notation: • b the number of counters in a stage;

(47)

• s the size of a flow (in bytes); • d the number of stages;

• C the number of bytes that can be sent during the entire measurement interval; • k = T ·b

C .

The probability of a flow of size s < T (1 −1_k) be inserted in the flow memory is given by: p ≤ 1 k · T T − s d (2.21)

2.2 Security in Sketch-Based Monitoring

A major problem that sketch-based algorithms have to face is the limited memory and CPU available. Today’s network links operate at very high speeds, decreasing the time budget a switch has to spend with each packet. This constraint has led to monitoring approaches that completely neglect security in favor of ones that minimize the time and space requirements. In some controlled environments this lack of security might be ac-ceptable because threats are limited, but in general it is difficult to assume that no attacks will ever occur. In addition, monitoring activities are often used in the context of network defense applications, such as anomaly detection and intrusion prevention. Therefore, if the monitoring algorithms are insecure then their results may not be trustworthy, what makes their activities worthless or, in a worse case, counter-productive — since corrupted results could lead the network administrator to take inappropriate actions. Therefore, se-curing the monitoring function is crucial to ensure that the decisions are always adequate. Every sketch-based algorithm makes use of one or more hash functions. If these functions are not secure, the entire sketch is vulnerable. In the implementation guidelines for the Count-Min sketch [17], for example, the authors say that the hash functions do not need to be particularly strong (as the cryptographic ones are). Some sketch’s authors opt not to specify what kind of hash functions are needed and others, like the AMS Sketch’s, suggest polynomial ones based on the module operation. A malicious user can exploit this fact to benefit himself, to harm someone else or to simply corrupt the correct operation of the algorithm.

The severity of an attack to a sketch depends on the level of influence the adversary has on the monitored network. We assume the adversary may be anywhere inside the monitored network but that he is not in control of the device where the monitoring solution is deployed. The next subsection tries to describe some vulnerabilities of the sketches, according to adversaries with different capabilities.

(48)

2.2.1 Adversary Capabilities

In this subsection, some malicious actions a user can take are described. We assume an adversary that might be anywhere inside the network but that has not compromised the device where the monitoring solution is deployed. All details about the implemented algo-rithms are known to the adversary, and therefore he may be able to perform the following actions. Depending on his privileges in the network, the malicious user may be able to insert crafted packets into the link, to drop/modify other user’s packets or he may be only able to eavesdrop the link. Table 2.1 summarizes the identified attacks that can be made to the sketch-based algorithms we presented in section 2.1.

General Algorithm’s Dependent

Eavesdrop Only

– Predict the next algo-rithm’s actions

– Find potential victims for other attacks

– Not applicable

Delay Packets – Not applicable – Not applicable Drop Packets – Prevent the algorithm

to execute some action – Not applicable

Modify Packets

– Preimage and colli-sion attacks on the hash functions;

– Corrupt packet’s data

– Overflow the counters

– Add negative values to counters – Corrupt hash functions that map items to +1 or -1

– Choose values that the hash function of Virtual Bitmap will map to values out-side the flow space covered

Generate Traffic

– Preimage and colli-sion attacks on the hash functions

– Overcounting of frag-mented packets

– Overflow the counters

– Add negative values to counters – Corrupt hash functions that map items to +1 or -1

– Choose values that the hash function of Virtual Bitmap will map to values out-side the flow space covered

– Adjust the behavior considered normal by the K-Ary algorithm to fit the behav-ior of a future attack.

Table 2.1: Attacks against sketch-based algorithms

We also inspected the security vulnerabilities of one of the sketch-based algorithms in particular – the Count-Min algorithm. The results are presented in Table 2.2. This table differs from Table 2.1 by presenting the attacks naive implementations of the Count-Min algorithm would be vulnerable to. Just to give an example, adding negative values to the algorithms’ counters is an example of an exclusive attack to the Count-Min algorithm. The Count-Min’s specification allows this operation since the algorithm was not originally

(49)

designed to be used specifically in a network monitoring context.

Attacks Examples

Eavesdrop Only

– Predict the next algo-rithm’s actions

– Find potential victims for other attacks

By eavesdropping the network the adver-sary will be able to build up the same data structure as the monitoring entity. He can use that to find out which buckets are not close to the threshold, identifying potential victims for future attacks. Delay Packets – Not applicable

Drop Packets – Prevent the algorithm to execute some action

By dropping packets and using the same data structure as the legitimate monitor-ing entity, the adversary is able to pre-vent the monitoring algorithm to take some action just before it does.

Modify Packets

– Corrupt packet’s data – Overflow the counters – Add negative values to counters

Preimage: the adversary can choose a a packet and modify a different packet so that they both hash to the same value. Both packets will increment the same counters.

Overflow: by overflowing the counters, the adversary prevents the monitoring entity from reading an high value from the counter.

Negative values: by making negative the field in the packet that will be used to increment the counters, the adversary is able to decremented the counters instead.

Generate Traffic

– Overcounting of frag-mented packets

– Overflow the counters – Add negative values to counters

Fragmented packets: the adversary can intercept packets and re-transmit those packets in smaller pieces. This ac-tion may trick the monitoring entity into counting each small packet as an individ-ual packet.

Table 2.2: Attacks against the Count-Min algorithm

Eavesdrop only

If the adversary is placed right before the monitoring device, he can fill his own data structure, which will become the same as the built up by the legitimate monitoring task. Knowing the implementation details of the algorithm and possessing the exact same cap-tured information as the monitoring task allows the adversary to predict the actions that