Web Graph compression

(1)

Web Graph compression

(2)

Outline

• Literature review:

– Connective Server;

– Link Data;

– WebGraph Framework;

– Extra literature.

• Compression results comparison;

• Final considerations.

(3)

Web Graph

A web graph relative to a set of URLs is a directed graph having those URLs as the set of nodes. An arc u→v is identified for each hyperlink from a URL u towards a URL v.

– URLs that do not appear either as sources or in more than T (4) pages are ignored;

–The URLs are normalized by converting hostnames to lower case, cannonicalizes port number, re-introducing them where they need, and adding a trailing slash to all URLs that do not have it.

(4)

Main features of Web Graphs

Locality: usually most of the hyperlinks are local, i.e, they point to other URLs on the same host. The literature reports that on average 80% of the hyperlinks are local.

Consecutivity: links within a same page are likely to

be consecutive respecting to the lexicographic

order.

(5)

Main features of WebGraphs

Similarity: Pages on the same host tend to have many hyperlinks pointing to the same pages.

Consecutivity is the dual

distance-one similarity.

(6)

Literature

Connectivity Server (1998) – Digital Systems Reseach Center

and Stanford University – K. Bharat, A. Broder, M. Henzinger, P.

Kumar, S. Venkatasubramanian;

Link Database (2001) - Compaq Systems Research Center – K.

Randall, R. Stata, R. Wickremesinghe, J. Wiener;

WebGraph Framework (2002) – Universita degli Studi di Milano

– P. Boldi, S. Vigna.

(7)

Connectivity Server

Tool for web graphs visualisation, analysis (connectivity, ranking pages) and URLs compression.

Used by Alta Vista;

Links represented by an outgoing and an incoming adjacency lists;

Composed of:

URL Database: URL, fingerprint, URL-id;

Host Database: group of URLs based on the hostname portion;

Link Database: URL, outlinks, inlinks.

(8)

Connectivity Server: URL compression

URLs are sorted lexicographically and stored as a delta encoded entry (70% reduction).

URLs delta URLs delta

encoding encoding

Indexing the Indexing the

delta

enconding

(9)

Link1: first version of Link Database

No compression: simple representation of outgoing and incoming adjacency lists of links.

Avg. inlink size: 34 bits

Avg. outlink size: 24 bits

(10)

Link2: second version of Link Database

Single list compression and starts compression Avg. inlink size: 8.9 bits

Avg. outlink size: 11.03 bits

(11)

Delta Encoding of the Adjacency Lists

Each array element is 32 bits long.

(12)

-3 = 101-104 (first item) 42 = 174-132 (other items)

Delta Encoding of the Adjacency Lists

.

(13)

Link3: third version of Link Database

Interlist compression with representative list Avg. inlink size: 5.66 bits

Avg. outlink size: 5.61 bits

(14)

Nybble Code

• The low-order bit of each nybble indicates whether or not there are more nybbles in the string

• The least-significant data bit encodes the sign.

• The remaining bits provide an unsigned number 28 = 0111 1000

-28 = 1111 0010

(15)

Starts array compression

• The URLs are divided into three partitions based on their degree;

• Elements of starts are indices to nybbles;

• The literature reports that 74% of the entries are in

the low-degree partition.

(16)

Starts array compression

Z(x) = max (indegree(x), outdegree(x)) P = the number of pages in each block.

Entry range Partition # bits

Z(x) > 254 High-degree partition 32

medium-degree partition (32+P*16)/P

Z(x) < 24 low-degree partition (32+P*8)/P

254 ≤ Z(x) ≤ 24

(17)

Interlist Compression

ref : relative index of the representative adjacency list;

deletes: set of URL-ids to delete from the representative list;

adds: set of URL-ids to add to the representative list.

LimitSelect-K-L: chooses the best representative adjacency list from

among the previus K (8) URL-ids' adjacency lists and only allows chains

of fewer than L (4) hops.

(18)

-codes (WebGraph Framework)

Interlist compression with representative list Avg. inlink size: 3.08 bits

Avg. outlink size: 2.89 bits

(19)

Compressing Gaps

Successor list S(x) = {s

₁

-x, s

₂

-s

₁

-1, ..., s

_k

-s

_k-1

-1}

For negative entries:

Adjacency list with compressed gaps.

Uncompressed adjacency list

(20)

Using copy lists

• Each bit on the copy list informs whether the corresponding successor of y is also a successor of x;

• The reference list index ref. is chosen as the value between 0 and W (window size) that gives the best compression.

Uncompressed adjacency list

Adjacency list with copy lists.

(21)

Using copy blocks

• The last block is omitted;

• The first copy block is 0 if the copy list starts with 0;

• The length is decremented by one for all blocks except the first one.

Adjacency list with copy blocks.

(22)

Compressing intervals

• Intervals: represented by their left extreme and lenght;

• Intervals length: are decremented by the threshold L

_min

;

• Residuals: compressed using differences.

Adjacency list with intervals.

(23)

Compressing intervals

Adjacency list with intervals.

50 = ? 0 = (15-15)*2

600 = (316-16)*2 5 = |13-15|*2-1 3018 = 3041-22-1

(24)

Compression comparison

Using different computers and compilers.

Inlink size Ouotlink size Access time# pages (million)# links (million) Database

Huff. 15.2 15.4 112 320 WebBase

Link1 34 24 13 61 1000 Web Crawler Mercator

Link2 8.9 11.03 47 61 1000 Web Crawler Mercator

Link3 5.66 5.61 248 61 1000 Web Crawler Mercator

3.25 2.18 206 18.5 300 .uk domain

s-Node 5.07 5.63 298 900 WebBase

ζ-codes

(25)

Extra Literature

Toward compressing Web graphs (2001), University of Massachusetts, Harvard University, M. Adler, M. Mitzenmacher – theorical study of Web-graph compression theorical study of Web-graph compression.

Compressing the graph structure of the web (2001), Polytechnic University (NY) , T. Suel, J. Yuan – only consider only consider

outlinks and stores 14 bits per link outlinks and stores 14 bits per link.

Representing web graphs (2002), Stanford University, S.

Raghan, H. Garcia-Molina – propose a new representation for propose a new representation for Web graphs that reduce time access.

Web graphs that reduce time access.

(26)

Conclusions

The compression techniques are specialized specialized for Web Graphs.

The average link size link size decreases with the increase of the graph.

The average link access time link access time increases with the increase of the graph.

The ζ -codes seems to have the best trade-off best trade-off

between avg. bit size and access time.

(27)

Counting Triangles in Data Streams Counting Triangles in Data Streams

Luciana Salete Buriol

⁺

Gereon Frahling*

Stefano Leonardi

⁺

Christian Sohler*

Alberto Marchetti Spaccamela

⁺

*Heinz Nixdorf Institute & Computer Science Department, University of Paderborn, Germany +University of Rome “La Sapienza”, Computer and Systems Department, Rome, Italy

(28)

Outline of the Talk

Introduction of the problem, motivation and objectives;

Data stream algorithm for computing triangles in a graph read as an arbitrary list of edges;

Data stream algorithm for computing triangles in a graph read as sequence of adjacency lists of nodes;

Computing k3,3 of a graph in a streaming fashion.

(29)

Data Stream Algorithms

In the data stream model, data arrives in a stream, one item at a time. The algorithms that process the stream

have the task of handling and mining massive data streams by summarizing the streams in small space.

Main applications:

When the streams are not stored and must be processed on the fly as they are produced;

When the memory or time for storing or processing the stream is limited;

When an exact computation is too time consuming and

just a good estimation of the underlying data is required.

(30)

Using Data Stream Algorithms for

Computing Topological and Statistical Properties of the Webgraph

Motivation:

Graph minors, as triangles and bipartite cliques are the cores of web communities. The number of such minors is an important webgraph measure.

Counting the number of triangles of a graph is a fundamental problem in network analysis.

For computing properties of the webgraph, we have to deal with massive data. For many computations, it is better to have an

approximation using less time and memory, than compute exactly values and consuming excessive time or memory.

(31)

Using Data Stream Algorithms for

Computing Topological and Statistical Properties of the Webgraph

Objectives:

To compute approximately statistical and topological properties of webgraphs using limited space and time.

(32)

Webgraph as a Stream

The data stream consists of a sequence of the arcs ∈ E;

Each item of the stream is an arc of the graph

A traditional data stream application considers arcs in an arbitrary order

Depending on the application, we can consider some order in the stream. For example, if some properties is maintained during a crawling process, the entire adjacency list of outgoing arcs of each node is extracted consecutively.

(33)

Samples and Sketches

We can divide data stream algorithms in two main groups:

samples and sketches algorithms.

Samples: selection of a sub-set of items and check some

specific property on them;

Define the kind of sample and the sample size

Sketches: inner product or aggregate of subsets of items

using different hash functions that compactly describe the sets in each inner product

Define the kind of hash function and how many to use

(34)

Counting Triangles in Data Streams

Let’s T

₀

, T

₁

, T

₂

and T

₃

represent the set of triples that has 0, 1, 2 and 3 edges ∈ E, respectively.

• Given a graph G=(V,E), where V is the set of vertices and E the set of edges, consider all triples of nodes ∈ V;

• we can find four type of structures considering a triple of nodes

and the number of edges connecting them

(35)

Counting Triangles in Data Streams

Previous best results by Yossef, Kumar and Sivakumar:

Reductions in Streaming Algorithms, with an Application to Counting Triangles in Graphs, 2002

O 1

3

. log 1

. 1

T

₁

T

₂

T

₃

3

.log n

O 1

2

.log 1

. 1

T

₁

T

₂

T

₃

• Our results: space

(36)

Counting Triangles in Data Streams

We take an edge e=(a,b) ∈ E and a node v ∈ V \ {a,b}, and look for the missing edges.

• The following property holds for any graph:

T

₁

+ 2T

₂

+ 3T

₃

=

^|E|(|V|-2)

• Triples belonging to T

₀

are not reached.

?

a

?

b

v

|E|(|V|-2)

(37)

A 3-pass streaming algorithm

We introduce a 3-pass streaming algorithm that, given an arbitrary edge (a,b) and a node u≠a,b outputs a variable β={0,1} :

1^st Pass: count the number of edges |E| in the stream 2^nd Pass: sample and edge e=(a,b) uniformly chosen

among all edges from the stream. Choose a node v uniformly from V\{a,b}

3^rd Pass: Test if edges (a,v) and (b,v) are present in

the stream. If (a,v) ∈ E and (b,v) ∈ E then output β=1

else output β=0.

(38)

A 3-pass streaming algorithm

The streaming algorithm outputs a value β having expected value:

E

3T

₃

T

₁

2T

₂

3T

₃

T

₃

E

.

E

V

2

3 • Furthermore:

(39)

A 3-pass streaming algorithm

There is a streaming algorithm that outputs a value

satisfying with probability 1-δ

We start r parallel instances of the 3-pass algorithm, and each one outputs a value β_i

T ₃

1

r ⁱ

.

E

V

2

3

T ₃

1 T ₃

T ₃

1 T ₃

r

2

T

₁

2T

₂

3T

₃

T

₃

ln

1

(40)

A 3-pass streaming algorithm

We use as an estimator for

We estimate T₃ as:

1 r

ⁱ ¹

r

i

T

₃

1 r

ⁱ ¹

r

i

.

E

V

2

3

E 3T₃

T₁ 2T₂ 3T₃

(41)

A 3-pass streaming algorithm

Proof by Chernoff Bounds

Setting both probabilities together

are bounded by δ Pr

1

r ⁱ ¹

r

i

1 E

e ²^.^E

.r 3

Pr

1

r ⁱ ¹

r

i

1 E

e ²^.^E

.r 2

r 2

2

T₁ 2T₂ 3T₃

T ₃ ln 1

! "

(42)

T ₃ ^T

₃

A 3-pass streaming algorithm

We suppose that the events within the brackets do not occur. In this case:

1 r _i ₁

r

i

1 E

1

r _i ₁

r

i

E V

2

3

1

E

E V

2

3 T ₃

1 T ₃

T ₃

1 T ₃

• The same argument for obtain

(43)

One pass algorithm

A uniform choice of an edge in one pass can be archived by choosing the first edge as a sample edge and replacing this edge by the i-th edge of the stream with probability 1/i.

When choosing a sample, it can happen that we already miss some arcs. So, if we pick the sample before any edge of this sample was seen, we don’t miss any triangle. We have 1/3 of probability of doing that.

(44)

Sample one-pass

i←1;

for each edge e_s=(a_s,b_s) in the stream do:

flip a coin. With probability 1/i do:

a ← a_s; b ← b_s;

v ← node uniformly chosen from v \ {a,b}

x ←false; y ←false;

end do

if e_s = (a,v) then x ←true;

If e_s = (b,v) then y ←true;

end for

if x=true and y=true return β=1 else return β=0

a

b

v

(45)

Sample one-pass

E

3T

₃

T

₁

2T

₂

3T

₃

• The streaming algorithm outputs a value β having expected value:

r 6

2

T₁ 2T₂ 3T₃

T ₃ ln 1

• The size of the sample

T

₃

1 r

ⁱ ¹

r

i

.

E

V

2

We estimate T₃ as:

(46)

Results for a sample set of size 100

(47)

Considering a structured stream

Which kind of structure can benefit the algorithm and still be a natural and good representation of the graph?

Let’s consider the case where the adjacency lists of nodes are stored in sequence in the stream

No order is required within each adjacency list

In this case each arc is seen twice in the stream

(48)

Results on Incidence Stream

• Our results:

• Previous best results from Yossef, Kumar and

Sivakumar: Reductions in Streaming Algorithms, with an Application to Counting Triangles in Graphs, 2002

O 1

2

. log 1

. 1

T

₂

T

₃

O 1

2

.log 1

. 1

T

₂

T

₃

2

.log n

d log n

(49)

Incidence streams

For each combination of two endpoints nodes of arcs

leaving from node i we have a path-length-2 (we call by V from now on), where an V is comprised of two edges with a common endpoint node

For each node i, where d

_i

is its degree, the number of V’s, having node i in, common is:

i i

Incidence list of node u A path-length-2 (called V)

d_i 2

d_i.

d_i 1 2

(50)

Counting triangles in incidence streams

In this case our sample is a V, and we check if the third arc is later seen in the stream

• The following property holds for any graph:

T

₂

3T

₃ _i ₁

V

d

_i

. d

_i

1

2

(51)

Incidence 3-pass algorithm

The algorithm outputs β with expected value:

E

3T

₃

T

₂

3T

₃

• The sample size in this case is:

r

2

T

₂

3T

₃

T

₃

ln

1 • The approximation obtained is:

T₃

1

r ⁱ ¹

r

i

. ⁱ ¹

V

d_i. d_i 1 6

(52)

Dimension of some graphs extracted from

different sources

Number of triangles of the graphs

(53)

Comparing with “Finding, Counting and Listing Triangles in Large Graphs,

Schank and Wagner, 2004”

(54)

Counting k3,3 in Data Streams

Let k3,3 denote the number of k3,3 minors and k3,1 denote the number of k3,1 minors

We propose a method for estimating the number of k3,3 of a graph

We suppose that the outdegree of the graph is bounded by d The edges are sorted by destination nodes

We do not assume any order by source nodes

(55)

Sample

Our sample in this case is a k3,1 and 2 nodes not belonging to the k3,1

w

v

b c a u

(56)

Conclusions and Future Work

Sampling presented better results then sketches, for the case of computing the number of small minors in graphs

Future work:

To implement the algorithms;

Verify which are important minors to be computed;

To propose data stream algorithms for estimating the number of such minors