Web Graph compression
Outline
• Literature review:
– Connective Server;
– Link Data;
– WebGraph Framework;
– Extra literature.
• Compression results comparison;
• Final considerations.
Web Graph
A web graph relative to a set of URLs is a directed graph having those URLs as the set of nodes. An arc u→v is identified for each hyperlink from a URL u towards a URL v.
– URLs that do not appear either as sources or in more than T (4) pages are ignored;
–The URLs are normalized by converting hostnames to lower case, cannonicalizes port number, re-introducing them where they need, and adding a trailing slash to all URLs that do not have it.
Main features of Web Graphs
Locality: usually most of the hyperlinks are local, i.e, they point to other URLs on the same host. The literature reports that on average 80% of the hyperlinks are local.
Consecutivity: links within a same page are likely to
be consecutive respecting to the lexicographic
order.
Main features of WebGraphs
Similarity: Pages on the same host tend to have many hyperlinks pointing to the same pages.
Consecutivity is the dual
distance-one similarity.
Literature
Connectivity Server (1998) – Digital Systems Reseach Center
and Stanford University – K. Bharat, A. Broder, M. Henzinger, P.Kumar, S. Venkatasubramanian;
Link Database (2001) - Compaq Systems Research Center – K.
Randall, R. Stata, R. Wickremesinghe, J. Wiener;
WebGraph Framework (2002) – Universita degli Studi di Milano
– P. Boldi, S. Vigna.
Connectivity Server
Tool for web graphs visualisation, analysis (connectivity, ranking pages) and URLs compression.
Used by Alta Vista;
Links represented by an outgoing and an incoming adjacency lists;
Composed of:
URL Database: URL, fingerprint, URL-id;
Host Database: group of URLs based on the hostname portion;
Link Database: URL, outlinks, inlinks.
Connectivity Server: URL compression
URLs are sorted lexicographically and stored as a delta encoded entry (70% reduction).
URLs delta URLs delta
encoding encoding
Indexing the Indexing the
delta
delta
enconding
enconding
Link1: first version of Link Database
No compression: simple representation of outgoing and incoming adjacency lists of links.
Avg. inlink size: 34 bits
Avg. outlink size: 24 bits
Link2: second version of Link Database
Single list compression and starts compression Avg. inlink size: 8.9 bits
Avg. outlink size: 11.03 bits
Delta Encoding of the Adjacency Lists
Each array element is 32 bits long.
-3 = 101-104 (first item) 42 = 174-132 (other items)
Delta Encoding of the Adjacency Lists
.
Link3: third version of Link Database
Interlist compression with representative list Avg. inlink size: 5.66 bits
Avg. outlink size: 5.61 bits
Nybble Code
• The low-order bit of each nybble indicates whether or not there are more nybbles in the string
• The least-significant data bit encodes the sign.
• The remaining bits provide an unsigned number 28 = 0111 1000
-28 = 1111 0010
Starts array compression
• The URLs are divided into three partitions based on their degree;
• Elements of starts are indices to nybbles;
• The literature reports that 74% of the entries are in
the low-degree partition.
Starts array compression
Z(x) = max (indegree(x), outdegree(x)) P = the number of pages in each block.
Entry range Partition # bits
Z(x) > 254 High-degree partition 32
medium-degree partition (32+P*16)/P
Z(x) < 24 low-degree partition (32+P*8)/P
254 ≤ Z(x) ≤ 24
Interlist Compression
ref : relative index of the representative adjacency list;
deletes: set of URL-ids to delete from the representative list;
adds: set of URL-ids to add to the representative list.
LimitSelect-K-L: chooses the best representative adjacency list from
among the previus K (8) URL-ids' adjacency lists and only allows chains
of fewer than L (4) hops.
-codes (WebGraph Framework)
Interlist compression with representative list Avg. inlink size: 3.08 bits
Avg. outlink size: 2.89 bits
Compressing Gaps
Successor list S(x) = {s
1-x, s
2-s
1-1, ..., s
k-s
k-1-1}
For negative entries:
Adjacency list with compressed gaps.
Uncompressed adjacency list
Using copy lists
• Each bit on the copy list informs whether the corresponding successor of y is also a successor of x;
• The reference list index ref. is chosen as the value between 0 and W (window size) that gives the best compression.
Uncompressed adjacency list
Adjacency list with copy lists.
Using copy blocks
• The last block is omitted;
• The first copy block is 0 if the copy list starts with 0;
• The length is decremented by one for all blocks except the first one.
Adjacency list with copy blocks.
Adjacency list with copy lists.
Compressing intervals
• Intervals: represented by their left extreme and lenght;
• Intervals length: are decremented by the threshold L
min;
• Residuals: compressed using differences.
Adjacency list with intervals.
Adjacency list with copy lists.
Compressing intervals
Adjacency list with intervals.
Adjacency list with copy lists.
50 = ? 0 = (15-15)*2
600 = (316-16)*2 5 = |13-15|*2-1 3018 = 3041-22-1
Compression comparison
Using different computers and compilers.
Inlink size Ouotlink size Access time# pages (million)# links (million) Database
Huff. 15.2 15.4 112 320 WebBase
Link1 34 24 13 61 1000 Web Crawler Mercator
Link2 8.9 11.03 47 61 1000 Web Crawler Mercator
Link3 5.66 5.61 248 61 1000 Web Crawler Mercator
3.25 2.18 206 18.5 300 .uk domain
s-Node 5.07 5.63 298 900 WebBase
ζ-codes
Extra Literature
Toward compressing Web graphs (2001), University of Massachusetts, Harvard University, M. Adler, M. Mitzenmacher – theorical study of Web-graph compression theorical study of Web-graph compression.
Compressing the graph structure of the web (2001), Polytechnic University (NY) , T. Suel, J. Yuan – only consider only consider
outlinks and stores 14 bits per link outlinks and stores 14 bits per link.
Representing web graphs (2002), Stanford University, S.
Raghan, H. Garcia-Molina – propose a new representation for propose a new representation for Web graphs that reduce time access.
Web graphs that reduce time access.
Conclusions
The compression techniques are specialized specialized for Web Graphs.
The average link size link size decreases with the increase of the graph.
The average link access time link access time increases with the increase of the graph.
The ζ -codes seems to have the best trade-off best trade-off
between avg. bit size and access time.
Counting Triangles in Data Streams Counting Triangles in Data Streams
Luciana Salete Buriol
+Gereon Frahling*
Stefano Leonardi
+Christian Sohler*
Alberto Marchetti Spaccamela
+*Heinz Nixdorf Institute & Computer Science Department, University of Paderborn, Germany +University of Rome “La Sapienza”, Computer and Systems Department, Rome, Italy
Outline of the Talk
Introduction of the problem, motivation and objectives;
Data stream algorithm for computing triangles in a graph read as an arbitrary list of edges;
Data stream algorithm for computing triangles in a graph read as sequence of adjacency lists of nodes;
Computing k3,3 of a graph in a streaming fashion.
Data Stream Algorithms
In the data stream model, data arrives in a stream, one item at a time. The algorithms that process the stream
have the task of handling and mining massive data streams by summarizing the streams in small space.
Main applications:
When the streams are not stored and must be processed on the fly as they are produced;
When the memory or time for storing or processing the stream is limited;
When an exact computation is too time consuming and
just a good estimation of the underlying data is required.
Using Data Stream Algorithms for
Computing Topological and Statistical Properties of the Webgraph
Motivation:
Graph minors, as triangles and bipartite cliques are the cores of web communities. The number of such minors is an important webgraph measure.
Counting the number of triangles of a graph is a fundamental problem in network analysis.
For computing properties of the webgraph, we have to deal with massive data. For many computations, it is better to have an
approximation using less time and memory, than compute exactly values and consuming excessive time or memory.
Using Data Stream Algorithms for
Computing Topological and Statistical Properties of the Webgraph
Objectives:
To compute approximately statistical and topological properties of webgraphs using limited space and time.
Webgraph as a Stream
The data stream consists of a sequence of the arcs ∈ E;
Each item of the stream is an arc of the graph
A traditional data stream application considers arcs in an arbitrary order
Depending on the application, we can consider some order in the stream. For example, if some properties is maintained during a crawling process, the entire adjacency list of outgoing arcs of each node is extracted consecutively.
Samples and Sketches
We can divide data stream algorithms in two main groups:
samples and sketches algorithms.
Samples: selection of a sub-set of items and check some
specific property on them;
Define the kind of sample and the sample size
Sketches: inner product or aggregate of subsets of items
using different hash functions that compactly describe the sets in each inner product
Define the kind of hash function and how many to use
Counting Triangles in Data Streams
Let’s T
0, T
1, T
2and T
3represent the set of triples that has 0, 1, 2 and 3 edges ∈ E, respectively.
• Given a graph G=(V,E), where V is the set of vertices and E the set of edges, consider all triples of nodes ∈ V;
• we can find four type of structures considering a triple of nodes
and the number of edges connecting them
Counting Triangles in Data Streams
Previous best results by Yossef, Kumar and Sivakumar:
Reductions in Streaming Algorithms, with an Application to Counting Triangles in Graphs, 2002
O 1
3
. log 1
. 1
T
1T
2T
33
.log n
O 1
2
.log 1
. 1
T
1T
2T
3• Our results: space
Counting Triangles in Data Streams
We take an edge e=(a,b) ∈ E and a node v ∈ V \ {a,b}, and look for the missing edges.
• The following property holds for any graph:
T
1+ 2T
2+ 3T
3=
|E|(|V|-2)• Triples belonging to T
0are not reached.
?
a
?b
v
|E|(|V|-2)
A 3-pass streaming algorithm
We introduce a 3-pass streaming algorithm that, given an arbitrary edge (a,b) and a node u≠a,b outputs a variable β={0,1} :
1st Pass: count the number of edges |E| in the stream 2nd Pass: sample and edge e=(a,b) uniformly chosen
among all edges from the stream. Choose a node v uniformly from V\{a,b}
3rd Pass: Test if edges (a,v) and (b,v) are present in
the stream. If (a,v) ∈ E and (b,v) ∈ E then output β=1
else output β=0.
A 3-pass streaming algorithm
The streaming algorithm outputs a value β having expected value:
E
3T
3T
12T
23T
3T
3E
.
E
V
2
3
• Furthermore:
A 3-pass streaming algorithm
There is a streaming algorithm that outputs a value
satisfying with probability 1-δ
We start r parallel instances of the 3-pass algorithm, and each one outputs a value βi
T 3
1
r i
.
E
V
2
3
T 3
1
T 3
T 3
1
T 3
r
2
2
T
12T
23T
3T
3ln
1
A 3-pass streaming algorithm
We use as an estimator for
We estimate T3 as:
1
r
i 1r
i
T
3
1
r
i 1r
i
.
E
V
2
3
E 3T3
T1 2T2 3T3
A 3-pass streaming algorithm
Proof by Chernoff Bounds
Setting both probabilities together
are bounded by δ Pr
1
r i 1
r
i
1 E
e 2.E
.r 3
Pr
1
r i 1
r
i
1 E
e 2.E
.r 2
r 2
2
T1 2T2 3T3
T 3 ln 1
! "
T 3 T
3A 3-pass streaming algorithm
We suppose that the events within the brackets do not occur. In this case:
1 r i 1
r
i
1 E
1
r i 1
r
i
E V
2
3
1
EE V
2
3
T 3
1
T 3
T 3
1
T 3
• The same argument for obtain
One pass algorithm
A uniform choice of an edge in one pass can be archived by choosing the first edge as a sample edge and replacing this edge by the i-th edge of the stream with probability 1/i.
When choosing a sample, it can happen that we already miss some arcs. So, if we pick the sample before any edge of this sample was seen, we don’t miss any triangle. We have 1/3 of probability of doing that.
Sample one-pass
i←1;
for each edge es=(as,bs) in the stream do:
flip a coin. With probability 1/i do:
a ← as; b ← bs;
v ← node uniformly chosen from v \ {a,b}
x ←false; y ←false;
end do
if es = (a,v) then x ←true;
If es = (b,v) then y ←true;
end for
if x=true and y=true return β=1 else return β=0
a
b
v
Sample one-pass
E
3T
3T
12T
23T
3• The streaming algorithm outputs a value β having expected value:
r 6
2
T1 2T2 3T3
T 3 ln 1
• The size of the sample
T
3
1
r
i 1r
i
.
E
V
2
We estimate T3 as:
Results for a sample set of size 100
Considering a structured stream
Which kind of structure can benefit the algorithm and still be a natural and good representation of the graph?
Let’s consider the case where the adjacency lists of nodes are stored in sequence in the stream
No order is required within each adjacency list
In this case each arc is seen twice in the stream
Results on Incidence Stream
• Our results:
• Previous best results from Yossef, Kumar and
Sivakumar: Reductions in Streaming Algorithms, with an Application to Counting Triangles in Graphs, 2002
O 1
2
. log 1
. 1
T
2T
3O 1
2
.log 1
. 1
T
2T
32
.log n
d log n
Incidence streams
For each combination of two endpoints nodes of arcs
leaving from node i we have a path-length-2 (we call by V from now on), where an V is comprised of two edges with a common endpoint node
For each node i, where d
iis its degree, the number of V’s, having node i in, common is:
i i
Incidence list of node u A path-length-2 (called V)
di 2
di.
di 1 2
Counting triangles in incidence streams
In this case our sample is a V, and we check if the third arc is later seen in the stream
• The following property holds for any graph:
T
23T
3 i 1V
d
i. d
i1
2
Incidence 3-pass algorithm
The algorithm outputs β with expected value:
E
3T
3T
23T
3• The sample size in this case is:
r
2
2
T
23T
3T
3ln
1
• The approximation obtained is:
T3
1
r i 1
r
i
. i 1
V
di. di 1 6
Dimension of some graphs extracted from
different sources
Number of triangles of the graphs
Comparing with “Finding, Counting and Listing Triangles in Large Graphs,
Schank and Wagner, 2004”
Counting k3,3 in Data Streams
Let k3,3 denote the number of k3,3 minors and k3,1 denote the number of k3,1 minors
We propose a method for estimating the number of k3,3 of a graph
We suppose that the outdegree of the graph is bounded by d The edges are sorted by destination nodes
We do not assume any order by source nodes
Sample
Our sample in this case is a k3,1 and 2 nodes not belonging to the k3,1
w
v
b c a u
Conclusions and Future Work
Sampling presented better results then sketches, for the case of computing the number of small minors in graphs
Future work:
To implement the algorithms;
Verify which are important minors to be computed;
To propose data stream algorithms for estimating the number of such minors