Minera ~ ao de Padr ~ oes Freq uentes em
Bases de Dados Distribu
das e Din
^
ami as
(EÆ ient Data Mining for Frequent Itemsets in
Evolving and Distributed Databases)
Disserta ~ao apresentada ao Curso de P
os-Gradua ~ao em Ci^en ia da Computa ~ao da
Universidade Federalde Minas Gerais, omo
requisito par ial para aobten ~ao dograu de
Mestre em Ci^en iada Computa ~ao.
Belo Horizonte
Adriano Alonso Veloso was born in Santos Dumont, Brazil, on Mar h 4 th
, 1979. He
started his studies at Universidade Federal de Minas Gerais, Belo Horizonte, in 1998 and
graduatedwithaBa helordegreeinComputerS ien ein2001. Hebegangraduatestudies
in Computer S ien e in 2002. He pursued his resear h in data mining, and parallel and
distributed systems under the dire tion of Dr. Wagner Meira Jr. He has published in
major data mining onferen es, and he has won some national a ademi awards. He
has also reviewed for several onferen es. Currently, Adriano is working at Smart Pri e
Te hnology Group, where he uses his data miningskills inseveral relatedproje ts.
A knowledgements
The last three years in Belo Horizonte were unforgettable. I will always have good
memoriesfrommyfriends AdrianoCarvalho, BrunoDiniz,RodrigoBarra,FlaviaRibeiro,
Alex Borges, GustavoMenezes, Leonardo Ro ha, and Eveline Veloso.
ThankstoDr. MohammedZaki,Dr. VirglioAlmeida,andMar ioBuntedeCarvalho,
for the ollaborations on various papers. Thanks also to Amol Gothing, Matthew Otey,
Liang Chen, RuomingJim,and ChaoWang.
Spe ial thanks to Dr. Srini Parthasarathy who was like a o-advisor on this thesis
and for providing me summer support and valuable time with the data mining resear h
group atthe Ohio-StateUniversity.
Very spe ial thanks tomyadvisor, Dr. WagnerMeira Jr.,forthefriendshipand for
guidingme through this thesis.
The most spe ial thanks go to my wonderfulparents,for the amazingloveand for
the undying support.
Happy reading!
Minera ~aodeDadoseopro essodeanalisedegrandesbasesdedados omoobjetivode
de-s obrir e identi ar modelos e padr~oes. Atualmente, uma ampla variedade de apli a ~oes,
omo omer io eletr^oni o, web mining, dete ~ao de intrusos e fraudes et ., est~ao sendo
onsideradas promissores ni hos para a utiliza ~ao de te ni as de minera ~aode dados. No
entanto, tais apli a ~oes que utilizam bases de dados distribudas e din^ami as, imp~oem
novosdesaos, dentre osquaisest~aoin ludos a apa idadede minera ~ao ontnuae
inter-ativa, adapta ~ao e iente dos modelos as onstantes mudan as nos dados, uso adequado
dosre ursos omputa ionaisdistribudos,ea essolimitadoabasesde dados omrestri ~oes
de priva idade. A utiliza ~ao de te ni as tradi ionais de minera ~ao de dados emtais
am-bientes resulta em uma ex essiva omuni a ~ao, desperd io de re ursos omputa ionais,
possibilidade deviola ~oesdapriva idadedosusuarios,egeralmenten~ao onseguem prover
o tempo de resposta interativo, estritamente ne essario emum pro esso de des oberta.
Nesta disserta ~ao ser~ao apresentados novos algoritmos para a minera ~ao de padr~oes
(itemsets)frequentes emgrandesbases dedadosdistribudasedin^ami as. Nossosnovos
al-goritmoss~ao apazesdediminuirsigni ativamenteosre ursos omputa ionaisne essarios,
e ainda assim eles s~ao efetivamente apazes de prover intera ~oes om o usuario
(modi- ar/restringiroespa ode bus a),modi a ~oesnos dados(remo ~ao/inser ~aode blo osde
dados), e garantia de tempo de resposta interativos atraves da utiliza ~ao de atualiza ~oes
seletivas(resultados aproximados). In lumos omo destaques adi ionaisde nossos
algorit-mos a ne essidade de baixa quantidade de omuni a ~ao(e sin roniza ~ao), e suporte para
opera ~oes emjanelade transa ~oes (en ontrar ospadr~oes existentes apenas duranteum
in-tervalodetempoespe o). Essas ara tersti aspermitemquenossosalgoritmosrealizem
a minera ~aode warehouses din^ami os e distribudos, e ate mesmo de data streams
prove-nientes de diversas fontes. Umaavalia ~aoextensiva utilizandobasesde dadosdin^ami ase
Data Mining is the pro ess of automati analysis of large databases in the hope of
dis- overing previouslyunsuspe ted and useful relationshipsand patterns. Currently, abroad
variety of new appli ations, su h as ele troni ommer e, web mining, network intrusion
dete tion et ., are emerging as promising domains for data mining. However, su h
ap-pli ations with distributed and evolving databases, impose new hallenges and issues for
dataminingresear hers. Among theseissues arein luded the apabilityof intera tive and
ontinuous mining, eÆ ient adaptationof models to onstant hanges inthe data, proper
use of distributed resour es, and limited a ess to possibly priva y sensitive data. Trying
toaddresstheseissueswithtraditionalalgorithmsresultsinhigh ommuni ationoverhead,
ex essive wastage of CPU and I/O resour es, priva y violations, and oftendoes not meet
the stringent intera tiveresponse times, essential tointera tive pro ess of dis overy.
This thesis presents new algorithmsfor miningfrequent itemsetsin large evolving and
distributeddatabases. Ournewalgorithmsprovidesigni antI/Oand omputational
sav-ings when ompared againstother te hniques,and at thesame time are able toee tively
handle user intera tions (modifying/ onstraining the sear h spa e for frequent itemsets),
online data updates (removing/inserting blo ks of data), and intera tive response times
through sele tiveupdates (approximate/partialresults). Some additionalhighlightsofthe
proposed algorithms in lude low ommuni ation (and syn hronization) overhead, and
in-tera tivesupportforwindowed operations( omputingtheexisting itemsetsover aspe i
time-interval). These features allow our algorithms to mine from evolving data stored
in distributed warehouses, to high speed data streams oming from distributed sour es.
Extensive experimental evaluation using evolving and distributed data demonstrates the
Contents
List of Figures 3
1 Introdu tion 4
1.1 ThesisContributions . . . 5
1.1.1 New In remental and Intera tive Te hniques . . . 5
1.1.2 New Distributed Te hniques . . . 6
1.1.3 Con rete A tual Appli ations . . . 6
1.2 ThesisOutline. . . 7
2 Mining Frequent Itemsets 8 2.1 Theoreti alFoundationsand Denitions . . . 8
2.2 ComputationalComplexity. . . 9
2.3 Challenges and RelatedWork . . . 10
2.3.1 Algorithmsfor Frequent Itemset Mining . . . 11
2.4 The ZigZag Algorithm . . . 13
2.4.1 HybridSear h . . . 13
2.4.2 DataStru ture forSupport Counting . . . 14
2.4.3 Evaluation . . . 16
2.5 Con lusions . . . 17
3 Mining Frequent Itemsets in Evolving Databases 19 3.1 ProblemStatement and Denitions . . . 19
3.2 i ZigZag In rementaland Intera tive Algorithm . . . 20
3.2.1 i ZigZagFeatures. . . 22
3.3 Evaluation . . . 26
3.4 Con lusions . . . 32
4 Mining Frequent Itemsets in Distributed Databases 34 4.1 ProblemStatement and Denitions . . . 34
4.2 d&i-ZigZag DistributedAlgorithm . . . 34
4.2.1 d&i-ZigZagFeatures . . . 37
4.3 Evaluation . . . 40
4.4 Con lusions . . . 44
5 Summary and Future Work 45 5.1 ThesisSummary . . . 45
5.2 FutureWork . . . 46
List of Figures
2.1 Frequent Itemset Mining Example. . . 11
2.2 Comparison: Exe ution Times inDierent Situations. . . 18
3.1 In rementalFrequent Itemset Mining Example. . . 22
3.2 Invariant, Predi table and Unpredi tableItemsets.. . . 25
3.3 TotalExe ution Time. . . 28
3.4 Cumulative Exe ution Time. . . 28
3.5 Speedup forDierentIn remental Congurations. . . 28
3.6 Proportionof Candidate, Frequent,and Retained Itemsets. . . 29
3.7 Performan e Comparison: i ZigZagvs. ULI.. . . 29
3.8 Changing min : Iterative and Hand-in-Hand Operations. . . 30
3.9 Trade-o: A ura y vs. Exe utionTime. . . 31
3.10 Gain inA ura y. . . 31
3.11 Predi tive A ura y. . . 32
4.1 DistributedFrequent Itemset Mining Example(Candidate Distribution). . 36
4.2 DistributedFrequent Itemset Mining Example(Data Distribution). . . 38
4.3 Comparison: Data vs. Candidate Distribution. . . 39
4.4 TotalExe ution Time. . . 41
4.5 Numberof transa tions in orporatedinto . . . 41
4.6 Query Response Time using EqualBlo k Sizes. . . 42
4.7 Query Response Time using Dierent Blo k Sizes. . . 43
Chapter 1
Introdu tion
Data Miningis one of the entrala tivities asso iatedwith understanding, navigating,
and exploiting the world of digital data. It is the me hanized pro ess of dis overing
in-teresting patterns and building useful models from large databases. A pattern is usually
des ribed as an expression in some languagedes ribing a subset of the data. A model is
a (statisti al) des ription of the entire database. A frequent itemset is a lassi al type of
pattern, and a olle tionof frequent itemsetsis a ompa t modelof the database.
Formally,theappli ationofdataminingpro eduresisaparti ularstepinthepro essof
dis overing useful knowledgefromdatabases. In su h pro ess,some a tivitiesare assigned
to humans and others toma hines. While ma hines are stillfar fromapproa hing human
abilities in the areas of synthesis of new knowledge and reative insight formation,
au-tomatingdatamodeling pro edures is asigni antni he suitablefor omputers. A urate
modeling ofdata is the ultimateform ofdata understanding and ompression. By
model-inglarge databases, weare essentiallybringingthem down in sizeto arange that humans
an analyze and understand. A olle tionof frequent itemsets,as a ompa t model ofthe
data, be omesthe ultimateportabledata storeand an serve toqui kly navigate and
ma-nipulatelarge volumes ofdata. Forthis reason,buildingthese models (orminingfrequent
itemsets) is a ore data miningtask, and several algorithmswere already proposed inthe
literature [1,3, 6, 12, 15, 30, 42, 45℄. Typi al appli ations in lude supermarket sales and
banking, astronomy, parti le physi s, hemistry, medi ine and biology, but, in fa t, new
appli ations are mentioned oftenin the daily press. Some of these new appli ations have
introdu ed animportantnew dimension tothe problem distributed sour es of evolving
(dynami )data. For example,every day retail hains re ord millionsof transa tions
om-ing from dierent pla es, tele ommuni ations ompanies onne t thousands of alls from
dierent lo ations, and popular web sites log millions of hits world wide. In all these
ap-pli ations, distributeddatabases are ontinuously being updatedwith anew blo k of data
at regular time intervals. Forlarge time intervals, we have the ommons enario in many
distributed data warehouses. For small time intervals, we have high-speed data streams
Modeling these evolving and distributed databases brings unique opportunities, but
it also imposes diÆ ult issues. Some of these issues, like intera tivity, proper use of the
distributed resour es, and limited a ess to priva y-sensitive data, are nothing less than
paramount. Intera tivity is oftenthe key for ee tive data understanding and knowledge
dis overy. Making proper use of the distributed resour es in order to guarantee qui k
response time is ru ial be ause lengthy time delay in response to a user request may
disturbthe owofhumanper eption andformationofinsight. Providingsu hissueswhen
dataisdistributedandevolvingisa hallengingtaskbe ause hangesinthedatainvalidate
the olle tion of frequent itemsets, and simply using traditional approa hes to build the
new olle tion an result in priva y violations, and in an explosion in the omputational
resour es required. Thereis anurgentneedfor non-trivialalgorithmsspe i allydesigned
for miningfrequent itemsets insu h evolving and distributeddatabases.
1.1 Thesis Contributions
In this thesis we propose and evaluate new algorithms for miningfrequent itemsetsin
evolving anddistributed databases. Our algorithmsmakeuse of anin remental te hnique
that avoids repli ating work done before, and at the same time it employs new
prun-ing approa hes to enhan e the mining pro ess. Also, the in remental te hnique provides
mu h more intera tivitythan other ones, like sele tive updates and windowed operations.
Further,our algorithmsmakeuse of a distributedte hnique whi h minimizesI/O
require-ments and ommuni ation overhead, being able to mine large geographi ally distributed
databases. Parts of this work have appeared in [13, 21, 22, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37,38℄. Our main ontributions are summarized next.
1.1.1 New In remental and Intera tive Te hniques
Contribution 1. [Redu ing Data S ans and New Pruning Approa hes℄ Our
te hnique uses a des riptive summary (in remental information) of the frequent itemsets
togenerate onlyfrequentsubsets, avoidingthegenerationandtestingofmanyunne essary
andidate itemsetsthat are usually examinedby other te hniques. Further, the
in remen-tal information is not only used to redu e data s ans but also to employ new pruning
approa hes, improvingthe sear hforfrequentitemsetsand in urringinsigni antruntime
performan e improvements [31, 36℄.
Contribution 2. [Sele tive Updates℄ Contrasting to other in remental
te h-niques [8, 9, 11, 26℄ whi h generally monitor hanges in the evolving database to dete t
the best moment to update the entire olle tion of frequent itemsets, we hose instead
to perform sele tive updates [33℄, that is, the entire olle tion is ompletely updated just
be generated ata fra tion of the total exe utiontime needed to generate anexa t
olle -tion. This feature enables the algorithm to postpone a full update operation, in reasing
itseÆ ien y signi antly,espe iallywhen mining high-speed data streams.
Contribution 3. [Flexible Parameter Modifi ation℄ Pra ti al data mining is
oftenahighlyiterativeand intera tive pro ess. The user typi ally hangesthe parameters
and runs the mining algorithm many times before satised with the nal results. This
iterativepro ess is very time onsuming. Traditionalalgorithms[8, 26℄ are unableto take
advantage of this iterative pro ess to speed up the urrent mining pro ess, wasting time
and repli ating work. Our algorithm has the exibility to hange the parameters a ross
mining operations,and even a rossdata updates (without work repli ation). This feature
is very ee tive in fa ilitatingiterativedata understanding and knowledgedis overy [31℄.
Contribution 4. [Short-Term Mining℄ We present a novel kind of intera tivity,
alledshort-term mining(orwindowedtransa tionaloperations),that keeps the olle tion
of frequent itemsets oherent with respe t to the most re ent data [31℄. Short-term
min-ing is desirable for appli ations su h as web mining, where the user behavior may vary
signi antly a ross time and olda ess patterns may be no longerrelevant.
1.1.2 New Distributed Te hniques
Contribution 5. [Low Communi ation and Syn hronization Overhead℄
Cur-rent parallel and distributed mining algorithms suer fromhigh ommuni ation and
syn- hronization overhead. Some of them need tosyn hronize several times duringthe
exe u-tion, in urring in poorparallel improvements [2, 10, 14, 16℄. Others assume a high-speed
network and perform ex essive ommuni ation operations [34, 39, 46℄. We present a
dis-tributed te hnique whi h minimizes ommuni ation requirements and performs only one
syn hronizationoperation. Thesefeaturesturnourdistributedte hnique apableofmining
over awide area network and grid-spa e [22, 35, 37℄.
Contribution 6. [Priva y-Preserving Communi ation Me hanism℄ Priva y
on erns may prevent data movement in a distributed environment data may be
dis-tributed amongseveral ustodians orsites, noneof whi hisallowed totransferits datato
anothersite [17℄. Wedeveloped a ommuni ation me hanism[35℄ toensure priva yinthe
ommuni ationbetween the sites involved inthe mining operation.
1.1.3 Con rete A tual Appli ations
Contribution 7. [Real-life Appli ations℄ The ee tiveness of our approa hes is
ap-1.2 Thesis Outline
This thesis is divided into 5 hapters. The remaining of this thesis is organized as
follows.
Chapter 2. [Mining Frequent Itemsets℄ We present the basi theoreti al
foun-dations for frequent itemset mining, as well as basi algorithms. We introdu e a new
algorithm that performs an eÆ ient sear h for frequent itemsets. We evaluate and
om-pare our algorithmagainst traditionalones.
Chapter 3. [Mining Frequent Itemsets in Evolving Databases℄ We introdu e
and evaluatenew in rementalalgorithmsandintera tivetoolsfor frequent itemsetmining
in evolvingdatabases.
Chapter 4. [Mining Frequent Itemsets in Distributed Databases℄ We
in-trodu e and evaluate new distributed algorithms for frequent itemset mining when the
evolving database is distributed.
Chapter 5. [Summary and Future Work℄ We summarize our main ontributions
Chapter 2
Mining Frequent Itemsets
In this hapter we des ribe the basi theore ti al foundations and denitions that are
ne essary to understand the frequent itemset mining problem. We also present its
om-putational omplexity, and the main hallenges and resear h wreathing this problem. We
nishthis hapterpresentingandevaluatinganovelalgorithmforfrequentitemsetmining.
2.1 Theoreti al Foundations and Denitions
In this se tion we present the foundations and denitions that form the basis of the
frequent itemsetmining problem.
Definition 1. [Itemsets℄ Foranyset X,itssize isthenumberofelementsinX, jX j.
Let I denote the set of n natural numbers f1, 2, ..., ng (n is the dimensionality of I).
Ea h x 2 I is alled an item. A non-empty subset of I is alled an itemset. The power
set ofI, denoted by P(I), istheset of allpossiblesubsets ofI. Anitemsetof sizek,X =
fx 1 , x 2 , ..., x k
g is alled a k-itemset (for onvenien e we drop set notation and denote X
as (x 1 x 2 :::x k
). For X;Y 2P(I) we say that X ontains Y if Y X. A set (of itemsets)
C P(I) is a olle tionof itemsets.
Definition 2. [Transa tions℄ A transa tion T
i
is an itemset, where i is a natural
number alled the transa tion identier or simply tid. A transa tion database D = fT
1 , T 2 , ..., T m
g, is a nite set of transa tions, with size jD j = m. The absolute support of
an itemset X inD is the number of transa tions in D that ontains X, given as (X;D)
= jfT
i
2 D j X T
i
gj. The (relative) support of an itemset X in D is the fra tion of
transa tions inD that ontains X, given as, (X;D) = (X;D)
jDj .
Definition 3. [Frequent and Maximal Frequent Itemsets℄ An itemset X is
frequent if (X;D) min
, where min
is a user-spe ied minimum-support threshold,
AfrequentitemsetX 2F( min
;D)ismaximal ifithas nofrequent superset. A olle tion
of maximal frequent itemsets is denoted as MF( min
;D). No itemset in MF( min
;D)
ontains another, that is, X;Y 2MF( min
;D), and X 6=Y implies X *Y.
Lemma 1. Any subset of a frequent itemset is frequent: X 2 F( min
;D) and Y X
implies Y 2F( min
;D). Thus, by denition, a frequent itemset must be subset of at least
one maximal frequent itemset[3℄.
Lemma 2. MF( min
;D) is the smallest olle tion of itemsets from whi h F( min
;D)
an be inferred[25℄.
Problem 1. [Mining Frequent Itemsets℄ Given min
and a transa tion database
D, the problem of miningfrequent itemsetsis to nd F( min
;D).
2.2 Computational Complexity
Next we highlight the graph-based theoreti s and provide some insight into the
om-putational omplexity of mining frequentitemsets.
Definition 4. [Bipartite Graphs℄ A graphG =(U;V;E)hastwodistin tvertexsets
U and V,and anedge set E =f(u;v)ju2U and v 2Vg. A omplete bipartitesubgraph
I T is alled a bipartite lique, and is denoted as K
i;t
, where j I j= i, j T j= t and
I U;T V.
The input database forthe frequent itemset mining problemisessentiallya very large
bipartite graph, with U as the set of items, V as the set of transa tions (or simply the
set of tids), and ea h (item, tid) pair as anedge. The problemof enumerating all
(maxi-mal) frequent itemsets orresponds to the task of enumerating all (maximal) onstrained
bipartite liques, K i;t , where t jDj min .
Theorem 1. The problem of determining whether a bipartite graph G = (U;V;E), with
jU j=jV j=n ontains a bipartite lique K
k;k
is NP-Complete[40℄.
Determining the existen e of a (maximal) bipartite lique (itemset), with restri tions
on the size of jI j=i (items) and j T j= t (support), su h that i+t k, (with onstant
k) is in P, the lass of problems that an be solved in polynomial time. On the other
hand, the problem whether there exists a maximal bipartite lique su h that i+t = k
is NP-Complete [18℄, the lass of problems for whi h no polynomial time algorithm is
known toexist. Whilethe problems inthe lassNP askswhether adesiredsolutionexists,
the problems in the lass #P asks how many solutions exist. Counting problems that
orrespond to NP-Complete problems are #P-Complete. The followingtheorem indi ates
Theorem 2. Determining the number of (maximal)bipartite liques in a bipartite graph
is #P-Complete [18℄.
The omplexityresultsshownabovearequitepessimisti ,andapplytogeneralbipartite
graphs. Weshouldthereforefo usonspe ial aseswherewe anndpolynomialtime
solu-tions. Fortunately,forfrequentitemsetmining,inpra ti e, the bipartitegraph(database)
isverysparse,andwe an infa tobtainlinear omplexityinthegraph(database)size [45℄.
2.3 Challenges and Related Work
We outline some of the urrent hallenges and resear h regarding frequent itemset
mining. This listisnot exhaustive andit isintended togivethereader afeelforthe types
of problems we wrestle with.
Challenge 1. [Huge Databases℄ Databases with millions of transa tions and
giga-byte size are quite ommonpla e. Methods for dealing with large data volumes in lude
more ee tive algorithms [3, 30, 45℄, sampling [19, 44℄, approximation methods [33℄, and
massively parallelpro essing [27, 34,37, 46℄.
Challenge 2. [High Dimensional Databases℄ Not onlythere is often a very large
number of transa tions in the database, but there an also be a very large number of
items, so that the dimensionality of the problem is high. A high dimensional database
reates problems in terms of in reasing the size of the sear h spa e in a ombinatorially
explosive way. Approa hes to this problem in lude more eÆ ient pruning te hniques [6,
12, 42℄ and methods to redu e the a tual dimensionality of the problem (i.e., ondensed
representations) [5,7℄.
Example 1. Let us onsider the example in Figure 2.1, where I = f1,2,3,4,5g and
the gure shows D and P(I). Suppose min
= 0.4 (40%). F(0:4;D) is omposed by
shadedand bolditemsets,whileMF(0:4;D)is omposed by thebolditemsets. Notethat
jMF(0:4;D)jis mu h smallerthan jF(0:4;D)j.
A naiveapproa htondF(0:4;D)istorst ompute(X;D)forea hX 2P(I), and
then return onlythose X that (X;D) 0.4. This approa h is inappropriatebe ause:
IfjI j (the dimensionality)is high, then jP(I) jis huge (i.e., 2 jIj
).
IfjD j (the size) islarge, then omputing (X;D) forall X 2P(I) isinfeasible.
Thus, the spa e for itemsets is often huge, and the enumeration of itemsets involves
dierent forms of sear h in this spa e. Pra ti al omputational onstraints pla e severe
is infrequent, we do not need to generate any of its supersets, sin e they must also be
infrequent. This simple pruning strategy was rst used in [3℄ and it greatly redu es the
numberof andidateitemsetsgenerated. Several dierentsear hes an beappliedby using
this pruning strategy.
Figure 2.1: Frequent Itemset Mining Example.
Challenge 3. [Evolving Data℄ Rapidly hanging (non-stationary) data may make
previously dis overed frequent itemsetsinvalid. Possible solutions in lude in remental [8,
11,26, 28,31,36,38℄ and modelquality extension te hniques [29℄.
Challenge 4. [Distributed Data℄ The issues on erning modern appli ations are
not onlythe size and dimensionality of the data, but also its distributed nature. Modern
appli ationsmayhavetheirdatabases logi allyandphysi allylo atedatdierent(distant)
pla es, and possibly they may be willing toshare their data miningmodels, but not their
data. In allthese aseswhat is needed isa de entralizedapproa h todata mining[22, 35,
37,39℄.
2.3.1 Algorithms for Frequent Itemset Mining
In thisse tionwewillbrie ydes ribethe mainalgorithmsforfrequentitemsetmining.
Our goal is not to go too mu h into detail, but to show the basi prin iples and the
dieren es between the algorithms. Although they share basi ideas, they fundamentally
dierin ertainaspe ts, su h asthe sear h employed and the way they ount the support
of the andidate frequent itemsets.
be frequent, thusinfrequentitemsetsare qui kly pruned out from onsideration. Apriori
performs a breadth-rst sear h for F( min
;D), allowing aneÆ ient andidate generation,
sin e all frequent itemsets at a given level are known before starting to ount andidates
at the next level. The strategy for generating andidates works as follows: rst, a set
of andidates is set up; Se ond, the algorithm s ans all transa tions and in rements the
ounters of the andidates. That is, for ea h set of andidate itemsets one pass over all
transa tions is ne essary. With a breadth-rst sear h su h a pass is needed for ea h level
ofthe sear h spa e. In ontrast, adepth-rstsear hwouldmake aseparatepass ne essary
for ea h lass that ontainsat least one andidate. The osts for ea hpass over the whole
database would ontrast with the relatively small number of andidates ounted in the
pass.
Although Apriori employs an eÆ ient andidate generation, it is extremely I/O
in-tensive, requiring as many passes over the database as the size of the longest andidate
generated. This pro ess of support ounting may bea eptable for sparse databases with
short-sizedtransa tions. However, whenthe itemsetsarelong-sized, Apriorimay require
substantial omputationaleorts. Thus, thesu ess ofApriori riti allyreliesonthefa t
that the length of the frequent itemsetsin the database are typi allyshort.
Algorithm 2. [ECLAT℄ ECLATperforms adepth-rst sear h forF( min
;D),whi h
is based on the on ept of equivalen e lasses [45℄. Two k-itemsets belong to the same
equivalen e lass if they share a ommon k-1 length prex. Ea h equivalen e lass may
be pro essed independently, thus ECLAT divides the sear h spa e for F( min
;D) into
smaller sub-spa es, where ea h one orresponds to an equivalen e lass and is pro essed
independently inmain memory, minimizingthe number of se ondary memory operations.
The depth-rst sear h starts from 1-itemsets and keeps generating andidates within the
same equivalen e lass untilF( min
;D)is found.
Algorithm 3. [FP-Growth℄ FP-Growth [15℄ is an interesting method that nds
F( min
;D) usinga frequent patterntree (FP-tree) stru ture, whi h isanextended
prex-tree stru ture forstoring ompa t informationaboutthe frequent itemsets. Better
perfor-man e isa hieved with three te hniques:
1. The database is ompressed into a ondensed data stru ture, whi h avoids repeated
database s ans.
2. AFP-tree-basedminingadoptsapatternfragmentgrowthmethodtoavoidthe ostly
generation of alarge numberof andidate sets.
3. A divide-and- onquer method is used to de ompose the mining task into a set of
smaller tasks for mining frequent itemsets onned in onditional databases, whi h
2.4 The ZigZag Algorithm
In this se tion we present ZigZag, a new algorithm for frequent itemset mining.
ZigZaghas twofeatures that distinguish it fromother algorithms:
1. It employs ahybridsear hfor frequentitemsets. Thissear hrst nds the maximal
frequent itemsets and then it generates only frequent subsets. This hoi e greatly
redu es the number of andidates generated.
2. It employs eÆ ient data stru tures for support ounting that greatly redu es the
number of data s ans.
Next we des ribe these two features and present experimental evaluation omparing
ZigZagagainstother algorithms.
2.4.1 Hybrid Sear h
Almost all algorithms for mining frequent itemsets use the same pro edure rst a
set of andidates is generated, next infrequent ones are pruned, and only the frequent
ones are used to generate the next set of andidates. Clearly, an important issue in this
task is to redu e the number of andidates generated. An interesting approa h to redu e
the number of andidates is to rst mine MF( min
;D). On e MF( min
;D) is found, it
is straightforward to obtain all frequent itemsets (and their support ounts) in a single
database s an, without generating infrequent (and unne essary) andidates. The number
of andidates generated to nd the MF( min
;D) is mu h smaller than the number of
andidates generated todire tly nd F( min
;D).
Step1. [Ba ktra kSear hforMF( min
;D)℄ AneÆ ientsear hforMF( min
;D)
must have the ability to qui kly remove large bran hes of P(I) from onsideration. This
propertyisasso iatedwiththenumberof andidatesgeneratedduringthesear h. Smaller
the number of andidates generated, faster the sear h will be. Our algorithm employs a
ba ktra k sear h to nd MF( min
;D). Ba ktra k algorithmsare useful for many
ombi-natorial problems where the solution an be represented as a set X = fx
0 ;x 1 ;:::g, where ea h x j
is hosen from a nite possible set, S
j
. Initially X is empty; it is extended one
item at a time, as the sear h spa e (i.e., P(I)) is traversed. The length of X is the same
as the depth of the orresponding node in the sear h tree. Given a k- andidate itemset,
X = fx 0 ;x 1 ;:::;x k 1
g, the possible values for the next item x
k
omes from a subset R
k
S
k
alled the ombine set. If y 2 S
k R
k
, then nodes in the subtree with root node
X = fx 0 ;x 1 ;:::;x k 1
;yg will not be onsidered by the ba ktra k algorithm. Sin e su h
subtrees have been pruned away from the original sear h spa e, the determinationof R
k
is also alled pruning. Ea h iteration of the algorithmtries extending X with every item
not asubset of any already known maximal frequent itemset. The next step is to extra t
the new possible set of extensions, S
k+1
, whi h onsists only of items in R
k
that follow
x. The new ombine set, R
k+1
, onsists of those items in the possible set that produ e
a frequent itemset when used to extend X. Any item not in the ombine set refers to a
pruned subtree. The ba ktra k sear h performs adepth-rst traversal of the sear h spa e.
Step2. [Top-DownEnumerationof F( min
;D)℄ Inthetop-downenumerationstep
ea hitemsetX 2MF( min
;D)isbroken intok subsetsofsize(k-1). Thispro essiterates
generating smaller subsets and omputing their support ounts until there are no more
subsets tobe he ked, and F( min
;D) isfound.
2.4.2 Data Stru ture for Support Counting
Stru ture 1. [TidSet℄ Let L(X;D) be the tidset [45℄ of a k itemset X in D (the
set of tids in D in whi h X has o urred), and thus, j L(X;D) j = (X;D). A ording
to [44℄, L(X;D) an be obtained by interse ting the tidsets of at least two k-1 subsets
of X. For example, in Figure 2.1, L(123;D) = f1, 5, 9, 10g and L(125;D) = f1, 4, 5,
8, 9, 10g. Consequently, L(1235;D) = L(123;D)\L(125;D) = f1, 5, 9, 10g. Note that
jL(1235;D)j =(1235;D)=4,and thus(1235;D)= 0.4. Alternatively,L(1235;D) an
alsobeobtained by L(1;D)\L(2;D)\L(3;D)\L(5;D)= f1,5, 9,10g.
Stru ture 2. [DiffSet℄ The diset [41℄ isaneÆ ient stru ture forsupport ounting.
The main idea is to avoid storing the entire tidset of ea h andidate. Instead, only the
dieren es between the tidsets of the itemsets X and Y are stored in the stru ture. The
disetisveryshort omparedtoitstidset ounterpart,anditisveryee tiveinimproving
the running time of the ounting operation. The diset of a 2-itemset Z = x [y is
H (Z;D) = L(x;D) L(y;D), and (Z;D) = (x;D) j H (Z;D) j. The diset of
a k-itemset (k 3) Z = X [ Y is H (Z;D) = H (Y;D) H (X;D), and (Z;D) =
(X;D) j H (Z;D) j (where X and Y are (k-1) itemsets). For example, in Figure2.1,
H (12;D)=L(1;D) L(2;D)=f2,4gand (12;D)=(1;D) jH (12;D)j=10 2=8.
Similarly, H (13;D) = f3,6,7,8g and (13;D) = 10 4 = 6. Consequently, H (123;D) =
H (13;D) H (12;D)= f3;6;7;8g f2;4g= f3;6;7;8g, and (123;D)= 8 4 =4, and
thus (123;D)= 0.4.
Support Counting of 1-itemsets ZigZag s ans ea h transa tion T
i
2 D, and if
x 2 T
i
, L(x;D) is augmented by i. At the end of this pro ess, ea h item x 2 D has its
tidset L(x;D). Note that (x;D) =j L(x;D) j. Clearly this approa h assumes that all
tidsetstsimultaneouslyinmainmemory. Therearebasi allythreealternativeapproa hes
for the ases where all tidsets do not t in main memory at the same time: distributed
Support Counting of k-itemsets (k 2) in Step 1 As explained earlier, inthe
rststepZigZagperformsasear hforMF( min
;D). InthisstepZigZagusesthediset
stru ture for support ounting. The disets are propagated from a node to its hildren
startingfromtheroot. Atthe rstlevel,ZigZaghas a esstothetidsets forea hitemin
D. Atthese ondlevel,thedisetofa2-itemsetZ =x[yisH (Z;D)=(L(x;D) L(y;D)),
and (Z;D) = (x;D) j H (Z;D) j. At subsequent levels, ZigZag has available the
disets for ea h elementin the ombine set. In this ase, the diset ofa k-itemset(k3)
Z =X [Y isH (Z;D)=H (Y;D) H (X;D),and (Z;D)=(X;D) jH (Z;D)j.
Support Counting of k-itemsets (k 2) in Step 2 It is not possible to use
the disetpropagationduringthetop-downenumeration, sin eZigZagdoesnot havethe
dieren esbetweenthetidsetsofthesubsets. Anaiveapproa hto ompute(X;D)would
require the interse tion of k dierent tidsets if the size of X is k. The k-way interse tion
method employed by ZigZag is dramati ally enhan ed by the use of an ee tive a he
that stores intermediate results for future use. The a he is simply implemented through
ahash-table,whi hkey is omposed bythe itemsfx
1 ;x
2 ;:::;x
n
gthat omposeasubset X.
Next we show the pseudo ode forthe ZigZag algorithm.
ZigZag( min
;D)
1. S anD and omputethefrequentitems (i.e.,F
1 ) 2. MF = Ba kTra k(;;F 1 ;0; min ) 3. F =; 4. forea h X 2MF 5. TopDown(X) 6. returnF Ba kTra k(X l ;C l ;l; min ) 1. forea h x2C l 2. X l +1 =X [x 3. P l +1 =fy:y2C l andy >xg 4. ifX l +1 [P l +1 has asupersetin MF 5. return 6. C l +1 =Extend(X l +1 ;P l +1 ;l; min ) 7. ifC l +1 ==; 8. ifX l +1 hasno supersetinMF 9. MF =MF [X l +1
10. elseBa kTra k(X
l +1 ;C l +1 ;l+1; min ) 11. returnMF
Extend(X l +1 ;P l +1 ;l; min ) 1. C=; 2. forea h y2P l +1 3. y 0 =y 4. ifl==0then H (y 0 ;D)=L(X l +1 ;D) L(y;D) 5. elseH (y 0 ;D)=H (y;D) H (X l +1 ;D) 6. if(y 0 ;D) min 7. C=C[y 0 8. returnC TopDown(X)
1. ifX wasnotpro essedyet
2. (X;D)=j(L(x 1 ;D)\L(x 2 ;D)\:::\L(x k ;D))j 3. (X;D)= j(X;D)j jDj 4. F =F[fXg
5. forea h (k 1) subsetY X
6. TopDown(Y)
2.4.3 Evaluation
Inthisse tionwe ompareZigZag againsttraditionalalgorithms: Apriori 1 ,E lat 2 ,and FP-Growth 3
. All experiments were arried out on a dual Pentium III 1GHz with 1GB of
mainmemory running Red Hat Linux 7.1. The basi metri employed inthe omparison is the
total exe utiontimespentto performtheminingoperation. Weutilizethreedierentdatabases:
WPortal, WCup, and VBook. These databases aredes ribed in thefollowing, and they will
beused inall evaluationsthroughout thisthesis 4
.
Database 1 [WPortal℄ WPortal representsthe a ess patterns of the largest Brazilian
Web Portal. The database was olle ted in approximately 4 months, and omprises 7,274,382
transa tions over 3,183 unique items, and ea h transa tion ontains an average length of 3.2
items.
1
Downloadablefrom http://fuzzy. s.uni-magdeburg.de/~borgelt/softwa re.h tml
2
Downloadablefrom http:// s.rpi.edu/~zaki/software
3
Thanksto BrunoGusm~aoforimplementingtheFP-Growth algorithm
4
Database 2 [WCup℄ WCup was generated from a 2-month li k-stream log of the 1998
WorldCupwebsite,whi hispubli lyavailableatftp://resear hsmp2. .vt.edu/pub/. Wes anned
the WCup log and produ ed a transa tion le,where ea h transa tion is a session of a ess to
thesite bya lient. Ea h iteminthe transa tionis a web request. Notall requests were turned
into items; to be ome an item, therequest musthave three properties:
1. The requestmethod isGET.
2. The requeststatus isOK.
3. The letype isHTML.
A sessionstarts with a request that satisesthe above properties, and endswhen there has
beenno li kfrom the lientfor 30 minutes. Allrequestsin asessionmust ome from thesame
lient. WCup omprises 6,027,378 transa tionsover5,271 uniqueitems.
Database 3[VBook℄ VBookwasgeneratedfroma2-monthweblogfromalargeele troni
bookstore. In this ase, ea h transa tion ontains the set of books examined, added to the
shopping art and/or bought by a ustomer during a single visit to the bookstore. VBook
omprises 141,735 transa tionsover8,836 uniqueitems.
Figure 2.2 shows the exe ution times obtained by the algorithms (Apriori, E lat,
FP-Growth andZigZag), usingdierentdatabases (WPortal,WCupand VBook)and dierent
minimum supportvalues. All databases t inthe main memory of our omputational
environ-ment,andtoensurethatthe omparisonbetweenthealgorithmsisfair,allalgorithmsa essthe
data in main memory, rather than in disk. ZigZag performs better in WPortal and VBook
databases. For the experiments usingthe WCup database, ZigZag is the best hoi e onlyfor
very smallminimum support values. The reason is that for greater values of minimum support
the ratio of maximal frequent is too high, and a dire t sear h for frequent itemsets is better.
Further, in the best ase, ZigZag oers one order of magnitudeimprovement, when ompared
against Apriori.
2.5 Con lusions
Inthis hapterwepresentthebasi foundationsand hallengesofthefrequentitemsetmining
problem. Wealsodepi tedthemainalgorithmstosolvethisproblem,and weproposeanewone,
alled ZigZag. We evaluate our algorithm by omparing it against others. ZigZag employs
100
1000
10000
0
0.0002
0.0004
0.0006
0.0008
0.001
Time (secs)
Minimum Support
WPortal
Apriori
FP−Growth
Eclat
ZigZag
100
1000
10000
0
2e−05
4e−05
6e−05
8e−05
0.0001
Time (secs)
Minimum Support
WPortal
Apriori
FP−Growth
Eclat
ZigZag
10
100
1000
10000
100000
0
0.002
0.004
0.006
0.008
0.01
Time (secs)
Minimum Support
WCup
Apriori
FP−Growth
Eclat
ZigZag
100
1000
10000
100000
0
0.0002
0.0004
0.0006
0.0008
0.001
Time (secs)
Minimum Support
WCup
Apriori
FP−Growth
Eclat
ZigZag
10
100
1000
10000
100000
0
0.002
0.004
0.006
0.008
0.01
Time (secs)
Minimum Support
VBook
Apriori
FP−Growth
Eclat
ZigZag
100
1000
10000
100000
0
0.0005
0.001
0.0015
0.002
Time (secs)
Minimum Support
VBook
Apriori
FP−Growth
Eclat
ZigZag
Figure 2.2: Comparison: Exe ution Times inDierent Situations.
two hapters we willproperlymodifytheZigZag algorithm forminingevolving and distributed
Chapter 3
Mining Frequent Itemsets in
Evolving Databases
Traditional approa hes for frequent itemset miningmake the assumption that the database
is stati , and a database update requires redis overing all frequent itemsets by s anning the
entiredatabase. Su happroa hesarememorylessandrepli ateworkthathasalreadybeendone,
wasting omputational resour es. In remental algorithms re-use the data mining results that
were obtainedpreviously,and ombine thisinformationwiththefreshdatato ee tivelyupdate
thenew olle tionof frequentitemsets, savingwork.
3.1 Problem Statement and Denitions
Problem 2. [Mining Frequent Itemsets in Evolving Databases℄ Using D as a
startingpoint,asetofnew transa tionsd +
isaddedand asetof oldtransa tionsd isremoved,
forming(i.e.,=(D[d + ) d ). Given min ,D;d +
,andd ,theproblemofminingfrequent
itemsets inevolving databases isto ndF( min
;), given some informationaboutF( min
;D).
Definition 5. [Emerged and Retained Itemsets℄ An itemsetX is alledan emerged
itemset if it is infrequent in D, butfrequent in (i.e., X 2= F( min
;D), but X 2 F( min
;)).
Similarly, X is alled a retained itemset if it is frequent in D, and remains frequent in (i.e.
X 2F( min
;D) and X 2F( min
;)).
Definition 6. [In remental Representation℄ When mining frequent itemsets in
evolving databases, a hing of previousresultsis obviously useful. Still, havingto go ba k to D
to nd F( min
workthatwasalreadydonewhenminingD. Whenmining,is omposedbyF( min
;D)(i.e.,
thefrequentitemsets inD alongwith theirsupport ounts).
Definition 7. [Approximate Colle tion of Frequent Itemsets℄ We dene
F 0 ( min ; max ;) (where max
is a maximum error threshold) as an approximate/partial
ol-le tion of frequent itemsets in . Formally, for some itemsets X 2 F 0 ( min ; max ;), (X;)
hasan approximate value, 0
(X;).
3.2 i ZigZag In remental and Intera tive Algorithm
The basi feature of an in remental mining algorithm is the in remental support ounting.
Our algorithm,i ZigZag, employsthe basi sear h rationaleexplainedin Se tion2.3, theonly
dieren eisthein rementalsupport ounting. Theideaisthatforagiven andidateX,(X;)
an be omputed by (X;D) + (X;d +
) (X;d ). When X is a retained itemset, (X;D)
is already stored in. In this ase, the needfor datas ans is greatly redu ed, and the support
ounting pro ess isenhan ed.
Support Counting of 1-itemsets i ZigZag s ans ea h transa tion T
i
2 , and if
x2T
i
there arethree possibilities:
1. Ifi2d thenbothL(x;d ) and L(x;D) areaugmentedbyi.
2. Ontheother hand, ifi2d +
,onlyL(x;d +
) isaugmentedbyi.
3. Otherwise,onlyL(x;D) isaugmented byi.
Attheendofthispro ess,ea hitemx2hasthree tidsets,L(x;d ),L(x;d +
),andL(x;D).
Note that(x;)=jL(x;)j=jL(x;D)j+jL(x;d +
)j jL(x;d )j.
Support Counting of k-itemsets (k 2) in Step 1 In step 1, i ZigZag uses
the diset stru turefor support ounting. At therst level of the sear h,i ZigZag has a ess
to the tidsets for ea h item in . At the se ond level, the diset of a 2-itemset Z = x[y is
H (Z;) =(L(x;D) L(y;D))[(L(x;d +
) L(y;d +
))[(L(x;d ) L(y;d )), and (Z;) =
(x;) jH (Z;)j. At subsequentlevels,i ZigZaghasavailablethedisetsforea helement
in the ombine set. In this ase, the diset of a k-itemset (k 3) Z = X [Y is H (Z;) =
Support Counting of k-itemsets (k2) in Step 2 Instep 2,i ZigZag employs
thek-wayinterse tionoftidsetsexplainedintheprevious hapter. Inthein remental ase, given
asubsetX =x 1 [x 2 [:::[x n ,L(X;)=((L(x 1 ;D)\L(x 2 ;D)\:::\L(x n ;D)) (L(x 1 ;d )\ L(x 2 ;d )\:::\L(x n ;d )))[(L(x 1 ;d + )\L(x 2 ;d + )\:::\L(x n ;d + )),and(X;)=jL(X;)j.
In thefollowingwe present themodiedpro edure forin remental support ounting.
Extend(X l +1 ;P l +1 ;l; min ) 1. C=; 2. forea h y2P l +1 3. y 0 =y 4. ifl==0then 5. H (y 0 ;d + )=L(X l +1 ;d + ) L(y;d + ) 6. H (y 0 ;d )=L(X l +1 ;d ) L(y;d ) 7. ify 0 isan emerged itemset 8. H (y 0 ;D)=L(X l +1 ;D) L(y;D) 9. else 10. H (y 0 ;d + )=H (y;d + ) H (X l +1 ;d + ) 11. H (y 0 ;d )=H (y;d ) H (X l +1 ;d ) 12. ify 0 isan emerged itemset 13. H (y 0 ;D)=H (y;D) H (X l +1 ;D) 14. ify 0
isa retained itemsetthen
15. (y 0 ;)=(y 0 )+(y 0 ;d + ) (y 0 ;d ) 16. else 17. (y 0 ;)=(y 0 ;D)+(y 0 ;d + ) (y 0 ;d ) 18. if(y 0 ;) min 19. C =C[y 0 20. returnC
Example 2. Let us onsider the example in Figure 3.1, where I = f1;2;3;4;5g. Suppose
min
=0:5. Thegure shows D,d +
,F(0:5;D) and F(0:5;). When miningF(0:5;), f1,2,3,
4,5,12,13, 14,15,23, 24,25,34, 35,45,123, 124,234, 235g aretheretained itemsetsand their
support ountsinare omputedoverandd +
. f125,245garetheemergeditemsetsandtheir
support ounts in are omputed over Dand d +
. Itemsetsf12, 34, 35, 45, 123, 124, 125, 234,
235,245g arepro essedduringstep1(i.e.,bottom-upsear h),whileitemsetsf13, 14,15,23,24,
Figure 3.1: In remental Frequent Itemset Mining Example.
3.2.1 i ZigZag Features
There are basi ally ve features that distinguish i ZigZag from other approa hes: novel
in rementalte hniques,novelpruningte hniques,theabilityofvarying min
a rossdataupdates,
theimplementationof short-termmining,and thedynami sele tionof itemsets to be updated.
Feature 1. [Novel In remental Te hniques℄ Almost all previousin remental
te h-niques [8,9,11 , 26 ℄usethenegative border to optimizethe omputation ofthenew olle tion of
frequentitemsets. Thenegativeborder onsistsofalltheitemsetsthatareinfrequent andidates.
Maintaining the negative border an be very usefulforredu ing data s ans,be ause it ontains
the \ losest" infrequent itemsets in the original database that an be ome frequent in the
up-dateddatabase. However, thenegative border an be huge, and itsmaintenan e isvery memory
onsumingand notwelladapted forvery largedatabases[24℄.
Feature 2. [Novel Pruning Te hniques using ℄ An ee tive algorithm musthave
of andidates generated, faster the sear h would be. Two general prin iplesfor eÆ ient sear h
strategyare that:
1. It is more eÆ ient to hoose the next bran h to explore to be the one whose ombine set
hasthefewestitems. Thisusuallyminimizesthe numberof andidatesgenerated.
2. If we are able to remove a node as early as possible from the ba ktra k sear h tree we
ee tivelyprune manybran hesfrom onsideration.
Reordering the elements in the urrent ombine set to a hieve these two goals is a very
ee tive mean of utting downthesear h spa efor MF( min
;). Thebasi heuristi isto sort
the ombine set inin reasing order of support; it is likelyto produ e small ombine sets in the
next level, sin e the items with lower support are less likely to produ e frequent itemsets. By
using,i ZigZag an ontinuously sorttheelementsinthe ombine sets generatedduringthe
sear h, sin e it has free a ess to (X;D) of possibly a large portion of the itemsets X that
an be generated in the sear h, potentially apturing as early as possible some hanges on the
dependen esbetweenthe andidateanditsrespe tive ombineset. Thisfeatureallowsi ZigZag
to get an ex ellent ordering of elementsto produ esmaller bran hes. In fa t, i ZigZag uses
to ompute orrelation measures, between a andidate and items inthe ombine set, instead of
simply reordering on support. Correlation an be used to generate statisti al dependen es for
both the presen e and absen e of items (i.e., items of the ombine set) in a andidate itemset,
and its value an be omputed by
(X[x;D)
(X;D)(x;D)
, where X is a andidate and x is an item in
the ombine set. Ifone sortsthe ombine setinin reasing order of orrelation,smaller ombine
sets are produ edat the next level, leadingto ahigh degree of pruningby redu ingthe number
of andidates generated, sin e thenext hoi e of a bran h to explore inthe ba ktra k sear h is
likelytobea goodapproximationofthebest hoi e whilemining. Insummary,themainidea
behindtheeÆ ien yof thesear hforMF( min
;)employedbyi ZigZag stemsfromthefa t
that it eliminates bran hes that are subsumed by an already mined maximal frequent itemset,
and it is very likely that the maximal frequent itemsets that are generated earlier subsume a
largenumberof andidateitemsets thatwouldbe generatediftheorderinwhi h thesemaximal
frequent itemsets were generatedweredierent.
Feature 3. [Flexible Parameter Modifi ation℄ Parameter modi ation an mean
redu ingorin reasing min
a rossdataupdates. Whenthe min
isin reased,thesear h spa eis
redu ed. In this ase, thenew olle tion offrequentitemsets an beobtainedbysimplyltering
outthe itemsets storedin that do notsatisfythenew minimumsupportthreshold. If d +
=;
andd =;,thereisnoneedfordatas ansandthelteringissuÆ ientbe ausethenew olle tion
isredu ed,thesear h spa eisexpanded. Inthis ase, isnotsuÆ ientenoughtodeterminethe
new olle tion offrequentitemsets,and datas ansare requiredto ndthoseadditionalfrequent
itemsets. However, even when min
is redu ed, the number of data s ans an be dramati ally
redu edbyusing.
Feature 4. [Short-TermData Mining℄ Short-termdatamining onsidersjustare ent
portionof the transa tions fordetermining F( min
;). That is, ifd +
ontains n transa tions,
thentheoldestntransa tionsfromDaredis arded. Short-termminingiseasilyimplementedby
i ZigZag,sin eitisexa tly the asewhen d +
hasntransa tions, andd ontainstheoldestn
transa tionsof D.
Feature 5. [Sele tiveUpdates℄ Animportantintera tivefeatureistheabilitytodete t
isolated hangesinevolvingdata. Analgorithmwiththisfeaturemay ontrolthe ost-to-a ura y
ratio, generating F 0 ( min ; max
;) at a fra tion of the exe ution time required for generating
F( min
;). i ZigZag andete twhi hitemsetsX anhave 0
(X;)wellestimated. i ZigZag
ategorizes three typesof itemsets:
Invariant : ThesupportofX doesnot hangesigni antlyfromDto(i.e.,(X;D) issimilar
to (X;)).
Predi table : Itispossibletoestimate 0
(X;)withinatoleran e. Severalapproximationtools
maybeemployed,from atraditionallinearestimationto sophisti atedtime-seriesanalysis
algorithms. (X;) presents some kindof trend [4 ℄,that is, itin reases orde reasesin a
systemati way asthedatabaseis updated.
Unpredi table : Itisnotpossible,givenasetofapproximationtools,toobtainagoodestimate
of 0
(X;) 1
.
Figure 3.2 shows the orrelation of two olle tions of itemsets. These itemsets are ranked
bysupportand their relative positionsare ompared. When the olle tion of itemsets is totally
a urate, allitemsets areinthe orre tposition. From Figure3.2we an seea omparisonof an
a urate olle tionand an out-datingone. Aswe an see,although there aresigni ant hanges
inthesupportofsomeitemsets,therearealsoalargenumberofitemsetswhi hremaina urate,
inthe orre t position,and do notneed to be updated (i.e., theinvariant itemsets, forming the
diagonal),andalsoalargenumberofitemsetswhi hthesupportvaluehasevolvedinasystemati
1
Thedesire,of ourse,istohavenounpredi tableitemsets,andthesear hfortoolsthatbetterestimate
thesupport ofitemsets isprobablyendless, andis outof s opeofthis thesis. Ourbeliefis thatit isnot
worth to employ sophisti ated tools, sin e the ost of generating F 0 ( min ; max ;) may approa h or
surpassthe ostofgeneratingF( min
way,followingsome kindof trend (i.e.,thepredi table itemsets, formingwaves). Thereexists a
majordieren ebetweeninvariantandpredi tableitemsets. Ifthereisalargenumberofinvariant
itemsets inD,then F( min
;D) will remain a urate as the database is updated. On theother
hand, ifthere is a largenumberof predi table itemsets, thenF( min
;D) will lossa ura y,but
we an make good estimates of the support of these itemsets and generate a highly a urate
approximate olle tionof frequent itemsets.
0
5000
10000
15000
20000
25000
0
5000
10000
15000
20000
25000
Outdating Ranking
Correct Ranking
After 10K transactions
0
5000
10000
15000
20000
25000
0
5000
10000
15000
20000
25000
Outdating Ranking
Correct Ranking
After 50K transactions
0
5000
10000
15000
20000
25000
0
5000
10000
15000
20000
25000
Outdating Ranking
Accurate Ranking
After 100K transactions
Figure 3.2: Invariant, Predi table and Unpredi tableItemsets.
We divide the estimation approa h into two phases. The rst phase samples the tidsets
asso iatedwith 1-itemsets whose union results in theitemset we want to estimate the support.
The se ond phase analyzesthe sampling in order to determine whether it is ne essary to ount
thea tualsupportofthe itemset. These phasesaredes ribedin thefollowing:
Phase 1. [Support Sampling℄ The startingpoint of the support sampling arethe
tid-sets asso iatedwith1-itemsets, whi hare always up-to-date sin ethese tidsets aresimply
augmented by the novel and obsolete transa tions. Given L(x;) 2
, we dene the binset
S(x;) = fb x1 ;b x2 ;:::;b xn
g as a set of n bins, where ea h bin b
xi
has the support ount
of x untilthe i th
(1 i n) partition of . Note that S(x;) an be easilybuilt sin e
L(x;)is always hronologi allyorderedby onstru tion. Now, givenZ =x[y andtheir
respe tive binsets S(x;) = fb x 1 ;b x 2 ;:::;b x n g and S(y;) = fb y 1 ;b y 2 ;:::;b y n g, we build S 0 (Z;)=fb Z1 ;b Z2 ;:::;b Zn
g,whi hisa setof nbins,whereea h binb
Zi
hasthesmallest
valuefromb x i andb y i
. Formally,thesupportsamplingofZ isbasedonestimatingthe
up-perboundonthemergeofthebinsetsS(x;)andS(y;). Thebinsetsforlongeritemsets
arebuilditeratively,byusingthesame approa h.
2
NotethatL(x;)=(L(x;D)[L(x;d +
Phase 2. [Support Estimation℄ Trend dete tion is a valuable tool to predi t support
ount valuesinthe ontext ofevolving databases. Awidespreadtrenddete tion te hnique
is linear regression. The model used by the linear regression is expressed as the fun tion
0
(X;)=
(X;D)+bx
jj
,wherexistheblo ksize,andbistheslopeofthelinethatrepresents
thelinear relationship betweenx and 0
(X;). The inputforthelinear regression for an
itemset X is the set of points in its binset S 0
(X;), generated in the support sampling
phase. The method of least squares determines the value of b that minimizes the sum of
thesquaresoftheerrors,anditiswidespreadusedforgeneratinglinearregression models.
To verify howgood isthe model generatedbythe linearregression, we mustestimate the
goodness-of-t, R 2
(X;), whi h represents the proportion of variation in the dependent
variable that has been explained or a ounted for by the regression line. This R 2
(X;)
indi atorrangesinvaluefrom0to1andrevealshow losely 0
(X;) orrelatesto(X;).
Wheneveranitemsetisinvariantorpredi table(i.e.,R 2
(X;) max
),itssupport anbe
simply predi tedusing the linear regression model, rather than omputed withexpensive
datas ans, providingextraordinarysavings in omputationaland I/Orequirements[33 ℄.
3.3 Evaluation
In this se tion we evaluate the ost-benet trade-os i ZigZag features imply in terms of
performan eanda ura y. Ourexperimentalevaluationwas arriedoutonadualPentiumIII1
GHzwith1GB of mainmemoryrunningRed Hat Linux7.1. Weassume thatupdateshappenin
aperiodi fashion. Weemployedfourbasi parametersgiven asinputto i ZigZag: min , max , jd +
jand jd j. Thus,forea h min
employed,we performed multipleexe utions ofi ZigZag,
whereea h exe utionemploysa dierent ombination of max
anddierent blo ksizes. Further,
we employedtwo basi metri s 3
inourevaluation:
1. Exe ution Time: It is the total elapsed time to nd an exa t/approximate olle tion of
frequent itemsets. Timingsareall based onwall lo ktime.
2. A ura y: It is the orrelation between F 0 ( min ; max ;) and F( min ;). The ranking
riteria is the support: F 0 ( min ; max ;) and F( min
;) are totally orrelated if both
olle tionshavethesame length, andthesame itemsetappears in orresponding positions
inboth olle tions.
Experiment1. [DataUpdates℄ Therstsetofexperiments ondu tedwastoempiri ally
verify the exe ution time improvements provided by our in remental algorithm. We have two
3
main obje tives: to understand how min
and the blo k size ae ts i ZigZag's performan e,
and to omparei ZigZag againsta state-of-the-art algorithm. Figure 3.3 shows theexe ution
time results obtained by applying dierent min
and in remental blo k sizes (300,000, 600,000
and 1,200,000 transa tions) using the WCup database. This experiment mimi s the reality in
some pra ti al warehouses, where a large blo k of data is periodi ally added to the database
(i.e.,a blo k of300,000 transa tionsis olle tedinapproximately6 days). Asexpe ted, smaller
exe ution times are observed forsmaller blo k sizes, sin e less transa tionsare pro essed in an
operation. Nevertheless, the high update frequen y may result in higher overall osts. In fa t,
the best update interval also depends on min
and will vary as the database grows, as we an
seein Figure3.4. We an also observe that aftera given numberof transa tionsis addedto the
database,theexe utiontimeofpro essingin rementstendstostabilize(mostlyforsmallerblo k
sizes). This is due to the large proportion of retained itemsets, as an be seen in Figure 3.6,
whi hmeansthatalargenumberofoperationswasperformedjustoverd +
,greatlyredu ingI/O
osts and explainingthe onstant exe ution timeobserved,sin e d +
hasa xed size. Figure 3.5
showsthespeedupnumbersofi ZigZag. Notethatthespeedupisinrelationtominingfrom
s rat h. As is expe ted, the speed is inversely proportional to the blo k size, be ause the size
of the new data oming in is smaller. Also note that better speedups are a hieved by greater
minimumsupports. We observedthat, forthedatabasesusedinthisexperiment,theproportion
ofretaineditemsets(itemsetsthatare omputedbyexaminingonlyd +
and)islargerforgreater
minimumsupportthresholds.
In the se ond experiment we ompared the performan e of i ZigZag against ULI [26 ℄ 4
.
Figure3.7showstheresultsobtainedusingtheWPortaldatabase. Aswe an see,the
improve-mentsrangefrom85%to 90%ingeneral. Further,betterruntimeimprovementsarea hieved by
largerin rementalblo ksofdata. TheimprovementsareduetotheeÆ ientsear hforF( min
;)
employed by i ZigZag, whi h avoids evaluating the entire negative border, and uses eÆ ient
pruningte hniques.
4
50
100
150
200
250
300
350
400
450
500
0
1000
2000
3000
4000
5000
6000
Execution Time (secs)
Transactions (x 1000)
WCup / ms = 0.001
300K
600K
1200K
20
40
60
80
100
120
140
160
180
200
220
0
1000
2000
3000
4000
5000
6000
Execution Time (secs)
Transactions (x 1000)
WCup / ms = 0.005
300K
600K
1200K
Figure 3.3: TotalExe ution Time.
200
400
600
800
1000
1200
1400
1600
1800
0
1000
2000
3000
4000
5000
6000
Cumulative Time (secs)
Transactions (x 1000)
WCup / ms = 0.001
300K
600K
1200K
100
200
300
400
500
600
700
800
900
1000
0
1000
2000
3000
4000
5000
6000
Cumulative Time (secs)
Transactions (x 1000)
WCup / ms = 0.005
300K
600K
1200K
Figure 3.4: Cumulative Exe ution Time.
1
2
3
4
5
6
7
6
8
10
12
14
16
18
20
Speedup
Block Size (%)
WCup
0.005
0.001
1
2
3
4
5
6
7
8
9
6
8
10
12
14
16
18
20
Speedup
Block Size (%)
WPortal
0.00005
0.0001
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
60000
1000
2000
3000
4000
5000
6000
Cardinality
Transactions (x 1000)
WCup / ms = 0.001 / |d+| = 300K
Candidate Itemsets
Frequent Itemsets
Retained Itemsets
8000
10000
12000
14000
16000
18000
20000
22000
24000
1500 2000 2500 3000 3500 4000 4500 5000 5500 6000
Cardinality
Transactions (x 1000)
WCup / ms = 0.001 / |d+| = 600K
Candidate Itemsets
Frequent Itemsets
Retained Itemsets
9000
10000
11000
12000
13000
14000
15000
16000
2500 3000 3500 4000 4500 5000 5500 6000
Cardinality
Transactions (x 1000)
WCup / ms = 0.001 / |d+| = 1200K
Candidate Itemsets
Frequent Itemsets
Retained Itemsets
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
7000
7500
1000
2000
3000
4000
5000
6000
Cardinality
Transactions (x 1000)
WCup / ms = 0.005 / |d+| = 300K
Candidate Itemsets
Frequent Itemsets
Retained Itemsets
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
1500 2000 2500 3000 3500 4000 4500 5000 5500 6000
Cardinality
Transactions (x 1000)
WCup / ms = 0.005 / |d+| = 600K
Candidate Itemsets
Frequent Itemsets
Retained Itemsets
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
2500 3000 3500 4000 4500 5000 5500 6000
Cardinality
Transactions (x 1000)
WCup / ms = 0.005 / |d+| = 1200K
Candidate Itemsets
Frequent Itemsets
Retained Itemsets
Figure 3.6: Proportion of Candidate,Frequent,and RetainedItemsets.
2000
3000
4000
5000
6000
7000
8000
9000
0
1000 2000 3000 4000 5000 6000 7000 8000
Execution Time (secs)
Transactions (x 1000)
WPortal (ULI) / ms = 0.0001
300K
600K
1200K
300
400
500
600
700
800
900
1000
0
1000 2000 3000 4000 5000 6000 7000 8000
Execution Time (secs)
Transactions (x 1000)
WPortal (ZigZag) / ms = 0.0001
300K
600K
1200K
0.11
0.115
0.12
0.125
0.13
0.135
0.14
0.145
0.15
0.155
0.16
0
1000 2000 3000 4000 5000 6000 7000 8000
Relative Execution Time
Transactions (x 1000)
WPortal / ms = 0.0001
300K
600K
1200K
Figure 3.7: Performan e Comparison: i ZigZag vs. ULI.
Experiment 2. [Modifying the Sear h Spa e: Changing min
℄ In this set of
experiments we are interested in investigating (hand-in-hand) drill-down/drill-up intera tions
based on parameter hange operations,using theWCupdatabase. We varied min
(from 0.020
to 0.001), but there are no data updates (i.e., d +
=; and d = ;). We ompare the iterative
miningoperationagainstthebase-lineapproa h,wheretheentiredatabaseisminedfroms rat h
always omposedbytheresultsobtainedfromthelastiteration(i.e.,i ZigZag usesF(0:020;)
to ndF(0:015;), thenF(0:015;) to ndF(0:010;), then F(0:010;) to ndF(0:005;),
and then F(0:005;) to nd F(0:001;)). We all this approa h hand-in-hand mining. The
graph in the left side shows the improvements obtained using this approa h. The savings are
very signi antin pra ti e, butthey dependon thedieren es between and F( min
;). For
instan e, F(0:020;) and F(0:015;) are very similar, and in this iteration i ZigZag is able
to save approximately 90% of the exe ution time, when ompared to the base-line approa h.
We observed that, for lower min
the dieren es between and F( min
;) are more relevant.
However, using F(0:005;) to nd F(0:001;), i ZigZag is still able to save approximately
45%of theexe ution time.
Inthese ondexperiment,hasxedvalue(i.e.,i ZigZag usesF(0:020;) tondallother
olle tions of frequent itemsets). As expe ted, thisapproa h is not so ee tive as the
hand-in-handminingapproa h,butitstillprovidesimprovementsthatrangefrom20%to50%,asshowed
bythegraphinthe right side.
0
200
400
600
800
1000
1200
1400
1600
0.005
0.01
0.015
0.02
Execution Time (secs)
Minimum Support
WCup
Mining From Scratch
Iterative, Hand-in-Hand Mining
0
200
400
600
800
1000
1200
1400
1600
0.005
0.01
0.015
0.02
Execution Time (secs)
Minimum Support
WCup
Mining From Scratch
Iterative Mining (0.020)
Iterative Mining (0.015)
Iterative Mining (0.010)
Iterative Mining (0.005)
Figure 3.8: Changing min: Iterative and Hand-in-Hand Operations.
Experiment 3. [Sele tive Updates℄ The thirdset of experimentsis on ernedin ev
al-uating the ee tiveness of performing sele tive updates. Our evaluation is based on the gains
in a ura y and in the relative exe ution time. For ea h experiment we varied max
(0.70,
0.75, 0.80, 0.85, 0.90, 0.95, 1.00), and then we ompared both F( min
;D) (the \old"
olle -tion) and F 0 ( min ; max
;) (the approximate olle tion) to F( min
;) (the a tual olle tion).
The gain in a ura y is given by the dieren e of the a ura y of F( min
;D) and the
a u-ra y of F 0 ( min ; max
;), and the relative exe ution time is the elapsed time spent for nding
F 0 ( min ; max
;) divided by the elapsed time spent for nding F( min
;). We employed 5
dierent blo ksizes (10,000, 20,000, 30,000, 40,000, and 50,000 transa tions). This experiment