Ö ÒÓ ÐÓÒ Ó Î ÐÓ Ó Å Ò Ö Ó È Ö Ó Ö ÕĐÙ ÒØ Ñ Ó ØÖ Ù Ò Ñ Æ ÒØ Ø Å Ò Ò ÓÖ Ö ÕÙ ÒØ ÁØ Ñ Ø Ò ÚÓÐÚ Ò Ò ØÖ ÙØ Ø µ ÖØ Ó ÔÖ ÒØ Ó ÙÖ Ó È Ó ¹ Ö Ù Ó Ñ Ò ÓÑÔÙØ Ó ÍÒ

(1)

Minera ~ ao de Padr ~ oes Freq uentes em

Bases de Dados Distribu

das e Din

^

ami as

(EÆ ient Data Mining for Frequent Itemsets in

Evolving and Distributed Databases)

Disserta ~ao apresentada ao Curso de P

os-Gradua ~ao em Ci^en ia da Computa ~ao da

Universidade Federalde Minas Gerais, omo

requisito par ial para aobten ~ao dograu de

Mestre em Ci^en iada Computa ~ao.

Belo Horizonte

(2)

Adriano Alonso Veloso was born in Santos Dumont, Brazil, on Mar h 4 th

, 1979. He

started his studies at Universidade Federal de Minas Gerais, Belo Horizonte, in 1998 and

graduatedwithaBa helordegreeinComputerS ien ein2001. Hebegangraduatestudies

in Computer S ien e in 2002. He pursued his resear h in data mining, and parallel and

distributed systems under the dire tion of Dr. Wagner Meira Jr. He has published in

major data mining onferen es, and he has won some national a ademi awards. He

has also reviewed for several onferen es. Currently, Adriano is working at Smart Pri e

Te hnology Group, where he uses his data miningskills inseveral relatedproje ts.

A knowledgements

The last three years in Belo Horizonte were unforgettable. I will always have good

memoriesfrommyfriends AdrianoCarvalho, BrunoDiniz,RodrigoBarra,FlaviaRibeiro,

Alex Borges, GustavoMenezes, Leonardo Ro ha, and Eveline Veloso.

ThankstoDr. MohammedZaki,Dr. VirglioAlmeida,andMar ioBuntedeCarvalho,

for the ollaborations on various papers. Thanks also to Amol Gothing, Matthew Otey,

Liang Chen, RuomingJim,and ChaoWang.

Spe ial thanks to Dr. Srini Parthasarathy who was like a o-advisor on this thesis

and for providing me summer support and valuable time with the data mining resear h

group atthe Ohio-StateUniversity.

Very spe ial thanks tomyadvisor, Dr. WagnerMeira Jr.,forthefriendshipand for

guidingme through this thesis.

The most spe ial thanks go to my wonderfulparents,for the amazingloveand for

the undying support.

Happy reading!

(3)

Minera ~aodeDadoseopro essodeanalisedegrandesbasesdedados omoobjetivode

de-s obrir e identi ar modelos e padr~oes. Atualmente, uma ampla variedade de apli a ~oes,

omo omer io eletr^oni o, web mining, dete ~ao de intrusos e fraudes et ., est~ao sendo

onsideradas promissores ni hos para a utiliza ~ao de te ni as de minera ~aode dados. No

entanto, tais apli a ~oes que utilizam bases de dados distribudas e din^ami as, imp~oem

novosdesaos, dentre osquaisest~aoin ludos a apa idadede minera ~ao ontnuae

inter-ativa, adapta ~ao e iente dos modelos as onstantes mudan as nos dados, uso adequado

dosre ursos omputa ionaisdistribudos,ea essolimitadoabasesde dados omrestri ~oes

de priva idade. A utiliza ~ao de te ni as tradi ionais de minera ~ao de dados emtais

am-bientes resulta em uma ex essiva omuni a ~ao, desperd io de re ursos omputa ionais,

possibilidade deviola ~oesdapriva idadedosusuarios,egeralmenten~ao onseguem prover

o tempo de resposta interativo, estritamente ne essario emum pro esso de des oberta.

Nesta disserta ~ao ser~ao apresentados novos algoritmos para a minera ~ao de padr~oes

(itemsets)frequentes emgrandesbases dedadosdistribudasedin^ami as. Nossosnovos

al-goritmoss~ao apazesdediminuirsigni ativamenteosre ursos omputa ionaisne essarios,

e ainda assim eles s~ao efetivamente apazes de prover intera ~oes om o usuario

(modi- ar/restringiroespa ode bus a),modi a ~oesnos dados(remo ~ao/inser ~aode blo osde

dados), e garantia de tempo de resposta interativos atraves da utiliza ~ao de atualiza ~oes

seletivas(resultados aproximados). In lumos omo destaques adi ionaisde nossos

algorit-mos a ne essidade de baixa quantidade de omuni a ~ao(e sin roniza ~ao), e suporte para

opera ~oes emjanelade transa ~oes (en ontrar ospadr~oes existentes apenas duranteum

in-tervalodetempoespe o). Essas ara tersti aspermitemquenossosalgoritmosrealizem

a minera ~aode warehouses din^ami os e distribudos, e ate mesmo de data streams

prove-nientes de diversas fontes. Umaavalia ~aoextensiva utilizandobasesde dadosdin^ami ase

(4)

Data Mining is the pro ess of automati analysis of large databases in the hope of

dis- overing previouslyunsuspe ted and useful relationshipsand patterns. Currently, abroad

variety of new appli ations, su h as ele troni ommer e, web mining, network intrusion

dete tion et ., are emerging as promising domains for data mining. However, su h

ap-pli ations with distributed and evolving databases, impose new hallenges and issues for

dataminingresear hers. Among theseissues arein luded the apabilityof intera tive and

ontinuous mining, eÆ ient adaptationof models to onstant hanges inthe data, proper

use of distributed resour es, and limited a ess to possibly priva y sensitive data. Trying

toaddresstheseissueswithtraditionalalgorithmsresultsinhigh ommuni ationoverhead,

ex essive wastage of CPU and I/O resour es, priva y violations, and oftendoes not meet

the stringent intera tiveresponse times, essential tointera tive pro ess of dis overy.

This thesis presents new algorithmsfor miningfrequent itemsetsin large evolving and

distributeddatabases. Ournewalgorithmsprovidesigni antI/Oand omputational

sav-ings when ompared againstother te hniques,and at thesame time are able toee tively

handle user intera tions (modifying/ onstraining the sear h spa e for frequent itemsets),

online data updates (removing/inserting blo ks of data), and intera tive response times

through sele tiveupdates (approximate/partialresults). Some additionalhighlightsofthe

proposed algorithms in lude low ommuni ation (and syn hronization) overhead, and

in-tera tivesupportforwindowed operations( omputingtheexisting itemsetsover aspe i

time-interval). These features allow our algorithms to mine from evolving data stored

in distributed warehouses, to high speed data streams oming from distributed sour es.

Extensive experimental evaluation using evolving and distributed data demonstrates the

(5)

Contents

List of Figures 3

1 Introdu tion 4

1.1 ThesisContributions . . . 5

1.1.1 New In remental and Intera tive Te hniques . . . 5

1.1.2 New Distributed Te hniques . . . 6

1.1.3 Con rete A tual Appli ations . . . 6

1.2 ThesisOutline. . . 7

2 Mining Frequent Itemsets 8 2.1 Theoreti alFoundationsand Denitions . . . 8

2.2 ComputationalComplexity. . . 9

2.3 Challenges and RelatedWork . . . 10

2.3.1 Algorithmsfor Frequent Itemset Mining . . . 11

2.4 The ZigZag Algorithm . . . 13

2.4.1 HybridSear h . . . 13

2.4.2 DataStru ture forSupport Counting . . . 14

2.4.3 Evaluation . . . 16

2.5 Con lusions . . . 17

3 Mining Frequent Itemsets in Evolving Databases 19 3.1 ProblemStatement and Denitions . . . 19

3.2 i ZigZag In rementaland Intera tive Algorithm . . . 20

3.2.1 i ZigZagFeatures. . . 22

3.3 Evaluation . . . 26

4 Mining Frequent Itemsets in Distributed Databases 34 4.1 ProblemStatement and Denitions . . . 34

4.2 d&i-ZigZag DistributedAlgorithm . . . 34

4.2.1 d&i-ZigZagFeatures . . . 37

4.3 Evaluation . . . 40

5 Summary and Future Work 45 5.1 ThesisSummary . . . 45

5.2 FutureWork . . . 46

(6)

List of Figures

2.1 Frequent Itemset Mining Example. . . 11

2.2 Comparison: Exe ution Times inDierent Situations. . . 18

3.1 In rementalFrequent Itemset Mining Example. . . 22

3.2 Invariant, Predi table and Unpredi tableItemsets.. . . 25

3.3 TotalExe ution Time. . . 28

3.4 Cumulative Exe ution Time. . . 28

3.5 Speedup forDierentIn remental Congurations. . . 28

3.6 Proportionof Candidate, Frequent,and Retained Itemsets. . . 29

3.7 Performan e Comparison: i ZigZagvs. ULI.. . . 29

3.8 Changing min : Iterative and Hand-in-Hand Operations. . . 30

3.9 Trade-o: A ura y vs. Exe utionTime. . . 31

3.10 Gain inA ura y. . . 31

3.11 Predi tive A ura y. . . 32

4.1 DistributedFrequent Itemset Mining Example(Candidate Distribution). . 36

4.2 DistributedFrequent Itemset Mining Example(Data Distribution). . . 38

4.3 Comparison: Data vs. Candidate Distribution. . . 39

4.4 TotalExe ution Time. . . 41

4.5 Numberof transa tions in orporatedinto . . . 41

4.6 Query Response Time using EqualBlo k Sizes. . . 42

4.7 Query Response Time using Dierent Blo k Sizes. . . 43

(7)

Chapter 1

Introdu tion

Data Miningis one of the entrala tivities asso iatedwith understanding, navigating,

and exploiting the world of digital data. It is the me hanized pro ess of dis overing

in-teresting patterns and building useful models from large databases. A pattern is usually

des ribed as an expression in some languagedes ribing a subset of the data. A model is

a (statisti al) des ription of the entire database. A frequent itemset is a lassi al type of

pattern, and a olle tionof frequent itemsetsis a ompa t modelof the database.

Formally,theappli ationofdataminingpro eduresisaparti ularstepinthepro essof

dis overing useful knowledgefromdatabases. In su h pro ess,some a tivitiesare assigned

to humans and others toma hines. While ma hines are stillfar fromapproa hing human

abilities in the areas of synthesis of new knowledge and reative insight formation,

au-tomatingdatamodeling pro edures is asigni antni he suitablefor omputers. A urate

modeling ofdata is the ultimateform ofdata understanding and ompression. By

model-inglarge databases, weare essentiallybringingthem down in sizeto arange that humans

an analyze and understand. A olle tionof frequent itemsets,as a ompa t model ofthe

data, be omesthe ultimateportabledata storeand an serve toqui kly navigate and

ma-nipulatelarge volumes ofdata. Forthis reason,buildingthese models (orminingfrequent

itemsets) is a ore data miningtask, and several algorithmswere already proposed inthe

literature [1,3, 6, 12, 15, 30, 42, 45℄. Typi al appli ations in lude supermarket sales and

banking, astronomy, parti le physi s, hemistry, medi ine and biology, but, in fa t, new

appli ations are mentioned oftenin the daily press. Some of these new appli ations have

introdu ed animportantnew dimension tothe problem distributed sour es of evolving

(dynami )data. For example,every day retail hains re ord millionsof transa tions

om-ing from dierent pla es, tele ommuni ations ompanies onne t thousands of alls from

dierent lo ations, and popular web sites log millions of hits world wide. In all these

ap-pli ations, distributeddatabases are ontinuously being updatedwith anew blo k of data

at regular time intervals. Forlarge time intervals, we have the ommons enario in many

distributed data warehouses. For small time intervals, we have high-speed data streams

(8)

Modeling these evolving and distributed databases brings unique opportunities, but

it also imposes diÆ ult issues. Some of these issues, like intera tivity, proper use of the

distributed resour es, and limited a ess to priva y-sensitive data, are nothing less than

paramount. Intera tivity is oftenthe key for ee tive data understanding and knowledge

dis overy. Making proper use of the distributed resour es in order to guarantee qui k

response time is ru ial be ause lengthy time delay in response to a user request may

disturbthe owofhumanper eption andformationofinsight. Providingsu hissueswhen

dataisdistributedandevolvingisa hallengingtaskbe ause hangesinthedatainvalidate

the olle tion of frequent itemsets, and simply using traditional approa hes to build the

new olle tion an result in priva y violations, and in an explosion in the omputational

resour es required. Thereis anurgentneedfor non-trivialalgorithmsspe i allydesigned

for miningfrequent itemsets insu h evolving and distributeddatabases.

1.1 Thesis Contributions

In this thesis we propose and evaluate new algorithms for miningfrequent itemsetsin

evolving anddistributed databases. Our algorithmsmakeuse of anin remental te hnique

that avoids repli ating work done before, and at the same time it employs new

prun-ing approa hes to enhan e the mining pro ess. Also, the in remental te hnique provides

mu h more intera tivitythan other ones, like sele tive updates and windowed operations.

Further,our algorithmsmakeuse of a distributedte hnique whi h minimizesI/O

require-ments and ommuni ation overhead, being able to mine large geographi ally distributed

databases. Parts of this work have appeared in [13, 21, 22, 27, 28, 29, 30, 31, 32, 33, 34,

35, 36, 37,38℄. Our main ontributions are summarized next.

1.1.1 New In remental and Intera tive Te hniques

Contribution 1. [Redu ing Data S ans and New Pruning Approa hes℄ Our

te hnique uses a des riptive summary (in remental information) of the frequent itemsets

togenerate onlyfrequentsubsets, avoidingthegenerationandtestingofmanyunne essary

andidate itemsetsthat are usually examinedby other te hniques. Further, the

in remen-tal information is not only used to redu e data s ans but also to employ new pruning

approa hes, improvingthe sear hforfrequentitemsetsand in urringinsigni antruntime

performan e improvements [31, 36℄.

Contribution 2. [Sele tive Updates℄ Contrasting to other in remental

te h-niques [8, 9, 11, 26℄ whi h generally monitor hanges in the evolving database to dete t

the best moment to update the entire olle tion of frequent itemsets, we hose instead

to perform sele tive updates [33℄, that is, the entire olle tion is ompletely updated just

(9)

be generated ata fra tion of the total exe utiontime needed to generate anexa t

olle -tion. This feature enables the algorithm to postpone a full update operation, in reasing

itseÆ ien y signi antly,espe iallywhen mining high-speed data streams.

Contribution 3. [Flexible Parameter Modifi ation℄ Pra ti al data mining is

oftenahighlyiterativeand intera tive pro ess. The user typi ally hangesthe parameters

and runs the mining algorithm many times before satised with the nal results. This

iterativepro ess is very time onsuming. Traditionalalgorithms[8, 26℄ are unableto take

advantage of this iterative pro ess to speed up the urrent mining pro ess, wasting time

and repli ating work. Our algorithm has the exibility to hange the parameters a ross

mining operations,and even a rossdata updates (without work repli ation). This feature

is very ee tive in fa ilitatingiterativedata understanding and knowledgedis overy [31℄.

Contribution 4. [Short-Term Mining℄ We present a novel kind of intera tivity,

alledshort-term mining(orwindowedtransa tionaloperations),that keeps the olle tion

of frequent itemsets oherent with respe t to the most re ent data [31℄. Short-term

min-ing is desirable for appli ations su h as web mining, where the user behavior may vary

signi antly a ross time and olda ess patterns may be no longerrelevant.

1.1.2 New Distributed Te hniques

Contribution 5. [Low Communi ation and Syn hronization Overhead℄

Cur-rent parallel and distributed mining algorithms suer fromhigh ommuni ation and

syn- hronization overhead. Some of them need tosyn hronize several times duringthe

exe u-tion, in urring in poorparallel improvements [2, 10, 14, 16℄. Others assume a high-speed

network and perform ex essive ommuni ation operations [34, 39, 46℄. We present a

dis-tributed te hnique whi h minimizes ommuni ation requirements and performs only one

syn hronizationoperation. Thesefeaturesturnourdistributedte hnique apableofmining

over awide area network and grid-spa e [22, 35, 37℄.

Contribution 6. [Priva y-Preserving Communi ation Me hanism℄ Priva y

on erns may prevent data movement in a distributed environment data may be

dis-tributed amongseveral ustodians orsites, noneof whi hisallowed totransferits datato

anothersite [17℄. Wedeveloped a ommuni ation me hanism[35℄ toensure priva yinthe

ommuni ationbetween the sites involved inthe mining operation.

1.1.3 Con rete A tual Appli ations

Contribution 7. [Real-life Appli ations℄ The ee tiveness of our approa hes is

(10)

ap-1.2 Thesis Outline

This thesis is divided into 5 hapters. The remaining of this thesis is organized as

follows.

Chapter 2. [Mining Frequent Itemsets℄ We present the basi theoreti al

foun-dations for frequent itemset mining, as well as basi algorithms. We introdu e a new

algorithm that performs an eÆ ient sear h for frequent itemsets. We evaluate and

om-pare our algorithmagainst traditionalones.

Chapter 3. [Mining Frequent Itemsets in Evolving Databases℄ We introdu e

and evaluatenew in rementalalgorithmsandintera tivetoolsfor frequent itemsetmining

in evolvingdatabases.

Chapter 4. [Mining Frequent Itemsets in Distributed Databases℄ We

in-trodu e and evaluate new distributed algorithms for frequent itemset mining when the

evolving database is distributed.

Chapter 5. [Summary and Future Work℄ We summarize our main ontributions

(11)

Chapter 2

Mining Frequent Itemsets

In this hapter we des ribe the basi theore ti al foundations and denitions that are

ne essary to understand the frequent itemset mining problem. We also present its

om-putational omplexity, and the main hallenges and resear h wreathing this problem. We

nishthis hapterpresentingandevaluatinganovelalgorithmforfrequentitemsetmining.

2.1 Theoreti al Foundations and Denitions

In this se tion we present the foundations and denitions that form the basis of the

frequent itemsetmining problem.

Definition 1. [Itemsets℄ Foranyset X,itssize isthenumberofelementsinX, jX j.

Let I denote the set of n natural numbers f1, 2, ..., ng (n is the dimensionality of I).

Ea h x 2 I is alled an item. A non-empty subset of I is alled an itemset. The power

set ofI, denoted by P(I), istheset of allpossiblesubsets ofI. Anitemsetof sizek,X =

fx 1 , x 2 , ..., x k

g is alled a k-itemset (for onvenien e we drop set notation and denote X

as (x 1 x 2 :::x k

). For X;Y 2P(I) we say that X ontains Y if Y X. A set (of itemsets)

C P(I) is a olle tionof itemsets.

Definition 2. [Transa tions℄ A transa tion T

i

is an itemset, where i is a natural

number alled the transa tion identier or simply tid. A transa tion database D = fT

1 , T 2 , ..., T m

g, is a nite set of transa tions, with size jD j = m. The absolute support of

an itemset X inD is the number of transa tions in D that ontains X, given as (X;D)

= jfT

i

2 D j X T

i

gj. The (relative) support of an itemset X in D is the fra tion of

transa tions inD that ontains X, given as, (X;D) = (X;D)

jDj .

Definition 3. [Frequent and Maximal Frequent Itemsets℄ An itemset X is

frequent if (X;D) min

, where min

is a user-spe ied minimum-support threshold,

(12)

AfrequentitemsetX 2F( min

;D)ismaximal ifithas nofrequent superset. A olle tion

of maximal frequent itemsets is denoted as MF( min

;D). No itemset in MF( min

;D)

ontains another, that is, X;Y 2MF( min

;D), and X 6=Y implies X *Y.

Lemma 1. Any subset of a frequent itemset is frequent: X 2 F( min

;D) and Y X

implies Y 2F( min

;D). Thus, by denition, a frequent itemset must be subset of at least

one maximal frequent itemset[3℄.

Lemma 2. MF( min

;D) is the smallest olle tion of itemsets from whi h F( min

;D)

an be inferred[25℄.

Problem 1. [Mining Frequent Itemsets℄ Given min

and a transa tion database

D, the problem of miningfrequent itemsetsis to nd F( min

;D).

2.2 Computational Complexity

Next we highlight the graph-based theoreti s and provide some insight into the

om-putational omplexity of mining frequentitemsets.

Definition 4. [Bipartite Graphs℄ A graphG =(U;V;E)hastwodistin tvertexsets

U and V,and anedge set E =f(u;v)ju2U and v 2Vg. A omplete bipartitesubgraph

I T is alled a bipartite lique, and is denoted as K

i;t

, where j I j= i, j T j= t and

I U;T V.

The input database forthe frequent itemset mining problemisessentiallya very large

bipartite graph, with U as the set of items, V as the set of transa tions (or simply the

set of tids), and ea h (item, tid) pair as anedge. The problemof enumerating all

(maxi-mal) frequent itemsets orresponds to the task of enumerating all (maximal) onstrained

bipartite liques, K i;t , where t jDj min .

Theorem 1. The problem of determining whether a bipartite graph G = (U;V;E), with

jU j=jV j=n ontains a bipartite lique K

k;k

is NP-Complete[40℄.

Determining the existen e of a (maximal) bipartite lique (itemset), with restri tions

on the size of jI j=i (items) and j T j= t (support), su h that i+t k, (with onstant

k) is in P, the lass of problems that an be solved in polynomial time. On the other

hand, the problem whether there exists a maximal bipartite lique su h that i+t = k

is NP-Complete [18℄, the lass of problems for whi h no polynomial time algorithm is

known toexist. Whilethe problems inthe lassNP askswhether adesiredsolutionexists,

the problems in the lass #P asks how many solutions exist. Counting problems that

orrespond to NP-Complete problems are #P-Complete. The followingtheorem indi ates

(13)

Theorem 2. Determining the number of (maximal)bipartite liques in a bipartite graph

is #P-Complete [18℄.

The omplexityresultsshownabovearequitepessimisti ,andapplytogeneralbipartite

graphs. Weshouldthereforefo usonspe ial aseswherewe anndpolynomialtime

solu-tions. Fortunately,forfrequentitemsetmining,inpra ti e, the bipartitegraph(database)

isverysparse,andwe an infa tobtainlinear omplexityinthegraph(database)size [45℄.

2.3 Challenges and Related Work

We outline some of the urrent hallenges and resear h regarding frequent itemset

mining. This listisnot exhaustive andit isintended togivethereader afeelforthe types

of problems we wrestle with.

Challenge 1. [Huge Databases℄ Databases with millions of transa tions and

giga-byte size are quite ommonpla e. Methods for dealing with large data volumes in lude

more ee tive algorithms [3, 30, 45℄, sampling [19, 44℄, approximation methods [33℄, and

massively parallelpro essing [27, 34,37, 46℄.

Challenge 2. [High Dimensional Databases℄ Not onlythere is often a very large

number of transa tions in the database, but there an also be a very large number of

items, so that the dimensionality of the problem is high. A high dimensional database

reates problems in terms of in reasing the size of the sear h spa e in a ombinatorially

explosive way. Approa hes to this problem in lude more eÆ ient pruning te hniques [6,

12, 42℄ and methods to redu e the a tual dimensionality of the problem (i.e., ondensed

representations) [5,7℄.

Example 1. Let us onsider the example in Figure 2.1, where I = f1,2,3,4,5g and

the gure shows D and P(I). Suppose min

= 0.4 (40%). F(0:4;D) is omposed by

shadedand bolditemsets,whileMF(0:4;D)is omposed by thebolditemsets. Notethat

jMF(0:4;D)jis mu h smallerthan jF(0:4;D)j.

A naiveapproa htondF(0:4;D)istorst ompute(X;D)forea hX 2P(I), and

then return onlythose X that (X;D) 0.4. This approa h is inappropriatebe ause:

IfjI j (the dimensionality)is high, then jP(I) jis huge (i.e., 2 jIj

).

IfjD j (the size) islarge, then omputing (X;D) forall X 2P(I) isinfeasible.

Thus, the spa e for itemsets is often huge, and the enumeration of itemsets involves

dierent forms of sear h in this spa e. Pra ti al omputational onstraints pla e severe

(14)

is infrequent, we do not need to generate any of its supersets, sin e they must also be

infrequent. This simple pruning strategy was rst used in [3℄ and it greatly redu es the

numberof andidateitemsetsgenerated. Several dierentsear hes an beappliedby using

this pruning strategy.

Figure 2.1: Frequent Itemset Mining Example.

Challenge 3. [Evolving Data℄ Rapidly hanging (non-stationary) data may make

previously dis overed frequent itemsetsinvalid. Possible solutions in lude in remental [8,

11,26, 28,31,36,38℄ and modelquality extension te hniques [29℄.

Challenge 4. [Distributed Data℄ The issues on erning modern appli ations are

not onlythe size and dimensionality of the data, but also its distributed nature. Modern

appli ationsmayhavetheirdatabases logi allyandphysi allylo atedatdierent(distant)

pla es, and possibly they may be willing toshare their data miningmodels, but not their

data. In allthese aseswhat is needed isa de entralizedapproa h todata mining[22, 35,

37,39℄.

2.3.1 Algorithms for Frequent Itemset Mining

In thisse tionwewillbrie ydes ribethe mainalgorithmsforfrequentitemsetmining.

Our goal is not to go too mu h into detail, but to show the basi prin iples and the

dieren es between the algorithms. Although they share basi ideas, they fundamentally

dierin ertainaspe ts, su h asthe sear h employed and the way they ount the support

of the andidate frequent itemsets.

(15)

be frequent, thusinfrequentitemsetsare qui kly pruned out from onsideration. Apriori

performs a breadth-rst sear h for F( min

;D), allowing aneÆ ient andidate generation,

sin e all frequent itemsets at a given level are known before starting to ount andidates

at the next level. The strategy for generating andidates works as follows: rst, a set

of andidates is set up; Se ond, the algorithm s ans all transa tions and in rements the

ounters of the andidates. That is, for ea h set of andidate itemsets one pass over all

transa tions is ne essary. With a breadth-rst sear h su h a pass is needed for ea h level

ofthe sear h spa e. In ontrast, adepth-rstsear hwouldmake aseparatepass ne essary

for ea h lass that ontainsat least one andidate. The osts for ea hpass over the whole

database would ontrast with the relatively small number of andidates ounted in the

pass.

Although Apriori employs an eÆ ient andidate generation, it is extremely I/O

in-tensive, requiring as many passes over the database as the size of the longest andidate

generated. This pro ess of support ounting may bea eptable for sparse databases with

short-sizedtransa tions. However, whenthe itemsetsarelong-sized, Apriorimay require

substantial omputationaleorts. Thus, thesu ess ofApriori riti allyreliesonthefa t

that the length of the frequent itemsetsin the database are typi allyshort.

Algorithm 2. [ECLAT℄ ECLATperforms adepth-rst sear h forF( min

;D),whi h

is based on the on ept of equivalen e lasses [45℄. Two k-itemsets belong to the same

equivalen e lass if they share a ommon k-1 length prex. Ea h equivalen e lass may

be pro essed independently, thus ECLAT divides the sear h spa e for F( min

;D) into

smaller sub-spa es, where ea h one orresponds to an equivalen e lass and is pro essed

independently inmain memory, minimizingthe number of se ondary memory operations.

The depth-rst sear h starts from 1-itemsets and keeps generating andidates within the

same equivalen e lass untilF( min

;D)is found.

Algorithm 3. [FP-Growth℄ FP-Growth [15℄ is an interesting method that nds

F( min

;D) usinga frequent patterntree (FP-tree) stru ture, whi h isanextended

prex-tree stru ture forstoring ompa t informationaboutthe frequent itemsets. Better

perfor-man e isa hieved with three te hniques:

1. The database is ompressed into a ondensed data stru ture, whi h avoids repeated

database s ans.

2. AFP-tree-basedminingadoptsapatternfragmentgrowthmethodtoavoidthe ostly

generation of alarge numberof andidate sets.

3. A divide-and- onquer method is used to de ompose the mining task into a set of

smaller tasks for mining frequent itemsets onned in onditional databases, whi h

(16)

2.4 The ZigZag Algorithm

In this se tion we present ZigZag, a new algorithm for frequent itemset mining.

ZigZaghas twofeatures that distinguish it fromother algorithms:

1. It employs ahybridsear hfor frequentitemsets. Thissear hrst nds the maximal

frequent itemsets and then it generates only frequent subsets. This hoi e greatly

redu es the number of andidates generated.

2. It employs eÆ ient data stru tures for support ounting that greatly redu es the

number of data s ans.

Next we des ribe these two features and present experimental evaluation omparing

ZigZagagainstother algorithms.

2.4.1 Hybrid Sear h

Almost all algorithms for mining frequent itemsets use the same pro edure rst a

set of andidates is generated, next infrequent ones are pruned, and only the frequent

ones are used to generate the next set of andidates. Clearly, an important issue in this

task is to redu e the number of andidates generated. An interesting approa h to redu e

the number of andidates is to rst mine MF( min

;D). On e MF( min

;D) is found, it

is straightforward to obtain all frequent itemsets (and their support ounts) in a single

database s an, without generating infrequent (and unne essary) andidates. The number

of andidates generated to nd the MF( min

;D) is mu h smaller than the number of

andidates generated todire tly nd F( min

;D).

Step1. [Ba ktra kSear hforMF( min

;D)℄ AneÆ ientsear hforMF( min

;D)

must have the ability to qui kly remove large bran hes of P(I) from onsideration. This

propertyisasso iatedwiththenumberof andidatesgeneratedduringthesear h. Smaller

the number of andidates generated, faster the sear h will be. Our algorithm employs a

ba ktra k sear h to nd MF( min

;D). Ba ktra k algorithmsare useful for many

ombi-natorial problems where the solution an be represented as a set X = fx

0 ;x 1 ;:::g, where ea h x j

is hosen from a nite possible set, S

j

. Initially X is empty; it is extended one

item at a time, as the sear h spa e (i.e., P(I)) is traversed. The length of X is the same

as the depth of the orresponding node in the sear h tree. Given a k- andidate itemset,

X = fx 0 ;x 1 ;:::;x k 1

g, the possible values for the next item x

k

omes from a subset R

k

S

k

alled the ombine set. If y 2 S

k R

k

, then nodes in the subtree with root node

X = fx 0 ;x 1 ;:::;x k 1

;yg will not be onsidered by the ba ktra k algorithm. Sin e su h

subtrees have been pruned away from the original sear h spa e, the determinationof R

k

is also alled pruning. Ea h iteration of the algorithmtries extending X with every item

(17)

not asubset of any already known maximal frequent itemset. The next step is to extra t

the new possible set of extensions, S

k+1

, whi h onsists only of items in R

k

that follow

x. The new ombine set, R

k+1

, onsists of those items in the possible set that produ e

a frequent itemset when used to extend X. Any item not in the ombine set refers to a

pruned subtree. The ba ktra k sear h performs adepth-rst traversal of the sear h spa e.

Step2. [Top-DownEnumerationof F( min

;D)℄ Inthetop-downenumerationstep

ea hitemsetX 2MF( min

;D)isbroken intok subsetsofsize(k-1). Thispro essiterates

generating smaller subsets and omputing their support ounts until there are no more

subsets tobe he ked, and F( min

;D) isfound.

2.4.2 Data Stru ture for Support Counting

Stru ture 1. [TidSet℄ Let L(X;D) be the tidset [45℄ of a k itemset X in D (the

set of tids in D in whi h X has o urred), and thus, j L(X;D) j = (X;D). A ording

to [44℄, L(X;D) an be obtained by interse ting the tidsets of at least two k-1 subsets

of X. For example, in Figure 2.1, L(123;D) = f1, 5, 9, 10g and L(125;D) = f1, 4, 5,

8, 9, 10g. Consequently, L(1235;D) = L(123;D)\L(125;D) = f1, 5, 9, 10g. Note that

jL(1235;D)j =(1235;D)=4,and thus(1235;D)= 0.4. Alternatively,L(1235;D) an

alsobeobtained by L(1;D)\L(2;D)\L(3;D)\L(5;D)= f1,5, 9,10g.

Stru ture 2. [DiffSet℄ The diset [41℄ isaneÆ ient stru ture forsupport ounting.

The main idea is to avoid storing the entire tidset of ea h andidate. Instead, only the

dieren es between the tidsets of the itemsets X and Y are stored in the stru ture. The

disetisveryshort omparedtoitstidset ounterpart,anditisveryee tiveinimproving

the running time of the ounting operation. The diset of a 2-itemset Z = x [y is

H (Z;D) = L(x;D) L(y;D), and (Z;D) = (x;D) j H (Z;D) j. The diset of

a k-itemset (k 3) Z = X [ Y is H (Z;D) = H (Y;D) H (X;D), and (Z;D) =

(X;D) j H (Z;D) j (where X and Y are (k-1) itemsets). For example, in Figure2.1,

H (12;D)=L(1;D) L(2;D)=f2,4gand (12;D)=(1;D) jH (12;D)j=10 2=8.

Similarly, H (13;D) = f3,6,7,8g and (13;D) = 10 4 = 6. Consequently, H (123;D) =

H (13;D) H (12;D)= f3;6;7;8g f2;4g= f3;6;7;8g, and (123;D)= 8 4 =4, and

thus (123;D)= 0.4.

Support Counting of 1-itemsets ZigZag s ans ea h transa tion T

i

2 D, and if

x 2 T

i

, L(x;D) is augmented by i. At the end of this pro ess, ea h item x 2 D has its

tidset L(x;D). Note that (x;D) =j L(x;D) j. Clearly this approa h assumes that all

tidsetstsimultaneouslyinmainmemory. Therearebasi allythreealternativeapproa hes

for the ases where all tidsets do not t in main memory at the same time: distributed

(18)

Support Counting of k-itemsets (k 2) in Step 1 As explained earlier, inthe

rststepZigZagperformsasear hforMF( min

;D). InthisstepZigZagusesthediset

stru ture for support ounting. The disets are propagated from a node to its hildren

startingfromtheroot. Atthe rstlevel,ZigZaghas a esstothetidsets forea hitemin

D. Atthese ondlevel,thedisetofa2-itemsetZ =x[yisH (Z;D)=(L(x;D) L(y;D)),

and (Z;D) = (x;D) j H (Z;D) j. At subsequent levels, ZigZag has available the

disets for ea h elementin the ombine set. In this ase, the diset ofa k-itemset(k3)

Z =X [Y isH (Z;D)=H (Y;D) H (X;D),and (Z;D)=(X;D) jH (Z;D)j.

Support Counting of k-itemsets (k 2) in Step 2 It is not possible to use

the disetpropagationduringthetop-downenumeration, sin eZigZagdoesnot havethe

dieren esbetweenthetidsetsofthesubsets. Anaiveapproa hto ompute(X;D)would

require the interse tion of k dierent tidsets if the size of X is k. The k-way interse tion

method employed by ZigZag is dramati ally enhan ed by the use of an ee tive a he

that stores intermediate results for future use. The a he is simply implemented through

ahash-table,whi hkey is omposed bythe itemsfx

1 ;x

2 ;:::;x

n

gthat omposeasubset X.

Next we show the pseudo ode forthe ZigZag algorithm.

ZigZag( min

;D)

1. S anD and omputethefrequentitems (i.e.,F

1 ) 2. MF = Ba kTra k(;;F 1 ;0; min ) 3. F =; 4. forea h X 2MF 5. TopDown(X) 6. returnF Ba kTra k(X l ;C l ;l; min ) 1. forea h x2C l 2. X l +1 =X [x 3. P l +1 =fy:y2C l andy >xg 4. ifX l +1 [P l +1 has asupersetin MF 5. return 6. C l +1 =Extend(X l +1 ;P l +1 ;l; min ) 7. ifC l +1 ==; 8. ifX l +1 hasno supersetinMF 9. MF =MF [X l +1

10. elseBa kTra k(X

l +1 ;C l +1 ;l+1; min ) 11. returnMF

(19)

Extend(X l +1 ;P l +1 ;l; min ) 1. C=; 2. forea h y2P l +1 3. y 0 =y 4. ifl==0then H (y 0 ;D)=L(X l +1 ;D) L(y;D) 5. elseH (y 0 ;D)=H (y;D) H (X l +1 ;D) 6. if(y 0 ;D) min 7. C=C[y 0 8. returnC TopDown(X)

1. ifX wasnotpro essedyet

2. (X;D)=j(L(x 1 ;D)\L(x 2 ;D)\:::\L(x k ;D))j 3. (X;D)= j(X;D)j jDj 4. F =F[fXg

5. forea h (k 1) subsetY X

6. TopDown(Y)

2.4.3 Evaluation

Inthisse tionwe ompareZigZag againsttraditionalalgorithms: Apriori 1 ,E lat 2 ,and FP-Growth 3

. All experiments were arried out on a dual Pentium III 1GHz with 1GB of

mainmemory running Red Hat Linux 7.1. The basi metri employed inthe omparison is the

total exe utiontimespentto performtheminingoperation. Weutilizethreedierentdatabases:

WPortal, WCup, and VBook. These databases aredes ribed in thefollowing, and they will

beused inall evaluationsthroughout thisthesis 4

.

Database 1 [WPortal℄ WPortal representsthe a ess patterns of the largest Brazilian

Web Portal. The database was olle ted in approximately 4 months, and omprises 7,274,382

transa tions over 3,183 unique items, and ea h transa tion ontains an average length of 3.2

items.

1

Downloadablefrom http://fuzzy. s.uni-magdeburg.de/~borgelt/softwa re.h tml

2

Downloadablefrom http:// s.rpi.edu/~zaki/software

3

Thanksto BrunoGusm~aoforimplementingtheFP-Growth algorithm

4

(20)

Database 2 [WCup℄ WCup was generated from a 2-month li k-stream log of the 1998

WorldCupwebsite,whi hispubli lyavailableatftp://resear hsmp2. .vt.edu/pub/. Wes anned

the WCup log and produ ed a transa tion le,where ea h transa tion is a session of a ess to

thesite bya lient. Ea h iteminthe transa tionis a web request. Notall requests were turned

into items; to be ome an item, therequest musthave three properties:

1. The requestmethod isGET.

2. The requeststatus isOK.

3. The letype isHTML.

A sessionstarts with a request that satisesthe above properties, and endswhen there has

beenno li kfrom the lientfor 30 minutes. Allrequestsin asessionmust ome from thesame

lient. WCup omprises 6,027,378 transa tionsover5,271 uniqueitems.

Database 3[VBook℄ VBookwasgeneratedfroma2-monthweblogfromalargeele troni

bookstore. In this ase, ea h transa tion ontains the set of books examined, added to the

shopping art and/or bought by a ustomer during a single visit to the bookstore. VBook

omprises 141,735 transa tionsover8,836 uniqueitems.

Figure 2.2 shows the exe ution times obtained by the algorithms (Apriori, E lat,

FP-Growth andZigZag), usingdierentdatabases (WPortal,WCupand VBook)and dierent

minimum supportvalues. All databases t inthe main memory of our omputational

environ-ment,andtoensurethatthe omparisonbetweenthealgorithmsisfair,allalgorithmsa essthe

data in main memory, rather than in disk. ZigZag performs better in WPortal and VBook

databases. For the experiments usingthe WCup database, ZigZag is the best hoi e onlyfor

very smallminimum support values. The reason is that for greater values of minimum support

the ratio of maximal frequent is too high, and a dire t sear h for frequent itemsets is better.

Further, in the best ase, ZigZag oers one order of magnitudeimprovement, when ompared

against Apriori.

2.5 Con lusions

Inthis hapterwepresentthebasi foundationsand hallengesofthefrequentitemsetmining

problem. Wealsodepi tedthemainalgorithmstosolvethisproblem,and weproposeanewone,

alled ZigZag. We evaluate our algorithm by omparing it against others. ZigZag employs

(21)

100 1000

10000

0 0.0002

0.0004

0.0006

0.0008

0.001 Time (secs)

Minimum Support

WPortal

Apriori

FP−Growth

Eclat

ZigZag

100 1000

10000

0 2e−05

4e−05

6e−05

8e−05

0.0001

Time (secs)

Minimum Support

WPortal

Apriori

FP−Growth

Eclat

ZigZag

10

100 1000

10000

100000

0

0.002

0.004

0.006

0.008

0.01 Time (secs)

Minimum Support

WCup

Apriori

FP−Growth

Eclat

ZigZag

100 1000

10000

100000

0 0.0002

0.0004

0.0006

0.0008

0.001 Time (secs)

Minimum Support

WCup

Apriori

FP−Growth

Eclat

ZigZag

10

100 1000

10000

100000

0

0.002

0.004

0.006

0.008

0.01 Time (secs)

Minimum Support

VBook

Apriori

FP−Growth

Eclat

ZigZag

100 1000

10000

100000

0 0.0005

0.001 0.0015

0.002 Time (secs)

Minimum Support

VBook

Apriori

FP−Growth

Eclat

ZigZag

Figure 2.2: Comparison: Exe ution Times inDierent Situations.

two hapters we willproperlymodifytheZigZag algorithm forminingevolving and distributed

(22)

Chapter 3

Mining Frequent Itemsets in

Evolving Databases

Traditional approa hes for frequent itemset miningmake the assumption that the database

is stati , and a database update requires redis overing all frequent itemsets by s anning the

entiredatabase. Su happroa hesarememorylessandrepli ateworkthathasalreadybeendone,

wasting omputational resour es. In remental algorithms re-use the data mining results that

were obtainedpreviously,and ombine thisinformationwiththefreshdatato ee tivelyupdate

thenew olle tionof frequentitemsets, savingwork.

3.1 Problem Statement and Denitions

Problem 2. [Mining Frequent Itemsets in Evolving Databases℄ Using D as a

startingpoint,asetofnew transa tionsd +

isaddedand asetof oldtransa tionsd isremoved,

forming(i.e.,=(D[d + ) d ). Given min ,D;d +

,andd ,theproblemofminingfrequent

itemsets inevolving databases isto ndF( min

;), given some informationaboutF( min

;D).

Definition 5. [Emerged and Retained Itemsets℄ An itemsetX is alledan emerged

itemset if it is infrequent in D, butfrequent in (i.e., X 2= F( min

;D), but X 2 F( min

;)).

Similarly, X is alled a retained itemset if it is frequent in D, and remains frequent in (i.e.

X 2F( min

;D) and X 2F( min

;)).

Definition 6. [In remental Representation℄ When mining frequent itemsets in

evolving databases, a hing of previousresultsis obviously useful. Still, havingto go ba k to D

to nd F( min

(23)

workthatwasalreadydonewhenminingD. Whenmining,is omposedbyF( min

;D)(i.e.,

thefrequentitemsets inD alongwith theirsupport ounts).

Definition 7. [Approximate Colle tion of Frequent Itemsets℄ We dene

F 0 ( min ; max ;) (where max

is a maximum error threshold) as an approximate/partial

ol-le tion of frequent itemsets in . Formally, for some itemsets X 2 F 0 ( min ; max ;), (X;)

hasan approximate value, 0

(X;).

3.2 i ZigZag In remental and Intera tive Algorithm

The basi feature of an in remental mining algorithm is the in remental support ounting.

Our algorithm,i ZigZag, employsthe basi sear h rationaleexplainedin Se tion2.3, theonly

dieren eisthein rementalsupport ounting. Theideaisthatforagiven andidateX,(X;)

an be omputed by (X;D) + (X;d +

) (X;d ). When X is a retained itemset, (X;D)

is already stored in. In this ase, the needfor datas ans is greatly redu ed, and the support

ounting pro ess isenhan ed.

Support Counting of 1-itemsets i ZigZag s ans ea h transa tion T

i

2 , and if

x2T

i

there arethree possibilities:

1. Ifi2d thenbothL(x;d ) and L(x;D) areaugmentedbyi.

2. Ontheother hand, ifi2d +

,onlyL(x;d +

) isaugmentedbyi.

3. Otherwise,onlyL(x;D) isaugmented byi.

Attheendofthispro ess,ea hitemx2hasthree tidsets,L(x;d ),L(x;d +

),andL(x;D).

Note that(x;)=jL(x;)j=jL(x;D)j+jL(x;d +

)j jL(x;d )j.

Support Counting of k-itemsets (k 2) in Step 1 In step 1, i ZigZag uses

the diset stru turefor support ounting. At therst level of the sear h,i ZigZag has a ess

to the tidsets for ea h item in . At the se ond level, the diset of a 2-itemset Z = x[y is

H (Z;) =(L(x;D) L(y;D))[(L(x;d +

) L(y;d +

))[(L(x;d ) L(y;d )), and (Z;) =

(x;) jH (Z;)j. At subsequentlevels,i ZigZaghasavailablethedisetsforea helement

in the ombine set. In this ase, the diset of a k-itemset (k 3) Z = X [Y is H (Z;) =

(24)

Support Counting of k-itemsets (k2) in Step 2 Instep 2,i ZigZag employs

thek-wayinterse tionoftidsetsexplainedintheprevious hapter. Inthein remental ase, given

asubsetX =x 1 [x 2 [:::[x n ,L(X;)=((L(x 1 ;D)\L(x 2 ;D)\:::\L(x n ;D)) (L(x 1 ;d )\ L(x 2 ;d )\:::\L(x n ;d )))[(L(x 1 ;d + )\L(x 2 ;d + )\:::\L(x n ;d + )),and(X;)=jL(X;)j.

In thefollowingwe present themodiedpro edure forin remental support ounting.

Extend(X l +1 ;P l +1 ;l; min ) 1. C=; 2. forea h y2P l +1 3. y 0 =y 4. ifl==0then 5. H (y 0 ;d + )=L(X l +1 ;d + ) L(y;d + ) 6. H (y 0 ;d )=L(X l +1 ;d ) L(y;d ) 7. ify 0 isan emerged itemset 8. H (y 0 ;D)=L(X l +1 ;D) L(y;D) 9. else 10. H (y 0 ;d + )=H (y;d + ) H (X l +1 ;d + ) 11. H (y 0 ;d )=H (y;d ) H (X l +1 ;d ) 12. ify 0 isan emerged itemset 13. H (y 0 ;D)=H (y;D) H (X l +1 ;D) 14. ify 0

isa retained itemsetthen

15. (y 0 ;)=(y 0 )+(y 0 ;d + ) (y 0 ;d ) 16. else 17. (y 0 ;)=(y 0 ;D)+(y 0 ;d + ) (y 0 ;d ) 18. if(y 0 ;) min 19. C =C[y 0 20. returnC

Example 2. Let us onsider the example in Figure 3.1, where I = f1;2;3;4;5g. Suppose

min

=0:5. Thegure shows D,d +

,F(0:5;D) and F(0:5;). When miningF(0:5;), f1,2,3,

4,5,12,13, 14,15,23, 24,25,34, 35,45,123, 124,234, 235g aretheretained itemsetsand their

support ountsinare omputedoverandd +

. f125,245garetheemergeditemsetsandtheir

support ounts in are omputed over Dand d +

. Itemsetsf12, 34, 35, 45, 123, 124, 125, 234,

235,245g arepro essedduringstep1(i.e.,bottom-upsear h),whileitemsetsf13, 14,15,23,24,

(25)

Figure 3.1: In remental Frequent Itemset Mining Example.

3.2.1 i ZigZag Features

There are basi ally ve features that distinguish i ZigZag from other approa hes: novel

in rementalte hniques,novelpruningte hniques,theabilityofvarying min

a rossdataupdates,

theimplementationof short-termmining,and thedynami sele tionof itemsets to be updated.

Feature 1. [Novel In remental Te hniques℄ Almost all previousin remental

te h-niques [8,9,11 , 26 ℄usethenegative border to optimizethe omputation ofthenew olle tion of

frequentitemsets. Thenegativeborder onsistsofalltheitemsetsthatareinfrequent andidates.

Maintaining the negative border an be very usefulforredu ing data s ans,be ause it ontains

the \ losest" infrequent itemsets in the original database that an be ome frequent in the

up-dateddatabase. However, thenegative border an be huge, and itsmaintenan e isvery memory

onsumingand notwelladapted forvery largedatabases[24℄.

Feature 2. [Novel Pruning Te hniques using ℄ An ee tive algorithm musthave

(26)

of andidates generated, faster the sear h would be. Two general prin iplesfor eÆ ient sear h

strategyare that:

1. It is more eÆ ient to hoose the next bran h to explore to be the one whose ombine set

hasthefewestitems. Thisusuallyminimizesthe numberof andidatesgenerated.

2. If we are able to remove a node as early as possible from the ba ktra k sear h tree we

ee tivelyprune manybran hesfrom onsideration.

Reordering the elements in the urrent ombine set to a hieve these two goals is a very

ee tive mean of utting downthesear h spa efor MF( min

;). Thebasi heuristi isto sort

the ombine set inin reasing order of support; it is likelyto produ e small ombine sets in the

next level, sin e the items with lower support are less likely to produ e frequent itemsets. By

using,i ZigZag an ontinuously sorttheelementsinthe ombine sets generatedduringthe

sear h, sin e it has free a ess to (X;D) of possibly a large portion of the itemsets X that

an be generated in the sear h, potentially apturing as early as possible some hanges on the

dependen esbetweenthe andidateanditsrespe tive ombineset. Thisfeatureallowsi ZigZag

to get an ex ellent ordering of elementsto produ esmaller bran hes. In fa t, i ZigZag uses

to ompute orrelation measures, between a andidate and items inthe ombine set, instead of

simply reordering on support. Correlation an be used to generate statisti al dependen es for

both the presen e and absen e of items (i.e., items of the ombine set) in a andidate itemset,

and its value an be omputed by

(X[x;D)

(X;D)(x;D)

, where X is a andidate and x is an item in

the ombine set. Ifone sortsthe ombine setinin reasing order of orrelation,smaller ombine

sets are produ edat the next level, leadingto ahigh degree of pruningby redu ingthe number

of andidates generated, sin e thenext hoi e of a bran h to explore inthe ba ktra k sear h is

likelytobea goodapproximationofthebest hoi e whilemining. Insummary,themainidea

behindtheeÆ ien yof thesear hforMF( min

;)employedbyi ZigZag stemsfromthefa t

that it eliminates bran hes that are subsumed by an already mined maximal frequent itemset,

and it is very likely that the maximal frequent itemsets that are generated earlier subsume a

largenumberof andidateitemsets thatwouldbe generatediftheorderinwhi h thesemaximal

frequent itemsets were generatedweredierent.

Feature 3. [Flexible Parameter Modifi ation℄ Parameter modi ation an mean

redu ingorin reasing min

a rossdataupdates. Whenthe min

isin reased,thesear h spa eis

redu ed. In this ase, thenew olle tion offrequentitemsets an beobtainedbysimplyltering

outthe itemsets storedin that do notsatisfythenew minimumsupportthreshold. If d +

=;

andd =;,thereisnoneedfordatas ansandthelteringissuÆ ientbe ausethenew olle tion

(27)

isredu ed,thesear h spa eisexpanded. Inthis ase, isnotsuÆ ientenoughtodeterminethe

new olle tion offrequentitemsets,and datas ansare requiredto ndthoseadditionalfrequent

itemsets. However, even when min

is redu ed, the number of data s ans an be dramati ally

redu edbyusing.

Feature 4. [Short-TermData Mining℄ Short-termdatamining onsidersjustare ent

portionof the transa tions fordetermining F( min

;). That is, ifd +

ontains n transa tions,

thentheoldestntransa tionsfromDaredis arded. Short-termminingiseasilyimplementedby

i ZigZag,sin eitisexa tly the asewhen d +

hasntransa tions, andd ontainstheoldestn

transa tionsof D.

Feature 5. [Sele tiveUpdates℄ Animportantintera tivefeatureistheabilitytodete t

isolated hangesinevolvingdata. Analgorithmwiththisfeaturemay ontrolthe ost-to-a ura y

ratio, generating F 0 ( min ; max

;) at a fra tion of the exe ution time required for generating

F( min

;). i ZigZag andete twhi hitemsetsX anhave 0

(X;)wellestimated. i ZigZag

ategorizes three typesof itemsets:

Invariant : ThesupportofX doesnot hangesigni antlyfromDto(i.e.,(X;D) issimilar

to (X;)).

Predi table : Itispossibletoestimate 0

(X;)withinatoleran e. Severalapproximationtools

maybeemployed,from atraditionallinearestimationto sophisti atedtime-seriesanalysis

algorithms. (X;) presents some kindof trend [4 ℄,that is, itin reases orde reasesin a

systemati way asthedatabaseis updated.

Unpredi table : Itisnotpossible,givenasetofapproximationtools,toobtainagoodestimate

of 0

(X;) 1

.

Figure 3.2 shows the orrelation of two olle tions of itemsets. These itemsets are ranked

bysupportand their relative positionsare ompared. When the olle tion of itemsets is totally

a urate, allitemsets areinthe orre tposition. From Figure3.2we an seea omparisonof an

a urate olle tionand an out-datingone. Aswe an see,although there aresigni ant hanges

inthesupportofsomeitemsets,therearealsoalargenumberofitemsetswhi hremaina urate,

inthe orre t position,and do notneed to be updated (i.e., theinvariant itemsets, forming the

diagonal),andalsoalargenumberofitemsetswhi hthesupportvaluehasevolvedinasystemati

1

Thedesire,of ourse,istohavenounpredi tableitemsets,andthesear hfortoolsthatbetterestimate

thesupport ofitemsets isprobablyendless, andis outof s opeofthis thesis. Ourbeliefis thatit isnot

worth to employ sophisti ated tools, sin e the ost of generating F 0 ( min ; max ;) may approa h or

surpassthe ostofgeneratingF( min

(28)

way,followingsome kindof trend (i.e.,thepredi table itemsets, formingwaves). Thereexists a

majordieren ebetweeninvariantandpredi tableitemsets. Ifthereisalargenumberofinvariant

itemsets inD,then F( min

;D) will remain a urate as the database is updated. On theother

hand, ifthere is a largenumberof predi table itemsets, thenF( min

;D) will lossa ura y,but

we an make good estimates of the support of these itemsets and generate a highly a urate

approximate olle tionof frequent itemsets.

0 5000

10000

15000

20000

25000

0 5000

10000

15000

20000

25000

Outdating Ranking

Correct Ranking

After 10K transactions

0 5000

10000

15000

20000

25000

0 5000

10000

15000

20000

25000

Outdating Ranking

Correct Ranking

After 50K transactions

0 5000

10000

15000

20000

25000

0 5000

10000

15000

20000

25000

Outdating Ranking

Accurate Ranking

After 100K transactions

Figure 3.2: Invariant, Predi table and Unpredi tableItemsets.

We divide the estimation approa h into two phases. The rst phase samples the tidsets

asso iatedwith 1-itemsets whose union results in theitemset we want to estimate the support.

The se ond phase analyzesthe sampling in order to determine whether it is ne essary to ount

thea tualsupportofthe itemset. These phasesaredes ribedin thefollowing:

Phase 1. [Support Sampling℄ The startingpoint of the support sampling arethe

tid-sets asso iatedwith1-itemsets, whi hare always up-to-date sin ethese tidsets aresimply

augmented by the novel and obsolete transa tions. Given L(x;) 2

, we dene the binset

S(x;) = fb x1 ;b x2 ;:::;b xn

g as a set of n bins, where ea h bin b

xi

has the support ount

of x untilthe i th

(1 i n) partition of . Note that S(x;) an be easilybuilt sin e

L(x;)is always hronologi allyorderedby onstru tion. Now, givenZ =x[y andtheir

respe tive binsets S(x;) = fb x 1 ;b x 2 ;:::;b x n g and S(y;) = fb y 1 ;b y 2 ;:::;b y n g, we build S 0 (Z;)=fb Z1 ;b Z2 ;:::;b Zn

g,whi hisa setof nbins,whereea h binb

Zi

hasthesmallest

valuefromb x i andb y i

. Formally,thesupportsamplingofZ isbasedonestimatingthe

up-perboundonthemergeofthebinsetsS(x;)andS(y;). Thebinsetsforlongeritemsets

arebuilditeratively,byusingthesame approa h.

2

NotethatL(x;)=(L(x;D)[L(x;d +

(29)

Phase 2. [Support Estimation℄ Trend dete tion is a valuable tool to predi t support

ount valuesinthe ontext ofevolving databases. Awidespreadtrenddete tion te hnique

is linear regression. The model used by the linear regression is expressed as the fun tion

0

(X;)=

(X;D)+bx

jj

,wherexistheblo ksize,andbistheslopeofthelinethatrepresents

thelinear relationship betweenx and 0

(X;). The inputforthelinear regression for an

itemset X is the set of points in its binset S 0

(X;), generated in the support sampling

phase. The method of least squares determines the value of b that minimizes the sum of

thesquaresoftheerrors,anditiswidespreadusedforgeneratinglinearregression models.

To verify howgood isthe model generatedbythe linearregression, we mustestimate the

goodness-of-t, R 2

(X;), whi h represents the proportion of variation in the dependent

variable that has been explained or a ounted for by the regression line. This R 2

(X;)

indi atorrangesinvaluefrom0to1andrevealshow losely 0

(X;) orrelatesto(X;).

Wheneveranitemsetisinvariantorpredi table(i.e.,R 2

(X;) max

),itssupport anbe

simply predi tedusing the linear regression model, rather than omputed withexpensive

datas ans, providingextraordinarysavings in omputationaland I/Orequirements[33 ℄.

3.3 Evaluation

In this se tion we evaluate the ost-benet trade-os i ZigZag features imply in terms of

performan eanda ura y. Ourexperimentalevaluationwas arriedoutonadualPentiumIII1

GHzwith1GB of mainmemoryrunningRed Hat Linux7.1. Weassume thatupdateshappenin

aperiodi fashion. Weemployedfourbasi parametersgiven asinputto i ZigZag: min , max , jd +

jand jd j. Thus,forea h min

employed,we performed multipleexe utions ofi ZigZag,

whereea h exe utionemploysa dierent ombination of max

anddierent blo ksizes. Further,

we employedtwo basi metri s 3

inourevaluation:

1. Exe ution Time: It is the total elapsed time to nd an exa t/approximate olle tion of

frequent itemsets. Timingsareall based onwall lo ktime.

2. A ura y: It is the orrelation between F 0 ( min ; max ;) and F( min ;). The ranking

riteria is the support: F 0 ( min ; max ;) and F( min

;) are totally orrelated if both

olle tionshavethesame length, andthesame itemsetappears in orresponding positions

inboth olle tions.

Experiment1. [DataUpdates℄ Therstsetofexperiments ondu tedwastoempiri ally

verify the exe ution time improvements provided by our in remental algorithm. We have two

3

(30)

main obje tives: to understand how min

and the blo k size ae ts i ZigZag's performan e,

and to omparei ZigZag againsta state-of-the-art algorithm. Figure 3.3 shows theexe ution

time results obtained by applying dierent min

and in remental blo k sizes (300,000, 600,000

and 1,200,000 transa tions) using the WCup database. This experiment mimi s the reality in

some pra ti al warehouses, where a large blo k of data is periodi ally added to the database

(i.e.,a blo k of300,000 transa tionsis olle tedinapproximately6 days). Asexpe ted, smaller

exe ution times are observed forsmaller blo k sizes, sin e less transa tionsare pro essed in an

operation. Nevertheless, the high update frequen y may result in higher overall osts. In fa t,

the best update interval also depends on min

and will vary as the database grows, as we an

seein Figure3.4. We an also observe that aftera given numberof transa tionsis addedto the

database,theexe utiontimeofpro essingin rementstendstostabilize(mostlyforsmallerblo k

sizes). This is due to the large proportion of retained itemsets, as an be seen in Figure 3.6,

whi hmeansthatalargenumberofoperationswasperformedjustoverd +

,greatlyredu ingI/O

osts and explainingthe onstant exe ution timeobserved,sin e d +

hasa xed size. Figure 3.5

showsthespeedupnumbersofi ZigZag. Notethatthespeedupisinrelationtominingfrom

s rat h. As is expe ted, the speed is inversely proportional to the blo k size, be ause the size

of the new data oming in is smaller. Also note that better speedups are a hieved by greater

minimumsupports. We observedthat, forthedatabasesusedinthisexperiment,theproportion

ofretaineditemsets(itemsetsthatare omputedbyexaminingonlyd +

and)islargerforgreater

minimumsupportthresholds.

In the se ond experiment we ompared the performan e of i ZigZag against ULI [26 ℄ 4

.

Figure3.7showstheresultsobtainedusingtheWPortaldatabase. Aswe an see,the

improve-mentsrangefrom85%to 90%ingeneral. Further,betterruntimeimprovementsarea hieved by

largerin rementalblo ksofdata. TheimprovementsareduetotheeÆ ientsear hforF( min

;)

employed by i ZigZag, whi h avoids evaluating the entire negative border, and uses eÆ ient

pruningte hniques.

4

(31)

50

100

150

200

250

300

350

400

450

500

0 1000

2000

3000

4000

5000

6000

Execution Time (secs)

Transactions (x 1000)

WCup / ms = 0.001

300K

600K

1200K

20

40

60

80

100

120

140

160

180

200

220

0 1000

2000

3000

4000

5000

6000

Execution Time (secs)

Transactions (x 1000)

WCup / ms = 0.005

300K

600K

1200K

Figure 3.3: TotalExe ution Time.

200

400

600

800 1000

1200

1400

1600

1800

0 1000

2000

3000

4000

5000

6000

Cumulative Time (secs)

Transactions (x 1000)

WCup / ms = 0.001

300K

600K

1200K

100

200

300

400

500

600

700

800

900 1000

0 1000

2000

3000

4000

5000

6000

Cumulative Time (secs)

Transactions (x 1000)

WCup / ms = 0.005

300K

600K

1200K

Figure 3.4: Cumulative Exe ution Time.

1

2

3

4

5

6

7

6

8

10

12

14

16

18

20 Speedup

Block Size (%)

WCup

0.005

0.001

1

2

3

4

5

6

7

8

9

6

8

10

12

14

16

18

20 Speedup

Block Size (%)

WPortal

0.00005

0.0001

(32)

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

55000

60000

1000

2000

3000

4000

5000

6000

Cardinality

Transactions (x 1000)

WCup / ms = 0.001 / |d+| = 300K

Candidate Itemsets

Frequent Itemsets

Retained Itemsets

8000

10000

12000

14000

16000

18000

20000

22000

24000

1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

Cardinality

Transactions (x 1000)

WCup / ms = 0.001 / |d+| = 600K

Candidate Itemsets

Frequent Itemsets

Retained Itemsets

9000

10000

11000

12000

13000

14000

15000

16000

2500 3000 3500 4000 4500 5000 5500 6000

Cardinality

Transactions (x 1000)

WCup / ms = 0.001 / |d+| = 1200K

Candidate Itemsets

Frequent Itemsets

Retained Itemsets

2000

2500

3000

3500

4000

4500

5000

5500

6000

6500

7000

7500

1000

2000

3000

4000

5000

6000

Cardinality

Transactions (x 1000)

WCup / ms = 0.005 / |d+| = 300K

Candidate Itemsets

Frequent Itemsets

Retained Itemsets

2000

2200

2400

2600

2800

3000

3200

3400

3600

3800

4000

4200

1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

Cardinality

Transactions (x 1000)

WCup / ms = 0.005 / |d+| = 600K

Candidate Itemsets

Frequent Itemsets

Retained Itemsets

2200

2300

2400

2500

2600

2700

2800

2900

3000

3100

3200

2500 3000 3500 4000 4500 5000 5500 6000

Cardinality

Transactions (x 1000)

WCup / ms = 0.005 / |d+| = 1200K

Candidate Itemsets

Frequent Itemsets

Retained Itemsets

Figure 3.6: Proportion of Candidate,Frequent,and RetainedItemsets.

2000

3000

4000

5000

6000

7000

8000

9000

0 1000 2000 3000 4000 5000 6000 7000 8000

Execution Time (secs)

Transactions (x 1000)

WPortal (ULI) / ms = 0.0001

300K

600K

1200K

300

400

500

600

700

800

900 1000

0 1000 2000 3000 4000 5000 6000 7000 8000

Execution Time (secs)

Transactions (x 1000)

WPortal (ZigZag) / ms = 0.0001

300K

600K

1200K

0.11

0.115

0.12

0.125

0.13

0.135

0.14

0.145

0.15

0.155

0.16

0 1000 2000 3000 4000 5000 6000 7000 8000

Relative Execution Time

Transactions (x 1000)

WPortal / ms = 0.0001

300K

600K

1200K

Figure 3.7: Performan e Comparison: i ZigZag vs. ULI.

Experiment 2. [Modifying the Sear h Spa e: Changing min

℄ In this set of

experiments we are interested in investigating (hand-in-hand) drill-down/drill-up intera tions

based on parameter hange operations,using theWCupdatabase. We varied min

(from 0.020

to 0.001), but there are no data updates (i.e., d +

=; and d = ;). We ompare the iterative

miningoperationagainstthebase-lineapproa h,wheretheentiredatabaseisminedfroms rat h

(33)

always omposedbytheresultsobtainedfromthelastiteration(i.e.,i ZigZag usesF(0:020;)

to ndF(0:015;), thenF(0:015;) to ndF(0:010;), then F(0:010;) to ndF(0:005;),

and then F(0:005;) to nd F(0:001;)). We all this approa h hand-in-hand mining. The

graph in the left side shows the improvements obtained using this approa h. The savings are

very signi antin pra ti e, butthey dependon thedieren es between and F( min

;). For

instan e, F(0:020;) and F(0:015;) are very similar, and in this iteration i ZigZag is able

to save approximately 90% of the exe ution time, when ompared to the base-line approa h.

We observed that, for lower min

the dieren es between and F( min

;) are more relevant.

However, using F(0:005;) to nd F(0:001;), i ZigZag is still able to save approximately

45%of theexe ution time.

Inthese ondexperiment,hasxedvalue(i.e.,i ZigZag usesF(0:020;) tondallother

olle tions of frequent itemsets). As expe ted, thisapproa h is not so ee tive as the

hand-in-handminingapproa h,butitstillprovidesimprovementsthatrangefrom20%to50%,asshowed

bythegraphinthe right side.

0

200

400

600

800 1000

1200

1400

1600

0.005

0.01

0.015

0.02 Execution Time (secs)

Minimum Support

WCup

Mining From Scratch

Iterative, Hand-in-Hand Mining

0

200

400

600

800 1000

1200

1400

1600

0.005

0.01

0.015

0.02 Execution Time (secs)

Minimum Support

WCup

Mining From Scratch

Iterative Mining (0.020)

Iterative Mining (0.015)

Iterative Mining (0.010)

Iterative Mining (0.005)

Figure 3.8: Changing min

: Iterative and Hand-in-Hand Operations.

Experiment 3. [Sele tive Updates℄ The thirdset of experimentsis on ernedin ev

al-uating the ee tiveness of performing sele tive updates. Our evaluation is based on the gains

in a ura y and in the relative exe ution time. For ea h experiment we varied max

(0.70,

0.75, 0.80, 0.85, 0.90, 0.95, 1.00), and then we ompared both F( min

;D) (the \old"

olle -tion) and F 0 ( min ; max

;) (the approximate olle tion) to F( min

;) (the a tual olle tion).

The gain in a ura y is given by the dieren e of the a ura y of F( min

;D) and the

a u-ra y of F 0 ( min ; max

;), and the relative exe ution time is the elapsed time spent for nding

F 0 ( min ; max

;) divided by the elapsed time spent for nding F( min

;). We employed 5

dierent blo ksizes (10,000, 20,000, 30,000, 40,000, and 50,000 transa tions). This experiment