• Nenhum resultado encontrado

•Section 6.1 The Market-Basket Model

N/A
N/A
Protected

Academic year: 2022

Share "•Section 6.1 The Market-Basket Model"

Copied!
20
0
0

Texto

(1)
(2)

• Section 6.1 The Market-Basket Model

• Rajaraman, A.; Leskovec, J.; Ullman, J.D. Mining of Massive Datasets. Cambridge University Press (2019). ISBN: 9781107015357

Available at http://www.mmds.org

• Itemset Mining

• Python Implementations described in:

Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Machine Learning:

Fundamental Concepts and Algorithms, Chapter 8, 2nd Edition, Cambridge University Press, March 2020. ISBN: 978-1108473989.

(3)

• Motivation – Supermarket shelf management – Market-basket model:

• Goal:

• Identify items that are bought together by sufficiently many customers

• Approach:

• Process the sales data collected with barcode scanners to find dependencies among items

• Example: a "classic" rule

• "If someone buys diaper and milk, then he/she is likely to buy beer"

(4)

• Model

• A large set of items

• A large set of baskets

• Each basket is a

small subset of items

• Goal: discover association rules

• People who bought {x,y,z}

Rules Discovered:

{Milk} --> {Coke}

TID Items

1 Bread, Coke, Milk 2 Beer, Bread

3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk

Input:

Output:

(5)

• Model

• Items = products

• Baskets = products bought together by customers

• Real market baskets

• Stores keep data about what customers buy together

• Market basket analysis

• Indicates how typical customers navigate stores

• Helps stores position items in the shelves

• Suggests combined promotions

TID Items

1 Bread, Coke, Milk 2 Beer, Bread

3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk

(6)

• Model

• Baskets = patients

• Items = drugs & side-effects

• Application:

• Used to detect combinations of drugs that result in side-effects

• Requires extension: absence of an item needs to be observed as well as presence

(7)

• Simplest question:

• Find itemsets that appear together “frequently” in baskets

• Support for an itemset I :

• Number of baskets containing all items in I

• (Often expressed as a fraction of the total number of baskets)

• Frequent itemset

• Given a support threshold s ,

an itemset I is frequent iff I appears in at least s baskets

TID Items

1 Bread, Coke, Milk 2 Beer, Bread

3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk

Support of

{Beer, Bread} = 2

(8)

• Items = {milk, coke, pepsi, beer, juice}

• Support threshold = 3 baskets

B

1

= {m, c, b} B

2

= {m, p, j}

B

3

= {m, b} B

4

= {c, j}

B

5

= {m, p, b} B

6

= {m, c, b, j}

B

7

= {c, b, j} B

8

= {b, c}

(9)

• Items = {milk, coke, pepsi, beer, juice}

• Support threshold = 3 baskets

B

1

= {m, c, b} B

2

= {m, p, j}

B

3

= {m, b} B

4

= {c, j}

B

5

= {m, p, b} B

6

= {m, c, b, j}

B

7

= {c, b, j} B

8

= {b, c}

• Frequent itemsets: {m}, {c}, {b}, {j} (but {p} has support <3)

(10)

• Items = {milk, coke, pepsi, beer, juice}

• Support threshold = 3 baskets

B

1

= {m, c, b} B

2

= {m, p, j}

B

3

= {m, b} B

4

= {c, j}

B

5

= {m, p, b} B

6

= {m, c, b, j}

B

7

= {c, b, j} B

8

= {b, c}

• Frequent itemsets: {m}, {c}, {b}, {j}

(11)

• Association Rules:

If-then rules about the contents of baskets

• {i1, i2,…,ik} → j means:

“ if a basket contains all of i1, i2,…,ik, then it is likely to contain j ”

• In practice there are too many rules => find significant/interesting ones

• Confidence of an association rule

The probability of item j, given I = {i1, i2,…,ik}

) support(

) support(

) conf(

I j j I

I ® = È

(12)

• Not all high-confidence rules are interesting

• Example:

The rule X → milk may have high confidence for many itemsets X, because milk is just purchased very often (independent of X) and the confidence will be high

• Interest of an association rule I → j:

• Difference between its confidence and the fraction of baskets that contain j

• Interesting rules are those with high positive or negative interest values (usually above 0.5)

] Pr[

) conf(

)

Interest(I ® j = I ® j - j

(13)

B1 = {m, c, b} B2 = {m, p, j}

B3 = {m, b} B4= {c, j}

B5 = {m, p, b} B6 = {m, c, b, j}

B7 = {c, b, j} B8 = {b, c}

• Association rule: {m, b} →c

Confidence = 2/4 = 0.5 since support({m, b, c}) = 2 and support({m, b}) = 4 ) Pr[c] = 5/8 since item c appears in 5/8 of the baskets

Interest = |0.5 – 5/8| = 1/8

• Rule is not very interesting!

(14)

• Problem:

• Find all association rules with

support ≥ s and confidence ≥ c

• (Support of an association rule is the support of the set of items on the left side)

• Hard part:

• Finding the frequent itemsets

• If {i1 , i2 ,…,ik} → j has high support and confidence,

then both {i1, i2,…,ik} and {i1 , i2 ,…,ik, j} will be “frequent”

(15)

• Step 1:

• Find all frequent itemsets I

• (next presentations)

• Step 2: Rule generation

• For every subset A of I, generate a rule A → I - A

Since I is frequent, A is also frequent

Single pass to compute the rule confidence

• Output the rules above the confidence threshold

(16)

B1 = {m, c, b} B2 = {m, p, j}

B3 = {m, c, b, n} B4= {c, j}

B5 = {m, p, b} B6 = {m, c, b, j}

B7 = {c, b, j} B8 = {b, c}

• Support threshold s = 3, confidence c = 0.75

1) Frequent itemsets: {b,m} {b,c} {c,m} {c,j} {m,c,b}

2) Generate rules:

(17)

B1 = {m, c, b} B2 = {m, p, j}

B3 = {m, c, b, n} B4= {c, j}

B5 = {m, p, b} B6 = {m, c, b, j}

B7 = {c, b, j} B8 = {b, c}

• Support threshold s = 3, confidence c = 0.75

1) Frequent itemsets: {b,m} {b,c} {c,m} {c,j} {m,c,b}

2) Generate rules:

b→m: c=4/6 b→c: c=5/6 b,c→m: c=3/5 m→b: c=4/5 … b,m→c: c=3/4 b→c,m: c=3/6

(18)

• To reduce the number of rules, filter the itemsets:

• Maximal frequent itemsets

No immediate superset is frequent

or

• Closed itemsets

No immediate superset has the same count (> 0)

Requires storing not only frequent information, but exact counts

(19)

Support Maximal (s=3) Closed

A 4 No No

B 5 No Yes

C 3 No No

AB 4 Yes Yes

AC 2 No No

BC 3 Yes Yes

ABC 2 No Yes

Frequent,

but superset BC is also frequent

Frequent, and

its only superset, ABC, is not frequent

Superset BC has same count.

Its only superset, ABC, has smaller count

Maximal frequent itemsets = No immediate superset is frequent Closed itemsets = No immediate superset has the same count (> 0)

(20)

• 6.2 Market Baskets and the A-Priori Algorithm

• Problem: How to find frequent itemsets?

... and association rules with high support and confidence

• Topics

How data is stored and manipulated when searching for frequent itemsets

The “A-Priori” algorithm

Referências

Documentos relacionados

cl'etude e tern consciencia de tel' consciencia. Esta e, pais, a epifania de uma excedencia, de uma nao-coincidencia fundamental. Sera esta incoincidencia que, ao aceder

Com base neste contexto procura-se, através de uma revisão da literatura, dar uma resposta sobre o papel das bibliotecas de ensino superior e dos seus profissionais

tangible, for he sells work objectified into commodities and not his manpower. If we regard him as a worker, he does not create surplus value. Bearing in mind the serious political

the price. If they know in advance that the central bank does not operate in the market, any dealer attaches an expected value of zero to an order received from the central

Além de trabalharmos em bibliotecas sem muros, cada vez mais conectados com o mundo, com todos os setores das instituições onde trabalhamos e com outras unidades de

We evaluate our pruning algorithm by considering the number of frequent itemsets in each iteration, the number of candidate itemsets (with and without pruning) and the hit

This paper introduces the Compact BitTable approach for mining frequent itemsets (CBT-fi) which clusters(groups) the similar transaction into one and forms a compact

It was noted that the approach to extracting frequent closed itemsets has the characteristic to generate a number of non-redundant and more compact association rules than the set