• Section 6.1 The Market-Basket Model
• Rajaraman, A.; Leskovec, J.; Ullman, J.D. Mining of Massive Datasets. Cambridge University Press (2019). ISBN: 9781107015357
Available at http://www.mmds.org
• Itemset Mining
• Python Implementations described in:
Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Machine Learning:
Fundamental Concepts and Algorithms, Chapter 8, 2nd Edition, Cambridge University Press, March 2020. ISBN: 978-1108473989.
• Motivation – Supermarket shelf management – Market-basket model:
• Goal:
• Identify items that are bought together by sufficiently many customers
• Approach:
• Process the sales data collected with barcode scanners to find dependencies among items
• Example: a "classic" rule
• "If someone buys diaper and milk, then he/she is likely to buy beer"
• Model
• A large set of items
• A large set of baskets
• Each basket is a
small subset of items
• Goal: discover association rules
• People who bought {x,y,z}
Rules Discovered:{Milk} --> {Coke}
TID Items
1 Bread, Coke, Milk 2 Beer, Bread
3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk
Input:
Output:
• Model
• Items = products
• Baskets = products bought together by customers
• Real market baskets
• Stores keep data about what customers buy together
• Market basket analysis
• Indicates how typical customers navigate stores
• Helps stores position items in the shelves
• Suggests combined promotions
TID Items
1 Bread, Coke, Milk 2 Beer, Bread
3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk
• Model
• Baskets = patients
• Items = drugs & side-effects
• Application:
• Used to detect combinations of drugs that result in side-effects
• Requires extension: absence of an item needs to be observed as well as presence
• Simplest question:
• Find itemsets that appear together “frequently” in baskets
• Support for an itemset I :
• Number of baskets containing all items in I
• (Often expressed as a fraction of the total number of baskets)
• Frequent itemset
• Given a support threshold s ,
an itemset I is frequent iff I appears in at least s baskets
TID Items
1 Bread, Coke, Milk 2 Beer, Bread
3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk
Support of
{Beer, Bread} = 2
• Items = {milk, coke, pepsi, beer, juice}
• Support threshold = 3 baskets
B
1= {m, c, b} B
2= {m, p, j}
B
3= {m, b} B
4= {c, j}
B
5= {m, p, b} B
6= {m, c, b, j}
B
7= {c, b, j} B
8= {b, c}
• Items = {milk, coke, pepsi, beer, juice}
• Support threshold = 3 baskets
B
1= {m, c, b} B
2= {m, p, j}
B
3= {m, b} B
4= {c, j}
B
5= {m, p, b} B
6= {m, c, b, j}
B
7= {c, b, j} B
8= {b, c}
• Frequent itemsets: {m}, {c}, {b}, {j} (but {p} has support <3)
• Items = {milk, coke, pepsi, beer, juice}
• Support threshold = 3 baskets
B
1= {m, c, b} B
2= {m, p, j}
B
3= {m, b} B
4= {c, j}
B
5= {m, p, b} B
6= {m, c, b, j}
B
7= {c, b, j} B
8= {b, c}
• Frequent itemsets: {m}, {c}, {b}, {j}
• Association Rules:
• If-then rules about the contents of baskets
• {i1, i2,…,ik} → j means:
“ if a basket contains all of i1, i2,…,ik, then it is likely to contain j ”
• In practice there are too many rules => find significant/interesting ones
• Confidence of an association rule
• The probability of item j, given I = {i1, i2,…,ik}
) support(
) support(
) conf(
I j j I
I ® = È
• Not all high-confidence rules are interesting
• Example:
The rule X → milk may have high confidence for many itemsets X, because milk is just purchased very often (independent of X) and the confidence will be high
• Interest of an association rule I → j:
• Difference between its confidence and the fraction of baskets that contain j
• Interesting rules are those with high positive or negative interest values (usually above 0.5)
] Pr[
) conf(
)
Interest(I ® j = I ® j - j
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
• Association rule: {m, b} →c
Confidence = 2/4 = 0.5 since support({m, b, c}) = 2 and support({m, b}) = 4 ) Pr[c] = 5/8 since item c appears in 5/8 of the baskets
Interest = |0.5 – 5/8| = 1/8
• Rule is not very interesting!
• Problem:
• Find all association rules with
support ≥ s and confidence ≥ c
• (Support of an association rule is the support of the set of items on the left side)
• Hard part:
• Finding the frequent itemsets
• If {i1 , i2 ,…,ik} → j has high support and confidence,
then both {i1, i2,…,ik} and {i1 , i2 ,…,ik, j} will be “frequent”
• Step 1:
• Find all frequent itemsets I
• (next presentations)
• Step 2: Rule generation
• For every subset A of I, generate a rule A → I - A
• Since I is frequent, A is also frequent
• Single pass to compute the rule confidence
• Output the rules above the confidence threshold
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, c, b, n} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
• Support threshold s = 3, confidence c = 0.75
1) Frequent itemsets: {b,m} {b,c} {c,m} {c,j} {m,c,b}
2) Generate rules:
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, c, b, n} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
• Support threshold s = 3, confidence c = 0.75
1) Frequent itemsets: {b,m} {b,c} {c,m} {c,j} {m,c,b}
2) Generate rules:
b→m: c=4/6 b→c: c=5/6 b,c→m: c=3/5 m→b: c=4/5 … b,m→c: c=3/4 b→c,m: c=3/6
• To reduce the number of rules, filter the itemsets:
• Maximal frequent itemsets
• No immediate superset is frequent
or
• Closed itemsets
• No immediate superset has the same count (> 0)
• Requires storing not only frequent information, but exact counts
Support Maximal (s=3) Closed
A 4 No No
B 5 No Yes
C 3 No No
AB 4 Yes Yes
AC 2 No No
BC 3 Yes Yes
ABC 2 No Yes
Frequent,
but superset BC is also frequent
Frequent, and
its only superset, ABC, is not frequent
Superset BC has same count.
Its only superset, ABC, has smaller count
Maximal frequent itemsets = No immediate superset is frequent Closed itemsets = No immediate superset has the same count (> 0)
• 6.2 Market Baskets and the A-Priori Algorithm
• Problem: How to find frequent itemsets?
• ... and association rules with high support and confidence
• Topics
• How data is stored and manipulated when searching for frequent itemsets
• The “A-Priori” algorithm