Multiple level association rules in a distribution company

(1)

MULTIPLE LEVEL ASSOCIATION RULES IN A DISTRIBUTION

COMPANY

Rita Isabel Mota Carrinho

Dissertation

Master’s in Data Analytics

Supervised by

Professor João Gama

(2)

i

Acknowledgements

Going through life is something no one can do alone.

Writing this paper took a lot out of me but it also took a lot of help from other people. I would like to extend my deepest gratitude to my supervisor, professor João Gama who supported me and put up with my constant need for attention and guidance without ever giving me a cross word, always patient and understanding.

Second, I would like to thank my parents who never let me give up and always believed in me even when I didn’t believe in myself.

To my father I want to say thank you for your support and cheering for me, to my mother I want to say thank you for keeping me grounded, being patient with me and always pushing me to do better. Nothing in my life would ever be possible or half as meaningful without you both.

I would also like to thank my brother and sister, for showing me so much love and cheering me up when I felt down and thought I couldn’t do it, I could never do wrong by them.

To my boyfriend Tiago I just want to say I’m sorry for putting you second and taking the stress out on you as well as thank you for hearing my rants without ever complaining and still sticking by my side through it all.

There are a lot of people I should thank and name in this section, but it would be too long, as it is I just want to thank all my friends for their love and patience.

Finally, I would like to extend my thanks to the all people at Amorim Revestimentos S.A., the Marketing Department in particular, and a very special thank you to Sandra Medas for always being my rock through everything, sticking by my side and teaching me so much that I will never be able to repay her.

(3)

ii

Abstract

Knowledge Discovery in Databases has been the star field of study for scientists these days and this is because, this being the age of technology, data is created, stored and shared at an alarming rate and the information that can be retrieved from this data is so valuable that new methods, techniques and algorithms are being created and tested every day to be able to deal with this amount of data. Be it cleaning, structuring, storing or mining it for information, there is a great need for this set of tools and people who can perform the task accurately and successfully.

Frequent pattern mining is one of the most prolific areas of KDD (Knowledge Discovery in Databases) and this is the theme for this dissertation.

A Portuguese distribution company in the flooring and other coverings business has provided data on their transactions to be mined for associations between their products. This is what this dissertation will be about.

Seeing as this company has a great number of items available to their customers, and seeing as most of them are similar to each other except for some characteristics, mining for these associations at the usual item level would be time consuming and most likely fruitless, so the author proposed a solution: to build a concept hierarchy and mine for association between the different abstraction levels, turning the frequent pattern mining test simpler and more likely to yield actionable conclusions.

Keywords: Frequent Pattern Mining, Association Rules, Hierarchical Associations Rules, Market Basket Analysis, Concept Hierarchies

(4)

iii

List of tables

Table 1: Example of encoding used for a specific item in the database. ... 19 Table 2: Excerpt from transactional database at the basic item level. ... 22 Table 3: Excerpt from transactional database at the basic item level after removing duplicate items per transaction. ... 22

Table 4: List of commands used to build and analyse derived association rules. ... 23 Table 5: Number of itemsets for each hierarchical level for different support values. .. 25 Table 6: Number of association rules derived for each hierarchical level for different support and confidence values. ... 26

Table 7: Top 15 closed frequent itemsets at the Category level of the hierarchy for a minimum support of 0,001. ... 28

Table 8: Top 15 closed frequent itemsets at the Item Group level of the hierarchy for a minimum support of 0,001. ... 30

Table 9: Top 15 closed frequent itemsets at the Basic Item level of the hierarchy for a minimum support of 0,001. ... 32

Table 10: Top 15 derived single level association rules at the category level of the hierarchy for a minimum support of 0,001 and minimum confidence of 0,1. ... 35

Table 11: Top 15 derived single level association rules at the Item Group level of the hierarchy for a minimum support of 0,001 and minimum confidence of 0,1. ... 37

Table 12: Top 15 derived single level association rules at the Basic Item level of the hierarchy for a minimum support of 0,001 and minimum confidence of 0,1. ... 40

Table 13: Top 15 rules sorted by lift for a minimum support of 0,001 and minimum confidence of 0,1 that provide the best information... 41

Table 14: Number of frequent itemsets excluding hierarchical level for different minimum support values. ... 45

Table 15: Number of derived hierarchical rules for various minimum support and confidence values, excluding rules with 100% confidence. ... 45

Table 16: Top 15 association rules excluding hierarchical level for a minimum support of 0,01 and minimum confidence of 0,5. ... 46

Table 17: Top 10 association rules excluding hierarchical level for a minimum support of 0,005 and minimum confidence 0,05, rules where the consequent is a hierarchical descendent of the antecedent non-withstanding. ... 47

(7)

vi

List of Figures

Figure 1: FP-growth algorithm for mining frequent patterns. ... 8 Figure 2: Table with example of transactional database encoding from Han et al. 1999 ... 10 Figure 3: Scheme of the hierarchy architecture and encoding ... 18 Figure 4: Command used to scan database into R environment. ... 21 Figure 5:Print screen of the command used in R to find frequent itemsets through the Apriori algorithm in the Category level of the hierarchy with support=0.1% ... 24

Figure 6: Command used to see redundant rules in R. ... 35 Figure 7: Command used to produce a graphical representation of the 50 most common itemsets present in association rules. ... 49

Figure 8: Grouped matrix-based visualization of the 50 most common itemsets from which association rules could be derived. ... 50

(8)

1

1. Introduction

This is an era when people are constantly subjected to a fast pace stream of information and, in situations when this information is not stored, kept and treated, each second, another chance to generate valuable knowledge that could be useful to a great number of fields where it could be applied in daily life is lost.

Knowledge Discovery in Databases (KDD) has become, much do to this, an interesting and fundamental process to evolution in business, science, medicine, engineering, marketing, among a great number of fields that benefit from the information that can be extracted from databases which, to the untrained eye, seem useless.

Given the overwhelming need for techniques to undercover this information, the evolution of KDD during the last three decades has been fast and vast. The number of algorithms, techniques, and mechanisms that have been developed is extensive and the investigation on this sector is in perpetual movement.

There are some people who still confuse KDD with Data Mining and call it exactly that, however, KDD is a process that can be divided into different steps and one of these steps, the most important step as most scientists believe it is, is precisely Data Mining.

Data Mining is a concept that refers to the step in the KDD process that consists in applying data analysis and algorithms for knowledge discovery in databases and produces results when trying to find patterns in those databases (Fayad et al. 1996).

Specifically, Data Mining puts at one’s disposal precious information that allows one to make guided, informed decisions when solving a problem. In Marketing, applications for these different techniques are immense, allowing for the creation of targeted marketing campaigns and promotions, recommendation systems and other applications that, eventually, translate to increase in sales and, consequently, revenue.

One of the biggest success cases for the Data Mining process is association rules and all the knowledge these provide about consumer habits and purchase patterns of customers of the establishment.

Knowing which items are most frequently acquired simultaneously by the customer it's possible, due to the close relationship with customer’s purchasing habits, to create conditions that lead to an increase in transactions involving certain items.

This study lands itself on this exact concept and intends to expand on it by searching for association rules through a broader spectrum.

(9)

2

Given a set of items, these items usually have an inherent hierarchy associated with them. What this means is that an item is not just this specific product, but it is rather an item with these specific characteristics within a certain category of products: a strawberry flavored yogurt is first and foremost a yogurt and if you choose to go to a broader level, a dairy product. This kind of connection between the different levels on a hierarchy, can be found by using association rules on different abstraction levels of the product and find rules that can be more useful by being less redundant and less random. This process is called Hierarchical Mining for Association Rules and it will be the subject of study for this document.

To contextualize this concept a literary review of articles published on this subject will be submitted along with the analysis of a dataset constructed by this student based on data provided by a Portuguese company.

To perform this analysis, when it comes to the experimental part of this paper, the software used is R, version 3.3.1, towards data treatment, management, and analysis, coupled with arules and arulesViz packages and the tool RStudio.

(10)

3

2. Motivation

The main reason behind this study is a great personal interest in what concerns not just the subject the applications and the fact that one can find them in a company environment.

By auditing the Data Analytics Master’s degree, a vast set of concepts, techniques, mechanisms for treating, mining and analyzing data were introduced and a world of possibilities was opened to any area of expertise where the information that can be extracted can be useful.

Given the acquired knowledge, there is now a great interest in finding a practical application for it, with this in mind, the subject chosen is one of the most proliferous in Data

Mining, association rules.

Association rules allow one to find associations and relationships between products acquired by the customers of a certain establishment e that way better understand how these customers behave and what they need by being able to promote, based on this information, changes to marketing campaigns, offers, items restocking, store layout, products recommendation by the retailer/vendor to the customer.

This understanding allows us to also find solutions to problems like the ones that were presented in this paper that was carried out in a Portuguese distribution company where the extensive number of available items is too great and is confusing even the distributor itself. It is necessary to have a better understanding of what makes customers buy different items simultaneously, what attracts the customer to a combination of products.

This paper allows the student to accomplish the wanting to find a solution to a real situation in the corporate environment to deepen the acquired academic knowledge.

(11)

4

3. Literary Review

3.1 Data Mining

The act of Knowledge Discovery in Databases is often commonly called Data Mining and it is defined as a process composed by a set of techniques and algorithms that are meant to identify valid and useful patterns which can be easily understood and applied by the user. The term Data Mining is misused though when referring to Knowledge Discovery in Databases (Coenen, 2004).

Data Mining is one step among the steps needed in the Knowledge Discovery in

Databases process. Other steps of this process are data selection, data preprocessing, data transformation and after data mining, finding and interpreting patterns to create knowledge. This way Data Mining is just a step where all the information available is analyzed and visualized finding a solution to the problem at hand. Also, with the advancement of technology, computational power has grown exponentially, and it provides the ability to treat datasets of a considerable size without losing efficiency (processing time) or precision.

When speaking of used techniques when facing the Data Mining process, one can distinguish three main groups: classification, pattern discovery and recognition and clustering. The following paper fits into the second group.

During the KDD (Knowledge Discovery Process) evolution process, the detection of patterns has always had a very relevant role. The techniques used to achieve this allow one to find association between product purchases by customers on a transactional database, trends, search of subgraphs among others.

The most used technique to find these patterns is Association Rules, a concept introduced by Agrawal et al in 1993.

The search for association rules is the process of pattern recognition that looks for connections between items that are acquired simultaneously by customers.

This will be the subject we will try to address along this paper.

3.2 Single Level Association Rules Mining

The concept of association rules was first introduced by Agrawal et al. in 1993. It was defined as a technique to extract useful information on customers’ purchase patterns and habits of a certain establishment with a transactional database. This is the first time that a

(12)

5

concept for minimum support and confidence is formulated as well as the techniques for the generation of frequent itemsets by using a new concept of minimum support.

Frequent itemsets are defined as combinations of items belonging to the same transaction that are repeated in a proportion greater or equal to the established minimum support.

For a certain itemset considered frequent, all the association rules are generated between any one of these items on this itemset and all the remaining elements of the same set.

Alongside generating association rules, support value is calculated and stored for the frequent itemset and for the item for which the association to the frequent itemset is being considered. By finding the ratio between these two values, we find the confidence which, if it is greater or equal to the stipulated value for minimum confidence, the rule is accepted as a good association rule.

The problem to be solved is the generation of frequent itemsets that verify the restriction of a stipulated minimum support. As a solution to this problem, Agrawal et al. propose an algorithm for the generation of frequent itemsets that considers as an extension of the itemset the sets that have no items in common with the itemset in question and performs various explorations through the database and by doing so, builds a group of itemsets that are changed by adding an extension of itself.

This group of itemsets is called the frontier set. Still, the fact that the algorithm needs to perform a great number of explorations through the database to be able to build these itemsets makes this algorithm very expensive, not only time-wise but also in terms if the memory it requires to be able to store the various attempts to build the itemset in each iteration. This way Agrawal et al. also propose a solution to conserve memory, which helps with the required memory problem but also allows for the pruning of redundant rules.

After the addition of these improvements concerning memory management and the pruning of redundant rules, the results when the model is applied to a transactional database prove to be satisfactory for each exploration that is performed. With the increase in the number of explorations the results get better but also, as was expected, so does the execution time.

In Han et al. (1999) the definition of minimum support becomes the minimum support of the items that are part of the before mentioned rule that was found.

Given a rule R. the minimum support is the lowest minimum support among the minimum supports of items that make the rule.

(13)

6

This new minimum support conceptualization allows for the consideration of a higher minimum support for the rule in question than the one we could have gotten in normal circumstances for rules composed only of frequent itemsets and lowest minimum supports for rules that contain itemsets that are less frequent.

Support is a downward closed measure which means that each subset of an itemset that’s been considered frequent, according to the established minimum support, will be considered a frequent itemset on its own. This property is extremely useful when considering the first step when generating association rules and this step is generating frequent itemsets. The algorithm proposed by Han et al. in 1999 considers this property when looking for frequent itemsets seeing as most algorithms that were used to that date did not considers this aspect which later led to a considerable number of redundant rules.

This is an algorithm that carries out the search for frequent itemsets level by level. This characteristic of the algorithm is similar to the better-known algorithm for finding association rules, the Apriori algorithm. Actually, if we keep the same minimum support value throughout the search, the algorithm turns into the Apriori. The difference between these two methods is that, to assure the downward closed property, each level, the new frequent itemsets are generated by building on the frequent itemsets found on the previous level.

In the case study proposed by Svetina et al in 2005, a transactional database from a certain company is used to find association rules and show some of its most lucrative applications.

Attending to the characteristics of the company in question, this article is the one that most resembles the present study. The company works in renovations with many different segments including retail and distributors all over the globe.

Some of the proposed solutions that benefit from the use of association rules are the ability to learn and understand the consumer’s habits and make informed decisions about the products pricing, their store placement and their profit margin. Also, they allow us to uncover which products, though not related to each other, may have a stronger sales performance and which items present a similar performance.

To conduct Market Basket Analysis within this company, the authors divide the process in two stages: the first one is finding all transactions that contain a certain item to be analyzed and second is finding all relevant transactions along with information such as price and the acquired amount.

Finally the authors offer solutions to a number of problems that use association rules like the creation of marketing campaigns, carefully chosen offers and sales like ones that

(14)

7

involve items that are frequently acquired simultaneously in which one of the items has a high profit margin and the other has a low profit margin and the latter can suffer a decrease in price if acquired with the first item.

One of the most popular algorithms for finding association rules was proposed by Agrawal et al. in 1994, the Apriori algorithm.

This algorithm generates frequent itemsets by performing one exploration through the database and accounting for minimum support established. In the following explorations, besides finding frequent itemsets that involve the previously found ones and the remaining items, support value for each set found is also calculated and stored.

The algorithm ends when all the frequent itemsets are found, next it proceeds with finding association rules.

After all frequent itemsets have been found and all association rules have been established, it then moves on with its evaluation according to the predefined minimum confidence value and lift.

Even though most association rule mining algorithms rely on generating frequent itemsets to derive rules from these itemsets, there are some algorithms that do not.

In 2000, Han et al. developed the FP-growth algorithm for association rule mining. Apart from the Apriori algorithm, this is one of the most popular algorithms to perform association rule mining.

The FP-growth algorithm starts by scanning the database ounce and creating a tree like structure where only frequent 1-itemsets (itemsets of length one) have nodes in that tree and they are arranged in a way that the more frequent an item is, the more likely this item shares nodes with other items. Alongside these itemsets, their support is also computed and stored. After the FP-tree is built the method shown in figure 1 is used to find the frequent itemsets and derive association rules.

In the end, this algorithm shows a few advantages versus the candidate generating algorithms.

This algorithm proposes a compact structure to store data and avoid multiple explorations through the original database which can be unproductive when dealing with large databases. The pattern growth method that is applied is also much less costly when it comes to memory and time to perform the algorithm, since it does not generate the candidates and test them after like the Apriori algorithm does.

(15)

8

Though most single level association rules mining algorithms are support driven, others like MinHash, Cohen et al in 2001, show how, for single level association rule mining, this is not necessarily the best approach. Support driven algorithms assume that itemsets with extremely low support are not significant and cannot produce useful rules, which may be true when looking at market basket analysis, but for other applications like fraud detection and internet security, these less frequent itemsets may have very important knowledge.

Lai et al., in 2001, have written an article where they compare a support driven algorithm to one of the confidence driven algorithms developed by Cohen et al., the MinHash algorithm.

To compare MinHash to a support-then-confidence algorithm they chose DLG algorithm.

The DLG is a Boolean algorithm that was introduced by Yen et al. in 1997 and it works in two stages as follows:

• First stage – It scans the database, creates a set of frequent itemsets with length 1, and creates a bit array for each itemset. Each bit in the bit array corresponds to a transaction in the dataset. If the transaction contains the 1-itemset (itemset of length 1) the bit is set to 1.

• Second stage – The support of the 1-itemsets generated in the first phase is used to generate support for the consequent k-itemsets (itemsets of length k) by performing an inner join of the previously built bit arrays.

(16)

9

As expected and theoretically hypothesized, the DLG algorithm performs better when support is sufficiently high, but when support tends to significantly lower levels, below 0.1%, MinHash wins over based not only on execution time and number of rules derived from the search but also by having a much lower percentage of false positives.

This shows that support-driven algorithms, though they provide very good results on most cases, are not the best choice for every task when looking for association rules. Namely if an itemset is less frequent, or rare, a correlation base algorithm like MinHash, becomes a better choice because they allow the user to find information that otherwise would be lost since an algorithm like Apriori wouldn´t find these rules. To be able to find them, support-then-confidence algorithms would take a very long time to find all the itemsets with such low minimum support and derive rules from them.

Though frequent pattern mining can extract very useful information and great applications, and the process for finding these patterns entices many scientists in searching and developing methods to mine for them, there is frequently a considerable amount of post mining that is required to reach these meaningful conclusions. This is due to the great number of patterns that can be derived from databases, especially large databases, and many of them are considered redundant patterns.

Some examples of research developed in this area are Bastide et al in 2000, Zhao et al. in 2009, Batbarai et al. in 2014, Chen et al. in 2015.

3.3 Hierarchical Association Rules

Hierarchical association rules are a technique for exploration and search of purchasing patterns in the products various abstraction levels.

Frequently, when trying to find consumer’s purchasing patterns at the lowest abstraction level, meaning, the most specific description of the item, for example, when looking at floor maintenance we would be studying cleaners or waxes, minimum support for each itemset is so low that the association rules that could be produced would become somewhat random and difficult to understand, also, the minimum confidence threshold would suffer for it.

As a way of being able to extract any information about the associations between certain items, algorithms for mining association rules evolved to be able to mine for not only association rules at the specific item level, but also at various abstraction levels also called hierarchical levels.

(17)

10

Han et al. approached this theme in 1999 in their article "Mining Multiple-Level Association

Rules in Large Databases". Instead of a simple item, now in each transaction, for each item,

there is a set of characteristics that define the products and, instead of testing the relationship between items, they explore the existing relations between the different abstraction levels.

In this case study the authors opt for an approach to the problem that uses minimum support values which vary according to the abstraction level on the hierarchy. When looking at higher levels in the concept hierarchy of items, the support value for those items is also higher. Given that the specificity level is lower the higher up we are on the hierarchy, we can have itemsets that are more frequent seeing as, for example, waxes and varnishes are both finishing materials, and we are more likely to have transactions including flooring items and finishing products than having a transaction where the customer bought a cork visual floating floor and a polyurethane based varnish for low foot traffic.

There are various products that are included in the same item group, as we descend towards the lower levels in the hierarchy, the item’s description becomes more specific, which leads to lower support values for the itemsets.

To be able to mine for hierarchical association rules, the authors encoded the original dataset so that each item has a code that is associated with its position on the concept hierarchy. Each digit corresponds to a group of items in that abstraction level, for example, 121 is the item that belongs to group 1 of level 1, group 2 of level 2, and group 1 of level 3.

Below in figure 2 is an example of an encoded transaction table used on the article:

According to the suppositions that were described, they propose 4 algorithms: ML_T2L1, ML_T1LA, ML_TML1, ML_T2LA.

(18)

11

Based on the results obtained on the experiment, the algorithm with the best performance is ML_T2LA.

In 1995, Srikant and Agrawal revisited the theme of association rules and so they develop 3 algorithms to mine for them. They named them Basic, Cumulate, and EstMerge.

The Basic algorithm finds a simple and direct way of solving the hierarchical association rule problem by adding the superior abstraction levels to each item in each transaction. Although this method provides a practical solution to the problem, the number of scans it makes through the database is too high and it makes this algorithm too time consuming. Given that this algorithm takes too long to execute, the authors then propose an improvement.

The new variant fist applies the Apriori algorithm to the generalized database which significantly improves the efficiency seeing as, the same way the Apriori algorithm only considers frequent itemsets, now also considers itemsets where the higher abstraction levels also form frequent itemsets.

Srikant and Agrawal called this algorithm Cumulate which performance, when applied to a dataset, is better than the performance of the Basic algorithm even when increasing the minimum support conditions.

Still, for the study in questions, the algorithm that registered the best performance was the EstMerge algorithm.

This algorithm tries to optimize the process by resorting to sampling. First, the algorithm determines the expected support for each item and then counts every itemset expected to be frequent.

Yadav et al, in 2015 propose another algorithm, MLTransTrie, that also uses different minimum support values for different abstraction levels. It is a bottom-up approach and it has two stages.

First, the algorithm finds frequent 1-itemsets and computes their support, then it filters the database to remove itemsets that are not considered frequent and transactions with only infrequent items.

Second, the database is transposed, sorted in descending order, and converted to bit format.

This algorithm showed an execution time and scanning time shorter than most multiple level association rule mining algorithms and it also takes less space in memory.

(19)

12

In 2015 there was also a paper published by Zekić-Sušac et al. where the authors study a database by applying multiple level association rules to the problem and used both objective measures (support, confidence, lift) and subjective measures (heuristical unexpectedness and heuristical actionability determined by human expert) to evaluate the derived set of rules.

Their research shows that the hierarchical rules mining method yielded the most interesting rules, also, the fact that they used both objective and subjective measures allowed them to extract some unexpected but actionable results.

As has been said before, most algorithms to search for association rules must perform multiple explorations through the database and they become time consuming and take a great amount of memory when dealing with large databases. In 2014, Xu et al. propose a new genetic algorithm that attempts to solve this problem.

“Genetic algorithm is a heuristic search approach that mimics the process of natural evolution and generates solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover” (Xu et al. 2014).

The algorithm uses a sub-tree based encoding system. Taking an initial catalog tree, which is a representation of the concept hierarchy, multiple level association rules are represented as sub-trees from this initial tree which is called the catalog tree. In each sub-tree, which represent the valid multi-level association rules, each leaf has a binary value attributed to it to indicate whether it is an antecedent or a consequent of the rule. If the leaf has a 0 attributed to it, it is an antecedent of a rule, if it has a 1 attributed to it, it is a consequent and if it has a -1 value attributed to it, it does not participate in the association rule.

To mine for the association rules, some genetic operators were put in place to mimic the processes that occur naturally in genetics. The main function to consider is the fitness function, this is the function determines how strong a candidate is to move on to the next generation and be reproducers. Selection is the operator that chooses, based on the fitness function, the candidates that do move on to the next generation.

The crossover operator is next, this operator allows for the crossing of the sub-trees’ nodes to generate new offspring.

Finally, there is a mutation operator that the generating of rules will be diversified by allowing some rules to be generated by randomly choosing a node of a sub-tree and either assign an alternative attribute values, prune its sub-tree or add a sub-tree to the node.

(20)

13

The genetic algorithm proved to have a better performance than the Apriori algorithm by taking less time to execute, and more accurate results, still it requires a very specific set up of a catalog tree for the database or it is not possible to run the algorithm.

In 2012, Kost et al. find a new application for multiple level association rule mining. The authors used the Clinical Classifications Software (CCS) and the ICD-9-CM numerical hierarchy to build a generalized database from the 2009 Vermont Uniform Hospital Discharge Data Set and mine for generalized association rules at different levels of abstraction and identify disease co-occurrences.

Even though these algorithms all present acceptable efficiency and efficacy level, in this case study, this search will be simplified by using the built-in algorithms available in the arules package of the R software.

(21)

14

4. The Problem of Association Rule Mining

Association rules is a technique used in Data Mining to extract relevant and useful information through finding frequent patterns with many areas of application.

The main application is finding purchasing patterns on a transactional database from a certain establishment. This is the application that will be explored in this dissertation.

Let T ={ , ,..., }t t1 2 tn be a set of transactions from any commercial establishment, let

1 2

{ , ,..., }n

I = i i i a set of all items sold by that establishment and that form T and let ,

X T  , the definition of an association rule is a relationshipI X Y, meaning that, the occurrence of X implies the occurrence of Y.

Support is a measure that provides information on the number of times that a certain itemset is present in the same transaction. This means that, if an itemset is considered frequent, the items that are in this itemset appear together in the same transaction a certain number of times, specifically a number that is higher than the initial conditions established by the user, in different transactions.

After an estimation of this parameter when trying to find the number of frequent itemsets generated with this value, a value for support is chosen which will serve as a threshold for finding frequent itemsets and association rules. This values that was chosen is called minimum support.

The support of an association rule is calculated as follows: )

sup(X Y =P X( Y)

Another measure that performs an extremely important role where association rules are concerned is the confidence of a rule. Given an association rule X Y, the confidence of a rule is calculated as follows:

sup( ( ) ) sup( ) X con Y Y f X Y = 

A rule is considered a strong association if it’s support and confidence is superior to the minimum support value and minimum confidence value.

In addition to these two measures, a third way of evaluating the quality of any association is lift.

When looking at a X Y association, lift is calculated as follows: ( ( ) ( ) ( ) ) P X lift X Y P Y X P Y   =

(22)

15

Lift measures, not only the quality of an association rule, but it also provides information

on how the items in the rule are related.

As we can see by the way it is calculated, it is clear that lift compares the probability that X and Y occur (simultaneously or not) and the probability they would co-occur if they were independent.

Given this fact, if lift =1, the probability of X and Y occurring is the same as the probability of them co-occurring, this means that the antecedent ( )X and the consequent

( )Y of this rule are independent.

If these itemsets are independent, there is no possibility for deriving an association from this relationship.

Consequently, if lift 1, this means that X and Y appear together in transactions less than it was expected, this means the rule that was derived is weaker than it should be and it should not be considered a good association.

On the other hand, if lift 1, then the co-occurrence of X and Y exceeds the expected number of transactions where they would co-occur, this means that the higher the lift, the higher the accuracy of the association and we can say that the occurrence of X has a high effect on the occurrence of Y . These are the rules that provide us with strong associations that are likely to be actionable.

When moving from single level association rules and multi-level association rules, the first stage in the process of mining these, is building the concept hierarchy and encoding the database.

After creating a concept hierarchy, or taxonomy, and encoding the database, it is time to estimate minimum support and minimum confidence so the process of generating frequent itemsets can move on according to the initial conditions established by the user.

When trying to find association rules and include the various abstraction levels of the item there are several approaches that can be taken, being that most of them start by analyzing the higher levels in the hierarchy and move down to the lower levels generating frequent itemsets on each level by using the frequent itemsets previously found on the level immediately before the one that is being analyzed.

When generating frequent itemsets one can assume minimum support to be universal, meaning that when mining the different abstraction levels, the minimum support stays the same as it reaches the lower levels of the taxonomy.

(23)

16

Another approach, the one chosen by Han et al., support varies when considering the inferior levels in the taxonomy. The lower the abstraction level the lower the minimum support required for an itemset to be considered frequent as when moving down the hierarchy, specificity increases, which makes all itemsets less frequent.

(24)

17

5. Case study

This paper is based on the information retrieved by this student during the internship attended at a portuguese distributing company where the data for this project originated.

The company in question deals in cork flooring as well as other types of coverings including wall coverings, as well as maintenance products for the coverings and provides certain services as well as marketing materials for the sales teams to promote the materials and improve brand awareness. Also, given that this company deals mainly in products which include cork in their basic structure and in the production of the greater part of their items, one of the main objectives of this company is to attract the client to the benefits of cork, mainly it’s sustainability and the fact that their products are ecofriendly.

This company works in very different commercial models being that some of their clients are their own sales units which by the materials to resell, some are distributors, others are retailers, and finally, a portion of them are end consumers.

One of the rules we will be looking for is if upon buying the materials, sales units and distributors also buy marketing materials, this is due to the great effort the company is making to create these marketing tools and to circulate their massage to make consumers aware of their values and know their materials seeing as, experience shows, the more the consumer knows them and the sales arguments the greater sales are.

5.1 Data description and building the dataset

To be able to perform this study, a product concept hierarchy was created based on the knowledge acquired during the time spent working at the company.

According to how the products are structured there is a business segment, a category, an item group, and then an item at the most specific level that was considered which, given that this company deals in flooring, is, for example, the type of flooring that was bought, choosing to exclude what the veneer or finishing on the item was.

Each item has been encoded with a nine-digit number according to the hierarchy scheme presented in figure 3.

(25)

18

Level 1 Level 2 Level 3 Level 04

Item 1 (3 digits) _{Item Group 1} (2 digits) Item 2 (3 digits) Category 1 (2 digits) Item 3 (3 digits) _{Item Group 2} (2 digits) … Business Segment 1 (2 digits) … _{Item Group 3} (2 digits) … _{Category 2} (2 digits) … _{Item Group 4} (2 digits) Item n (3 digits)

Figure 3: Scheme of the hierarchy architecture and encoding

Given the scheme above, below is presented an example of the encoding of an item in the database:

(26)

19

BUSINESS

SEGMENT CATEGORY ITEM GROUP ITEM

13 - Coverings 1321 – Cork _Coverings 132132 – Cork Coverings _{Glue down} 132132404 – Cork Coverings _{Glue down Cork o Floor}

Table 1: Example of encoding used for a specific item in the database.

So, as is demonstrated by table 1, the first two digits refer to business segment, 13 would be Coverings, the second level would be 1321 which corresponds to Cork Coverings, third level would add two more digits to the code, 32, and 132132 corresponds to Cork Coverings Gluedown and finally, given that when it comes to the basic item three more digits are added due to the fact that this level has more items per category than the previous ones, 404 corresponds to the basic item Cork o Floor. After the encoding of the item 132132404 corresponds to an item that is Cork Coverings Glue down and an element of the Cork o Floor family.

Like it was explained before, when it comes to the items in the Coverings category, the basic unit was grouped based on the items structure and not the actual basic item. If we were to consider the true basic item, we would have to consider not only the item structure, but also what kind of veneer it has, color, finishing, and other characteristics that make the basic item at its most specific level. If we considered every characteristic of the items, the hierarchy would get too long and the frequent itemsets would have to low of a support that no relevant rules could be found.

After encoding all the items in the database according to the established hierarchy, then, to be able to mine for frequent itemsets in each level of the hierarchy, the database needed to be separated into category, subcategory and basic item.

Before the database could be separated into three different files, there was still some preprocessing to be done on the initial dataset given that the original data file was the billing information for all orders for the years 2015 and 2016. To transform the file into the transactional database that served as the starting point for the three different files that allow us to mine for multilevel association rules the following changes were performed:

• First all the customer’s information was removed since this was not relevant when mining for association rules, specifically their name, address, telephone and identification number.

• The type of business the customer owns was also removed given that this information was not relevant to the mining of multilevel association rules.

(27)

20

• Another information that was removed were all the rows in the database that referred to credit notes issued to clients since what was needed was only sales. • All information referring to the invoice number and any other information on it

was also removed.

• Information on Sales representative was deleted since it was not relevant to our problem.

• All information on pricing and amount of a certain item that was purchased in a transaction, since it was not relevant to mining multilevel association rules, were also eliminated.

• Finally, all information on where the items were delivered was removed.

The data used for the study contains purchases made by customers in a period of 2 years because, being a company that does not stock most items, purchases sometimes take months to be completed and some of those can carry to the following year, also, seing as items don’t show the same rotation, for factory efficiency, production was planned accordingly and some items were dispatched to the customer before all items on order were completed, which meant the order was incomplete as well.

For this study, only completed orders that were placed and dispatched in the year 2015 and 2016 were considered.

After cleaning the dataset and removing all the unusable information, the data was then ready to be separated into 3 transactional datasets:

• transactional database_Cat – the items are encoded with level 1, Business Segment and level 2, Category.

• transactional database_ItGr – the items are encoded with level 1, 2 and 3, which means we can mine for itemsets at the Item Group level.

• transactional database_BUN – the items are encoded to their most specific level, level 4 which is the basic unit and here is where we will mine for frequent itemset at the item level.

This transactional database has a total of 149916 transactions which comprehends a total of 108 different items within a set of 5 different business segments, 18 categories and 38 item groups.

(28)

21

The average number of items purchase per transaction is 1.8989 when looking at the basic unit.

If in a single transaction an amount greater than one of any items was bought, they were removed until only one of each item remained per transaction.

Since when searching for frequent itemsets, the Apriori algorithm doesn’t account for the quantity of items but rather the number of transactions where they are present, there is no reason to replicate the items on any transaction.

To build this dataset the software used was Excel from Office 365. After all the files were formatted in a way that could be scanned and analyzed, the software used for the analysis was R.

5.2 Scanning and treating data

After the database was created, it was necessary to scan the datasets into R’s environment, so it could then be treated and analyzed, for this the following command was used:

tr_cat <- read.transactions(“transactional dataset_cat.csv”, format=”single”, rm.duplicates=TRUE, sep=”,”, cols=c(“TransactionID”, “ItemCat”))

Figure 4: Command used to scan database into R environment.

The command read.transactions is a built-in function from the package arules, it scans the transactional dataset and creates an object of class transactions.

The argument set to single, as opposed to basket, refers to the way the transactional dataset is put together. If the file is single format, each row is identified with a transaction number and there are various rows that correspond to the same transaction. An example of this configuration is presented in table 2.

TransactionID ItemBUN … … 3 122132402 3 122132402 4 132131405 4 132131405 4 132131405 5 132131405

(29)

22

6 132131405

… …

Table 2: Excerpt from transactional database at the basic item level.

The example shown in table 2 was taken from the basic item transactional dataset built for this study. As is demonstrated by the table, another aspect that needed to be taken into account was the replication of items in the same transaction, this is why the read.transactions function has the rm.duplicates argument.

When set to TRUE the rm.duplicates argument removes all duplicate items in a transaction so that every item only shows up in a transaction ounce. When mining for frequent itemsets and association rules, the number of times an item was purchased in the same transaction, or the quantity that was bought, have no interference in the task at hand. How many of a certain item was purchased in one transaction will not affect what frequent itemsets are generated and it will not affect the associations derived from them. This is due to the fact that when mining for association rules, the algorithm does not look for quantity of a certain item that was purchased but rather if the item is present in the transaction.

After setting this argument to TRUE the dataset had all these duplicates removed. An example of this can be seen in table 3.

TransactionID ItemBUN … … 3 122132402 4 132131405 5 132131405 6 132131405 … …

(30)

23

To perform an analysis on the data, the other commands used are presented below:

Command Purpose

inspect Display detailed information on a corpus, a term-document matrix, or _{a text document}

is.redundant Find redundant rules

is.closed Find closed frequent itemsts

is.maximal Find maximal frequent itemsets

apriori Mine frequent itemsets or association rules using the Apriori algorithm

setwd Set a working directory

length Gets the length of an object

plot Generic function for plotting of R objects

(31)

24

6. Parameter estimation and frequent itemsets

This section of the dissertation is about the parameters that will be used to mine for hierarchical association rules and as has previously been demonstrated, the support for the itemsets at the basic item level is much lower than the support of the itemsets at higher hierarchical levels. Also, when looking at the confidence values, these should be as high as possible but, given that multilevel association rules will be mined, these values are also expected to be lower the further down into the hierarchy.

To derive the multi-level association rules, the same value of minimum support and minimum confidence will be used for all levels of the hierarchy.

To evaluate these rules, even though support and confidence values will also be considered, the main measure that will be used to evaluate the association rules will be lift.

Estimating minimum support and confidence values is a step that one must take to be able to reach a more satisfying set of rules since these will be the values that will determine which of the itemsets are frequent and consequently, by having more accurate frequent itemsets, achieve results where the rules are more relevant and actionable.

6.1. Parameter Estimation

To mine for multilevel association rules, first, a value for minimum support and minimum confidence need to be established which will, as has been stated before, serve as a threshold for finding frequent itemsets and, consequently association rules.

When looking for these rules the structure of the database needs to be taken into account and, seeing as this is a multilevel association rules problem, all the levels in the hierarchy have to be considered and especially, when establishing a minimum support, the basic item level since this will be the level where it is expect to find frequent itemsets with the lowest support and the rules with the lowest confidence.

To look for closed frequent itemsets with different values for minimum support, the following command in R was used.

Closeditemsets_cat <- apriori(tr_cat, parameter=list(target=’closed frequent itemsets’, support=0.001))

Figure 5:Print screen of the command used in R to find frequent itemsets through the Apriori algorithm in the Category level of the hierarchy with support=0.1%

(32)

25

To be able to choose a minimum support value that leads to a significant number of frequent itemsets different values for minimum support were tested.

Table 5 shows how various minimum support levels performed at the different hierarchy levels: Category, Item Group and Basic Item.

Category Item Group Basic Item

Support Number of _itemsets Support Number of _itemsets Support Number of _itemsets

0,1 3 0,1 3 0,1 2 0,05 4 0,05 5 0,05 6 0,025 9 0,025 7 0,025 9 0,005 19 0,005 26 0,005 33 0,001 42 0,001 59 0,001 108 0,0001 145 0,0001 270 0,0001 333

Table 5: Number of itemsets for each hierarchical level for different support values.

By analysing the table above we can see that the best minimum support value to choose would be 0,0001, still this is a very low support, and it ends up finding itemsets that are not that frequent, so the smart choice would be to go for 0,001, it is low, but not extremely low, and with it, it’s still possible to find a reasonable number of frequent itemsets.

To make this choice easier, instead of looking only at number of frequent itemsets generated versus minimum support value, a test for the number of rules derived from the different support levels should also be performed as to better evaluate which values shows an acceptable performance.

To do this, various tests were performed with the different support levels mentioned above and combining them with different minimum confidence values possibilities to see which pair (min. sup, min. conf.) gives us the most satisfactory results.

(33)

26 Category Support 0,1 0,05 0,025 0,005 0,001 C onf id en ce 0,3 0 0 2 9 15 0,1 0 0 2 11 28 0,05 0 0 6 16 46 0,01 0 0 6 19 58 0,001 0 0 6 19 65 Item Group Support 0,1 0,05 0,025 0,005 0,001 C onf id en ce 0,3 _0,1 0 ₀ 0 ₀ 1 ₁ 11 ₁₅ 14 ₃₅ 0,05 0 0 4 21 53 0,01 0 0 4 23 70 0,001 0 0 4 23 81 Basic Item Support 0,1 0,05 0,025 0,005 0,001 C onf id en ce 0,3 _0,1 0 ₀ 0 ₀ 1 ₁ ₁₄8 58 ₉₄ 0,05 0 0 4 19 116 0,01 0 0 4 23 143 0,001 0 0 4 23 159

Table 6: Number of association rules derived for each hierarchical level for different support and confidence values.

As can be seen from the table above, and was clearly expected, (min.sup., min.conf.) = (0.001, 0.001) derive the most rules in every hierarchy level, still, if we look at minimum support of 0.001 but take as minimum confidence 0.1, the number of derived rules is still satisfactory and high enough that we can still reach meaningful conclusions, therefore, even if 0.001 is a very low minimum support, 0.1 as minimum confidence is still a good value.

6.2. Closed Frequent Itemsets

Before worrying about finding association rules, there is another problem in the process of pattern mining that has been studied alongside it, especially given the time some pattern mining algorithms take to execute and how many scans they need to go through the database, and that problem is finding frequent itemsets.

(34)

27

Most algorithms find a frequent itemset by looking at the support of the sets and finding them frequent if this support is higher than the initial conditions established by the user. It’s not a bad method at all, and some algorithms even manage to perform very well by doing so, even in large databases. The best example for this is the Apriori algorithm. Still, what happens when looking at these frequent itemsets is that some of them end up being redundant.

A solution for this problem is, instead of just searching for any frequent itemset, the search should look for closed frequent itemsets.

A frequent itemset is called closed if there does not exist a superset of the original itemset that has the same support (Borgelt et al., 2011).

This has been a subject of study for many scientists and there have been some articles published where there have been some algorithms developed for searching for these itemsets like Zaki et al. in 2001, Borgelt et al. in 2011, Pasquier et al. in 2015.

Another concept that is important when dealing with frequent itemset is maximal frequent itemset.

A frequent itemset is called maximal if it has no superset that is frequent.

This is a more compact representation of frequent itemsets which is very valuable when dealing with algorithms that require a lot of memory and execution time.

6.2.1. Category Frequent Itemsets

Before deriving association rules, a preliminary analysis was made to find closed frequent itemsets and perhaps get a preview of what might happen when searching for associations.

As was stated in the previous section, the minimum support used was 0,001. The top 15 closed frequent itemsets are shown in table 7.

Items Description Support

{1321} Coverings - Cork Coverings 0,4216

{1323} Coverings - Vinyl Coverings 0,4199

{1324} Coverings - Wood Flooring 0,1049

{1221} Complementary products - Accessories 0,0835

{1221,1323} Complementary products - Accessories; Coverings - Vinyl _Coverings 0,0346

{1121} Auxiliary Products - Flooring Materials 0,0326

{1321,1323} Coverings - Cork Coverings; Coverings - Vinyl Coverings 0,0318

(35)

28

{1221,1321} Complementary products - Accessories; Coverings - Cork _Coverings 0,0256

{1523} Miscellaneous – Services 0,0244

{1122} Auxiliary Products - Production Materials 0,0211

{1224} Complementary products - Underlays 0,0153

{1121,1321} Auxiliary Products - Flooring Materials; Coverings - Cork _Coverings 0,0139 {1122,1321} Auxiliary Products - Production Materials; Coverings - Cork _Coverings 0,0129 {1321,1523} Coverings - Cork Coverings; Miscellaneous - Services 0,0108

Table 7: Top 15 closed frequent itemsets at the Category level of the hierarchy for a minimum support of 0,001.

The table above demonstrates something that one when working closely with the company would be expecting, which is that most transactions in the database will include coverings. Actually, in the top 15 closed frequent itemsets, 9 include some sort of product from the coverings business segment. Usually these Coverings that are purchased are flooring, sometimes they are Wall Coverings.

The itemset with the highest support level is {1321 - Coverings - Cork Coverings}. Cork Coverings are the main product sold by this company. Even though it does not register the highest sales performance, meaning this is mainly a flooring business and products are bought in large quantities instead of bought frequently, it is still a product that hold the company's identity and it makes sense that it is not bought in large quantities but bought frequently in various transactions given that clients, not wanting to risk having a large quantity of these items and not being able to sell them, they choose to buy smaller quantities and keep ordering them as needed.

Also, these items, seeing as they are one of the most promoted items in the company's offer, are present in most client’s pricelists and shops. This leads to a high support for this 1-itemset of approximately 42%, Almost half of the transactions contain these items.

Similar to Cork Covering, but now due to the fact that these are considered best sellers, Vinyl Coverings, besides being purchased in high quantities, are also bought frequently.

Vinyl Coverings being cheaper and considered, though not always the case, more resistant, are very sought after and it makes sense that these are present in approximately 42% of transactions just like Cork Coverings.

Even though this company mainly sells products that have cork as one of the main components of their coverings, and in fact only produce this type of products, they sell,

(36)

29

nonetheless, wood coverings, and there is a great number of customers who still prefer wood flooring.

Still, since this is not a wood flooring company, one can see the significant drop in support, these items are only present in what is considered a low percentage of transactions for a flooring company, seeing as its support is only around 10%.

The other business segment that shows a significant presence in the database is the complementary product segment.

This segment carries any kind of items that are sold by the company that are often paired with flooring purchases, anything from accessories like skirtings to marketing materials. Support shows that 8% of transactions include some sort of complementary product as is demonstrated by the table where we can see that from the top 15 closed frequent itemsets 5 include these items, and 3 of them are in the accessories category, also, in the 42 closed frequent itemsets that were found, 15 include complementary products, which represents approximately 35% of frequent itemsets.

Auxiliary products are also a sought after segment, but the surprise is that a category which is among the frequent itemsets is production materials since the company where this study was conducted is the only branch that has a production unit, therefore more information on this is needed to understand why these are among the highest support items, even though they are present in only 2% (approx.) of transactions. It’s important to note that 2% of transactions is still approximately 3163 transactions.

In the next section the closed frequent itemsets at the Item Group level are analyzed and some more information on these items is provided.

6.2.2. Item Group Frequent Itemsets

After the overview on the closed frequent itemsets at the category level were analyzed, Item Group was the next level to look at. Seeing the frequent itemsets at the Item Group level provides more information on the most frequently purchased items and what kind of products they are.

If we look at table 8 we can see that some of the insights from the previous section are now more understandable.

(37)

30

Items Description Support

{132331} Coverings - Vinyl Coverings - Floating 0,4144

{132131} Coverings - Cork Coverings - Floating 0,3511

{132431} Coverings - Wood Flooring - Floating 0,1025

{122132} Complementary products - Accessories - Skirtings 0,0824

{132132} Coverings - Cork Coverings - Glue down 0,0775

{122132,132331} Complementary products - Accessories - Skirtings; _{Coverings - Vinyl Coverings - Floating} 0,0338 {132131,132331} Coverings - Cork Coverings - Floating; Coverings - Vinyl Coverings _{– Floating} 0,0302

{152331} Miscellaneous - Services - Direct Application 0,0244

{122132,132131} Complementary products - Accessories - Skirtings; _{Coverings - Cork Coverings - Floating} 0,0232

{112231} Auxiliary Products - Production Materials - Chemicals 0,0211

{112131} Auxiliary Products - Flooring Materials - Assembly 0,018

{122331} Complementary products - Marketing Materials - Advertisement 0,0167

{112132} Auxiliary Products - Flooring Materials - Maintenance 0,0159

{122333} Complementary products - Marketing Materials - Samples 0,0159

{112231,132132} Auxiliary Products - Production Materials - Chemicals; _{Coverings - Cork Coverings - Glue down} 0,012

Table 8: Top 15 closed frequent itemsets at the Item Group level of the hierarchy for a minimum support of 0,001.

Coverings, specifically Cork and Vinyl, are the most purchased kind of product, by analyzing the transaction at the Item Group level we can see that, in addition to this, the kind of flooring that is most frequently purchased is floating floors.

Coverings with a glue down installation are also among the most frequently purchased items but, if we look at the support values, we can see that there is a big discrepancy between how many transactions include coverings that are considered floating items and coverings that are considered glue down items. If we compare two top closed frequent itemsets of the same category from the table above we can see that {132131 - Floating Cork Coverings} and {132132 - Glue down Cork Coverings}, though they are both amongst the top 15 closed frequent itemsets, Floating Cork Coverings have a support of 35% versus Glue down Cork Coverings that have a support of only approximately 8%.

However, 8% of transactions is still around 11579 transactions which include Glue Down Cork Coverings.

When it comes to complementary products we can now see that the most frequently purchased complementary product is {122132 - Complementary Products - Accessories - Skirtings}, and, as is indicated by their presence in the top 15 closed frequent itemsets, these

(38)

31

are items that are not only a frequent purchase by itself but also when combined with the purchase of flooring products (Floating Cork and Vinyl Coverings). This may indicate an association.

Here we can see that the Auxiliary Production Materials are mostly Chemicals, this may mean that the items being purchased are for contracting projects seeing as this is one of the types of market this company caters to.

This Item Group includes items like adhesives that may be used in a construction site. Marketing materials are also amongst the most frequently purchased items. As was stated in section 5, this company invests a considerable amount of resources (financial and human) in developing and distributing marketing materials to its customers so it was to be expected that these should be among the most frequently purchased items, be it advertising materials, like brochures and leaflets, gifts but also samples of products.

Another kind of auxiliary product that is among the most purchased are maintenance products. This group of items include products to perform regular maintenance for flooring products, also sporadic maintenance like waxing and polishing.

After investigating the frequent itemsets at the Item Group level, there are some itemsets that could very likely lead to associations and quite possibly strong ones, this can be seen when looking at single level and multi-level associations rules in section 7.

6.2.3. Basic Item Frequent Itemsets

After looking at the frequent itemsets in a category level and item group level, it’s necessary to look at the basic item level to have further insight into these purchases.

We can see in table 9 the results.

When speaking of Vinyl Floorings, given that the company sells more than one kind of vinyl flooring, these could be very different.

As is demonstrated by table 9, LVT Floating, which is a kind of product which top layer is a vinyl sheet, like all Vinyl coverings, it has an HDF (High Density Fiber) layer, and this is the kind of vinyl flooring that most popular among most of the markets.

Even though HDF products show a tendency to disappear, there are still some markets that show very high sales performance for this type of product.

When comparing the LVT Floating items to other products sold by this company, these are the most affordable and certainly, except for real wood flooring, the ones with the most realist look and feel of wood, which is the most popular flooring visual everywhere.

Multiple level association rules in a distribution company