Market Basket Analysis and Recommendation Engines using real examples

... Dr Rosaria Silipo/bigdata-madesimple:

market basket analysis or recommendation engine  is what is behind all these recommendations we get when we go shopping online or whenever we receive targeted advertising. The underlying engine collects information about people’s habits and knows that if people buy pasta and wine, they are usually also interested in pasta sauces. So, the next time you go to the supermarket and buy pasta and wine, be ready to get a recommendation for some pasta sauce!

A typical analysis goal when applying market basket analysis it to produce a set of association rules in the following form:

IF {pasta, wine, garlic} THEN pasta-sauce

The first part of the rule is called “antecedent”, the second part is called “consequent”. A few measures, such as support, confidence, and lift, define how reliable each rule is. The most famous algorithm generating these rules is the Apriori algorithm.

We have split this use case into two parts. First we build the required association rules on a set of example transactions; second, we deploy the rule engine in a productive environment to generate recommendations for new basket data and/or new transactions.

The two workflows can be downloaded from the KNIME EXAMPLES server 050_Applications/050016_MarketBasketAnalysis .

We use here an artificially constructed data set consisting of two data tables: one containing the transaction data – i.e. the sequences of product IDs in fictitious baskets – and one containing product infos – i.e. name and price.

Building the Association Rules

The central part in building a recommendation engine is the Association Rule Learner node, which implements the Apriori algorithm, in either the traditional or the Borgelt implementation. The Borgelt implementation offers a few performance improvements over the traditional algorithm. The output association rule set, however, remains the same.

Both Association Rule Learner nodes work on a collection of product IDs. A collection is a particular data cell type, assembling together data cells. There are many ways of producing a collection data cell from other data cells [4]: we decided to use the Cell Splitter node. The Cell Splitter node splits strings into smaller substrings according to a delimiter character. The output can assume multiple forms. One of these is a collection column (option “as set (remove duplicates)”). In our case, we split the original transaction string into many product ID substrings, using space as the delimiter character, and collect all product IDs into a collection column.

After running on a data set with past shopping basket examples, the Association Rule Learner node produces a number of rules. Each rule includes a collection of product IDs as antecedent, one product ID as consequent, and a few quality measures, such as support, confidence, and lift.

In Borgelt’s implementation of the Apriori algorithm, three support measures are available for each rule. If A is the antecedent and C is the consequent, then:

•         Body Set Support = support(A) = # items/transactions containing A

•         Head Set Support = support # items/transactions containing C

•         Item Set Support = support (A ∪ C) = # items/transactions containing both antecedent and consequent

Item Set Support tells us how often antecedent A and consequent C are found together in an item set in the whole data set. However, the same antecedent can produce a number of different consequents. So, another measure of the rule quality is how often that antecedent A produces that consequent C among all possible consequents. This is the rule confidence.

Rule Confidence = support(A ∪ C) /support(A)

One more quality measure â


Relevant For

Big Data Hadoop Data Science