-
Notifications
You must be signed in to change notification settings - Fork 16
/
Association-Mining-With-R.html
410 lines (375 loc) · 30.5 KB
/
Association-Mining-With-R.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
<!DOCTYPE html>
<html>
<head>
<title>Association Mining With R | arules</title>
<meta charset="utf-8">
<meta name="Description" content="R Language Tutorials for Advanced Statistics">
<meta name="Keywords" content="R, Tutorial, Machine learning, Statistics, Data Mining, Analytics, Data science, Linear Regression, Logistic Regression, Time series, Forecasting">
<meta name="Distribution" content="Global">
<meta name="Author" content="Selva Prabhakaran">
<meta name="Robots" content="index, follow">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="shortcut icon" href="/screenshots/iconb-64.png" type="image/x-icon" />
<link href="www/bootstrap.min.css" rel="stylesheet">
<link href="www/highlight.css" rel="stylesheet">
<link href='http://fonts.googleapis.com/css?family=Inconsolata:400,700'
rel='stylesheet' type='text/css'>
<!-- Color Script -->
<style type="text/css">
a {
color: #3675C5;
color: rgb(25, 145, 248);
color: #4582ec;
color: #3F73D8;
}
li {
line-height: 1.65;
}
/* reduce spacing around math formula*/
.MathJax_Display {
margin: 0em 0em;
}
</style>
<!-- Add Google search -->
<script language="Javascript" type="text/javascript">
function my_search_google()
{
var query = document.getElementById("my-google-search").value;
window.open("http://google.com/search?q=" + query
+ "%20site:" + "http://r-statistics.co");
}
</script>
</head>
<body>
<div class="container">
<div class="masthead">
<!--
<ul class="nav nav-pills pull-right">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">
Table of contents<b class="caret"></b>
</a>
<ul class="dropdown-menu pull-right" role="menu">
<li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>
<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>
<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>
<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>
<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
</ul>
</li>
</ul>
-->
<ul class="nav nav-pills pull-right">
<div class="input-group">
<form onsubmit="my_search_google()">
<input type="text" class="form-control" id="my-google-search" placeholder="Search..">
<form>
</div><!-- /input-group -->
</ul><!-- /.col-lg-6 -->
<h3 class="muted"><a href="/">r-statistics.co</a><small> by Selva Prabhakaran</small></h3>
<hr>
</div>
<div class="row">
<div class="col-xs-12 col-sm-3" id="nav">
<div class="well">
<li>
<ul class="list-unstyled">
<li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>
<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>
<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>
<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>
<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
</ul>
</li>
</div>
<div class="well">
<p>Stay up-to-date. <a href="https://docs.google.com/forms/d/1xkMYkLNFU9U39Dd8S_2JC0p8B5t6_Yq6zUQjanQQJpY/viewform">Subscribe!</a></p>
<p><a href="https://docs.google.com/forms/d/13GrkCFcNa-TOIllQghsz2SIEbc-YqY9eJX02B19l5Ow/viewform">Chat!</a></p>
</div>
<h4>Contents</h4>
<ul class="list-unstyled" id="toc"></ul>
<!--
<hr>
<p><a href="/contribute.html">How to contribute</a></p>
<p><a class="btn btn-primary" href="">Edit this page</a></p>
-->
</div>
<div id="content" class="col-xs-12 col-sm-8 pull-right">
<h1>Association Mining (Market Basket Analysis)</h1>
<blockquote>
<p>Association mining is commonly used to make product recommendations by identifying products that are frequently bought together. But, if you are not careful, the rules can give misleading results in certain cases.</p>
</blockquote>
<p>Association mining is usually done on transactions data from a retail market or from an online e-commerce store. Since most transactions data is large, the <code>apriori</code> algorithm makes it easier to find these patterns or <em>rules</em> quickly.</p>
<p>So, What is a <em>rule</em>?</p>
<p>A rule is a notation that represents which item/s is frequently bought with what item/s. It has an <em>LHS</em> and an <em>RHS</em> part and can be represented as follows:</p>
<p><strong>itemset A => itemset B</strong></p>
<p>This means, the item/s on the right were frequently purchased along with items on the left.</p>
<h2>How to measure the strength of a rule?</h2>
<p>The <code>apriori()</code> generates the most relevent set of rules from a given transaction data. It also shows the <em>support</em>, <em>confidence</em> and <em>lift</em> of those rules. These three measure can be used to decide the relative strength of the rules. So what do these terms mean?</p>
<p>Lets consider the rule <strong>A => B</strong> in order to compute these metrics.</p>
<p><br /><span class="math display">$$Support = \frac{Number\ of\ transactions\ with\ both\ A\ and\ B}{Total\ number\ of\ transactions} = P\left(A \cap B\right)$$</span><br /></p>
<p><br /><span class="math display">$$Confidence = \frac{Number\ of\ transactions\ with\ both\ A\ and\ B}{Total\ number\ of\ transactions\ with\ A} = \frac{P\left(A \cap B\right)}{P\left(A\right)}$$</span><br /></p>
<p><br /><span class="math display">$$Expected Confidence = \frac{Number\ of\ transactions\ with\ B}{Total\ number\ of\ transactions} = P\left(B\right)$$</span><br /></p>
<p><br /><span class="math display">$$Lift = \frac{Confidence}{Expected\ Confidence} = \frac{P\left(A \cap B\right)}{P\left(A\right).P\left(B\right)}$$</span><br /></p>
<p><em>Lift</em> is the factor by which, the co-occurence of A and B exceeds the expected probability of A and B co-occuring, had they been independent. So, higher the lift, higher the chance of A and B occurring together.</p>
<p>Lets see how to get the rules, confidence, lift etc using the <code>arules</code> package in R.</p>
<h2>Example</h2>
<h4>Transactions data</h4>
<p>Lets play with the <code>Groceries</code> data that comes with the <code>arules</code> pkg. Unlike dataframe, using <code>head(Groceries)</code> does not display the transaction items in the data. To view the transactions, use the <code>inspect()</code> function instead.</p>
<p>Since association mining deals with transactions, the data has to be converted to one of class <code>transactions</code>, made available in R through the <code>arules</code> pkg. This is a necessary step because the <code>apriori()</code> function accepts transactions data of class <code>transactions</code> only.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(arules)
<span class="kw">class</span>(Groceries)
<span class="co">#> [1] "transactions"</span>
<span class="co">#> attr(,"package")</span>
<span class="co">#> [1] "arules"</span>
<span class="kw">inspect</span>(<span class="kw">head</span>(Groceries, <span class="dv">3</span>))
<span class="co">#> items </span>
<span class="co">#> 1 {citrus fruit, </span>
<span class="co">#> semi-finished bread, </span>
<span class="co">#> margarine, </span>
<span class="co">#> ready soups} </span>
<span class="co">#> 2 {tropical fruit, </span>
<span class="co">#> yogurt, </span>
<span class="co">#> coffee} </span>
<span class="co">#> 3 {whole milk} </span></code></pre></div>
<p>If you have to read data from a file as a transactions data, use <code>read.transactions()</code>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">tdata <-<span class="st"> </span><span class="kw">read.transactions</span>(<span class="st">"transactions_data.txt"</span>, <span class="dt">sep=</span><span class="st">"</span><span class="ch">\t</span><span class="st">"</span>)</code></pre></div>
<p>If you already have your transactions stored as a dataframe, you could convert it to class <code>transactions</code> as follows,</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">tData <-<span class="st"> </span><span class="kw">as</span> (myDataFrame, <span class="st">"transactions"</span>) <span class="co"># convert to 'transactions' class</span></code></pre></div>
<p>Here are couple more utility functions that are good to know:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">size</span>(<span class="kw">head</span>(Groceries)) <span class="co"># number of items in each observation</span>
<span class="co">#> [1] 4 3 1 4 4 5</span>
<span class="kw">LIST</span>(<span class="kw">head</span>(Groceries, <span class="dv">3</span>)) <span class="co"># convert 'transactions' to a list, note the LIST in CAPS</span>
<span class="co">#> [[1]]</span>
<span class="co">#> [1] "citrus fruit" "semi-finished bread" "margarine" </span>
<span class="co">#> [4] "ready soups" </span>
<span class="co">#> </span>
<span class="co">#> [[2]]</span>
<span class="co">#> [1] "tropical fruit" "yogurt" "coffee" </span>
<span class="co">#> </span>
<span class="co">#> [[3]]</span>
<span class="co">#> [1] "whole milk"</span></code></pre></div>
<h2>How to see the most frequent items?</h2>
<p>The <code>eclat()</code> takes in a transactions object and gives the most frequent items in the data based the support you provide to the <code>supp</code> argument. The <code>maxlen</code> defines the maximum number of items in each itemset of frequent items.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">frequentItems <-<span class="st"> </span><span class="kw">eclat</span> (Groceries, <span class="dt">parameter =</span> <span class="kw">list</span>(<span class="dt">supp =</span> <span class="fl">0.07</span>, <span class="dt">maxlen =</span> <span class="dv">15</span>)) <span class="co"># calculates support for frequent items</span>
<span class="kw">inspect</span>(frequentItems)
<span class="co">#> items support </span>
<span class="co">#> 1 {other vegetables,whole milk} 0.07483477</span>
<span class="co">#> 2 {whole milk} 0.25551601</span>
<span class="co">#> 3 {other vegetables} 0.19349263</span>
<span class="co">#> 4 {rolls/buns} 0.18393493</span>
<span class="co">#> 5 {yogurt} 0.13950178</span>
<span class="co">#> 6 {soda} 0.17437722</span>
<span class="kw">itemFrequencyPlot</span>(Groceries, <span class="dt">topN=</span><span class="dv">10</span>, <span class="dt">type=</span><span class="st">"absolute"</span>, <span class="dt">main=</span><span class="st">"Item Frequency"</span>) <span class="co"># plot frequent items</span></code></pre></div>
<p><img src='screenshots/item_frequency_plot_arules.png' width='528' height='289' /></p>
<h1>How to get the product recommendation rules?</h1>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">rules <-<span class="st"> </span><span class="kw">apriori</span> (Groceries, <span class="dt">parameter =</span> <span class="kw">list</span>(<span class="dt">supp =</span> <span class="fl">0.001</span>, <span class="dt">conf =</span> <span class="fl">0.5</span>)) <span class="co"># Min Support as 0.001, confidence as 0.8.</span>
rules_conf <-<span class="st"> </span><span class="kw">sort</span> (rules, <span class="dt">by=</span><span class="st">"confidence"</span>, <span class="dt">decreasing=</span><span class="ot">TRUE</span>) <span class="co"># 'high-confidence' rules.</span>
<span class="kw">inspect</span>(<span class="kw">head</span>(rules_conf)) <span class="co"># show the support, lift and confidence for all rules</span>
<span class="co">#> lhs rhs support confidence lift </span>
<span class="co">#> 113 {rice,sugar} => {whole milk} 0.001220132 1 3.913649</span>
<span class="co">#> 258 {canned fish,hygiene articles} => {whole milk} 0.001118454 1 3.913649</span>
<span class="co">#> 1487 {root vegetables,butter,rice} => {whole milk} 0.001016777 1 3.913649</span>
<span class="co">#> 1646 {root vegetables,whipped/sour cream,flour} => {whole milk} 0.001728521 1 3.913649</span>
<span class="co">#> 1670 {butter,soft cheese,domestic eggs} => {whole milk} 0.001016777 1 3.913649</span>
<span class="co">#> 1699 {citrus fruit,root vegetables,soft cheese} => {other vegetables} 0.001016777 1 5.168156</span>
rules_lift <-<span class="st"> </span><span class="kw">sort</span> (rules, <span class="dt">by=</span><span class="st">"lift"</span>, <span class="dt">decreasing=</span><span class="ot">TRUE</span>) <span class="co"># 'high-lift' rules.</span>
<span class="kw">inspect</span>(<span class="kw">head</span>(rules_lift)) <span class="co"># show the support, lift and confidence for all rules</span>
<span class="co">#> lhs rhs support confidence lift </span>
<span class="co">#> 53 {Instant food products,soda} => {hamburger meat} 0.001220 0.6315789 18.995</span>
<span class="co">#> 37 {soda,popcorn} => {salty snack} 0.001220 0.6315789 16.697</span>
<span class="co">#> 444 {flour,baking powder} => {sugar} 0.001016 0.5555556 16.408</span>
<span class="co">#> 327 {ham,processed cheese} => {white bread} 0.001931 0.6333333 15.045</span>
<span class="co">#> 55 {whole milk,Instant food products} => {hamburger meat} 0.001525 0.5000000 15.038</span>
<span class="co">#> 4807 {other vegetables,curd,yogurt,whipped/sour cream} => {cream cheese } 0.001016 0.5882353 14.834</span></code></pre></div>
<p>The rules with confidence of 1 (see <code>rules_conf</code> above) imply that, whenever the LHS item was purchased, the RHS item was also purchased 100% of the time.</p>
<p>A rule with a lift of 18 (see <code>rules_lift</code> above) imply that, the items in LHS and RHS are 18 times more likely to be purchased together compared to the purchases when they are assumed to be unrelated.</p>
<h2>How To Control The Number Of Rules in Output ?</h2>
<p>Adjust the <code>maxlen</code>, <code>supp</code> and <code>conf</code> arguments in the <code>apriori</code> function to control the number of rules generated. You will have to adjust this based on the sparesness of you data.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">rules <-<span class="st"> </span><span class="kw">apriori</span>(Groceries, <span class="dt">parameter =</span> <span class="kw">list</span> (<span class="dt">supp =</span> <span class="fl">0.001</span>, <span class="dt">conf =</span> <span class="fl">0.5</span>, <span class="dt">maxlen=</span><span class="dv">3</span>)) <span class="co"># maxlen = 3 limits the elements in a rule to 3</span></code></pre></div>
<ol style="list-style-type: decimal">
<li>To get <strong>‘strong‘</strong> rules, increase the value of <em>‘conf’</em> parameter.</li>
<li>To get <strong>‘longer‘</strong> rules, increase <em>‘maxlen’</em>.</li>
</ol>
<h2>How To Remove Redundant Rules ?</h2>
<p>Sometimes it is desirable to remove the rules that are subset of larger rules. To do so, use the below code to filter the redundant rules.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">subsetRules <-<span class="st"> </span><span class="kw">which</span>(<span class="kw">colSums</span>(<span class="kw">is.subset</span>(rules, rules)) ><span class="st"> </span><span class="dv">1</span>) <span class="co"># get subset rules in vector</span>
<span class="kw">length</span>(subsetRules) <span class="co">#> 3913</span>
rules <-<span class="st"> </span>rules[-subsetRules] <span class="co"># remove subset rules. </span></code></pre></div>
<h2>How to Find Rules Related To Given Item/s ?</h2>
<p>This can be achieved by modifying the <code>appearance</code> parameter in the <code>apriori()</code> function. For example,</p>
<h4>To find what factors influenced purchase of product X</h4>
<p>To find out what customers had purchased before buying ‘Whole Milk’. This will help you understand the patterns that led to the purchase of ‘whole milk’.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">rules <-<span class="st"> </span><span class="kw">apriori</span> (<span class="dt">data=</span>Groceries, <span class="dt">parameter=</span><span class="kw">list</span> (<span class="dt">supp=</span><span class="fl">0.001</span>,<span class="dt">conf =</span> <span class="fl">0.08</span>), <span class="dt">appearance =</span> <span class="kw">list</span> (<span class="dt">default=</span><span class="st">"lhs"</span>,<span class="dt">rhs=</span><span class="st">"whole milk"</span>), <span class="dt">control =</span> <span class="kw">list</span> (<span class="dt">verbose=</span>F)) <span class="co"># get rules that lead to buying 'whole milk'</span>
rules_conf <-<span class="st"> </span><span class="kw">sort</span> (rules, <span class="dt">by=</span><span class="st">"confidence"</span>, <span class="dt">decreasing=</span><span class="ot">TRUE</span>) <span class="co"># 'high-confidence' rules.</span>
<span class="kw">inspect</span>(<span class="kw">head</span>(rules_conf))
<span class="co">#> lhs rhs support confidence lift </span>
<span class="co">#> 196 {rice,sugar} => {whole milk} 0.001220132 1 3.913649</span>
<span class="co">#> 323 {canned fish,hygiene articles} => {whole milk} 0.001118454 1 3.913649</span>
<span class="co">#> 1643 {root vegetables,butter,rice} => {whole milk} 0.001016777 1 3.913649</span>
<span class="co">#> 1705 {root vegetables,whipped/sour cream,flour} => {whole milk} 0.001728521 1 3.913649</span>
<span class="co">#> 1716 {butter,soft cheese,domestic eggs} => {whole milk} 0.001016777 1 3.913649</span>
<span class="co">#> 1985 {pip fruit,butter,hygiene articles} => {whole milk} 0.001016777 1 3.913649</span></code></pre></div>
<h4>To find out what products were purchased after/along with product X</h4>
<p>The is a case to find out <em>the Customers who bought ‘Whole Milk’ also bought . .</em> In the equation, ‘whole milk’ is in LHS (left hand side).</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">rules <-<span class="st"> </span><span class="kw">apriori</span> (<span class="dt">data=</span>Groceries, <span class="dt">parameter=</span><span class="kw">list</span> (<span class="dt">supp=</span><span class="fl">0.001</span>,<span class="dt">conf =</span> <span class="fl">0.15</span>,<span class="dt">minlen=</span><span class="dv">2</span>), <span class="dt">appearance =</span> <span class="kw">list</span>(<span class="dt">default=</span><span class="st">"rhs"</span>,<span class="dt">lhs=</span><span class="st">"whole milk"</span>), <span class="dt">control =</span> <span class="kw">list</span> (<span class="dt">verbose=</span>F)) <span class="co"># those who bought 'milk' also bought..</span>
rules_conf <-<span class="st"> </span><span class="kw">sort</span> (rules, <span class="dt">by=</span><span class="st">"confidence"</span>, <span class="dt">decreasing=</span><span class="ot">TRUE</span>) <span class="co"># 'high-confidence' rules.</span>
<span class="kw">inspect</span>(<span class="kw">head</span>(rules_conf))
<span class="co">#> lhs rhs support confidence lift </span>
<span class="co">#> 6 {whole milk} => {other vegetables} 0.07483477 0.2928770 1.5136341</span>
<span class="co">#> 5 {whole milk} => {rolls/buns} 0.05663447 0.2216474 1.2050318</span>
<span class="co">#> 4 {whole milk} => {yogurt} 0.05602440 0.2192598 1.5717351</span>
<span class="co">#> 2 {whole milk} => {root vegetables} 0.04890696 0.1914047 1.7560310</span>
<span class="co">#> 1 {whole milk} => {tropical fruit} 0.04229792 0.1655392 1.5775950</span>
<span class="co">#> 3 {whole milk} => {soda} 0.04006101 0.1567847 0.8991124</span></code></pre></div>
<p>One drawback with this is, you will get only 1 item on the RHS, irrespective of the support, confidence or minlen parameters.</p>
<h2>Caveat with using Lift</h2>
<p>The directionality of the rule is lost when <em>lift</em> is used. That is, the lift of any rule, <em>A => B</em> and the rule <em>B => A</em> will be the same. See the calculation below:</p>
<h4><em>A -> B</em></h4>
<ul>
<li><p>Support: <span class="math inline"><em>P</em>(<em>A</em>∩<em>B</em>)</span></p></li>
<li><p>Confidence: <span class="math inline">$\frac{P\left( A \cap B \right)}{P\left( A \right)}$</span></p></li>
<li><p>Expected Confidence: <span class="math inline"><em>P</em>(<em>B</em>)</span></p></li>
<li><p>Lift: <span class="math inline">$\frac{Confidence}{Expected\ Confidence}$</span> = <span class="math inline">$\frac{P\left( A \cap B \right)}{P\left( A \right).P\left( B \right)}$</span></p></li>
</ul>
<h4><em>B -> A</em></h4>
<ul>
<li><p>Support: <span class="math inline"><em>P</em>(<em>A</em>∩<em>B</em>)</span></p></li>
<li><p>Confidence: <span class="math inline">$\frac{P\left( A \cap B \right)}{P\left( B \right)}$</span></p></li>
<li><p>Expected Confidence: <span class="math inline"><em>P</em>(<em>B</em>)</span></p></li>
<li><p>Lift: <span class="math inline">$\frac{Confidence}{Expected\ Confidence}$</span> = <span class="math inline">$\frac{P\left( A \cap B \right)}{P\left( A \right).P\left( B \right)}$</span></p></li>
</ul>
<h4>Important Note</h4>
<p>For both rules <em>A -> B</em> and <em>B -> A</em>, the value of <em>lift</em> and support turns out to be the same. This means we cannot use lift to make recommendation for a particular <em>directional</em> ‘rule’. It can merely be used to club frequently bought items into groups.</p>
<h2>Caveat with using Confidence</h2>
<p>The <em>confidence</em> of a rule can be a misleading measure while making product recommendations in real world problems, especially while making <em>add-ons</em> product recommendations. Lets consider the following data with 4 transactions, involving IPhones and Headsets:</p>
<ol style="list-style-type: decimal">
<li>Iphone, Headset</li>
<li>Iphone, Headset</li>
<li>Iphone</li>
<li>Iphone</li>
</ol>
<p>We can create 2 rules for these transactions as shown below:</p>
<ol style="list-style-type: decimal">
<li><em>Iphone -> Headset</em></li>
<li><em>Headset -> IPhone</em></li>
</ol>
<p>In real world, it would be realistic to recommend <em>headphones</em> to a person who just bought an <em>iPhone</em> and not the other way around. Imagine being recommended an iPhone when you just finished purchasing a pair of headphones. Not nice!.</p>
<p>While selecting rules from the <code>apriori</code> output, you might guess that higher the confidence a rule has, better is the rule. But for cases like this, the headset -> iPhone rule will have a higher confidence (2 times) over iPhone -> headset. Can you see why? The calculation below show how.</p>
<h4>Confidence Calculation:</h4>
<p><strong>iPhone -> Headset</strong>: <span class="math inline">$\frac{P(iPhone\ \cap\ Headset)}{P(iPhone)}$</span> = 0.5 / 1 = <strong>0.5</strong></p>
<p><strong>Headset -> iPhone</strong>: <span class="math inline">$\frac{P(iPhone\ \cap\ Headset)}{P(Headset)}$</span> = 0.5 / 0.5 = <strong>1.0</strong></p>
<p>As, you can see, the <em>headset -> iPhone</em> recommendation has a higher confidence, which is misleading and unrealistic. So, confidence should not be the <em>only measure</em> you should use to make product recommendations.</p>
<p>So, you probably need to check more criteria such as the price of products, product types etc before recommending items, especially in cross selling cases.</p>
</div>
</div>
<div class="footer">
<hr>
<p>© 2016-17 Selva Prabhakaran. Powered by <a href="http://jekyllrb.com/">jekyll</a>,
<a href="http://yihui.name/knitr/">knitr</a>, and
<a href="http://johnmacfarlane.net/pandoc/">pandoc</a>.
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">Creative Commons License.</a>
</p>
</div>
</div> <!-- /container -->
<script src="//code.jquery.com/jquery.js"></script>
<script src="www/bootstrap.min.js"></script>
<script src="www/toc.js"></script>
<!-- MathJax Script -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<!-- Google Analytics Code -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-69351797-1', 'auto');
ga('send', 'pageview');
</script>
<style type="text/css">
/* reduce spacing around math formula*/
.MathJax_Display {
margin: 0em 0em;
}
body {
font-family: 'Helvetica Neue', Roboto, Arial, sans-serif;
font-size: 16px;
line-height: 27px;
font-weight: 400;
}
blockquote p {
line-height: 1.75;
color: #717171;
}
.well li{
line-height: 28px;
}
li.dropdown-header {
display: block;
padding: 0px;
font-size: 14px;
}
</style>
</body>
</html>