[PDF] Association Rule - Folie 1 PDF DM07-Association-Analysis-HWS2019-V1.pdf

23 oct 2019 · Association Analysis Heiko Paulheim What is Association Analysis? 2 Frequent Itemset Correlation Analysis in Python • e g , using

[PDF] Classification based on Associations (CBA) - a performance analysis

The same apriori implementation is used, for example, in the popular arules R package [4] The CBA-CB phase for both M1 and M2 was implemented in Python,

[PDF] Python Implementation of Interpretable Decision - CEUR-WSorg

PyIDS – Python Implementation of Interpretable IDS is a classification algorithm based on association rules An association rule Filip, J , Kliegr, T : Classification based on associations (CBA)-a performance analysis In: RuleML+ RR

[PDF] Analysis - Association Rule LEARNING - Zax Rosenberg

Is an item in our “basket” or not? · Often a list of lists in Python, if data fits in memory · Alternatively, preprocessing binarize in scikit-learn

[PDF] Isaiah Hull - DataCamp

MARKET BASKET ANALYSIS IN PYTHON Preparing the data 1 Generate the rules Use Apriori algorithm and association rules 2 Convert antecedents and

[PDF] Market Basket Analysis in Retail - UPCommons

less distance to the centroid, then we found the association rules on that shop and Those tools used in project are Python [9] and the Pandas library for the

Application of Improved Eclat Algorithm in Students Evaluation of

The association algorithm is used to analyze the rules that satisfy a certain Python language and applies it to students' evaluation of data mining This time

Data Mining I

Association Analysis

Heiko Paulheim

10/23/19 Heiko Paulheim 2 Outline

1.What is Association Analysis?

2.Frequent Itemset Generation

3.Rule Generation

4.Interestingness Measures

5.Handling Continuous and Categorical Attributes

10/23/19 Heiko Paulheim 3 Association Analysis

•First algorithms developed in the early 90s at IBM by Agrawal & Srikant •Motivation -Availability of barcode cash registers

10/23/19 Heiko Paulheim 4 Association Analysis

•initially used for Market Basket Analysis -to find how items purchased by customers are related •later extended to more complex data structures -sequential patterns (see Data Mining II) -subgraph patterns •and other application domains -life science -social science -web usage mining

10/23/19 Heiko Paulheim 5 Simple Approaches

•To find out if two items x and y are bought together, we can compute their correlation •E.g., Pearson's correlation coefficient: •Numerical coding: -1: item was bought -0: item was not bought • : average of x (i.e., how often x was bought)∑(xi-x)(yi-y)

10/23/19 Heiko Paulheim 6 Correlation Analysis in RapidMiner

10/23/19 Heiko Paulheim 7 Correlation Analysis in Python

•e.g., using Pandas: import seaborn as sns corr = dataframe.corr() sns.heatmap(corr)

10/23/19 Heiko Paulheim 8 •Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactionsExamples of Association Rules {Diaper} r {Beer}, {Milk, Brea á d á } r {Eggs,Coke}, {Beer, Brea á d á } r {Milk},TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

→ denotes co-occurence, not causality!Association Analysis

10/23/19 Heiko Paulheim 9 Correlation vs. Causality

http://xkcd.com/552/

10/23/19 Heiko Paulheim 10 Definition: Frequent Itemset

•Itemset -A collection of one or more items •Example: {Milk, Bread, Diaper} -k-itemset •An itemset that contains k items •Support (s) -Frequency of occurrence of an itemset •e.g. s({Milk, Bread, Diaper}) = 2/5 •Frequent Itemset -An itemset w/ support ≥ a minimum support threshold (minsup)

10/23/19 Heiko Paulheim 11 Definition: Association Rule

•Association Rule -An implication expression of the form

X → Y, where X and Y are itemsets

•Interpretation: when X occurs,

Y occurs with a certain probability

•More formally, it's a conditional probability -P(Y|X) - given X, what is the probability of Y? •Known as confidence (c) -e.g., for {Bread, Milk} → {Diaper}, the probability is 2/3TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

10/23/19 Heiko Paulheim 12 Definition: Evaluation Metrics

•Given the rule {Milk, Diaper} → {Beer} •Support: -Fraction of total transactions which contain both X and Y •Confidence: -Fraction of transactions containing X which also contain YTID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

4.05 2 |T| 67.03
2 )Diaper,Milk( c

10/23/19 Heiko Paulheim 13 Association Rule Mining Task

•Given a set of transactions T, the goal of association rule mining is to find all rules having -support ≥ minsup threshold -confidence ≥ minconf threshold •minsup and minconf are provided by the user •Brute-force approach: -List all possible association rules -Compute the support and confidence for each rule -Remove rules that fail the minsup and minconf thresholds → Computationally prohibitive due to large number of candidates!

10/23/19 Heiko Paulheim 14 Mining Association Rules

Examples of Rules:

{Milk, Diaper} → {Beer} (s=0.4, c=0.67{Milk, Beer} → {Diaper} (s=0.4, c=1.0){Diaper, Beer} → {Milk} (s=0.4, c=0.67){Beer} → {Milk, Diaper} (s=0.4, c=0.67) {Diaper}→ {Milk, Beer} (s=0.4, c=0.5) {Milk} → {Diaper, Beer} (s=0.4, c=0.5)

s(X→Y):=∣X∪Y∣ ∣T∣Observations •All the above rules are partitions of the same itemset, i.e. {Milk,

Diaper, Beer}

•Rules originating from the same itemset have identical support -but can have different confidence → we may decouple the support and confidence requirements

10/23/19 Heiko Paulheim 15 Apriori Algorithm: Basic Idea

•Two-step approach •First: Frequent Itemset Generation -Generate all itemsets whose support ≥ minsup •Second: Rule Generation -Generate high confidence rules from each frequent itemset -where each rule is a binary partitioning of a frequent itemset •However: Frequent itemset generation is still computationally expensive....

10/23/19 Heiko Paulheim 16 Frequent Itemset Generation

Given d items, there are

2d candidate itemsets!null

ABACADAEBCBDBECDCEDEABCDE

ABCABDABEACDACEADEBCDBCEBDECDE

ABCDABCEABDEACDEBCDE

ABCDE

10/23/19 Heiko Paulheim 17 https://www.scrapehero.com/number-of-products-on-amazon-april-2019/Brute-force Approach

•Example: -Amazon sells 120 million products (Amazon.com, as of April 2019) •That is 2120000000 possible itemsets •As a number: -3.017... × 1036,123,599 -That is: a number with 36 million digits! -Comparison: the largest supercomputer has a capacity of 40 Petabytes (=3.2x1017 bits) •However: -most itemsets will not be important at all -e.g., a book on Chinese calligraphy and an iPhone cover bought together -thus, smarter algorithms should be possible

10/23/19 Heiko Paulheim 18 Brute-force Approach

•Each itemset in the lattice is a candidate frequent itemset •Count the support of each candidate by scanning the database •Match each transaction against every candidate •Complexity ~ O(NMw) → Expensive since M = 2d •A smarter algorithm is required

10/23/19 Heiko Paulheim 19 Anti-Monotonicity of Support

•What happens when an itemset gets larger? •s({Bread}) = 0.8 -s({Bread,Milk}) = 0.6 -s({Bread,Milk,Diaper}) = 0.4 •s({Milk}) = 0.8 -s({Milk,Diaper}) = 0.6 -s({Milk,Diaper,Beer}) = 0.4 •There is a pattern here!TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

10/23/19 Heiko Paulheim 20 Anti-Monotonicity of Support

•There is a pattern here! -It is called anti-monitonicity of support •If X is a subset of Y -s(Y) is at most as large as s(X) •Consequence for frequent itemset search (aka Apriori principle): -If Y is frequent, X also has to be frequent -i.e.: all subsets of frequent itemsets are frequentTID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

)()()(:,YsXsYXYXmBS

10/23/19 Heiko Paulheim 21 Foun á d á to

InfrequentIllustrating the Apriori Principle

Prune á d á

supersetsnull

ABACADAEBCBDBECDCEDEABCDE

ABCABDABEACDACEADEBCDBCEBDECDE

ABCDABCEABDEACDEBCDE

ABCDEnull

ABACADAEBCBDBECDCEDEABCDE

ABCABDABEACDACEADEBCDBCEBDECDE

ABCDABCEABDEACDEBCDE

ABCDE

10/23/19 Heiko Paulheim 22 The Apriori Algorithm

1.Start at k=1

2.Generate frequent itemsets of length k=1

3.Repeat until no new frequent itemsets are identified

1.Generate length (k+1) candidate itemsets from

length k frequent itemsets; increase k

2.Prune candidate itemsets that cannot be

frequent because they contain subsets of length k that are infrequent (Apriori Principle)

3.Count the support of each remaining candidate

by scanning the DB

4.Eliminate candidates that are infrequent, leaving

only those that are frequent

10/23/19 Heiko Paulheim 23 Items (1-itemsets)

Pairs (2-itemsets)

Triplets

(3-itemsets)Minimum Support = 3

No need to generate

quotesdbs_dbs10.pdfusesText_16

[PDF] [PDF] Association Rule - Folie 1