RogueWave

imsl.data_mining.FrequentItemSets

class FrequentItemSets

Aggregate frequent itemsets and compute association rules.

Notes

Function frequent_itemsets() and the methods of class FrequentItemSets perform the Apriori algorithm for association rule discovery. Association rules are statements of the form “if X, then Y”, given with some measure of confidence.

1. Application Fields of Apriori

The main application for association rule discovery is market basket analysis, where X and Y are products or groups of products, and the occurrences are individual transactions, or “market baskets”. The results help sellers learn relationships between different products they sell, supporting better marketing decisions. There are other applications for association rule discovery, such as the problem areas of text mining and bioinformatics. The Apriori algorithm ([R6]) is one of the most popular algorithms for association rule discovery in transactional data sets.

2. Application to a Single Set of Transactions

The Apriori algorithm consists of two stages when applied to a single transaction set. In the first and most critical stage, which is executed via function frequent_itemsets(), the Apriori algorithm mines the transactions for frequent itemsets. An itemset is frequent if it appears in more than a minimum number of transactions. The number of transactions containing an itemset is known as its “support”, and the minimum support (as a percentage of transactions) is a control parameter in the algorithm. The algorithm begins by finding the frequent single itemsets. Then the algorithm generates all two-item itemsets from the frequent single items and determines which among them are frequent. From the collection of frequent pairs, Apriori forms candidate three-item subsets and determines which are frequent, and so on. The algorithm stops, when either a maximum itemset size is reached, or when none of the candidate itemsets are frequent. In this way, the Apriori algorithm exploits the apriori property: for an itemset to be frequent, all of its proper subsets must also be frequent. At each step the problem is reduced to only the frequent subsets.

In the second stage, executed via method association_rules(), the algorithm generates association rules. These are of the form \(X \Rightarrow Y\) (read, “if X, then Y”), where Y and X are disjoint frequent itemsets. The confidence measure associated with the rule is defined as the proportion of transactions containing X that also contain Y. Denote the support of X (the number of transactions containing X) as \(S_X\), and the support of \(Z=X \cup Y\) as \(S_Z\). The confidence of the rule \(X \Rightarrow Y\) is the ratio \(S_Z/S_X\). Note that the confidence ratio is the conditional probability

\[P[X|Y] =\frac{P[XY]}{P[X]}\]

where \(P[XY]\) denotes the probability of both X and Y. The probability of an itemset X is estimated by \(S_X/N\), where N is the total number of transactions.

Another measure of the strength of the association is known as the lift, which is the ratio \((S_ZN)/(S_XS_Y)\). Lift values close to 1.0 suggest the sets are independent, and that they occur together by chance. Large lift values indicate a strong association. A minimum confidence threshold and a lift threshold can be specified.

3. Application to Blocks of Transactions (Aggregation)

Since the Apriori algorithm can operate on subsets of data, it can be used with distributed data or data sets larger than physical memory. Additionally, it may be useful in parallel computing environments where nodes can be programmed to calculate intermediate results in parallel.

For each data set or block of transactions, call function frequent_itemsets() to obtain the frequent itemsets for each block. The same parameter settings, such as minimum support percentage, must be used in each call.

Then, call method union() to obtain the union of all frequent itemsets. The resulting sets serve as the “candidate” itemsets for the global set of transactions.

An itemset which is frequent in one transaction set may or may not be frequent in the larger collection. To find the itemsets that are frequent over the entire set of transactions, use method frequency() to perform another pass through the individual blocks, this time counting the occurrence of each itemset in each transaction set. This step can be done in parallel.

The next step is then to sum up the individual counts before filtering for the frequent itemsets. This is achieved by adding the frequency vectors returned by method frequency(). After this step, the frequencies of each candidate itemset over all of the transactions are known.

In the final step, call method update() to determine all itemsets that meet the threshold to be considered “frequent”. This step completes the aggregation of the transactional data sets.

Once the frequent itemsets are known, the strong association rules can be found using method association_rules().

The aggregation method is due to [R8] and is also summarized and compared with other approaches in [R7].

References

[R6](1, 2) Agrawal, R., and Srikant, R. (1994), Fast algorithms for mining association rules, Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, Chile, August 29 - September 1, 1994.
[R7](1, 2) Rajamaran, A., and Ullman, J. D. (2011), Mining of Massive Datasets, Cambridge University Press, Cambridge, UK.
[R8](1, 2) Savasere, A., Omiecinski, E., and Navathe, S. (1995), An Efficient Algorithm for Mining Association Rules in Large Databases, Proceedings of the 21st International Conference on Very Large Data Bases, Zurich, Switzerland, 1995.

Examples

Example 1:

This example applies the Apriori algorithm to a data set consisting of 50 transactions involving five different product IDs. The minimum support percentage is set to 0.30, giving a minimum required support of 15 transactions. The frequent itemsets and strong association rules are printed.

>>> import imsl.data_mining.apriori as apriori
>>> max_num_products = 5
>>> max_set_size = 10
>>> min_pct_support = 0.30
>>> confidence = 0.8
>>> lift = 2.0
>>> x = [[1,  3], [1,  2], [1,  1], [2,  1], [2,  2], [2,  4], [2,  5],
...     [3,  3], [4,  4], [4,  3], [4,  5], [4,  1], [5,  5], [6,  1],
...     [6,  2], [6,  3], [7,  5], [7,  3], [7,  2], [8,  3], [8,  4],
...     [8,  1], [8,  5], [8,  2], [9,  4], [10, 5], [10, 3], [11, 2],
...     [11, 3], [12, 4], [13, 4], [14, 2], [14, 3], [14, 1], [15, 3],
...     [15, 5], [15, 1], [16, 2], [17, 3], [17, 5], [17, 1], [18, 5],
...     [18, 1], [18, 2], [18, 3], [19, 2], [20, 4], [21, 1], [21, 4],
...     [21, 2], [21, 5], [22, 5], [22, 4], [23, 2], [23, 5], [23, 3],
...     [23, 1], [23, 4], [24, 3], [24, 1], [24, 5], [25, 3], [25, 5],
...     [26, 1], [26, 4], [26, 2], [26, 3], [27, 2], [27, 3], [27, 1],
...     [27, 5], [28, 5], [28, 3], [28, 4], [28, 1], [28, 2], [29, 4],
...     [29, 5], [29, 2], [30, 2], [30, 4], [30, 3], [31, 2], [32, 5],
...     [32, 1], [32, 4], [33, 4], [33, 1], [33, 5], [33, 3], [33, 2],
...     [34, 3], [35, 5], [35, 3], [36, 3], [36, 5], [36, 4], [36, 1],
...     [36, 2], [37, 1], [37, 3], [37, 2], [38, 4], [38, 2], [38, 3],
...     [39, 3], [39, 2], [39, 1], [40, 2], [40, 1], [41, 3], [41, 5],
...     [41, 1], [41, 4], [41, 2], [42, 5], [42, 1], [42, 4], [43, 3],
...     [43, 2], [43, 4], [44, 4], [44, 5], [44, 2], [44, 3], [44, 1],
...     [45, 4], [45, 5], [45, 3], [45, 2], [45, 1], [46, 2], [46, 4],
...     [46, 5], [46, 3], [46, 1], [47, 4], [47, 5], [48, 2], [49, 1],
...     [49, 4], [49, 3], [50, 3], [50, 4]]
>>> itemsets = apriori.frequent_itemsets(max_num_products, x,
...                                      max_set_size=max_set_size,
...                                      min_pct_support=min_pct_support)
>>> # Print frequent itemsets
>>> print(itemsets)  
Frequent Itemsets (Out of 50  Transactions):
Size   Support  Itemset
  1        27   { 1 }
  1        30   { 2 }
  1        33   { 3 }
  1        27   { 4 }
  1        27   { 5 }
  2        20   { 1  2 }
  2        22   { 1  3 }
  2        16   { 1  4 }
  2        19   { 1  5 }
  2        22   { 2  3 }
  2        16   { 2  4 }
  2        15   { 2  5 }
  2        16   { 3  4 }
  2        19   { 3  5 }
  2        17   { 4  5 }
  3        17   { 1  2  3 }
  3        15   { 1  3  5 }
>>> # Print association rules
>>> assoc_rules = itemsets.association_rules(confidence, lift)
>>> for rule in assoc_rules:
...     print(rule)  
Association Rule (itemset X implies itemset Y):
X = {1} ==> Y = {3}
  supp(X)=27, supp(Y)=33, supp(X U Y)=22
  conf=0.81, lift=1.23
Association Rule (itemset X implies itemset Y):
X = {1 2} ==> Y = {3}
  supp(X)=20, supp(Y)=33, supp(X U Y)=17
  conf=0.85, lift=1.29

Example 2:

This example shows how to apply the Apriori algorithm to separate blocks of data and combine results. The data are two separate blocks of 50 transactions involving five different product IDs. The minimum support percentage is set to 0.30, providing a minimum required support of 30 transactions overall. The frequent itemsets and strong association rules are printed.

>>> import imsl.data_mining.apriori as apriori
>>> max_num_products = 5
>>> max_set_size = 4
>>> min_pct_support = 0.30
>>> confidence = 0.8
>>> lift = 2.0
>>> x1 = [[1,  3], [1,  2], [1,  1], [2,  1], [2,  2], [2,  4], [2,  5],
...      [3,  3], [4,  4], [4,  3], [4,  5], [4,  1], [5,  5], [6,  1],
...      [6,  2], [6,  3], [7,  5], [7,  3], [7,  2], [8,  3], [8,  4],
...      [8,  1], [8,  5], [8,  2], [9,  4], [10, 5], [10, 3], [11, 2],
...      [11, 3], [12, 4], [13, 4], [14, 2], [14, 3], [14, 1], [15, 3],
...      [15, 5], [15, 1], [16, 2], [17, 3], [17, 5], [17, 1], [18, 5],
...      [18, 1], [18, 2], [18, 3], [19, 2], [20, 4], [21, 1], [21, 4],
...      [21, 2], [21, 5], [22, 5], [22, 4], [23, 2], [23, 5], [23, 3],
...      [23, 1], [23, 4], [24, 3], [24, 1], [24, 5], [25, 3], [25, 5],
...      [26, 1], [26, 4], [26, 2], [26, 3], [27, 2], [27, 3], [27, 1],
...      [27, 5], [28, 5], [28, 3], [28, 4], [28, 1], [28, 2], [29, 4],
...      [29, 5], [29, 2], [30, 2], [30, 4], [30, 3], [31, 2], [32, 5],
...      [32, 1], [32, 4], [33, 4], [33, 1], [33, 5], [33, 3], [33, 2],
...      [34, 3], [35, 5], [35, 3], [36, 3], [36, 5], [36, 4], [36, 1],
...      [36, 2], [37, 1], [37, 3], [37, 2], [38, 4], [38, 2], [38, 3],
...      [39, 3], [39, 2], [39, 1], [40, 2], [40, 1], [41, 3], [41, 5],
...      [41, 1], [41, 4], [41, 2], [42, 5], [42, 1], [42, 4], [43, 3],
...      [43, 2], [43, 4], [44, 4], [44, 5], [44, 2], [44, 3], [44, 1],
...      [45, 4], [45, 5], [45, 3], [45, 2], [45, 1], [46, 2], [46, 4],
...      [46, 5], [46, 3], [46, 1], [47, 4], [47, 5], [48, 2], [49, 1],
...      [49, 4], [49, 3], [50, 3], [50, 4]]
>>> x2 = [[1,  2], [1,  1], [1,  4], [1,  3], [2,  2], [2,  5], [2,  3],
...      [2,  1], [2,  4], [3,  5], [3,  4], [4,  2], [5,  4], [5,  2],
...      [5,  3], [5,  5], [6,  3], [6,  5], [7,  2], [7,  5], [7,  4],
...      [7,  1], [7,  3], [8,  2], [9,  2], [9,  4], [10, 4], [10, 2],
...      [11, 4], [11, 1], [12, 3], [12, 1], [12, 5], [12, 2], [13, 2],
...      [14, 3], [14, 4], [14, 2], [15, 2], [16, 5], [16, 2], [16, 4],
...      [17, 1], [18, 2], [18, 3], [18, 4], [19, 3], [19, 1], [19, 2],
...      [19, 4], [20, 5], [20, 1], [21, 5], [21, 4], [21, 1], [21, 3],
...      [22, 4], [22, 1], [22, 5], [23, 1], [23, 2], [24, 4], [25, 4],
...      [25, 3], [26, 5], [26, 2], [26, 3], [26, 4], [26, 1], [27, 2],
...      [27, 1], [27, 5], [27, 3], [28, 1], [28, 2], [28, 3], [28, 4],
...      [29, 5], [29, 2], [29, 1], [30, 5], [30, 3], [30, 2], [30, 4],
...      [31, 4], [31, 1], [32, 1], [32, 2], [32, 3], [32, 4], [32, 5],
...      [33, 3], [33, 2], [33, 4], [33, 5], [33, 1], [34, 3], [34, 4],
...      [34, 5], [34, 2], [35, 2], [35, 3], [36, 3], [36, 5], [36, 4],
...      [37, 1], [37, 4], [37, 2], [37, 3], [37, 5], [38, 5], [38, 3],
...      [38, 1], [38, 2], [39, 2], [39, 5], [40, 4], [40, 2], [41, 4],
...      [42, 4], [43, 5], [43, 4], [44, 5], [44, 4], [44, 3], [44, 2],
...      [44, 1], [45, 1], [45, 2], [45, 3], [45, 5], [45, 4], [46, 3],
...      [46, 4], [47, 4], [47, 5], [47, 2], [47, 3], [48, 5], [48, 3],
...      [48, 2], [48, 1], [48, 4], [49, 4], [49, 5], [50, 4], [50, 1]]
>>> # Find frequent itemsets in x1 and x2.
>>> itemsets1 = apriori.frequent_itemsets(max_num_products, x1,
...                                       max_set_size=max_set_size,
...                                       min_pct_support=min_pct_support)
>>> itemsets2 = apriori.frequent_itemsets(max_num_products, x2,
...                                       max_set_size=max_set_size,
...                                       min_pct_support=min_pct_support)
>>> # Take the union of itemsets1 and itemsets2.
>>> cand_itemsets = itemsets1.union(itemsets2)
>>> # Count the frequencies of each candidate itemset in each data set.
>>> freq1 = cand_itemsets.frequency(x1)
>>> freq2 = cand_itemsets.frequency(x2)
>>> # Sum the frequencies.
>>> freq = freq1 + freq2
>>> # Determine which of the candidate itemsets are frequent.
>>> itemsets = cand_itemsets.update(freq)
>>> # Print the aggregated frequent itemsets.
>>> print(itemsets)  
Frequent Itemsets (Out of 100  Transactions):
Size   Support  Itemset
  1        51   { 1 }
  1        63   { 2 }
  1        60   { 3 }
  1        63   { 4 }
  1        54   { 5 }
  2        37   { 1  2 }
  2        38   { 1  3 }
  2        33   { 1  4 }
  2        35   { 1  5 }
  2        44   { 2  3 }
  2        38   { 2  4 }
  2        34   { 2  5 }
  2        38   { 3  4 }
  2        38   { 3  5 }
  2        37   { 4  5 }
  3        32   { 1  2  3 }
  3        31   { 2  3  4 }
>>> # Generate and print the strong association rules.
>>> assoc_rules = itemsets.association_rules(confidence, lift)
>>> for rule in assoc_rules:
...     print(rule)  
Association Rule (itemset X implies itemset Y):
X = {1 2} ==> Y = {3}
  supp(X)=37, supp(Y)=60, supp(X U Y)=32
  conf=0.86, lift=1.44
Association Rule (itemset X implies itemset Y):
X = {1 3} ==> Y = {2}
  supp(X)=38, supp(Y)=63, supp(X U Y)=32
  conf=0.84, lift=1.34
Association Rule (itemset X implies itemset Y):
X = {2 4} ==> Y = {3}
  supp(X)=38, supp(Y)=60, supp(X U Y)=31
  conf=0.82, lift=1.36
Association Rule (itemset X implies itemset Y):
X = {3 4} ==> Y = {2}
  supp(X)=38, supp(Y)=63, supp(X U Y)=31
  conf=0.82, lift=1.29

Attributes

item_sets Return the itemsets.
max_num_products Return the maximum number of products.
max_set_size Return the maximum itemset size.
min_pct_support Return the minimum percentage of transaction support.
n_trans Return the number of transactions used to construct the itemsets.
support Return the support for each itemset.

Methods

association_rules(confidence, lift) Compute strong association rules among itemsets.
frequency(x) Count the frequency of each itemset in a transaction data set.
union(*itemsets) Compute the union of itemsets from different transaction sets.
update(frequencies) Update the set of frequent itemsets in the candidate itemsets.