aggr_apriori

IMSL C Stat Library

aggr_apriori

Computes the frequent itemsets in a transaction set using aggregation.

Synopsis

#include <imsls.h>

void imsls_f_aggr_apriori (Imsls_keyword step, …, 0)

The type double function is imsls_d_aggr_apriori.

Required Arguments

Imsls_keyword step (Input)
One optional argument must be supplied to indicate which calculation step is to be performed. step is the name of the optional argument, defined as follows:

step	Description
IMSLS_FREQUENT_ITEMSETS	Compute the frequent itemsets.
IMSLS_UNION	Compute the union of two itemsets.
IMSLS_COUNT	Count the occurrence of each itemset in the transaction data set.
IMSLS_SUM	Add the counts of two itemsets.
IMSLS_UPDATE_FREQ_ITEMSETS	Update the set of frequent itemsets.
IMSLS_ASSOCIATION_RULES	Compute the strong association rules.

Synopsis with Optional Arguments

#include <imsls.h>

void imsls_f_aggr_apriori (

IMSLS_FREQUENT_ITEMSETS, int n, int x[], int max_num_products, int max_set_size,

double min_pct_support, Imsls_f_apriori_itemsets **itemsets, or

IMSLS_UNION, Imsls_f_apriori_itemsets *itemsets1,

Imsls_f_apriori_itemsets *itemsets2, Imsls_f_apriori_itemsets **cand_itemsets, or

IMSLS_COUNT, Imsls_f_apriori_itemsets *cand_itemsets, int n, int x[], int **freq, or

IMSLS_SUM, int n_itemsets,int prev_freq1[], int prev_freq2[], int freq[],or

IMSLS_UPDATE_FREQ_ITEMSETS, Imsls_f_apriori_itemsets *cand_itemsets, int n_itemsets, int freq[], Imsls_f_apriori_itemsets **itemsets, or

IMSLS_ASSOCIATION_RULES, Imsls_f_apriori_itemsets *itemsets,float confidence, float lift, Imsls_f_association_rules **assoc_rules,

Optional Arguments

IMSLS_FREQUENT_ITEMSETS, int n, int x[], int max_num_products, int max_set_size, double min_pct_support, Imsls_f_apriori_itemsets **itemsets (Input/Output)

Computes the frequent itemsets in a transaction set.

int n (Input)
Number of (transaction, item) pairs in x.

int x[] (Input)
Array of size n x 2, each row of which represents a transaction id and item id pair. The algorithm assumes that an individual transaction is complete within a single dataset. That is, there is no matching of transaction ids between different data sets.

int max_num_products (Input)
Maximum number of items or products that may be present in the aggregation of all transactions.

int max_set_size (Input)
Maximum size of an itemset. Only frequent itemsets with max_set_size or fewer items are considered in the analysis.

double min_pct_support (Input)
Minimum percentage of transactions in which an item or itemset must be present to be considered frequent. min_pct_support must be in the interval [0,1].

Imsls_f_apriori_itemsets **itemsets (Output)
Address of a pointer to an Imsls_f_apriori_itemsets data structure containing the frequent itemsets in the transaction set x. If no value can be computed, then NULL is returned. To release this space, use imsls_free_apriori_itemsets.

IMSLS_UNION, Imsls_f_apriori_itemsets *itemsets1, Imsls_f_apriori_itemsets *itemsets2, Imsls_f_apriori_itemsets **cand_itemsets (Input/Output)

Computes the union of two itemsets.

Imsls_f_apriori_itemsets *itemsets1 (Input)
Pointer to an Imsls_f_apriori_itemsets data structure containing the frequent itemsets for the union.

Imsls_f_apriori_itemsets *itemsets2 (Input)
Pointer to an Imsls_f_apriori_itemsets data structure containing the frequent itemsets for the union.

Imsls_f_apriori_itemsets **cand_itemsets (Output)
Address of a pointer to an Imsls_f_apriori_itemsets data structure containing the union of two itemsets. If no value can be computed, then NULL is returned. To release this space, use imsls_free_apriori_itemsets.

IMSLS_COUNT, Imsls_f_apriori_itemsets *cand_itemsets, int n, int x[], int **freq (Input/Output)

Counts the frequency of each itemset in a transaction data set.

Imsls_f_apriori_itemsets cand_itemsets (Input)
Candidate itemsets and the corresponding number of transactions.

int n (Input)
Number of transaction/item pairs in x.

int x[] (Input)
Array of size n x 2, each row of which represents a transaction id and item id pair. The algorithm assumes that an individual transaction is complete within a single dataset. That is, there are no matching of transaction ids between different data sets.

int **freq (Output)
Address of an internally allocated array of length cand_itemsets->n_itemsets containing the number of occurrences of each itemset in x. To release this space, use imsls_free.

IMSLS_SUM, int n_itemsets, int prev_freq1[],int prev_freq2[], int **freq (Input/Output)

Sum up the itemset frequencies in prev_freq1 and prev_freq2 and return in freq.

int n_itemsets (Input)
Length of prev_freq1 and prev_freq2 which corresponds to the number of itemsets.

int prev_freq1[] (Input)
Array of length n_itemsets containing the itemset frequencies counted over one or more blocks of transaction data.

int prev_freq2[] (Input)
Array of length n_itemsets containing the itemset frequencies counted over a second set of blocks of transaction data.

int **freq (Output)
Array of length n_itemsets containing the sum of the frequencies.

IMSLS_UPDATE_FREQ_ITEMSETS, Imsls_f_apriori_itemsets *cand_itemsets, int n_itemsets, int freq[], Imsls_f_apriori_itemsets **itemsets (Input/Output)

Updates the set of frequent items.

Imsls_f_apriori_itemsets *cand_itemsets (Input)
Candidate itemsets and the corresponding number of transactions.

int n_itemsets (Input)
Length of freq.

int freq[] (Input)
Array of length n_itemsets containing the frequencies for each itemset in cand_itemsets.

Imsls_f_apriori_itemsets **itemsets (Output)
Address of a pointer to an Imsls_f_apriori_itemsets data structure containing the frequent itemsets. If no value can be computed, then NULL is returned. To release this space, use imsls_free_apriori_itemsets.

IMSLS_ASSOCIATION_RULES, Imsls_f_apriori_itemsets *itemsets, float confidence, float lift, Imsls_f_association_rules **assoc_rules (Input/Output)

Computes the strong association rules among itemsets.

Imsls_f_apriori_itemsets *itemsets (Input)
A pointer to an Imsls_f_itemsets data structure containing the itemsets.

float confidence (Input)
The minimum confidence used to determine the strong association rules. confidence must be in the interval [0,1]. lift is the other criterion that determines whether an association is “strong.” If either criterion, confidence or lift is exceeded, the association rule is considered “strong.”

float lift (Input)
The minimum lift used to determine the strong association rules. lift must be non-negative. confidence is the other criterion which determines whether an association is “strong.” If either criterion, confidence or lift is exceeded, the association rule will be considered “strong.”

Imsls_f_association_rules **assoc_rules (Output)
Address of a pointer to an Imsls_f_association_rules data structure containing the strong association rules among the itemsets. If no value can be computed, then NULL is returned. To release this space, use imsls_free_association_rules.

Description

The function imsls_f_aggr_apriori performs the Apriori algorithm for association rule discovery. Association rules are statements of the form, "if X, then Y", given with some measure of confidence. The main application for association rule discovery is market basket analysis, where X and Y are products or groups of products, and the occurrences are individual transactions, or "market baskets." The results help sellers learn relationships between the different products they sell, supporting better marketing decisions. There are other applications for association rule discovery, such as the problem areas of text mining and bioinformatics. The Apriori algorithm (Agrawal and Srikant, 1994) is one of the most popular algorithms for association rule discovery in transactional datasets. For a full description of the Apriori algorithm, see imsls_f_apriori.

The interface to the function imsls_f_aggr_apriori is designed to complete the analysis over a series of steps, with each step requiring a call to the function. With this design, Apriori can be applied to separate blocks of transactions. For each dataset or block, call imsls_f_aggr_apriori with IMSLS_FREQUENT_ITEMSETS to obtain the frequent itemsets from each block. The same parameter settings, such as minimum support percentage, must be used in each separate call. Then, call imsls_f_aggr_apriori with the keyword IMSLS_UNION sequentially to obtain the union of all the frequent itemsets. The resulting set serves as the "candidate" itemsets for the global set of transactions.

An itemset which is frequent in one transaction set may or may not be frequent in the larger collection. To find the itemsets that are frequent over the entire set of transactions, imsls_f_aggr_apriori performs another pass through the individual blocks, this time counting the occurrences of each of the itemsets in each of the transaction sets. This step can be done in parallel, using keyword IMSLS_COUNT. The next step is then to sum up the individual counts before filtering for the frequent itemsets. This is achieved with the keyword IMSLS_SUM, applied successively to pairs of previous counts. After this step, the frequencies of each itemset over all of the transactions are known and it remains to be seen if any meet the threshold to be considered "frequent". The final step in determining the frequent itemsets is IMSLS_UPDATE_FREQ_ITEMSETS. Once the frequent itemsets are known, the strong association rules can be found using the step, IMSLS_ASSOCIATION_RULES , although this is not a special step in the aggregation. The method is due to Savasere, Omiecinski, and Navathe (1995) and is also summarized and compared with other approaches in Rajaraman and Ullman (2011).

Since imsls_f_aggr_apriori can operate on subsets of data, it can be used when physical memory cannot hold the entire data set. Additionally, this design may be useful in parallel computing environments where nodes can be programmed to calculate intermediate results in parallel.

Data Structures

The data structures used by imsls_f_aggr_apriori are described below. (For imsls_d_aggr_apriori, the structure names are Imsls_d_apriori_itemsets, Imsls_d_association_rules, and Imsls_d_rule_components where type float becomes double).

Structure definitions are provided for informational purposes and may be subject to change.

Table 13.39 — Imsls_f_apriori_itemsets
Field	Description
int n_itemsets	Length of array itemsets containing the Imsls_apriori_items structures.
Imsls_apriori_items *itemsets	Array of Imsls_apriori_items structures containing the set of frequent items and the support for that set.
int n_trans	Number of transactions.
int max_num_products	Maximum number of products.
int max_set_size	Maximum itemset size.
double min_pct_support	Minimum percentage of transactions.

Table 13.40 — Imsls_apriori_items
Field	Description
int n_items	Length of items.
int *items	Array containing the set of frequent items.
int support	Number of transactions in which the item appears.

Table 13.41 — Imsls_f_association_rules
Field	Description
int n_rules	Length of array rules containing the Imsls_f_rule_components structures.
Imsls_f_rule_components *rules	Array containing the association rules.

Table 13.42 — Imsls_f_rule_components
Field	Description
int n_x	Length of x.
int *x	Array containing the X components of the association rule.
int n_y	Length of y.
int *y	Array containing the Y components of the association rule.
int support[3]	Support for Z, X and Y components of the association rule.
float confidence	Confidence of the association rule.
float lift	Lift of the association rule.

Example

This example shows how to apply Apriori to separate blocks of data and combine results. The data are two separate blocks of 50 transactions involving five different product IDs. The minimum support percentage is set to 0.30, providing a minimum required support of 30 transactions overall.

#include <imsls.h>

#define N1 144

#define N2 147

int main() {

int i;

int max_num_products = 5, max_set_size = 4;

float confidence = 0.8, lift = 2.0;

double min_pct_support = 0.30;

int x1[N1][2] = {

{1, 3}, {1, 2}, {1, 1}, {2, 1}, {2, 2}, {2, 4}, {2, 5},

{3, 3}, {4, 4}, {4, 3}, {4, 5}, {4, 1}, {5, 5}, {6, 1},

{6, 2}, {6, 3}, {7, 5}, {7, 3}, {7, 2}, {8, 3}, {8, 4},

{8, 1}, {8, 5}, {8, 2}, {9, 4}, {10, 5}, {10, 3}, {11, 2},

{11, 3}, {12, 4}, {13, 4}, {14, 2}, {14, 3}, {14, 1}, {15, 3},

{15, 5}, {15, 1}, {16, 2}, {17, 3}, {17, 5}, {17, 1}, {18, 5},

{18, 1}, {18, 2}, {18, 3}, {19, 2}, {20, 4}, {21, 1}, {21, 4},

{21, 2}, {21, 5}, {22, 5}, {22, 4}, {23, 2}, {23, 5}, {23, 3},

{23, 1}, {23, 4}, {24, 3}, {24, 1}, {24, 5}, {25, 3}, {25, 5},

{26, 1}, {26, 4}, {26, 2}, {26, 3}, {27, 2}, {27, 3}, {27, 1},

{27, 5}, {28, 5}, {28, 3}, {28, 4}, {28, 1}, {28, 2}, {29, 4},

{29, 5}, {29, 2}, {30, 2}, {30, 4}, {30, 3}, {31, 2}, {32, 5},

{32, 1}, {32, 4}, {33, 4}, {33, 1}, {33, 5}, {33, 3}, {33, 2},

{34, 3}, {35, 5}, {35, 3}, {36, 3}, {36, 5}, {36, 4}, {36, 1},

{36, 2}, {37, 1}, {37, 3}, {37, 2}, {38, 4}, {38, 2}, {38, 3},

{39, 3}, {39, 2}, {39, 1}, {40, 2}, {40, 1}, {41, 3}, {41, 5},

{41, 1}, {41, 4}, {41, 2}, {42, 5}, {42, 1}, {42, 4}, {43, 3},

{43, 2}, {43, 4}, {44, 4}, {44, 5}, {44, 2}, {44, 3}, {44, 1},

{45, 4}, {45, 5}, {45, 3}, {45, 2}, {45, 1}, {46, 2}, {46, 4},

{46, 5}, {46, 3}, {46, 1}, {47, 4}, {47, 5}, {48, 2}, {49, 1},

{49, 4}, {49, 3}, {50, 3}, {50, 4}

};

int x2[N2][2] = {

{1, 2}, {1, 1}, {1, 4}, {1, 3}, {2, 2}, {2, 5}, {2, 3},

{2, 1}, {2, 4}, {3, 5}, {3, 4}, {4, 2}, {5, 4}, {5, 2},

{5, 3}, {5, 5}, {6, 3}, {6, 5}, {7, 2}, {7, 5}, {7, 4},

{7, 1}, {7, 3}, {8, 2}, {9, 2}, {9, 4}, {10, 4}, {10, 2},

{11, 4}, {11, 1}, {12, 3}, {12, 1}, {12, 5}, {12, 2}, {13, 2},

{14, 3}, {14, 4}, {14, 2}, {15, 2}, {16, 5}, {16, 2}, {16, 4},

{17, 1}, {18, 2}, {18, 3}, {18, 4}, {19, 3}, {19, 1}, {19, 2},

{19, 4}, {20, 5}, {20, 1}, {21, 5}, {21, 4}, {21, 1}, {21, 3},

{22, 4}, {22, 1}, {22, 5}, {23, 1}, {23, 2}, {24, 4}, {25, 4},

{25, 3}, {26, 5}, {26, 2}, {26, 3}, {26, 4}, {26, 1}, {27, 2},

{27, 1}, {27, 5}, {27, 3}, {28, 1}, {28, 2}, {28, 3}, {28, 4},

{29, 5}, {29, 2}, {29, 1}, {30, 5}, {30, 3}, {30, 2}, {30, 4},

{31, 4}, {31, 1}, {32, 1}, {32, 2}, {32, 3}, {32, 4}, {32, 5},

{33, 3}, {33, 2}, {33, 4}, {33, 5}, {33, 1}, {34, 3}, {34, 4},

{34, 5}, {34, 2}, {35, 2}, {35, 3}, {36, 3}, {36, 5}, {36, 4},

{37, 1}, {37, 4}, {37, 2}, {37, 3}, {37, 5}, {38, 5}, {38, 3},

{38, 1}, {38, 2}, {39, 2}, {39, 5}, {40, 4}, {40, 2}, {41, 4},

{42, 4}, {43, 5}, {43, 4}, {44, 5}, {44, 4}, {44, 3}, {44, 2},

{44, 1}, {45, 1}, {45, 2}, {45, 3}, {45, 5}, {45, 4}, {46, 3},

{46, 4}, {47, 4}, {47, 5}, {47, 2}, {47, 3}, {48, 5}, {48, 3},

{48, 2}, {48, 1}, {48, 4}, {49, 4}, {49, 5}, {50, 4}, {50, 1}

};

Imsls_f_apriori_itemsets *itemsets1 = NULL, *itemsets2 = NULL,

*cand_itemsets = NULL, *itemsets = NULL;

int *prev_freq1 = NULL, *prev_freq2 = NULL, *freq = NULL;

Imsls_f_association_rules *assoc_rules = NULL;

/* Find frequent itemsets in x1 and x2. */

imsls_f_aggr_apriori(IMSLS_FREQUENT_ITEMSETS,

N1, &x1[0][0], max_num_products,

max_set_size, min_pct_support, &itemsets1,

0);

imsls_f_aggr_apriori(IMSLS_FREQUENT_ITEMSETS,

N2, &x2[0][0], max_num_products,

max_set_size, min_pct_support, &itemsets2,

0);

/* Take the union of itemsets1 and itemsets2. */

imsls_f_aggr_apriori(IMSLS_UNION,

itemsets1, itemsets2, &cand_itemsets,

0);

/* Count the frequencies of each candidate itemset in

each of the data sets */

imsls_f_aggr_apriori(IMSLS_COUNT, cand_itemsets,

N1, &x1[0][0], &prev_freq1,

0);

imsls_f_aggr_apriori(IMSLS_COUNT, cand_itemsets,

N2, &x2[0][0], &prev_freq2,

0);

/* Sum the frequencies. */

imsls_f_aggr_apriori(IMSLS_SUM, cand_itemsets->n_itemsets,

prev_freq1, prev_freq2, &freq,

0);

/* Determine which of the candidate itemsets are frequent. */

imsls_f_aggr_apriori(IMSLS_UPDATE_FREQ_ITEMSETS,

cand_itemsets, cand_itemsets->n_itemsets, freq, &itemsets,

0);

/* Generate the strong association rules. */

imsls_f_aggr_apriori(IMSLS_ASSOCIATION_RULES,

itemsets, confidence, lift, &assoc_rules,

0);

imsls_f_write_apriori_itemsets(itemsets);

imsls_f_write_association_rules(assoc_rules);

imsls_f_free_apriori_itemsets(itemsets1);

imsls_f_free_apriori_itemsets(itemsets2);

imsls_f_free_apriori_itemsets(cand_itemsets);

imsls_f_free_apriori_itemsets(itemsets);

imsls_f_free_association_rules(assoc_rules);

imsls_free(prev_freq1);

imsls_free(prev_freq2);

imsls_free(freq);

}

Output

Frequent Itemsets (Out of 100 Transactions):

Size Support Itemset

1 51 { 1 }

1 63 { 2 }

1 60 { 3 }

1 63 { 4 }

1 54 { 5 }

2 37 { 1 2 }

2 38 { 1 3 }

2 33 { 1 4 }

2 35 { 1 5 }

2 44 { 2 3 }

2 38 { 2 4 }

2 34 { 2 5 }

2 38 { 3 4 }

2 38 { 3 5 }

2 37 { 4 5 }

3 32 { 1 2 3 }

3 31 { 2 3 4 }

Association Rules (itemset X implies itemset Y):

X = {1 2} ==> Y = {3}

supp(X)=37, supp(Y)=60, supp(X U Y)=32

conf= 0.86, lift=1.44

X = {1 3} ==> Y = {2}

supp(X)=38, supp(Y)=63, supp(X U Y)=32

conf= 0.84, lift=1.34

X = {2 4} ==> Y = {3}

supp(X)=38, supp(Y)=60, supp(X U Y)=31

conf= 0.82, lift=1.36

X = {3 4} ==> Y = {2}

supp(X)=38, supp(Y)=63, supp(X U Y)=31

conf= 0.82, lift=1.29

Warning Errors

IMSLS_MIN_SUPPORT_NOT_MET

No items met minimum support of #.

Fatal Errors

IMSLS_NEED_IARG_GE	"name" = #. "name" must be greater than or equal to #.
IMSLS_NEED_IARG_GT	"name" = #. "name" must be greater than #.
IMSLS_INEQUALITY_VIOLATION_12	"name1" = # must equal "name2" = #.