8 Association

This chapter describes association, the unsupervised mining function for discovering association rules.

See Also:

This chapter contains the following topics:

About Association
A Sample Association Problem
Algorithm for Association Rules

About Association

Association is a data mining function that discovers the probability of the co-occurrence of items in a collection. The relationships between co-occurring items are expressed as association rules.

Association rules are often used to analyze sales transactions. For example, it might be noted that customers who buy cereal at the grocery store often buy milk at the same time. In fact, association analysis might find that 85% of the checkout sessions that include cereal also include milk. This relationship could be formulated as the following rule.

Cereal implies milk with 85% confidence

This application of association modeling is called market-basket analysis. It is valuable for direct marketing, sales promotions, and for discovering business trends. Market-basket analysis can also be used effectively for store layout, catalog design, and cross-sell.

Association modeling has important applications in other domains as well. For example, in e-commerce applications, association rules may be used for Web page personalization. An association model might find that a user who visits pages A and B is 70% likely to also visit page C in the same session. Based on this rule, a dynamic link could be created for users who are likely to be interested in page C. The association rule could be expressed as follows.

A and B imply C with 70% confidence

See Also:

"Confidence"

Transactions

Unlike other data mining functions, association is transaction-based. In transaction processing, a case consists of a transaction such as a market basket or Web session. The collection of items in the transaction is an attribute of the transaction. Other attributes might be the date, time, location, or user ID associated with the transaction.

The collection of items in the transaction is a multi-record attribute. Transactional data is said to be in multi-record case format. An example is shown in Figure 8-1.

Figure 8-1 Transactional Data

Description of "Figure 8-1 Transactional Data"

Since Oracle Data Mining requires single-record case format, the column that holds the collection must be transformed to a nested table type prior to mining for association rules. Transactional data in single-record case format is shown in Figure 8-2.

Figure 8-2 Transactional Data Transformed for Mining

The case ID for transactional data may be a multi-column key. For example, the case ID for sales transactions could be a customer ID and a time stamp.

Note:

Oracle Data Miner handles nested table transformation transparently. Instructions for using SQL to transform transactional data are provided in Oracle Data Mining Application Developer's Guide.

See Also:

Figure 4-3, "Sample Build Data for Regression" and Figure 7-2, "Build Data for Clustering" for examples of single-record case format

"Sparse Data"

Items and Collections

In transactional data, a collection of items is associated with each case. The collection could theoretically include all possible members of the collection. For example, all products could theoretically be purchased in a single market-basket transaction. However, in actuality, only a tiny subset of all possible items are present in a given transaction; the items in the market-basket represent only a small fraction of the items available for sale in the store.

Oracle Data Mining implements collections as nested rows, as shown in Figure 8-2. Each nested row specifies an item name and a value. If an item is present in a collection, it has a non-null value. An item is uniquely identified by its name and its value. Items with the same name but different values may occur across collections. For example, if one transaction includes one gallon of milk and another includes two gallons of milk , milk-1 and milk-2 are interpreted as different items.

Sparse Data

When an item is not present in a collection, it may have a null value or it may simply be missing. Many of the items may be missing or null, since many of the items that could be in the collection are probably not present in any individual transaction.

Missing rows in a collection indicate sparsity. This means that a high proportion of the nested rows are not populated. The Oracle Data Mining association algorithm is optimized for processing sparse data.

See Also:

Oracle Data Mining Application Developer's Guide for information about Oracle Data Mining and sparse data

Itemsets

The first step in association analysis is the enumeration of itemsets. An itemset is any combination of two or more items in a transaction.

The maximum number of items in an itemset is user-specified. If the maximum is two, all the item pairs will be counted. If the maximum is greater than two, all the item pairs, all the item triples, and all the item combinations up to the specified maximum will be counted.

The maximum number of items in an itemset is specified by the ASSO_MAX_RULE_LENGTH setting, which also applies to the rules derived from the itemsets.

See Also:

"Association Rules" to learn about the relationship between itemsets and rules

Oracle Database PL/SQL Packages and Types Reference for descriptions of the build settings for association rules

Table 8-1 shows the itemsets derived from the transactions in Figure 8-2, assuming that ASSO_MAX_RULE_LENGTH is set to 3.

Table 8-1 Itemsets

Transaction	Itemsets
11	(B,D) (B,E) (D,E) (B,D,E)
12	(A,B) (A,C) (A,E) (B,C) (B,E) (C,E) (A,B,C) (A,B,E) (A,C,E) (B,C,E)
13	(B,C) (B,D) (B,E) (C,D) (C,E) (D,E) (B,C,D) (B,C,E) (B,D,E) (C,D,E)

Tip:

Decrease the maximum rule length if you want to decrease the build time for the model and generate simpler rules.

Frequent Itemsets

Association rules are calculated from itemsets. If rules are generated from all possible itemsets, there may be a very high number of rules and the rules may not be very meaningful. Also, the model may take a long time to build. Typically it is desirable to only generate rules from itemsets that are well-represented in the data. Frequent itemsets are those that occur with a minimum frequency specified by the user.

The minimum frequent itemset support is a user-specified percentage that limits the number of itemsets used for association rules. An itemset must appear in at least this percentage of all the transactions if it is to be used as a basis for rules.

The ASSO_MIN_SUPPORT setting specifies the minimum frequent itemset support. It also applies to the rules derived from the frequent itemsets.

See Also:

"Association Rules" to learn about the relationship between frequent itemsets and rules

Oracle Database PL/SQL Packages and Types Reference for descriptions of the build settings for association rules

Table 8-2 shows the itemsets from Table 8-1 that are frequent itemsets with support > 66%.

Table 8-2 Frequent Itemsets

Frequent Itemset	Transactions	Support
(B,C)	2 of 3	67%
(B,D)	2 of 3	67%
(B,E)	3 of 3	100%
(C,E)	2 of 3	67%
(D,E)	2 of 3	67%
(B,C,E)	2 of 3	67%
(B,D,E)	2 of 3	67%

Tip:

Increase the minimum support if you want to decrease the build time for the model and generate fewer rules.

See Also:

Chapter 10, "Apriori" for information about the calculation of association rules

A Sample Association Problem

This example shows association rules mined from sales transactions in the SH schema. Sales is a fact table linked to products, customers, and other dimension tables through foreign keys. Oracle Data Miner automatically converts the transactional data to single-record case.

The items in each transaction are products; each transaction is uniquely identified by a customer ID. Figure 8-3 shows the dialog in Oracle Data Miner for selecting transactional data.

Figure 8-3 Select Transactional Data

Description of "Figure 8-3 Select Transactional Data"

Figure 8-4 shows the dialog for selecting the unique transaction identifer.

Figure 8-4 Select Transaction Identifier

Description of "Figure 8-4 Select Transaction Identifier"

A model with default settings built on this data generates many rules. One way to limit the number of rules is to raise the support and confidence. Figure 8-5 shows Confidence raised to 65% and Support raised to 75% in the Advanced Settings dialog.

Figure 8-5 Advanced Settings for Association Rules

Description of "Figure 8-5 Advanced Settings for Association Rules"

Figure 8-6 shows the rules that are returned when you increase the confidence and support.

Figure 8-6 Sample Association Rules

Description of "Figure 8-6 Sample Association Rules"

You can filter the rules in a number of different ways. The dialog in Figure 8-7 specifies that only rules with "Mouse Pad" in the antecedent, and "Keyboard Wrist Rest" in the consequent should be returned.

Figure 8-7 Filter Rules

Description of "Figure 8-7 Filter Rules"

Figure 8-8 shows the three rules that result from the filtering criteria specified in Figure 8-7. The first rule states that a customer who purchases a mousepad and a 1.44 MB External 3.5 Diskette is likely to also buy a keyboard wrist rest at same time. The confidence for this rule is 99%. The support is 77%.

Figure 8-8 Display rules with mousepad in antecedent

Description of "Figure 8-8 Display rules with mousepad in antecedent"

See Also:

"Confidence" for a discussion of confidence

Algorithm for Association Rules

Oracle Data Mining uses the Apriori algorithm to calculate association rules for items in frequent itemsets.

See Also:

Chapter 10, "Apriori"