Association Rules Use Case Scenario

A popular movie rental website is being updated. The movie rental company wishes to provide movie recommendations to their customers based on their frequently rented movies and purchase transaction history. They approach you, a data scientist, for assistance with movie recommendations. Using the Apriori algorithm, you solve this problem by analysing popular movies that are frequently watched together.

Before you start your OML4SQL use case journey, ensure that you have the following:

Load Data

Examine the data set and its attributes. Load the data in your database.

In this use case, you will load the data set to your database. If you are using Oracle Autonomous Database, you will use an existing data file from the Oracle Cloud Infrastructure (OCI) Object Storage. You will create a sample table, load data into the sample table from files on the OCI Object Storage, and explore the data. If you are using the on-premises database, you will use Oracle SQL developer to import the data set and explore the data.

To understand the data, you will perform the following:
  • Access the data.
  • Examine the various attributes or columns of the data set.
  • Assess data quality (by exploring the data).

Examine Data

The following table displays information about the attributes from MOVIES_SALES_FACT:

Attribute Name Information
ORDER_NUM Specifies the order number
ACTUAL_PRICE Specifies the actual price of the movie
AGE Specifies the age of the customer
AGE_BAND Specifies the age band of the customer. The possible values are 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89 and so on.
APP Specifies the application used for the movie
CITY Specifies the name of the city
CITY_ID Specifies the city ID
COMMUTE_DISTANCE Specifies the commute distance
COMMUTE_DISTANCE_BAND Specifies the commute distance band
CONTINENT Specifies the continent name
COUNTRY Specifies the country name
COUNTRY_CODE Specifies the country code
COUNTRY_ID Specifies the country ID
CREDIT_BALANCE Specifies the credit balance of the customer
CUSTOMER_ID Specifies the customer ID
CUSTOMER_NAME Specifies the customer name
DAY Specifies the day of the week in YYYY-mm-dd hh:mm:ss format
DAY_NAME Specifies the day of the week
DAY_NUM_OF_WEEK Specifies the day number of the week
DEVICE Specifies the device information used by the customer
DISCOUNT_PERCENT Specifies the discount percent
DISCOUNT_TYPE Specifies the discount type availed by the customer. Possible values are referral, coupon, promotion, volume, none
EDUCATION Specifies customer's education
EMAIL Specifies email ID of the customer
FULL_TIME Specifies customer's employment status such as full time, not employed, part time
GENDER Specifies the gender of the customer
GENRE Specifies the genre of the movie
HOUSEHOLD_SIZE Specifies the household size of the customer
HOUSEHOLD_SIZE_BAND Specifies the household size band
INCOME Specifies the income of the customer
INCOME_BAND Specifies the income band of the customer
INSUFF_FUNDS_INCIDENTS Specifies the number of insufficient funds incidents that the customer had
JOB_TYPE Specifies the cusotmer's job
LATE_MORT_RENT_PMTS Specifies is the customer had any late mortgage or rent payment
LIST_PRICE Specifies the list price of the movie
MARITAL_STATUS Specifies the marital status of the customer
MONTH Specifies the month in MON-YYYY format
MONTH_NAME Specifies the month. For example, January.
MONTH_NUM_OF_YEAR Specifies the month number of the year
MORTGAGE_AMT Specifies the mortgage amount
MOVIE_ID Specifies the movie ID
NUM_CARS Specifies the number of the cars that the customer owns
NUM_MORTGAGES Specifies the number of mortgages
OS Specifies the OS information
PAYMENT_METHOD Specifies the payment method
PET Specifies if the customer owns a pet
POSTAL_CODE Specifies the postal code of the address
PROMOTION_RESPONSE Specifies the response of the customer to a promotional offer
QUANTITY_SOLD Specifies the quantity sold
QUARTER_NAME Specifies the quarter name in Qn-YYYY format. For example, Q1-2001.
QUARTER_NUM_OF_YEAR Specifies the quarter number of the year
RENT_OWN Specifies if the customer is living at a rented place or own place
SEARCH_GENRE Specifies the genre of the movies searched
SEGMENT_DESCRIPTION Describes the population segment
SEGMENT_NAME Specifies the population segment name
SKU Specifies the SKU ID
STATE_PROVINCE Specifies the province
STATE_PROVINCE_ID Specifies the province ID
STREET_ADDRESS Specifies the customer's address
TITLE Specifies the movie title
USERNAME Specifies the username provided by the customer
WORK_EXPERIENCE Specifies the work experience of the customer
WORK_EXPERIENCE_BAND Specifies the work experience band of the customer
YEAR Specifies the year
YEARS_CURRENT_EMPLOYER Specifies the current employer of the customer
YEARS_CURRENT_EMPLOYER_BAND Specifies the customer's employment band in years with the current employer
YEARS_RESIDENCE Specifies the number of years the customer has been residing at a place
YEARS_RESIDENCE_BAND Specifies the residence band

Create a Table

Create a table called MOVIE_SALES_FACT. This table is used in DBMS_CLOUD.COPY_DATA procedure to access the data set.

Enter the following code in the OML Notebooks and run the notebook.

%sql
CREATE TABLE MOVIE_SALES_FACT
( ORDER_NUM NUMBER(38,0),
 DAY DATE,
 DAY_NUM_OF_WEEK NUMBER(38,0),
 DAY_NAME VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 MONTH VARCHAR2(12 BYTE) COLLATE USING_NLS_COMP,
 MONTH_NUM_OF_YEAR NUMBER(38,0),
 MONTH_NAME VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 QUARTER_NAME VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 QUARTER_NUM_OF_YEAR NUMBER(38,0),
 YEAR NUMBER(38,0),
 CUSTOMER_ID NUMBER(38,0),
 USERNAME VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 CUSTOMER_NAME VARCHAR2(250 BYTE) COLLATE USING_NLS_COMP,
 STREET_ADDRESS VARCHAR2(250 BYTE) COLLATE USING_NLS_COMP,
 POSTAL_CODE VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 CITY_ID NUMBER(38,0),
 CITY VARCHAR2(128 BYTE) COLLATE USING_NLS_COMP,
 STATE_PROVINCE_ID NUMBER(38,0),
 STATE_PROVINCE VARCHAR2(128 BYTE) COLLATE USING_NLS_COMP,
 COUNTRY_ID NUMBER(38,0),
 COUNTRY VARCHAR2(126 BYTE) COLLATE USING_NLS_COMP,
 COUNTRY_CODE VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 CONTINENT VARCHAR2(128 BYTE) COLLATE USING_NLS_COMP,
 SEGMENT_NAME VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 SEGMENT_DESCRIPTION VARCHAR2(128 BYTE) COLLATE USING_NLS_COMP,
 CREDIT_BALANCE NUMBER(38,0),
 EDUCATION VARCHAR2(128 BYTE) COLLATE USING_NLS_COMP,
 EMAIL VARCHAR2(128 BYTE) COLLATE USING_NLS_COMP,
 FULL_TIME VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 GENDER VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 HOUSEHOLD_SIZE NUMBER(38,0),
 HOUSEHOLD_SIZE_BAND VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 WORK_EXPERIENCE NUMBER(38,0),
 WORK_EXPERIENCE_BAND VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 INSUFF_FUNDS_INCIDENTS NUMBER(38,0),
 JOB_TYPE VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 LATE_MORT_RENT_PMTS NUMBER(38,0),
 MARITAL_STATUS VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 MORTGAGE_AMT NUMBER(38,0),
 NUM_CARS NUMBER(38,0),
 NUM_MORTGAGES NUMBER(38,0),
 PET VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 PROMOTION_RESPONSE NUMBER(38,0),
 RENT_OWN VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 YEARS_CURRENT_EMPLOYER NUMBER(38,0),
 YEARS_CURRENT_EMPLOYER_BAND VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 YEARS_CUSTOMER NUMBER(38,0),
 YEARS_CUSTOMER_BAND VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 YEARS_RESIDENCE NUMBER(38,0),
 YEARS_RESIDENCE_BAND VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 AGE NUMBER(38,0),
 AGE_BAND VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 COMMUTE_DISTANCE NUMBER(38,0),
 COMMUTE_DISTANCE_BAND VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 INCOME NUMBER(38,0),
 INCOME_BAND VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 MOVIE_ID NUMBER(38,0),
 SEARCH_GENRE VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 TITLE VARCHAR2(4000 BYTE) COLLATE USING_NLS_COMP,
 GENRE VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 SKU NUMBER(38,0),
 LIST_PRICE NUMBER(38,2),
 APP VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 DEVICE VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 OS VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 PAYMENT_METHOD VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 DISCOUNT_TYPE VARCHAR2(26 BYTE) COLLATE USING_NLS_COMP,
 DISCOUNT_PERCENT NUMBER(38,1),
 ACTUAL_PRICE NUMBER(38,2),
 QUANTITY_SOLD NUMBER(38,0)
 )
;

Load Data in the Table

Load the data set stored in object storage to the MOVIE_SALES_FACT table.

Before you load this data ensure that you set Compute Resources to Medium or High. If you select High, then, set Memory field to 16 for High Resource Service. You must have Administrator privilege to configure the memory settings. See Compute Resource.

Add a new paragraph in your OML notebook and run the following statement:
%script
BEGIN
 DBMS_CLOUD.COPY_DATA (table_name => 'MOVIE_SALES_FACT',file_uri_list => 'https://objectstorage.uk-london-1.oraclecloud.com/n/adwc4pm/b/moviestream_kl/o/d801_movie_sales_fact_m-*.csv', format => '{"delimiter":",", "recorddelimiter":"newline", "skipheaders":"1", "quote":"\\\"", "rejectlimit":"1000", "trimspaces":"rtrim", "ignoreblanklines":"false", "ignoremissingcolumns":"true", "dateformat":"DD-MON-YYYY HH24:MI:SS"}');
END;
/
 

PL/SQL procedure successfully completed.
 
 
---------------------------
Examine the statement:
  • table_name: is the target table’s name.
  • credential_name: is the name of the credential created earlier.
  • file_uri_list: is a comma delimited list of the source files you want to load. The special character * in the file d801_movie_sales_fact_m-*.csv means you are bulk loading the MovieStream data set containing sales data for 2018-2020.
  • format: defines the options you can specify to describe the format of the source file, including whether the file is of type text, ORC, Parquet, or Avro.
    • delimiter: Specifies the field delimiter (special character). Here, it is specified as "," (comma)
    • recorddelimiter: Specifies the record delimiter. The default value is newline. By default, DBMS_CLOUD tries to automatically find the correct newline character as the delimiter.
    • skipheaders: Specifies how many rows should be skipped from the start of the file. In this use case, it is 1.
    • quote: Specifies the quote character for the fields.
    • rejectlimit: The operation will error out after specified number of rows are rejected. Here, the value is 1000.
    • trimspaces: Specifies how the leading and trailing spaces of the fields are trimmed. Here it is rtrim. The rtrim value indicates that you want trailing spaces trimmed.
    • ignoreblanklines: Blank lines are ignored when set to true. The default value is false.
    • ignoremissingcolumns: If there are more columns in the field_list than there are in the source files, the extra columns are stored as null. The default value is false. In this use case, it is set to true.
    • dateformat: Specifies the date format in the source file.

In this example, adwc4pm is the Oracle Cloud Infrastructure object storage namespace and moviestream_kl is the bucket name.

Explore Data

Once the data is loaded into the table, explore the data to understand and assess the quality of the data. At this stage assess the data to identify data types and noise in the data. Look for missing values and numeric outlier values.

Assess Data Quality

To assess the data, first, you must be able to view the data in your database. For this reason, you will use SQL statements to query the table.

If you are working with Oracle Autonomous Database, you can use the Oracle Machine Learning (OML) Notebooks for your data science project, including assessing data quality. If you are using the on-premises Oracle Database, you can use the Oracle SQL Developer to assess data quality. Query the data as described.

Note:

Each record in the database is called a case and each case is identified by a case_id. In this use case, CUSTOMER_ID is the case_id.

The following steps help you with the exploratory analysis of the data:

  1. View the data in the MOVIE_SALES_FACT table by running the following query:
    SELECT * FROM MOVIE_SALES_FACT;
  2. Find the COUNT rows in the data set, run the following statement:
    SELECT DISTINCT COUNT(*) from MOVIE_SALES_FACT;
    
    COUNT(*)
     97890562 
    ---------------------------
  3. To find distinct or unique customers in the table, run the following statement:
    %script SELECT COUNT (DISTINCT CUST_ID) FROM MOVIE_SALES_FACT;
    COUNT(DISTINCTCUST_ID) 
    4845 
    ---------------------------
  4. To view the data type of the columns, run the following statement:
    %script
    DESCRIBE MOVIE_SALES_FACT;
    
    Name 			    Null?      Type
    --------------------------- ----- --------------
    ORDER_NUM 				NUMBER(38)
    DAY DATE
    DAY_NUM_OF_WEEK                      	NUMBER(38)
    DAY_NAME 			        VARCHAR2(26)
    MONTH 				    	VARCHAR2(12)
    MONTH_NUM_OF_YEAR 		      	NUMBER(38)
    MONTH_NAME 				VARCHAR2(26)
    QUARTER_NAME 			     	VARCHAR2(26)
    QUARTER_NUM_OF_YEAR 		     	NUMBER(38)
    YEAR 					NUMBER(38)
    CUSTOMER_ID 				NUMBER(38)
    USERNAME 				VARCHAR2(26)
    CUSTOMER_NAME 			        VARCHAR2(250)
    STREET_ADDRESS 			    	VARCHAR2(250)
    POSTAL_CODE 				VARCHAR2(26)
    CITY_ID 				NUMBER(38)
    CITY 					VARCHAR2(128)
    STATE_PROVINCE_ID 			NUMBER(38)
    STATE_PROVINCE 				VARCHAR2(128)
    COUNTRY_ID 				NUMBER(38)
    COUNTRY 				VARCHAR2(126)
    COUNTRY_CODE 				VARCHAR2(26)
    CONTINENT 				VARCHAR2(128)
    SEGMENT_NAME 				VARCHAR2(26)
    SEGMENT_DESCRIPTION 			VARCHAR2(128)
    CREDIT_BALANCE 				NUMBER(38)
    EDUCATION 				VARCHAR2(128)
    EMAIL 					VARCHAR2(128)
    FULL_TIME 				VARCHAR2(26)
    GENDER 					VARCHAR2(26)
    HOUSEHOLD_SIZE 				NUMBER(38)
    HOUSEHOLD_SIZE_BAND 			VARCHAR2(26)
    WORK_EXPERIENCE 			NUMBER(38)
    WORK_EXPERIENCE_BAND 			VARCHAR2(26)
    INSUFF_FUNDS_INCIDENTS 			NUMBER(38)
    JOB_TYPE 				VARCHAR2(26)
    LATE_MORT_RENT_PMTS 			NUMBER(38)
    MARITAL_STATUS 				VARCHAR2(26)
    MORTGAGE_AMT 				NUMBER(38)
    NUM_CARS 				NUMBER(38)
    NUM_MORTGAGES 				NUMBER(38)
    PET 					VARCHAR2(26)
    PROMOTION_RESPONSE 			NUMBER(38)
    RENT_OWN 				VARCHAR2(26)
    YEARS_CURRENT_EMPLOYER 			NUMBER(38)
    YEARS_CURRENT_EMPLOYER_BAND 	  	VARCHAR2(26)
    YEARS_CUSTOMER 				NUMBER(38)
    YEARS_CUSTOMER_BAND 			VARCHAR2(26)
    YEARS_RESIDENCE 			NUMBER(38)
    YEARS_RESIDENCE_BAND 			VARCHAR2(26)
    AGE 					NUMBER(38)
    AGE_BAND 				VARCHAR2(26)
    COMMUTE_DISTANCE 			NUMBER(38)
    COMMUTE_DISTANCE_BAND 			VARCHAR2(26)
    INCOME 					NUMBER(38)
    INCOME_BAND 				VARCHAR2(26)
    MOVIE_ID 				NUMBER(38)
    SEARCH_GENRE 				VARCHAR2(26)
    TITLE 					VARCHAR2(4000)
    GENRE 					VARCHAR2(26)
    SKU 					NUMBER(38)
    LIST_PRICE 				NUMBER(38,2)
    APP 					VARCHAR2(26)
    DEVICE 					VARCHAR2(26)
    OS 					VARCHAR2(26)
    PAYMENT_METHOD 				VARCHAR2(26)
    DISCOUNT_TYPE 				VARCHAR2(26)
    DISCOUNT_PERCENT 			NUMBER(38,1)
    ACTUAL_PRICE 				NUMBER(38,2)
    QUANTITY_SOLD 				NUMBER(38)
     
    ---------------------------
  5. Select the required columns from MOVIE_SALES_FACT table.
    %sql
     SELECT ORDER_NUM, MONTH, CUSTOMER_ID, MOVIE_ID, TITLE, GENRE, ACTUAL_PRICE, QUANTITY_SOLD FROM MOVIE_SALES_FACT
     ORDER BY CUSTOMER_ID;
     
    Selected columns
  6. Select customers who watched, for example, the movie "Titanic" and check other popular movies watched among those customers.
    %sql
    select title, count(1) cnt
    from movie_sales_fact a
    join (
    select distinct customer_id
    from movie_sales_fact
    where title = 'Titanic' ) b
    on a.customer_id = b.customer_id
    group by title
    having count(1) > 800000
    List of popolar movies watched among people who watched Titanic
  7. The data set is huge with millions of records. Create a view called MOVIES to select a smaller data set by providing a customer ID range.
    %script
    CREATE OR REPLACE VIEW MOVIES AS
    SELECT DISTINCT CUSTOMER_ID, MOVIE_ID, TITLE, GENRE
    FROM MOVIE_SALES_FACT
    WHERE CUSTOMER_ID BETWEEN 1000000 AND 1000120;
    View MOVIES created.
     ---------------------------
  8. You can check the distribution of genre from the new view MOVIES:
    %sql
    SELECT * FROM MOVIES;

    In OML Notebooks, click the bar icon and expand settings. Drag GENRE to keys and CUSTOMER_ID to values and select COUNT.

    Movie genre
  9. Now, check the count of rows by running the following statement:
    %script
    SELECT DISTINCT COUNT (*) FROM MOVIES;
    
    COUNT(*)
     10194
    ---------------------------
  10. To check if there are any missing values (NULL values), run the following statement:
    SELECT COUNT(*) FROM MOVIES WHERE CUSTOMER_ID=NULL OR MOVIE_ID=NULL OR TITLE=NULL OR GENRE=NULL;
    COUNT(*) 
    0 
    ---------------------------

    NULLs, if found, are automatically handled by the OML algorithms. Alternately, you can manually replace NULLs with NVL SQL function.

This completes the data exploration stage. OML supports Automatic Data Preparation (ADP). ADP is enabled through the model settings. When ADP is enabled, the transformations required by the algorithm are performed automatically and embedded in the model. This step is done during the Build Model stage. The commonly used methods of data preparation are binning, normalization, and missing value treatment.

Build Model

Build your model using your data set. Use the DBMS_DATA_MINING.CREATE_MODEL2 procedure to build your model and specify the model settings.

For unsupervised learning, like Association Rules, you do not have labels or predictors to calculate the accuracy or assess the performance. So you don't need to train your model on a separate training data set and then evaluate it on a test set. The entire data set can be used to build the model. For an unsupervised learning, you don't have an objective way to assess your model. So, a training or a test split is not useful.

Algorithm Selection

Oracle supports the Apriori algorithm to build an Association Rules model.

Apriori calculates the probability of an item being present in a frequent itemset, given that another item or group of items is present. An itemset is any combination of two or more items in a transaction. Frequent itemsets are those that occur with a minimum frequency that the user specifies. An association rule states that an item or group of items implies the presence of another item with some probability and support.

The following steps guide you to build your model with the Apriori algorithm.

  1. Build your model using the CREATE_MODEL2 procedure. First, declare a variable to store model settings or hyperparameters. Run the following script:
    %script
     BEGIN DBMS_DATA_MINING.DROP_MODEL('AR_MOVIES');
    EXCEPTION WHEN OTHERS THEN NULL; END;
    /
    DECLARE
     v_setlist DBMS_DATA_MINING.SETTING_LIST;
    BEGIN
     v_setlist('ALGO_NAME') := 'ALGO_APRIORI_ASSOCIATION_RULES';
     V_setlist('PREP_AUTO') := 'ON';
     V_setlist('ASSO_MIN_SUPPORT') := '0.02';
     V_setlist('ASSO_MIN_CONFIDENCE') := '0.1';
     V_setlist('ASSO_MAX_RULE_LENGTH'):= '2';
     V_setlist('ODMS_ITEM_ID_COLUMN_NAME'):= 'TITLE';
     
     DBMS_DATA_MINING.CREATE_MODEL2(MODEL_NAME 	    => 'AR_MOVIES',
    			       MINING_FUNCTION     => 'ASSOCIATION',
    			       DATA_QUERY 	    => 'select * from MOVIES',
    	                       SET_LIST 	    => v_setlist,
    			       CASE_ID_COLUMN_NAME => 'CUSTOMER_ID');
    END;
    
    PL/SQL procedure successfully completed.
    ---------------------------
    PL/SQL procedure successfully completed.
    

    Examine the script:

    • v_setlist is a variable to store SETTING_LIST.
    • DBMS_DATA_MINING is the PL/SQL package used for machine learning. These settings are described in DBMS_DATA_MINING - Model Settings.
    • SETTING_LIST specifies model settings or hyperparameters for our model.
    • ALGO_NAME specifies the algorithm name. Since you are using Apriori as your algorithm, set ALGO_APRIORI_ASSOCIATION_RULES.
    • PREP_AUTO is the setting used for Automatic Data Preparation. Here, enable Automatic Data Preparation. The value of the setting is ON.
    • ASSO_MIN_SUPPORT is minimum support for association rules (in percentage) that limits the number of itemsets used for association rules. An itemset must appear in at least this percentage of all the transactions if it is to be used as a basis for rules. Apriori discovers patterns with frequencies above the minimum support threshold. This is the minimum threshold that each rule must satisfy. Here, the algorithms finds patterns with frequenqies above 0.02. Increase the minimum support if you want to decrease the build time for the model and generate fewer rules.
    • ASSO_MIN_CONFIDENCE determines minimum confidence for association rules. It is a conditional probability that the consequent occurs given the occurrence of an antecedent. In other words, the confidence of a rule indicates the probability of both the antecedent and the consequent appearing in the same transaction. The default value is 0.1.
    • ASSO_MAX_RULE_LENGTH specifies the maximum number of items in an itemset. If the maximum is two, all the item pairs are counted. In this use case, if you want to increase the value to 3, consider working with a smaller data set since each customer would watch lot of movies. If the maximum is greater than two, all the item pairs, all the item triples, and all the item combinations up to the specified maximum are counted. Increasing this value increases the run time and complexity significantly. Hence, for demonstration purposes on this data set, it is recommended to set the value to 2.

      Tip:

      One way to limit the number of rules produced is to raise the support and confidence. Support is the joint probability of two items that are purchased together. For instance, item beer and diaper happens together with probability of 0.1, vodka and ice cream are purchased together with the probability of 0.05. If you raise the support threshold to 0.1. You will not see vodka and ice cream in the rules. Similarly, the confidence is the probability of people purchasing item A given they have purchased B. The probability of people who purchase beer given that they have already purchased a diaper is 0.2; The probability of people who purchase ice cream given that they have purchased vodka is 0.6. Using the threshold 0.6, you can remove the rule of people purchasing beer given that they already purchased diaper.
    • ODMS_ITEM_ID_COLUMN_NAME name of a column that contains the items in a transaction. In this use case, it is TITLE. When this setting is specified, the algorithm expects the data to be presented in a native transactional format, consisting of two columns:
      • Case ID, either categorical or numeric
      • Item ID, either categorical or numeric

    The CREATE_MODEL2 procedure takes the following parameters:

    • MODEL_NAME: Specify a unique name for your model. The name of the model is in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. Here, the model name is AR_MOVIES.
    • MINING_FUNCTION: Specifies the machine learning function. Since you are solving an association problem in this use case, select ASSOCIATION.
    • DATA_QUERY: A query that provides training data for building the model. Here, the query is SELECT * FROM MOVIES.
    • SET_LIST: Specifies SETTING_LIST variable. Here, it is v_setlist.
    • CASE_ID_COLUMN_NAME: A unique case identifier column in the build data. In this use case, case_id is CUSTOMER_ID. If there is a composite key, you must create a new attribute before creating the model. This may involve concatenating values from the columns, or mapping a unique identifier to each distinct combination of values. The CASE_ID assists with reproducible results, joining scores for individual customers with other data in, example, scoring data table.

    Note:

    Any parameters or settings not specified are either system-determined or default values are used.

Evaluate

Evaluate your model by viewing diagnostic metrics and performing quality checks.

Sometimes querying dictionary views and model detail views is sufficient to measure your model's performance. However, you can evaluate your model by computing test metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), confusion matrix, lift statistics, cost matrix, and so on. For Association Rules, you can inspect various rules to see if they reveal new insights for item dependencies (antecedent itemset implying consequent) or for unexpected relationships among items.

Dictionary and Model Views

To obtain information about the model and view model settings, you can query data dictionary views and model detail views. Specific views in model detail views display model statistics which can help you evaluate the model.

By examining various statistics in the model detail views, you can compare models to arrive at one model that satisfies your evaluation criteria.

The results of an association model are the rules that identify patterns of association within the data. Oracle Machine Learning for SQL does not support a scoring operation for association modeling. Instead, support and confidence are the primary metrics for evaluating the quality of the rules that the model generates. These statistical measures can be used to rank the rules and hence the usefulness of the predictions.

Association rules can be applied as follows:

  • Support: How often do these items occur together in the data when you apply Association Rules?
  • Confidence: How frequently the consequent occurs in transactions that contain the antecedent.
  • Value: How much business value is connected to item associations

Additionally, Oracle Machine Learning for SQL supports lift for association rules. Lift indicates the strength of a rule over the random co-occurrence of the antecedent and the consequent, given their individual support. Lift provides information about the improvement, the increase in probability of the consequent given the antecedent. Lift is defined as confidence of the combination of items divided by the support of the consequent. Any rule with an improvement of less than 1 does not indicate a real cross-selling opportunity, no matter how high its support and confidence, because it actually offers less ability to predict a purchase than does random chance.

The data dictionary views for Oracle Machine Learning are listed in the following table. A database administrator (DBA) and USER versions of the views are also available.

View Name Description
ALL_MINING_MODELS Provides information about all accessible machine learning models
ALL_MINING_MODEL_ATTRIBUTES Provides information about the attributes of all accessible machine learning models
ALL_MINING_MODEL_SETTINGS Provides information about the configuration settings for all accessible machine learning models
ALL_MINING_MODEL_VIEWS Provides information about the model views for all accessible machine learning models
ALL_MINING_MODEL_XFORMS Provides the user-specified transformations embedded in all accessible machine learning models.

Model detail views are specific to the algorithm. You can obtain more insights about the model you created by viewing the model detail views. The names of model detail views begin with DM$xx where xx corresponds to the view prefix. See Model Detail Views.

  1. You can review the model settings in USER_MINING_MODEL_SETTINGS by running the following query:
    SELECT SETTING_NAME, SETTING_VALUE
     FROM USER_MINING_MODEL_SETTINGS
     WHERE MODEL_NAME = 'AR_MOVIES'
     ORDER BY SETTING_NAME;
    
    SETTING_NAME                   SETTING_VALUE                   
    ALGO_NAME                      ALGO_APRIORI_ASSOCIATION_RULES  
    ASSO_MAX_RULE_LENGTH           2                               
    ASSO_MIN_CONFIDENCE            0.1                             
    ASSO_MIN_REV_CONFIDENCE        0                               
    ASSO_MIN_SUPPORT               0.02                            
    ASSO_MIN_SUPPORT_INT           1                               
    ODMS_DETAILS                   ODMS_ENABLE                     
    ODMS_ITEM_ID_COLUMN_NAME       TITLE                           
    ODMS_MISSING_VALUE_TREATMENT   ODMS_MISSING_VALUE_AUTO         
    ODMS_SAMPLING                  ODMS_SAMPLING_DISABLE           
    PREP_AUTO                      ON                              
     
     
    11 rows selected.
     
     
    ---------------------------
  2. Run the following statement to see information on various views in USER_MINING_MODEL_VIEWS:
    SELECT view_name, view_type FROM USER_MINING_MODEL_VIEWS
     WHERE MODEL_NAME = 'AR_MOVIES'
     ORDER BY VIEW_NAME;
    
    VIEW_NAME        VIEW_TYPE                                         
    DM$VAAR_MOVIES   Association Rules For Transactional Data          
    DM$VGAR_MOVIES   Global Name-Value Pairs                           
    DM$VIAR_MOVIES   Association Rule Itemsets                         
    DM$VRAR_MOVIES   Association Rules                                 
    DM$VSAR_MOVIES   Computed Settings                                 
    DM$VTAR_MOVIES   Association Rule Itemsets For Transactional Data  
    DM$VWAR_MOVIES   Model Build Alerts                                
     
     
    7 rows selected.
     
     
    ---------------------------
  3. To view the Association Rules Itemsets For Transactional Data (DM$VTxx) model detail view, run the following script:
    %script
    SELECT ITEM_NAME, SUPPORT, NUMBER_OF_ITEMS
    FROM DM$VTAR_MOVIES;
     
    
    ITEM_NAME                SUPPORT               NUMBER_OF_ITEMS  
    Dallas Buyers Club                           1                 2
    Dallas Buyers Club         0.66666666666666663                 2
    Dallas Buyers Club         0.33333333333333331                 2
    Elvira's Haunted Hills                       1                 2
    Elvira's Haunted Hills     0.66666666666666663                 2
    Elvira's Haunted Hills     0.33333333333333331                 2
    Elvira's Haunted Hills                       1                 2
    Elvira's Haunted Hills                       1                 2
    Ghostbusters {{nbsp II                       1                 2
    Ghostbusters {{nbsp II     0.66666666666666663                 2
    Ghostbusters {{nbsp II     0.33333333333333331                 2
    Ghostbusters {{nbsp II                       1                 2
    Ghostbusters {{nbsp II                       1                 2
    Hits                       0.33333333333333331                 2
    ...

    This view provides the itemsets information in transactional format. In the first transaction, Dallas Buyers Club and another movie are purchased or rented together with 100% support (support 1).

  4. Now, view the Association Rules for Transactional Data (DM$VAxx) model detail view:
    %sql SELECT * FROM DM$VAAR_MOVIES;
    Association Rules for Transactional Data DM$VAxx model detail view

    From this view, you can see that both antecedent and consequent are purchased together frequently (Support =1). You can expect the consequent to be present whenever the listed antecedent is present (Confidence=1). You can say that the probability of purchasing the consequent increases with the presence of the listed antecedent (Lift=1).

  5. To see top 10 association rules, run the following query:

    The IF component of an association rule is known as the antecedent. The THEN component is known as the consequent. The antecedent and the consequent are disjoint; they have no items in common. Oracle Machine Learning for SQL supports association rules that have one or more items in the antecedent and a single item in the consequent.

    %script 
    SELECT * FROM 
     (SELECT RULE_ID, ANTECEDENT_PREDICATE ANTECEDENT,
    CONSEQUENT_PREDICATE CONSEQUENT,
    ROUND(RULE_SUPPORT,3) SUPP, ROUND(RULE_CONFIDENCE,3) CONF, NUMBER_OF_ITEMS NUM_ITEMS
    FROM DM$VAAR_MOVIES
    ORDER BY RULE_CONFIDENCE DESC, RULE_SUPPORT DESC)
    WHERE ROWNUM <= 10 
    ORDER BY RULE_ID;
    
    RULE_ID   ANTECEDENT                     CONSEQUENT   SUPP   CONF   NUM_ITEMS   
        10759 101 Dalmatians                 10                1      1           2 
        10761 12 Years a Slave               10                1      1           2 
        10763 127 Hours                      10                1      1           2 
        10771 1984                           10                1      1           2 
        10773 2-Headed Shark Attack          10                1      1           2 
        10777 20,000 Leagues Under the Sea   10                1      1           2 
        10779 2001: A Space Odyssey          10                1      1           2 
        10781 2012                           10                1      1           2 
        10785 3 Ninjas                       10                1      1           2 
        10787 3 from Hell                    10                1      1           2 
    
    
    10 rows selected. 
    
    
    ---------------------------
    Examine the statement:
    • RULE_ID is the rule identifier.
    • ANTECEDENT_PREDICATE: provides the name of the antecedent.
    • CONSEQUENT_PREDICATE: provides name of the consequent item.
    • ROUND (RULE_SUPPORT, 3) SUPP: provides support of the rule rounded to 3 digits after the decimal.
    • ROUND(RULE_CONFIDENCE, 3) CONF: the likelihood a transaction satisfying the rule when it contains the antecedent, rounded to 3 digits after the decimal.
    • NUM_OF_ITEMS: specifies number of items in a rule.
  6. You can also view which consequent items occur most frequently or which consequent items are included in most rules. To do so, run the following query:
    %sql
    SELECT CONSEQUENT, COUNT(1) CNT FROM
    (SELECT ANTECEDENT_PREDICATE ANTECEDENT,
    CONSEQUENT_PREDICATE CONSEQUENT,
    RULE_SUPPORT SUPP, RULE_CONFIDENCE CONF, NUMBER_OF_ITEMS NUM
    FROM DM$VAAR_MOVIES
    ORDER BY RULE_CONFIDENCE DESC)
    GROUP BY  CONSEQUENT
    ORDER BY CNT;

    In OML Notebooks, click settings and click the Bar Chart icon to visualize the result. Click Rotate to rotate the bar graph to 45 degrees.

    Count of most frequently occurring consequentVisual representation of most frequently occurring consequent
  7. To view which antecedent items occur most frequently or which antecedent items are included in most rules, run the following script:
    
    SELECT ANTECEDENT, COUNT(1) CNT 
    FROM
    (SELECT ANTECEDENT_PREDICATE ANTECEDENT,
    CONSEQUENT_PREDICATE CONSEQUENT,
    RULE_SUPPORT SUPP, RULE_CONFIDENCE CONF, NUMBER_OF_ITEMS NUM
    FROM DM$VAAR_MOVIES
    ORDER BY RULE_CONFIDENCE DESC)
    GROUP BY  ANTECEDENT
    ORDER BY CNT

    In OML Notebooks, click settings and click the Bar Chart icon to visualize the result. Click Rotate to rotate the bar graph to 45 degrees.

    Displays which antecedents occur most frequentlyVisual representation of the most frequently occurring antecedents
  8. To check how many rules show up in each band of support, run the following query:
    
    %sql
    SELECT '['|| (SUPP_BIN -1)*0.2 ||','||SUPP_BIN*0.2||']' BUCKET, COUNT(1)
    FROM (
    SELECT ANTECEDENT_PREDICATE ANTECEDENT,
    CONSEQUENT_PREDICATE CONSEQUENT,
    RULE_SUPPORT SUPP, RULE_CONFIDENCE CONF, NUMBER_OF_ITEMS NUM, WIDTH_BUCKET(RULE_SUPPORT, 0, 1, 4) SUPP_BIN 
     FROM DM$VAAR_MOVIES ) a
    GROUP BY SUPP_BIN
    ORDER BY SUPP_BIN;
    
    Examine the query:
    • SELECT '['|| (SUPP_BIN -1)*0.2 ||','||SUPP_BIN*0.2||']' BUCKET, COUNT(1) creates the intervals for the buckets.
    • The function WIDTH_BUCKET lets you construct equiwidth histograms, in which the histogram range is divided into intervals that have identical size. Here it produces buckets ranging from 0 to 1 and assigns number 1, …, 5, with identical size of 0.2. For instance the first bucket has the value = 1, for the range [0, 0.2].

    In OML Notebooks, click settings and click the Bar Chart icon to visualize the result.

    How many rules are present in each bucket of supportVisual representation of how many rules are present in each bucket of support
  9. To check how many rules show up in each band of confidence, run the following query:
    
    %sql
    SELECT '['|| (CONF_BIN -1)*0.2 ||','||CONF_BIN*0.2||']' BUCKET, COUNT(1)
    FROM (
    SELECT ANTECEDENT_PREDICATE ANTECEDENT,
    CONSEQUENT_PREDICATE CONSEQUENT,
    RULE_SUPPORT SUPP, RULE_CONFIDENCE CONF, NUMBER_OF_ITEMS NUM, WIDTH_BUCKET(RULE_CONFIDENCE, 0, 1, 4) CONF_BIN 
     FROM DM$VAAR_MOVIES ) a
    GROUP BY CONF_BIN
    ORDER BY CONF_BIN;
    
    
    How many rules are present in each bucket of confidence

    In OML Notebooks, click settings and click the Bar Chart icon to visualize the result.

    Visual representation of how many rules are present in each bucket of confidence
  10. To recommend top five movies based on customer's selection, use the NUMBER_OF_ITEMS and EXTRACT as predicate and query the Association Rules model detail view (DM$VRxx).

    Association Rules support only a single consequent item.

    %sql
    
    SELECT ROWNUM RANK,
     CONSEQUENT_NAME RECOMMENDATION,
     NUMBER_OF_ITEMS NUM,
     ROUND(RULE_SUPPORT, 3) SUPPORT, 
     ROUND(RULE_CONFIDENCE, 3) CONFIDENCE,
     ROUND(RULE_LIFT, 3) LIFT,
     ROUND(RULE_REVCONFIDENCE, 3) REVERSE_CONFIDENCE
     FROM (SELECT * FROM DM$VRAR_MOVIES
     WHERE NUMBER_OF_ITEMS = 2
     AND EXTRACT(antecedent, '//item[item_name="101 Dalmatians"]') IS NOT NULL
     ORDER BY NUMBER_OF_ITEMS
     )
     WHERE ROWNUM <= 5;
    Examine the query:
    • ROUND(RULE_LIFT, 3) LIFT: The degree of improvement in the prediction over random chance when the rule is satisfied.
    • ROUND(RULE_REVCONFIDENCE, 3) REVERSE_CONFIDENCE: The number of transactions in which the rule occurs divided by the number of transactions in which the consequent occurs rounded to 3 digits after the decimal.
    • NUMBER_OF_ITEMS: Here, this parameter controls the size of the rule.

      Note:

      In this use case, since you are looking for ASSO_MAX_RULE_LENGTH =2, you can skip this parameter.
    • EXTRACT: Filters on the antecedent. If the antecedent must include “101 Dalmatians”, then use extract(antecedent, '//item[item_name=“101 Dalmatians”]') IS NOT NULL
    EXTRACT predicate

    In this step, if the customer's cart has 101 Dalmatians movie, the customer is 66.7% likely to rent or buy Graduation Day, How to Be, and 10 Minutes Gone and there are 100% chances that they will buy 10.

To conclude, you have successfully examined association rules and provided top movie recommendations to customers based on their frequently purchased and/or rented movies.