Clustering Use Case Scenario
You're a game data scientist. Marketing team wants to promote a new game and want customers who bought a gaming product with a high credit limit. They want to segment customers based on game purchases and credit level. You help them identify target customers and segment the population using k-Means.
Related Content
Topic | Link |
---|---|
OML4SQL GitHub Example | Clustering - k-Means |
CREATE_MODEL2 Procedure
|
CREATE_MODEL2 Procedure |
Generic Model Settings | DBMS_DATA_MINING - Model Settings |
k-Means Settings | DBMS_DATA_MINING - Algorithm Settings: k-Means |
Data Dictionary Settings | Oracle Machine Learning Data Dictionary Views |
k-Means - Model Detail Views | Model Detail Views for k-Means |
About Clustering | About Clustering |
About k-Means | About k-Means |
Before you start your OML4SQL use case journey, ensure that you have the following:
- Data Set
The data set used for this use case is from the SH schema. The SH schema can be readily accessed in Oracle Autonomous Database. For on-premises databases, the schema is installed during the installation or can be manually installed by downloading the scripts. See Installing the Sample Schemas.
- Database
Select or create database out of the following options:
- Get your FREE cloud account. Go to https://cloud.oracle.com/database and select Oracle Database Cloud Service (DBCS), or Oracle Autonomous Database. Create an account and create an instance. See Autonomous Database Quick Start Workshop.
- Download the latest version of Oracle Database (on premises).
- Machine Learning Tools
Depending on your database selection,
- Use OML Notebooks for Oracle Autonomous Database.
- Install and use Oracle SQL Developer connected to an on-premises database or DBCS. See Installing and Getting Started with SQL Developer.
- Other Requirements
Data Mining Privileges (this is automatically set for ADW). See System Privileges for Oracle Machine Learning for SQL.
Related Topics
Load Data
Access the data set from the SH Schema and explore the data to understand the attributes.
Remember:
The data set used for this use case is from the SH schema. The SH schema can be readily accessed in Oracle Autonomous Database. For on-premises databases, the schema is installed during the installation or can be manually installed by downloading the scripts. See Installing the Sample Schemas.
- Access the data.
- Examine the various attributes or columns of the data set.
- Assess data quality (by exploring the data).
Access Data
You will use CUSTOMERS
and SUPPLEMENTARY_DEMOGRAPHICS
table data from the SH schema.
Examine Data
The following table displays information about the attributes from SUPPLEMENTARY_DEMOGRAPHICS
:
Attribute Name | Information |
---|---|
CUST_ID |
The ID of the customer |
EDUCATION |
Educational information of the customer |
OCCUPATION |
Occupation of the customer |
HOUSEHOLD_SIZE |
People per house |
YRS_RESIDENCE |
Number of years of residence |
AFFINITY_CARD |
Whether the customer holds an affinity card |
BULK_PACK_DISKETTES |
Product. Indicates whether the customer already owns the product. 1 means Yes. 0 means No |
FLAT_PANEL_MONITOR |
Product. Indicates whether the customer already owns the product. 1 means Yes. 0 means No |
HOME_THEATER_PACKAGE |
Product. Indicates whether the customer already owns the product. 1 means Yes. 0 means No |
BOOKKEEPING_APPLICATION |
Product. Indicates whether the customer already owns the product. 1 means Yes. 0 means No |
PRINTER_SUPPLIES |
Product. Indicates whether the customer already owns the product. 1 means Yes. 0 means No |
Y_BOX_GAMES |
Product. Indicates whether the customer already owns the product. 1 means Yes. 0 means No |
OS_DOC_SET_KANJI |
Product. Indicates whether the customer already owns the product. 1 means Yes. 0 means No |
COMMENTS |
Product. Indicates whether the customer already owns the product. 1 means Yes. 0 means No |
Explore Data
Once the data is accessible, explore the data to understand and assess the quality of the data. At this stage assess the data to identify data types and noise in the data. Look for missing values and numeric outlier values.
Assess Data Quality
To assess the data, first, you must be able to view the data in your database. For this reason, you will use SQL statements to query the SH.CUSTOMERS
and the SH.SUPPLEMENTARY_DEMOGRAPHICS
table.
SH
schema as described.
Note:
Each record in the database is called a case and each case is identified by acase_id
. In this use case, CUST_ID
is the case_id
.
The following steps help you with the exploratory analysis of the data:
This completes the data exploration stage. OML supports Automatic Data Preparation (ADP). ADP is enabled through the model settings. When ADP is enabled, the transformations required by the algorithm are performed automatically and embedded in the model. This step is done during the Build Model stage. The commonly used methods of data preparation are binning, normalization, and missing value treatment.
Related Topics
Build Model
Build your model using your data set. Use the DBMS_DATA_MINING.CREATE_MODEL2
procedure to build your model and specify the model settings.
Algorithm Selection
Before you build a model, choose the suitable algorithm. You can choose one of the following algorithms to solve a clustering problem:- k-Means
- Expectation Maximization (EM)
- Orthogonal Cluster (O-Cluster)
K-Means does not assume a particular distribution of the data. The k-Means algorithm is a distance-based clustering algorithm that partitions the data into a specified number of clusters. The EM algorithm is a probability density estimation technique. EM method is based on assumption that the data has several clusters and each cluster is distributed according to a certain Gaussian distribution. O-Cluster is a neighbor based method. It identifies areas of high density in the data and separates the dense areas into clusters. It is able to cluster data points that forms a certain shape, which sometimes can be a complex pattern like a circle, spiral, or even a tie shape.
K-Means tends to cluster points only close to each other and does not necessarily cluster the data based on the shapes. Therefore, K-Means method is the one with the simplest assumption. Thus, it is the clustering method to start with.
The following steps guide you to build your model with the selected algorithm.
Evaluate
Evaluate your model by viewing diagnostic metrics and performing quality checks.
Sometimes querying dictionary views and model detail views is sufficient to measure your model's performance. However, you can evaluate your model by computing test metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), confusion matrix, lift statistics, cost matrix, and so on. For Association Rules, you can inspect various rules to see if they reveal new insights for item dependencies (antecedent itemset implying consequent) or for unexpected relationships among items.
Dictionary and Model Views
To obtain information about the model and view model settings, you can query data dictionary views and model detail views. Specific views in model detail views display model statistics which can help you evaluate the model.
The data dictionary views for Oracle Machine Learning are listed in the following table. A database administrator (DBA) and USER versions of the views are also available.
View Name | Description |
---|---|
ALL_MINING_MODELS | Provides information about all accessible machine learning models |
ALL_MINING_MODEL_ATTRIBUTES | Provides information about the attributes of all accessible machine learning models |
ALL_MINING_MODEL_SETTINGS | Provides information about the configuration settings for all accessible machine learning models |
ALL_MINING_MODEL_VIEWS | Provides information about the model views for all accessible machine learning models |
ALL_MINING_MODEL_XFORMS | Provides the user-specified transformations embedded in all accessible machine learning models. |
Model detail views are specific to the algorithm. You can obtain more insights about the model you created by viewing the model detail views. The names of model detail views begin with DM$xx where xx corresponds to the view prefix. See Model Detail Views.
The following steps help you to view different dictionary views and model detail views.
Score
Scoring involves applying the model to the target data. Use CLUSTER_PROBABILITY
function to predict the clusters. For Clustering, "scoring" involves assigning each record to a cluster, with a certain probability. However, one can also obtain the probability of a record belonging to each cluster.
WHERE
clause.