Clustering Use Case

3.3 Clustering Use Case

A retail store has information about its customers' behavior and the purchases they make. Now with the available data, they would like you to analyze and identify if there are any similarities between the customers. Use Oracle Machine Learning to segment customers by finding clusters in the data set which can be then used to support targeted marketing campaigns to increase retail sales. In this use case, you will learn how to identify such segments using the k-Means algorithm.

Topic	Link
OML4Py GitHub Example	Clustering k-Means
About Clustering	About Clustering
Model Settings	About Model Settings
Shared Settings	Shared Settings
k-Means - Model Detail Views	Model Detail Views for k-Means

3.3.1 Load Data

Access the data set from the SH Schema and explore the data to understand the attributes.

Remember:

The data set used for this use case is from the SH schema. The SH schema can be readily accessed in Oracle Autonomous AI Database. For on-premises databases, the schema is installed during the installation or can be manually installed by downloading the scripts. See Installing the Sample Schemas.

To understand the data, you will perform the following:

Access the data.
Examine the various attributes or columns of the data set.
Assess data quality (by exploring the data).

Access Data

You will use CUSTOMERS and SUPPLEMENTARY_DEMOGRAPHICS table data from the SH schema.

Parent topic: Clustering Use Case

3.3.2 Explore Data

Once the data is accessible, explore the data to understand and assess the quality of the data. At this stage assess the data to identify data types and noise in the data. Look for missing values and numeric outlier values.

Assess Data Quality

To access database data from Python using OML4Py, you must first create a oml.DataFrame proxy object in Python which represents a database table, view, or query. Create a oml.DataFrame proxy object for SUPPLEMENTARY_DEMOGRAPHICS and CUSTOMERS and then merge them by inner join on a key column, e.g., CUST_ID. Assess the data to identify data types and noise in the data. Look for missing values, outlier numeric values, or inconsistently labeled categorical values.

The following steps help you with the exploratory analysis of the data:

Run the following script in a %python interpreter paragraph to import the oml modules, the Panda's module, and set the display options:

import pandas as pd
import matplotlib.pyplot as plt
import oml
  
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Use the oml.sync function to create the Python object DEMOGRAPHICS as a proxy for a database table SUPPLEMENTARY_DEMOGRAPHICS. The oml.sync function returns an oml.DataFrame object. The oml.DataFrame object returned by oml.sync is a proxy for the database object.

Note:
Only one environment for a given database schema can exist at a time. If "schema=None", then objects are created searched in the current user's schema.
```
DEMOGRAPHICS = oml.sync(table = "SUPPLEMENTARY_DEMOGRAPHICS", schema = "SH")
z.show(DEMOGRAPHICS.head())
```
To determine the number of rows and columns in the oml.DataFrame object DEMOGRAPHICS, use DataFrame.shape.
```
print("Shape:",DEMOGRAPHICS.shape)
```
```
(4500, 14)
```
Use the oml.sync function to create the Python object CUSTOMERS as a proxy for a database table SH.CUSTOMERS. The query argument uses the SQL SELECT statement for selecting columns to include for use through the proxy object.c
```
CUSTOMERS = oml.sync(query = 'SELECT CUST_ID, CUST_GENDER, CUST_MARITAL_STATUS, CUST_YEAR_OF_BIRTH, CUST_INCOME_LEVEL, CUST_CREDIT_LIMIT FROM SH.CUSTOMERS')
z.show(CUSTOMERS.head())
```
To determine the number of rows and columns in the oml.DataFrame object CUSTOMERS, use DataFrame.shape.
```
print("Shape:",CUSTOMERS.shape)
```
```
(55500, 6)
```
Create a new oml.DataFrame CUSTOMER_DATA by merging the table CUSTOMERS and DEMOGRAPHICS with an inner join on the common column CUST_ID. The merge function joins one oml.DataFrame to another oml.DataFrame. The suffixes parameter is used when the two oml.DataFrame have conflicting column names.
```
CUSTOMER_DATA = CUSTOMERS[["CUST_ID", "CUST_GENDER", "CUST_MARITAL_STATUS", "CUST_YEAR_OF_BIRTH", "CUST_INCOME_LEVEL", "CUST_CREDIT_LIMIT"]].merge(DEMOGRAPHICS[["CUST_ID", "HOUSEHOLD_SIZE","YRS_RESIDENCE", "Y_BOX_GAMES"]], how = "inner", on = 'CUST_ID',suffixes = ["",""])
```
To determine the number of rows and columns in the oml.DataFrame object CUSTOMER_DATA, use DataFrame.shape.
```
print("Shape:",CUSTOMER_DATA.shape)
```
```
Shape: (4500, 9)
```
Use the concat function to concatenate the new column CUST_AGE in an oml.DataFrame object CUSTOMER_DATA. The column CUST_AGE contains the age based on the column CUST_YEAR_OF_BIRTH where the year of birth is converted to age in the year 2005. The information in the CUST_YEAR_OF_BIRTH column has been modified and maintained in CUST AGE, so drop the CUST_YEAR_OF_BIRTH column.
```
CUSTOMER_DATA=CUSTOMER_DATA.concat({'CUST_AGE':abs(CUSTOMER_DATA['CUST_YEAR_OF_BIRTH'] -2005)})
CUSTOMER_DATA=CUSTOMER_DATA.drop('CUST_YEAR_OF_BIRTH')
CUSTOMER_DATA.head()
```
Run the following script to view the data type of each column.
```
print("The datatypes of the column: ","\n")
print(CUSTOMER_DATA.dtypes)
```
To check if there are any missing values run the following script. The count function returns the number of elements that are not NULL for each column and the len() function returns the number of rows in the dataset.
```
print("Number of missing values in each column is : \n")
print(len(CUSTOMER_DATA)-CUSTOMER_DATA.count())
```
Use the crosstab method to perform a cross-column analysis of an oml.DataFrame object in the database. The crosstab method computes a cross-tabulation of two or more columns. By default, it computes a frequency table for the columns unless a column and an aggregation function have been passed to it. In this example, the crosstab function displays the distribution of unique values of CUST_CREDIT_LIMIT along the x-axis and its occurrence frequency along the y-axis.
```
z.show(CUSTOMER_DATA.crosstab(‘CUST_CREDIT_LIMIT’))
```
Use the transparency layer method describe to calculate descriptive statistics that summarize the central tendency, dispersion, and shape of the CUSTOMER_DATA table in each numeric column.

Note:
All computations are computed in the database and only the result statistics are returned to the Python client, in this case, the notebook. Eliminating the need to move data and using the database as a high-performance compute engine greatly increases scalability.
```
CUSTOMER_DATA.describe()
```
Before building the model, it's important to ensure that the data is clean. Data often contains outliers, which can form separate clusters that negatively impact model quality.
The following script defines a function, IQR, to calculate the interquartile range for a dataframe. It takes two arguments: SUMMARY_DF (which contains summary statistics of the dataframe, generated using the describe method) and a list of features. The IQR function uses a for loop to compute the interquartile range for each feature in the list.
Run the script to calculate the interquartile range for the specified columns:
```
def IQR(SUMMARY_DF, features):
    result = [0]*len(features)   
    for i, feature in enumerate(features):
        result[i] = abs(SUMMARY_DF[feature]['75%'] - SUMMARY_DF[feature]['25%'])
    return result
 
print(IQR(CUSTOMER_DATA.describe(),['CUST_AGE', 'CUST_CREDIT_LIMIT', 'YRS_RESIDENCE', 'Y_BOX_GAMES']))
```
```
[20.0, 6000.0, 2.0, 1.0]
```
The user-defined function remove_outlier uses the interquartile range to find outliers in the data and remove them. In boxplot, outliers are points that lie outside of the upper and lower quartiles by 1.5 times the interquartile range (Q1 - 1.5 * IQR or Q3 + 1.5 * IQR). Another form of outlier treatment is clipping or capping, where more extreme values are replaced with a max or min value, e.g., the 1.5 IRQ values.
The following function removes rows with outliers of a given feature based on quantiles:
```
def remove_outlier(DF, SUMMARY_DF, features):
    iqrs = IQR(SUMMARY_DF, features)   
    for i, iqr in enumerate(iqrs):
         H = 1.5*iqr
         DF = DF[ ( DF[features[i]] > SUMMARY_DF[features[i]]['25%'] - H ) & ( DF[features[i]] < SUMMARY_DF[features[i]]['75%'] + H )]
    print(DF.shape)
    return DF
     
CUSTOMER_DATA_CLEAN= remove_outlier(CUSTOMER_DATA, CUSTOMER_DATA.describe(), ['CUST_AGE', 'CUST_CREDIT_LIMIT', 'YRS_RESIDENCE', 'Y_BOX_GAMES'])
 
print("Shape:",CUSTOMER_DATA_CLEAN.shape)
```
```
Shape: (4233, 9)
```

This completes the data understanding and data preparation stage.

Parent topic: Clustering Use Case

3.3.3 Build Model

To evaluate a model's performance, it is common practice to split the data into training and test sets. This allows you to assess how well the model generalizes to unseen data. However, in unsupervised learning, such as clustering, there are no labels or predictors available to calculate accuracy or evaluate performance. As a result, you can use the entire dataset to build the model without the need to split it. Since there is no ground truth to compare the results against, the training-test split is neither applicable nor useful in unsupervised learning.

Algorithm Selection

Using OML4Py, you can choose one of the following algorithms to solve a clustering problem:

Expectation-Maximization (EM)
K-Means (KM)

The Expectation-Maximization (EM) algorithm uses a probabilistic clustering based on a density estimation algorithm. The EM algorithm is used when data contains hidden components or when some data points are absent. In contrast, the k-Means (KM) algorithm is a distance-based clustering algorithm that partitions data into a specified number of clusters. Distance-based algorithms are based on the principle that nearby data points are more closely related to one another than to those that are farther away. This algorithm works iteratively to minimize the within-cluster variance in relation to the nearest cluster centroid.

The k-Means algorithm is chosed, as it is simpler than the Expectation-Maximization (EM) algorithm. Since, the optimal number of clusters is unknown, start with one cluster and gradually increase the number of clusters. Use the Elbow meathod to determine the optimal number of clusters.

To specify model settings and build a k-Means model object that will partition and segment the data, run the following script. The settings are provided as key-value pairs, or dictionary pairs, where each key represents a parameter name and its corresponding value represents the setting. Some of the specified settings include KMNS_ITERATIONS and KMNS_DISTANCE. The k-Means algorithm utilizes the number of clusters (k) along with these settings to configure the algorithm.

The following steps guide you to build your model with the selected algorithm.

Use the oml.km algorithm to build your model and specify the model settings. Run the following script:
```
try:
    oml.drop(model="CUST_CLUSTER_MODEL")
except:
    pass
     
setting = {'KMNS_ITERATIONS': 10,
           'KMNS_DISTANCE': 'KMNS_EUCLIDEAN',
           'KMNS_NUM_BINS': 10,
           'KMNS_DETAILS': 'KMNS_DETAILS_ALL',
           'PREP_AUTO': 'ON'}
 
km_mod1 = oml.km(n_clusters = 1, **setting).fit(CUSTOMER_DATA_CLEAN, model_name = "CUST_CLUSTER_MODEL", case_id = 'CUST_ID')
```
Examine the script:
- KMNS_ITERATIONS: Specifies the maximum number of allowed iterations, with a default of 20.
- KMNS_DISTANCE: Specify the type of distance functions used, by default distance function is the Euclidean distance.
- KMNS_NUM_BINS: Specifies the number of bins in the attribute histogram produced by k-Means.
- KMNS_DETAILS: Determines the level of cluster details that are computed during the build. The value KMNS_DETAILS_ALL indicates that the cluster hierarchy, record counts, and descriptive statistics such as variances, modes, histograms, and rules are computed.
- PREP_AUTO: Used for Automatic Data Preparation. By default, it is enabled as 'PREP_AUTO': PREP_AUTO_ON, which requires the DBMS_DATA_MINING package. Alternatively, it can be set as 'PREP_AUTO': 'ON'. This allows the compiler to validate that the PL/SQL constant name is correct.

Parent topic: Clustering Use Case

3.3.4 Evaluate

Evaluate a model by assessing its performance using various metrics and techniques to determine how effectively it generalizes to new, unseen data. This process involves comparing predictions to actual outcomes using metrics like accuracy, precision, recall, F1 score, or mean squared error, depending on the model type. The evaluation helps identify the model's strengths and weaknesses, guiding further improvement or tuning.

Information and Model Settings

Note:

To get a complete list of information on the settings available in the k-Means module, run the below script:

help(oml.algo.km)

The following steps help you to view different model detail views.

Use km_mod1 to access the model details available through the k-Means model object, like the model settings, coefficients, fit details, and more.
```
 km_mod1
```
Use km_mod1.clusters to list the clusters information.
```
z.show(km_mod1.clusters)
```
Run the following script to display the model's centroid concerning each column. It gives mean, variance for a numeric attribute and mode for a categorical attribute.
```
z.show(km_mod1.centroids)
```
To determine the optimal value of k you will use the Elbow method. Assuming that k lies in a given range, search for the best k by running k-Means over each k in the given range. For each k find the within-cluster variance. As the value of k increases within-cluster variance should decrease. This is because more centers mean that the data points are on average at less distance from each other for a given centroid. Imagine that each data point is a center then within-cluster variance would be zero. Before the optimal value of k, the decrease in within-cluster-variance will be relatively large as new clusters will have increasingly closer data points within them. After the optimal value of k, the decrease will be slow as the new clusters formed will be similar to each other. The range of k should be chosen such that we can see the sharp decrease at optimal value and the slow decrease afterward resulting in an almost linear pattern being formed in the plot. The entire curve usually looks like an L shape and the best K lies in the turning point or the elbow of the L shape.
```
incluster_sum = []
for cluster in range(1, 9):
    setting = {'kmns_iterations': 15, 'KMNS_RANDOM_SEED': 1}   
    km_mod2 = oml.km(n_clusters = cluster, **setting).fit(CUSTOMER_DATA_CLEAN)
    incluster_sum.append(abs(km_mod2.score(CUSTOMER_DATA_CLEAN)))
     
plt.plot(range(1, 9),incluster_sum) 
plt.title('The Elobw Method Graph') 
plt.xlabel('Number of clusters(k)') 
plt.ylabel('wcss_list') 
plt.show()
```
The elbow joint or the number of optimal clusters can be observed at k=3.

Build the final model according to the optimal value of clusters.

try:
    oml.drop(model="CUST_CLUSTER_MODEL")
except:
    pass
      
setting = {'KMNS_ITERATIONS': 20,
           'KMNS_DISTANCE': 'KMNS_EUCLIDEAN',
           'KMNS_NUM_BINS': 10,
           'KMNS_DETAILS': 'KMNS_DETAILS_ALL',
           'PREP_AUTO': 'ON'}
km_mod3 = oml.km(n_clusters = 3, **setting).fit(CUSTOMER_DATA_CLEAN, model_name = "CUST_CLUSTER_MODEL", case_id = 'CUST_ID')

Run the following script to display the model's clusters and the parent-child relationship in the cluster hierarchy.
```
z.show(km_mod3.clusters)
```
The clusters with cluster_id 3, 4, and 5 are the leaf clusters having 1369, 1442, and 1422 rows of data points in them. The cluster with cluster_id equal to 1 is the root of the binary tree as its parent_cluster_id is equal to nan.
Run the following script to get only the parent/child relationship.
```
z.show(km_mod3.taxonomy)
```
To view the per cluster-attribute center (centroid) information of leaf clusters use the model attribute centroids which displays model statistics like mean, variance, and mode value for each cluster and attribute.
```
km_mod3.centroids[km_mod3.centroids["CLUSTER_ID"]>=3]
```
Cluster ID 3 has users with the highest mean for Y_BOX_GAMES and CUST_INCOME_LEVEL.

For the model km_mod3, use the model attribute cluster_hists to view the cluster histogram details.To see the CUST_INCOME_LEVEL attribute's histogram details for Cluster ID 5.

mod_histogram=km_mod3.cluster_hists
 
z.show(mod_histogram[(mod_histogram['cluster.id']==3) & 
       ((mod_histogram['variable']=='CUST_INCOME_LEVEL:A: Below 30,000') | 
        (mod_histogram['variable']=='CUST_INCOME_LEVEL:B: 30,000 - 49,999') | 
        (mod_histogram['variable']=='CUST_INCOME_LEVEL:C: 50,000 - 69,999') | 
        (mod_histogram['variable']=='CUST_INCOME_LEVEL:D: 70,000 - 89,999') | 
        (mod_histogram['variable']=='CUST_INCOME_LEVEL:E: 90,000 - 109,999') | 
        (mod_histogram['variable']=='CUST_INCOME_LEVEL:F: 110,000 - 129,999') | 
        (mod_histogram['variable']=='CUST_INCOME_LEVEL:G: 130,000 - 149,999') | 
        (mod_histogram['variable']=='CUST_INCOME_LEVEL:H: 150,000 - 169,999') | 
        (mod_histogram['variable']=='CUST_INCOME_LEVEL:I: 170,000 - 189,999') | 
        (mod_histogram['variable']=='CUST_INCOME_LEVEL:J: 190,000 - 249,999') | 
        (mod_histogram['variable']=='CUST_INCOME_LEVEL:K: 250,000 - 299,999') | 
        (mod_histogram['variable']=='CUST_INCOME_LEVEL:L: 300,000 and above'))])

This histogram groups cluster-id 3 into bins based on the CUST_INCOME_LEVEL. The bin with the highest count of customers earns salaries ranging from 190,000 to 249,999 annually.

Check the support and confidence level of leaf clusters(3,4 and 5) using the model attribute rules which give the conditions for a case to be assigned with some probability to a cluster. Support and confidence are metrics that describe the relationships between clustering rules and cases. Support is the percentage of cases for which the rule holds. Confidence is the probability that a case described by this rule is assigned to the cluster.
```
km_mod3.rules
```
The columns headers in the above data frame specify the following:
- cluster.id: The ID of a cluster in the model rhs.support: The record count rhs.conf:
- rhs.support: The record count
- rhs.conf: The record confidence
- lhr.support: The rule support
- lhs.conf: The rule confidence
- lhs.var: The attribute predicate name
- lhs.var.support: The attribute predicate support
- lhs.var.conf: The attribute predicate confidence
- predicate: The attribute predicate
Run the following script to get the cluster id and the total number of data points present in each leaf node. The total number of data points at each tree level should always be conserved. The total number of data points present in the root node should equal the sum of all the data points present in the leaf nodes (1369+1442+1422=4233).
```
z.show(km_mod3.leaf_cluster_counts)
```

Score

The clusters discovered by k-Means are used to score a new record by estimating the probabilities that the new record belongs to each of the k clusters. The cluster with the highest probability is assigned to the record.

In this step, you will make predictions on the CUSTOMER_DATA_CLEAN and add the CUST_ID as a supplemental column so that you can uniquely associate scores with the data. To do so run the below script:
```
pred = km_mod3.predict(CUSTOMER_DATA_CLEAN, supplemental_cols = CUSTOMER_DATA_CLEAN[["CUST_ID"]])
z.show(pred)
```
To make predictions that return probability for each cluster on the data use predict_proba function.
```
pred = km_mod3.predict_proba(CUSTOMER_DATA_CLEAN, supplemental_cols = CUSTOMER_DATA_CLEAN[["CUST_ID"]])
z.show(pred)
```

With Embedded Python Execution, all the above tasks can be achieved. You can invoke user-defined Python functions in Python engines spawned and managed by the database environment. Use the oml.do_eval function to run a user-defined input function that builds a k-Means model, scores records, and displays the results.

def build_km_1():
    
    setting = {'KMNS_ITERATIONS': 20,
           'KMNS_DISTANCE': 'KMNS_EUCLIDEAN',
           'KMNS_NUM_BINS': 10,
           'KMNS_DETAILS': 'KMNS_DETAILS_ALL',
           'PREP_AUTO': 'ON'}
   
    # Create a KM model object and fit it.
    try:
        oml.drop(model="CUST_CLUSTER_MODEL_EPE")
    except:
        pass
    km_mod_epe = oml.km(n_clusters = 3, **setting).fit(CUSTOMER_DATA_CLEAN, model_name = "CUST_CLUSTER_MODEL_EPE", case_id = 'CUST_ID')
     
    # Show model details.
    #km_mod_epe
    pred=(km_mod_epe.predict(CUSTOMER_DATA_CLEAN,  supplemental_cols =CUSTOMER_DATA_CLEAN[:, ['CUST_ID']]))
    return pred
 
z.show(oml.do_eval(func = build_km_1))

Run the following script to display the probability score of customer c1 with CUST_ID (102308) belonging to each cluster.
```
c1=CUSTOMER_DATA_CLEAN[CUSTOMER_DATA_CLEAN['CUST_ID']==102308]
km_mod3.predict_proba(c1, supplemental_cols =c1['CUST_ID'])
```

To sell a new gaming product, you must target customers who have already purchased Y_BOX_GAMES and have a high credit limit. You have successfully segmented the population into different clusters and the cluster with cluster-id 3 has the target population with the greatest percentage of customers who have already purchased Y_BOX_GAMES, with a mean CUST_CREDIT_LIMIT of 8322. So, you can confidently target customers in cluster-id 3 to sell a new game product.

Parent topic: Clustering Use Case