3.3.4 Evaluate
Evaluate a model by assessing its performance using various metrics and techniques to determine how effectively it generalizes to new, unseen data. This process involves comparing predictions to actual outcomes using metrics like accuracy, precision, recall, F1 score, or mean squared error, depending on the model type. The evaluation helps identify the model's strengths and weaknesses, guiding further improvement or tuning.
Information and Model Settings
Note:
To get a complete list of information on the settings available in thek
-Means module,
run the below
script:help(oml.algo.km)
The following steps help you to view different model detail views.
- Use
km_mod1
to access the model details available through the k-Means model object, like the model settings, coefficients, fit details, and more.km_mod1
- Use
km_mod1.clusters
to list the clusters information.z.show(km_mod1.clusters)
- Run the following script to display the model's centroid concerning
each column. It gives mean, variance for a numeric attribute and mode for a
categorical
attribute.
z.show(km_mod1.centroids)
- To determine the optimal value of k you will use the Elbow method.
Assuming that k lies in a given range, search for the best k by running k-Means
over each k in the given range. For each k find the within-cluster variance. As
the value of k increases within-cluster variance should decrease. This is
because more centers mean that the data points are on average at less distance
from each other for a given centroid. Imagine that each data point is a center
then within-cluster variance would be zero. Before the optimal value of k, the
decrease in within-cluster-variance will be relatively large as new clusters
will have increasingly closer data points within them. After the optimal value
of k, the decrease will be slow as the new clusters formed will be similar to
each other. The range of k should be chosen such that we can see the sharp
decrease at optimal value and the slow decrease afterward resulting in an almost
linear pattern being formed in the plot. The entire curve usually looks like an
L shape and the best K lies in the turning point or the elbow of the L shape.
incluster_sum = [] for cluster in range(1, 9): setting = {'kmns_iterations': 15, 'KMNS_RANDOM_SEED': 1} km_mod2 = oml.km(n_clusters = cluster, **setting).fit(CUSTOMER_DATA_CLEAN) incluster_sum.append(abs(km_mod2.score(CUSTOMER_DATA_CLEAN))) plt.plot(range(1, 9),incluster_sum) plt.title('The Elobw Method Graph') plt.xlabel('Number of clusters(k)') plt.ylabel('wcss_list') plt.show()
The elbow joint or the number of optimal clusters can be observed at k=3.
- Build the final model according to the optimal value of
clusters.
try: oml.drop(model="CUST_CLUSTER_MODEL") except: pass setting = {'KMNS_ITERATIONS': 20, 'KMNS_DISTANCE': 'KMNS_EUCLIDEAN', 'KMNS_NUM_BINS': 10, 'KMNS_DETAILS': 'KMNS_DETAILS_ALL', 'PREP_AUTO': 'ON'} km_mod3 = oml.km(n_clusters = 3, **setting).fit(CUSTOMER_DATA_CLEAN, model_name = "CUST_CLUSTER_MODEL", case_id = 'CUST_ID')
- Run the following script to display the model's clusters and the
parent-child relationship in the cluster
hierarchy.
z.show(km_mod3.clusters)
The clusters with cluster_id 3, 4, and 5 are the leaf clusters having 1369, 1442, and 1422 rows of data points in them. The cluster with cluster_id equal to 1 is the root of the binary tree as its parent_cluster_id is equal to nan.
- Run the following script to get only the parent/child relationship.
z.show(km_mod3.taxonomy)
- To view the per cluster-attribute center (centroid) information of
leaf clusters use the model attribute centroids which displays model statistics
like mean, variance, and mode value for each cluster and
attribute.
km_mod3.centroids[km_mod3.centroids["CLUSTER_ID"]>=3]
Cluster ID 3 has users with the highest mean for Y_BOX_GAMES and CUST_INCOME_LEVEL.
- For the model km_mod3, use the model attribute cluster_hists to view
the cluster histogram details.To see the CUST_INCOME_LEVEL attribute's histogram
details for Cluster ID
5.
mod_histogram=km_mod3.cluster_hists z.show(mod_histogram[(mod_histogram['cluster.id']==3) & ((mod_histogram['variable']=='CUST_INCOME_LEVEL:A: Below 30,000') | (mod_histogram['variable']=='CUST_INCOME_LEVEL:B: 30,000 - 49,999') | (mod_histogram['variable']=='CUST_INCOME_LEVEL:C: 50,000 - 69,999') | (mod_histogram['variable']=='CUST_INCOME_LEVEL:D: 70,000 - 89,999') | (mod_histogram['variable']=='CUST_INCOME_LEVEL:E: 90,000 - 109,999') | (mod_histogram['variable']=='CUST_INCOME_LEVEL:F: 110,000 - 129,999') | (mod_histogram['variable']=='CUST_INCOME_LEVEL:G: 130,000 - 149,999') | (mod_histogram['variable']=='CUST_INCOME_LEVEL:H: 150,000 - 169,999') | (mod_histogram['variable']=='CUST_INCOME_LEVEL:I: 170,000 - 189,999') | (mod_histogram['variable']=='CUST_INCOME_LEVEL:J: 190,000 - 249,999') | (mod_histogram['variable']=='CUST_INCOME_LEVEL:K: 250,000 - 299,999') | (mod_histogram['variable']=='CUST_INCOME_LEVEL:L: 300,000 and above'))])
This histogram groups cluster-id 3 into bins based on the CUST_INCOME_LEVEL. The bin with the highest count of customers earns salaries ranging from 190,000 to 249,999 annually.
-
Check the support and confidence level of leaf clusters(3,4 and 5) using the model attribute rules which give the conditions for a case to be assigned with some probability to a cluster. Support and confidence are metrics that describe the relationships between clustering rules and cases. Support is the percentage of cases for which the rule holds. Confidence is the probability that a case described by this rule is assigned to the cluster.
km_mod3.rules
The columns headers in the above data frame specify the following:
- cluster.id: The ID of a cluster in the model rhs.support: The record count rhs.conf:
- rhs.support: The record count
- rhs.conf: The record confidence
- lhr.support: The rule support
- lhs.conf: The rule confidence
- lhs.var: The attribute predicate name
- lhs.var.support: The attribute predicate support
- lhs.var.conf: The attribute predicate confidence
- predicate: The attribute predicate
- Run the following script to get the cluster id and the total number of data
points present in each leaf node. The total number of data points at each tree
level should always be conserved. The total number of data points present in the
root node should equal the sum of all the data points present in the leaf nodes
(1369+1442+1422=4233).
z.show(km_mod3.leaf_cluster_counts)
Score
The clusters discovered by k-Means are used to score a new record by estimating the probabilities that the new record belongs to each of the k clusters. The cluster with the highest probability is assigned to the record.
-
In this step, you will make predictions on the CUSTOMER_DATA_CLEAN and add the CUST_ID as a supplemental column so that you can uniquely associate scores with the data. To do so run the below script:
pred = km_mod3.predict(CUSTOMER_DATA_CLEAN, supplemental_cols = CUSTOMER_DATA_CLEAN[["CUST_ID"]]) z.show(pred)
- To make predictions that return probability for each cluster on the
data use predict_proba
function.
pred = km_mod3.predict_proba(CUSTOMER_DATA_CLEAN, supplemental_cols = CUSTOMER_DATA_CLEAN[["CUST_ID"]]) z.show(pred)
- With Embedded Python Execution, all the above tasks can be
achieved. You can invoke user-defined Python functions in Python engines spawned
and managed by the database environment. Use the oml.do_eval function to run a
user-defined input function that builds a k-Means model, scores records, and
displays the
results.
def build_km_1(): setting = {'KMNS_ITERATIONS': 20, 'KMNS_DISTANCE': 'KMNS_EUCLIDEAN', 'KMNS_NUM_BINS': 10, 'KMNS_DETAILS': 'KMNS_DETAILS_ALL', 'PREP_AUTO': 'ON'} # Create a KM model object and fit it. try: oml.drop(model="CUST_CLUSTER_MODEL_EPE") except: pass km_mod_epe = oml.km(n_clusters = 3, **setting).fit(CUSTOMER_DATA_CLEAN, model_name = "CUST_CLUSTER_MODEL_EPE", case_id = 'CUST_ID') # Show model details. #km_mod_epe pred=(km_mod_epe.predict(CUSTOMER_DATA_CLEAN, supplemental_cols =CUSTOMER_DATA_CLEAN[:, ['CUST_ID']])) return pred z.show(oml.do_eval(func = build_km_1))
- Run the following script to display the probability score of
customer c1 with CUST_ID (102308) belonging to each
cluster.
c1=CUSTOMER_DATA_CLEAN[CUSTOMER_DATA_CLEAN['CUST_ID']==102308] km_mod3.predict_proba(c1, supplemental_cols =c1['CUST_ID'])
To sell a new gaming product, you must target customers who have already purchased Y_BOX_GAMES and have a high credit limit. You have successfully segmented the population into different clusters and the cluster with cluster-id 3 has the target population with the greatest percentage of customers who have already purchased Y_BOX_GAMES, with a mean CUST_CREDIT_LIMIT of 8322. So, you can confidently target customers in cluster-id 3 to sell a new game product.
Parent topic: Clustering Use Case