3.3.4 Evaluate

Evaluate a model by assessing its performance using various metrics and techniques to determine how effectively it generalizes to new, unseen data. This process involves comparing predictions to actual outcomes using metrics like accuracy, precision, recall, F1 score, or mean squared error, depending on the model type. The evaluation helps identify the model's strengths and weaknesses, guiding further improvement or tuning.

Information and Model Settings

Note:

To get a complete list of information on the settings available in the k-Means module, run the below script:
help(oml.algo.km)

The following steps help you to view different model detail views.

  • Use km_mod1 to access the model details available through the k-Means model object, like the model settings, coefficients, fit details, and more.
     km_mod1

    Shows the output for km_mod1 model.

  • Use km_mod1.clusters to list the clusters information.
    z.show(km_mod1.clusters)

    Shows the clusters of km_mod1.

  • Run the following script to display the model's centroid concerning each column. It gives mean, variance for a numeric attribute and mode for a categorical attribute.
    z.show(km_mod1.centroids)

    Shows the centroids of km_mod1

  • To determine the optimal value of k you will use the Elbow method. Assuming that k lies in a given range, search for the best k by running k-Means over each k in the given range. For each k find the within-cluster variance. As the value of k increases within-cluster variance should decrease. This is because more centers mean that the data points are on average at less distance from each other for a given centroid. Imagine that each data point is a center then within-cluster variance would be zero. Before the optimal value of k, the decrease in within-cluster-variance will be relatively large as new clusters will have increasingly closer data points within them. After the optimal value of k, the decrease will be slow as the new clusters formed will be similar to each other. The range of k should be chosen such that we can see the sharp decrease at optimal value and the slow decrease afterward resulting in an almost linear pattern being formed in the plot. The entire curve usually looks like an L shape and the best K lies in the turning point or the elbow of the L shape.
    incluster_sum = []
    for cluster in range(1, 9):
        setting = {'kmns_iterations': 15, 'KMNS_RANDOM_SEED': 1}   
        km_mod2 = oml.km(n_clusters = cluster, **setting).fit(CUSTOMER_DATA_CLEAN)
        incluster_sum.append(abs(km_mod2.score(CUSTOMER_DATA_CLEAN)))
         
    plt.plot(range(1, 9),incluster_sum) 
    plt.title('The Elobw Method Graph') 
    plt.xlabel('Number of clusters(k)') 
    plt.ylabel('wcss_list') 
    plt.show()

    Shows the elbow graph.

    The elbow joint or the number of optimal clusters can be observed at k=3.

  • Build the final model according to the optimal value of clusters.
    try:
        oml.drop(model="CUST_CLUSTER_MODEL")
    except:
        pass
          
    setting = {'KMNS_ITERATIONS': 20,
               'KMNS_DISTANCE': 'KMNS_EUCLIDEAN',
               'KMNS_NUM_BINS': 10,
               'KMNS_DETAILS': 'KMNS_DETAILS_ALL',
               'PREP_AUTO': 'ON'}
    km_mod3 = oml.km(n_clusters = 3, **setting).fit(CUSTOMER_DATA_CLEAN, model_name = "CUST_CLUSTER_MODEL", case_id = 'CUST_ID')
  • Run the following script to display the model's clusters and the parent-child relationship in the cluster hierarchy.
    z.show(km_mod3.clusters)

    Shows the clusters of km_mod3.

    The clusters with cluster_id 3, 4, and 5 are the leaf clusters having 1369, 1442, and 1422 rows of data points in them. The cluster with cluster_id equal to 1 is the root of the binary tree as its parent_cluster_id is equal to nan.

  • Run the following script to get only the parent/child relationship.
    z.show(km_mod3.taxonomy)

    Shows the taxonomy.

  • To view the per cluster-attribute center (centroid) information of leaf clusters use the model attribute centroids which displays model statistics like mean, variance, and mode value for each cluster and attribute.
    km_mod3.centroids[km_mod3.centroids["CLUSTER_ID"]>=3]

    Shows the clustering gt 3.

    Cluster ID 3 has users with the highest mean for Y_BOX_GAMES and CUST_INCOME_LEVEL.

  • For the model km_mod3, use the model attribute cluster_hists to view the cluster histogram details.To see the CUST_INCOME_LEVEL attribute's histogram details for Cluster ID 5.
    mod_histogram=km_mod3.cluster_hists
     
    z.show(mod_histogram[(mod_histogram['cluster.id']==3) & 
           ((mod_histogram['variable']=='CUST_INCOME_LEVEL:A: Below 30,000') | 
            (mod_histogram['variable']=='CUST_INCOME_LEVEL:B: 30,000 - 49,999') | 
            (mod_histogram['variable']=='CUST_INCOME_LEVEL:C: 50,000 - 69,999') | 
            (mod_histogram['variable']=='CUST_INCOME_LEVEL:D: 70,000 - 89,999') | 
            (mod_histogram['variable']=='CUST_INCOME_LEVEL:E: 90,000 - 109,999') | 
            (mod_histogram['variable']=='CUST_INCOME_LEVEL:F: 110,000 - 129,999') | 
            (mod_histogram['variable']=='CUST_INCOME_LEVEL:G: 130,000 - 149,999') | 
            (mod_histogram['variable']=='CUST_INCOME_LEVEL:H: 150,000 - 169,999') | 
            (mod_histogram['variable']=='CUST_INCOME_LEVEL:I: 170,000 - 189,999') | 
            (mod_histogram['variable']=='CUST_INCOME_LEVEL:J: 190,000 - 249,999') | 
            (mod_histogram['variable']=='CUST_INCOME_LEVEL:K: 250,000 - 299,999') | 
            (mod_histogram['variable']=='CUST_INCOME_LEVEL:L: 300,000 and above'))])

    Shows the mod histogram for km_mod3

    This histogram groups cluster-id 3 into bins based on the CUST_INCOME_LEVEL. The bin with the highest count of customers earns salaries ranging from 190,000 to 249,999 annually.

  • Check the support and confidence level of leaf clusters(3,4 and 5) using the model attribute rules which give the conditions for a case to be assigned with some probability to a cluster. Support and confidence are metrics that describe the relationships between clustering rules and cases. Support is the percentage of cases for which the rule holds. Confidence is the probability that a case described by this rule is assigned to the cluster.

    km_mod3.rules

    Shows the rules.

    The columns headers in the above data frame specify the following:

    • cluster.id: The ID of a cluster in the model rhs.support: The record count rhs.conf:
    • rhs.support: The record count
    • rhs.conf: The record confidence
    • lhr.support: The rule support
    • lhs.conf: The rule confidence
    • lhs.var: The attribute predicate name
    • lhs.var.support: The attribute predicate support
    • lhs.var.conf: The attribute predicate confidence
    • predicate: The attribute predicate
  • Run the following script to get the cluster id and the total number of data points present in each leaf node. The total number of data points at each tree level should always be conserved. The total number of data points present in the root node should equal the sum of all the data points present in the leaf nodes (1369+1442+1422=4233).
    z.show(km_mod3.leaf_cluster_counts)

Score

The clusters discovered by k-Means are used to score a new record by estimating the probabilities that the new record belongs to each of the k clusters. The cluster with the highest probability is assigned to the record.

  1. In this step, you will make predictions on the CUSTOMER_DATA_CLEAN and add the CUST_ID as a supplemental column so that you can uniquely associate scores with the data. To do so run the below script:

    pred = km_mod3.predict(CUSTOMER_DATA_CLEAN, supplemental_cols = CUSTOMER_DATA_CLEAN[["CUST_ID"]])
    z.show(pred)

    Shows the prediction for the test data.

  2. To make predictions that return probability for each cluster on the data use predict_proba function.
    pred = km_mod3.predict_proba(CUSTOMER_DATA_CLEAN, supplemental_cols = CUSTOMER_DATA_CLEAN[["CUST_ID"]])
    z.show(pred)

    Shows the predict prob.

  3. With Embedded Python Execution, all the above tasks can be achieved. You can invoke user-defined Python functions in Python engines spawned and managed by the database environment. Use the oml.do_eval function to run a user-defined input function that builds a k-Means model, scores records, and displays the results.
    def build_km_1():
        
        setting = {'KMNS_ITERATIONS': 20,
               'KMNS_DISTANCE': 'KMNS_EUCLIDEAN',
               'KMNS_NUM_BINS': 10,
               'KMNS_DETAILS': 'KMNS_DETAILS_ALL',
               'PREP_AUTO': 'ON'}
       
        # Create a KM model object and fit it.
        try:
            oml.drop(model="CUST_CLUSTER_MODEL_EPE")
        except:
            pass
        km_mod_epe = oml.km(n_clusters = 3, **setting).fit(CUSTOMER_DATA_CLEAN, model_name = "CUST_CLUSTER_MODEL_EPE", case_id = 'CUST_ID')
         
        # Show model details.
        #km_mod_epe
        pred=(km_mod_epe.predict(CUSTOMER_DATA_CLEAN,  supplemental_cols =CUSTOMER_DATA_CLEAN[:, ['CUST_ID']]))
        return pred
     
    z.show(oml.do_eval(func = build_km_1))

    Shows the ouput for do_eval.

  4. Run the following script to display the probability score of customer c1 with CUST_ID (102308) belonging to each cluster.
    c1=CUSTOMER_DATA_CLEAN[CUSTOMER_DATA_CLEAN['CUST_ID']==102308]
    km_mod3.predict_proba(c1, supplemental_cols =c1['CUST_ID'])

    Shows the predict prob for 1 customer.

To sell a new gaming product, you must target customers who have already purchased Y_BOX_GAMES and have a high credit limit. You have successfully segmented the population into different clusters and the cluster with cluster-id 3 has the target population with the greatest percentage of customers who have already purchased Y_BOX_GAMES, with a mean CUST_CREDIT_LIMIT of 8322. So, you can confidently target customers in cluster-id 3 to sell a new game product.