This chapter describes clustering, the unsupervised mining function for discovering natural groupings in the data.
See Also:
"Unsupervised Data Mining"This chapter includes the following topics:
Clustering analysis finds clusters of data objects that are similar in some sense to one another. The members of a cluster are more like each other than they are like members of other clusters. The goal of clustering analysis is to find high-quality clusters such that the inter-cluster similarity is low and the intra-cluster similarity is high.
Clustering, like classification, is used to segment the data. Unlike classification, clustering models segment data into groups that were not previously defined. Classification models segment data by assigning it to previously-defined classes, which are specified in a target. Clustering models do not use a target.
Clustering is useful for exploring data. If there are many cases and no obvious groupings, clustering algorithms can be used to find natural groupings. Clustering can also serve as a useful data-preprocessing step to identify homogeneous groups on which to build supervised models.
Clustering can also be used for anomaly detection. Once the data has been segmented into clusters, you might find that some cases do not fit well into any clusters. These cases are anomalies or outliers.
Since known classes are not used in clustering, the interpretation of clusters can present difficulties. How do you know if the clusters can reliably be used for business decision making?
You can analyze clusters by examining information generated by the clustering algorithm. Oracle Data Mining generates the following information about each cluster:
Position in the cluster hierarchy, described in "Cluster Rules"
Rule for the position in the hierarchy, described in "Cluster Rules"
Attribute histograms, described in "Attribute Histograms"
Cluster centroid, described in "Centroid of a Cluster"
As with other forms of data mining, the process of clustering may be iterative and may require the creation of several models. The removal of irrelevant attributes or the introduction of new attributes may improve the quality of the segments produced by a clustering model.
There are several different approaches to the computation of clusters. Clustering algorithms may be characterized as:
Hierarchical — Groups data objects into a hierarchy of clusters. The hierarchy can be formed top-down or bottom-up. Hierarchical methods rely on a distance function to measure the similarity between clusters.
Note:
The clustering algorithms supported by Oracle Data Mining perform hierarchical clustering.Partitioning — Partitions data objects into a given number of clusters. The clusters are formed in order to optimize an objective criterion such as distance.
Locality-based — Groups neighboring data objects into clusters based on local conditions.
Grid-based — Divides the input space into hyper-rectangular cells, discards the low-density cells, and then combines adjacent high-density cells to form clusters.
Reference:
Campos, M.M., Milenova, B.L., "O-Cluster: Scalable Clustering of Large High Dimensional Data Sets", Oracle Data Mining Technologies, 10 Van De Graaff Drive, Burlington, MA 01803.
Oracle Data Mining performs hierarchical clustering. The leaf clusters are the final clusters generated by the algorithm. Clusters higher up in the hierarchy are intermediate clusters.
Rules describe the data in each cluster. A rule is a conditional statement that captures the logic used to split a parent cluster into child clusters. A rule describes the conditions for a case to be assigned with some probability to a cluster. For example, the following rule applies to cases that are assigned to cluster 19:
IF OCCUPATION in Cleric. AND OCCUPATION in Crafts AND OCCUPATION in Exec. AND OCCUPATION in Prof. CUST_GENDER in M COUNTRY_NAME in United States of America CUST_MARITAL_STATUS in Married AFFINITY_CARD in 1.0 EDUCATION in < Bach. AND EDUCATION in Bach. AND EDUCATION in HS-grad AND EDUCATION in Masters CUST_INCOME_LEVEL in B: 30,000 - 49,999 AND CUST_INCOME_LEVEL in E: 90,000 - 109,999 AGE lessOrEqual 0.7 AND AGE greaterOrEqual 0.2 THEN Cluster equal 19.0
In Oracle Data Miner, a histogram represents the distribution of the values of an attribute in a cluster. Figure 7-1 shows a histogram for the distribution of occupations in a cluster of customer data.
In this cluster, about 13% of the customers are craftsmen; about 13% are executives, 2% are farmers, and so on. None of the customers in this cluster are in the armed forces or work in housing sales.
Figure 7-1 Histogram in Oracle Data Miner
The centroid represents the most typical case in a cluster. For example, in a data set of customer ages and incomes, the centroid of each cluster would be a customer of average age and average income in that cluster. If the data set included gender, the centroid would have the gender most frequently represented in the cluster. Figure 7-1 shows the centroid values for a cluster.
The centroid is a prototype. It does not necessarily describe any given case assigned to the cluster. The attribute values for the centroid are the mean of the numerical attributes and the mode of the categorical attributes.
These examples use the clustering model km_sh_clus_sample
, created by one of the Oracle Data Mining sample programs, to show how clustering might be used to find natural groupings in the build data or to score new data.
Figure 7-2 shows six columns and ten rows from the case table used to build the model. Note that no column is designated as a target.
See Also:
Oracle Data Mining Administrator's Guide for information about the Oracle Data Mining sample programsSuppose you want to segment your customer data before performing further analysis. You could analyze the metrics generated for the data by the clustering algorithm. Figure 7-3 shows clustering details displayed in Oracle Data Miner. The details describe the yrs_residence
attribute in cluster 3. It shows that 20% of customers have been in their current residence for 2 years, almost 25% have been in their current residence for 3 years, and so on.
Figure 7-3 Cluster Information for the Build Data
Suppose you want to segment a database of regional customer data for marketing research purposes. You might experiment by using a clustering model that you developed for a different region. Figure 7-4 shows some of the cluster assignments in the scored customer data. It shows a 95.5% probability that customer 100,001 is a member of cluster 15, a 89% probability that customer 100,002 is in cluster 6, and so on.
Note:
Oracle Data Miner displays the generalized case ID in theDMR$CASE_ID
column of the apply output table. The cluster assignment for each case is displayed in the CLUSTER_ID
column. The probability of membership in that cluster is displayed in the PROBABILITY
column.The conditions of membership in a cluster are described in a rule. Figure 7-5 shows the rule for cluster 15.
Oracle Data Mining supports two clustering algorithms: an enhanced version of k-means, and an Oracle proprietary algorithm called Orthogonal Partitioning Clustering (O-Cluster). Both algorithms perform hierarchical clustering.
The main characteristics of the enhanced k-means and O-Cluster algorithms are compared in Table 7-1.
Table 7-1 Clustering Algorithms Compared
Feature | Enhanced k-Means | O-Cluster |
---|---|---|
Clustering methodolgy |
Distance-based |
Grid-based |
Number of cases |
Handles data sets of any size |
More appropriate for data sets that have more than 500 cases. Handles large tables through active sampling |
Number of attributes |
More appropriate for data sets with a low number of attributes |
More appropriate for data sets with a high number of attributes |
Number of clusters |
User-specified |
Automatically determined |
Hierarchical clustering |
Yes |
Yes |
Probabilistic cluster assignment |
Yes |
Yes |
See Also:
Chapter 13, "k-Means" for information about Oracle's implementation of the k-Means algorithm
Chapter 17, "O-Cluster" for information about the Oracle-proprietary algorithm, Orthogonal Partitioning Clustering