Oracle® Data Mining Concepts 11g Release 2 (11.2) Part Number E1680803 


View PDF 
This chapter describes clustering, the unsupervised mining function for discovering natural groupings in the data.
See Also:
"Unsupervised Data Mining"This chapter includes the following topics:
Clustering analysis finds clusters of data objects that are similar in some sense to one another. The members of a cluster are more like each other than they are like members of other clusters. The goal of clustering analysis is to find highquality clusters such that the intercluster similarity is low and the intracluster similarity is high.
Clustering, like classification, is used to segment the data. Unlike classification, clustering models segment data into groups that were not previously defined. Classification models segment data by assigning it to previouslydefined classes, which are specified in a target. Clustering models do not use a target.
Clustering is useful for exploring data. If there are many cases and no obvious groupings, clustering algorithms can be used to find natural groupings. Clustering can also serve as a useful datapreprocessing step to identify homogeneous groups on which to build supervised models.
Clustering can also be used for anomaly detection. Once the data has been segmented into clusters, you might find that some cases do not fit well into any clusters. These cases are anomalies or outliers.
Since known classes are not used in clustering, the interpretation of clusters can present difficulties. How do you know if the clusters can reliably be used for business decision making?
You can analyze clusters by examining information generated by the clustering algorithm. Oracle Data Mining generates the following information about each cluster:
Position in the cluster hierarchy, described in "Cluster Rules"
Rule for the position in the hierarchy, described in "Cluster Rules"
Attribute histograms, described in "Attribute Histograms"
Cluster centroid, described in "Centroid of a Cluster"
As with other forms of data mining, the process of clustering may be iterative and may require the creation of several models. The removal of irrelevant attributes or the introduction of new attributes may improve the quality of the segments produced by a clustering model.
There are several different approaches to the computation of clusters. Clustering algorithms may be characterized as:
Hierarchical — Groups data objects into a hierarchy of clusters. The hierarchy can be formed topdown or bottomup. Hierarchical methods rely on a distance function to measure the similarity between clusters.
Note:
The clustering algorithms supported by Oracle Data Mining perform hierarchical clustering.Partitioning — Partitions data objects into a given number of clusters. The clusters are formed in order to optimize an objective criterion such as distance.
Localitybased — Groups neighboring data objects into clusters based on local conditions.
Gridbased — Divides the input space into hyperrectangular cells, discards the lowdensity cells, and then combines adjacent highdensity cells to form clusters.
Reference:
Campos, M.M., Milenova, B.L., "OCluster: Scalable Clustering of Large High Dimensional Data Sets", Oracle Data Mining Technologies, 10 Van De Graaff Drive, Burlington, MA 01803.
Oracle Data Mining performs hierarchical clustering. The leaf clusters are the final clusters generated by the algorithm. Clusters higher up in the hierarchy are intermediate clusters.
Rules describe the data in each cluster. A rule is a conditional statement that captures the logic used to split a parent cluster into child clusters. A rule describes the conditions for a case to be assigned with some probability to a cluster. For example, the following rule applies to cases that are assigned to cluster 19:
IF OCCUPATION in Cleric. AND OCCUPATION in Crafts AND OCCUPATION in Exec. AND OCCUPATION in Prof. CUST_GENDER in M COUNTRY_NAME in United States of America CUST_MARITAL_STATUS in Married AFFINITY_CARD in 1.0 EDUCATION in < Bach. AND EDUCATION in Bach. AND EDUCATION in HSgrad AND EDUCATION in Masters CUST_INCOME_LEVEL in B: 30,000  49,999 AND CUST_INCOME_LEVEL in E: 90,000  109,999 AGE lessOrEqual 0.7 AND AGE greaterOrEqual 0.2 THEN Cluster equal 19.0
Support and confidence are metrics that describe the relationships between clustering rules and cases.
Support is the percentage of cases for which the rule holds.
Confidence is the probability that a case described by this rule will actually be assigned to the cluster.
In Oracle Data Miner, a histogram represents the distribution of the values of an attribute in a cluster. Figure 71 shows a histogram for the distribution of occupations in a cluster of customer data.
In this cluster, about 13% of the customers are craftsmen; about 13% are executives, 2% are farmers, and so on. None of the customers in this cluster are in the armed forces or work in housing sales.
Figure 71 Histogram in Oracle Data Miner
The centroid represents the most typical case in a cluster. For example, in a data set of customer ages and incomes, the centroid of each cluster would be a customer of average age and average income in that cluster. If the data set included gender, the centroid would have the gender most frequently represented in the cluster. Figure 71 shows the centroid values for a cluster.
The centroid is a prototype. It does not necessarily describe any given case assigned to the cluster. The attribute values for the centroid are the mean of the numerical attributes and the mode of the categorical attributes.
These examples use the clustering model km_sh_clus_sample
, created by one of the Oracle Data Mining sample programs, to show how clustering might be used to find natural groupings in the build data or to score new data.
Figure 72 shows six columns and ten rows from the case table used to build the model. Note that no column is designated as a target.
See Also:
Oracle Data Mining Administrator's Guide for information about the Oracle Data Mining sample programsSuppose you want to segment your customer data before performing further analysis. You could analyze the metrics generated for the data by the clustering algorithm. Figure 73 shows clustering details displayed in Oracle Data Miner. The details describe the yrs_residence
attribute in cluster 3. It shows that 20% of customers have been in their current residence for 2 years, almost 25% have been in their current residence for 3 years, and so on.
Figure 73 Cluster Information for the Build Data
Suppose you want to segment a database of regional customer data for marketing research purposes. You might experiment by using a clustering model that you developed for a different region. Figure 74 shows some of the cluster assignments in the scored customer data. It shows a 95.5% probability that customer 100,001 is a member of cluster 15, a 89% probability that customer 100,002 is in cluster 6, and so on.
Note:
Oracle Data Miner displays the generalized case ID in theDMR$CASE_ID
column of the apply output table. The cluster assignment for each case is displayed in the CLUSTER_ID
column. The probability of membership in that cluster is displayed in the PROBABILITY
column.The conditions of membership in a cluster are described in a rule. Figure 75 shows the rule for cluster 15.
Oracle Data Mining supports two clustering algorithms: an enhanced version of kmeans, and an Oracle proprietary algorithm called Orthogonal Partitioning Clustering (OCluster). Both algorithms perform hierarchical clustering.
The main characteristics of the enhanced kmeans and OCluster algorithms are compared in Table 71.
Table 71 Clustering Algorithms Compared
Feature  Enhanced kMeans  OCluster 

Clustering methodolgy 
Distancebased 
Gridbased 
Number of cases 
Handles data sets of any size 
More appropriate for data sets that have more than 500 cases. Handles large tables through active sampling 
Number of attributes 
More appropriate for data sets with a low number of attributes 
More appropriate for data sets with a high number of attributes 
Number of clusters 
Userspecified 
Automatically determined 
Hierarchical clustering 
Yes 
Yes 
Probabilistic cluster assignment 
Yes 
Yes 
See Also:
Chapter 13, "kMeans" for information about Oracle's implementation of the kMeans algorithm
Chapter 17, "OCluster" for information about the Oracleproprietary algorithm, Orthogonal Partitioning Clustering