7 Clustering
 About Clustering
Identify clusters of similar data objects, useful for exploring and preprocessing data without predefined categories.  Evaluating a Clustering Model
Assess clustering models by examining generated information, such as centroids and hierarchical rules, to ensure reliability for business decisions.  Clustering Algorithms
Learn different clustering algorithms used in Oracle Machine Learning for SQL.
Related Topics
Parent topic: Machine Learning Techniques
7.1 About Clustering
Identify clusters of similar data objects, useful for exploring and preprocessing data without predefined categories.
The members of a cluster are more like each other than they are like members of other clusters. Different clusters can have members in common. The goal of clustering analysis is to find highquality clusters such that the intercluster similarity is low and the intracluster similarity is high.
Clustering, like classification, is used to segment the data. Unlike classification, clustering models segment data into groups that were not previously defined. Classification models segment data by assigning it to previouslydefined classes, which are specified in a target. Clustering models do not use a target.
Clustering is useful for exploring data. You can use clustering algorithms to find natural groupings when there are many cases and no obvious groupings.
Clustering can serve as a useful datapreprocessing step to identify homogeneous groups on which you can build supervised models.
You can also use clustering for anomaly detection. Once you segment the data into clusters, you find that some cases do not fit well into any clusters. These cases are anomalies or outliers.
 How are Clusters Computed?
Compute clusters using densitybased, distancebased, or gridbased methods to identify highdensity areas, measure similarity, and form clusters.  Scoring New Data
Score new data probabilistically to predict cluster assignments for new cases.  Hierarchical Clustering
Perform hierarchical clustering to generate final leaf clusters and intermediate clusters in the hierarchy.
Parent topic: Clustering
7.1.1 How are Clusters Computed?
Compute clusters using densitybased, distancebased, or gridbased methods to identify highdensity areas, measure similarity, and form clusters.
There are several different approaches to the computation of clusters. Oracle Machine Learning supports the methods listed here.

Densitybased: This type of clustering finds the underlying distribution of the data and estimates how areas of high density in the data correspond to peaks in the distribution. Highdensity areas are interpreted as clusters. Densitybased cluster estimation is probabilistic.

Distancebased: This type of clustering uses a distance metric to determine similarity between data objects. The distance metric measures the distance between actual cases in the cluster and the prototypical case for the cluster. The prototypical case is known as the centroid.

Gridbased: This type of clustering divides the input space into hyperrectangular cells and identifies adjacent highdensity cells to form clusters.
Parent topic: About Clustering
7.1.2 Scoring New Data
Score new data probabilistically to predict cluster assignments for new cases.
Although clustering is an unsupervised machine learning technique, Oracle Machine Learning supports the scoring operation for clustering. New data is scored probabilistically.
Parent topic: About Clustering
7.1.3 Hierarchical Clustering
Perform hierarchical clustering to generate final leaf clusters and intermediate clusters in the hierarchy.
Oracle Machine Learning supports clustering algorithms that perform hierarchical clustering. The leaf clusters are the final clusters generated by the algorithm. Clusters higher up in the hierarchy are intermediate clusters.
 Rules
Describe data in clusters using conditional statements that capture the logic for cluster assignments.  Support and Confidence
Evaluate clustering rules using support (percentage of applicable cases) and confidence (probability of correct cluster assignment).
Parent topic: About Clustering
7.1.3.1 Rules
Describe data in clusters using conditional statements that capture the logic for cluster assignments.
Rules describe the data in each cluster. A rule is a conditional statement that captures the logic used to split a parent cluster into child clusters. A rule describes the conditions for a case to be assigned with some probability to a cluster.
Parent topic: Hierarchical Clustering
7.1.3.2 Support and Confidence
Evaluate clustering rules using support (percentage of applicable cases) and confidence (probability of correct cluster assignment).
Support and confidence are metrics that describe the relationships between clustering rules and cases. Support is the percentage of cases for which the rule holds. Confidence is the probability that a case described by this rule is actually assigned to the cluster.
Parent topic: Hierarchical Clustering
7.2 Evaluating a Clustering Model
Assess clustering models by examining generated information, such as centroids and hierarchical rules, to ensure reliability for business decisions.
Since known classes are not used in clustering, the interpretation of clusters can present difficulties. How do you know if the clusters can reliably be used for business decision making? Oracle Machine Learning clustering models support a high degree of model transparency. You can evaluate the model by examining information generated by the clustering algorithm: for example, the centroid of a distancebased cluster. Moreover, because the clustering process is hierarchical, you can evaluate the rules and other information related to each cluster's position in the hierarchy.
Parent topic: Clustering
7.3 Clustering Algorithms
Learn different clustering algorithms used in Oracle Machine Learning for SQL.
Oracle Machine Learning for SQL supports these clustering algorithms:

Expectation Maximization
Expectation Maximization is a probabilistic, densityestimation clustering algorithm.

kMeans
kMeans is a distancebased clustering algorithm. Oracle Machine Learning for SQL supports an enhanced version of kMeans.

Orthogonal Partitioning Clustering (OCluster)
OCluster is a proprietary, gridbased clustering algorithm.
See Also:
Campos, M.M., Milenova, B.L., "OCluster: Scalable Clustering of Large High Dimensional Data Sets", Oracle Data Mining Technologies, 10 Van De Graaff Drive, Burlington, MA 01803.
The main characteristics of the two algorithms are compared in the following table.
Table 71 Clustering Algorithms Compared
Feature  kMeans  OCluster  Expectation Maximization 

Clustering methodolgy 
Distancebased 
Gridbased 
Distributionbased 
Number of cases 
Handles data sets of any size 
More appropriate for data sets that have more than 500 cases. Handles large tables through active sampling 
Handles data sets of any size 
Number of attributes 
More appropriate for data sets with a low number of attributes 
More appropriate for data sets with a high number of attributes 
Appropriate for data sets with many or few attributes 
Number of clusters 
Userspecified 
Automatically determined 
Automatically determined 
Hierarchical clustering 
Yes 
Yes 
Yes 
Probabilistic cluster assignment 
Yes 
Yes 
Yes 
Note:
Oracle Machine Learning for SQL uses kMeans as the default clustering algorithm.
Related Topics
Parent topic: Clustering