Oracle® Data Mining Concepts 11g Release 2 (11.2) E1680807 


PDF · Mobi · ePub 
This chapter describes clustering, the unsupervised mining function for discovering natural groupings in the data.
See Also:
"Unsupervised Data Mining"This chapter includes the following topics:
Clustering analysis finds clusters of data objects that are similar in some sense to one another. The members of a cluster are more like each other than they are like members of other clusters. The goal of clustering analysis is to find highquality clusters such that the intercluster similarity is low and the intracluster similarity is high.
Clustering, like classification, is used to segment the data. Unlike classification, clustering models segment data into groups that were not previously defined. Classification models segment data by assigning it to previouslydefined classes, which are specified in a target. Clustering models do not use a target.
Clustering is useful for exploring data. If there are many cases and no obvious groupings, clustering algorithms can be used to find natural groupings.
Clustering can serve as a useful datapreprocessing step to identify homogeneous groups on which to build supervised models.
Clustering can also be used for anomaly detection. Once the data has been segmented into clusters, you might find that some cases do not fit well into any clusters. These cases are anomalies or outliers.
There are several different approaches to the computation of clusters. Oracle Data Mining supports distancebased and gridbased clustering:
Distancebased — This type of clustering uses a distance metric to determine similarity between data objects. The distance metric measures the distance between actual cases in the cluster and the prototypical case for the cluster. The prototypical case is known as the centroid.
Oracle Data Mining supports an enhanced version of kMeans, a distancebased clustering algorithm.
Gridbased — This type of clustering divides the input space into hyperrectangular cells, discards the lowdensity cells, and then combines adjacent highdensity cells to form clusters.
Oracle Data Mining supports Orthogonal Partitioning Clustering (OCluster), a proprietary gridbased clustering algorithm.
Reference:
Campos, M.M., Milenova, B.L., "OCluster: Scalable Clustering of Large High Dimensional Data Sets", Oracle Data Mining Technologies, 10 Van De Graaff Drive, Burlington, MA 01803.
The clustering algorithms supported by Oracle Data Mining perform hierarchical clustering. The leaf clusters are the final clusters generated by the algorithm. Clusters higher up in the hierarchy are intermediate clusters.
Since known classes are not used in clustering, the interpretation of clusters can present difficulties. How do you know if the clusters can reliably be used for business decision making?
Oracle Data Mining clustering models support a high degree of model transparency. You can evaluate the model by examining information generated by the clustering algorithm: for example, the centroid of a distancebased cluster. Moreover, because the clustering process is hierarchical, you can evaluate the rules and other information related to each cluster's position in the hierarchy.
Oracle Data Mining supports two clustering algorithms: kMeans and OCluster. The main characteristics of the two algorithms are compared in Table 71.
Table 71 Clustering Algorithms Compared