Oracle® Data Mining Concepts 11g Release 1 (11.1) B2812904 


PDF · Mobi · ePub 
This chapter describes the enhanced kMeans clustering algorithm supported by Oracle Data Mining.
See Also:
Chapter 7, "Clustering"This chapter includes the following topics:
The kMeans algorithm is a distancebased clustering algorithm that partitions the data into a predetermined number of clusters (provided there are enough distinct cases).
Distancebased algorithms rely on a distance metric (function) to measure the similarity between data points. The distance metric is either Euclidean, Cosine, or Fast Cosine distance. Data points are assigned to the nearest cluster according to the distance metric used.
Oracle Data Mining implements an enhanced version of the kmeans algorithm with the following features:
The algorithm builds models in a hierarchical manner. The algorithm builds a model top down using binary splits and refinement of all nodes at the end. In this sense, the algorithm is similar to the bisecting kmeans algorithm. The centroid of the inner nodes in the hierarchy are updated to reflect changes as the tree evolves. The whole tree is returned.
The algorithm grows the tree one node at a time (unbalanced approach). Based on a user setting available in either of the programming interfaces, the node with the largest variance is split to increase the size of the tree until the desired number of clusters is reached. The maximum number of clusters is specified in the build setting for clustering models, CLUS_NUM_CLUSTERS
. (See Chapter 7, "Clustering".)
The algorithm provides probabilistic scoring and assignment of data to clusters.
The algorithm returns, for each cluster, a centroid (cluster prototype), histograms (one for each attribute), and a rule describing the hyperbox that encloses the majority of the data assigned to the cluster. The centroid reports the mode for categorical attributes or the mean and variance for numerical attributes.
This approach to kmeans avoids the need for building multiple kmeans models and provides clustering results that are consistently superior to the traditional kmeans.
The clusters discovered by enhanced kMeans are used to generate a Bayesian probability model that is then used during scoring (model apply) for assigning data points to clusters. The kmeans algorithm can be interpreted as a mixture model where the mixture components are spherical multivariate normal distributions with the same variance for all components.
Automatic Data Preparation performs outliersensitive normalization for kMeans.
When there are missing values in columns with simple data types (not nested), kMeans interprets them as missing at random. The algorithm replaces missing categorical values with the mode and missing numerical values with the mean.
When there are missing values in nested columns, kMeans interprets them as sparse. The algorithm replaces sparse numerical data with zeros and sparse categorical data with zero vectors.
If you manage your own data preparation for kMeans, keep in mind that outliers with equiwidth binning can prevent kMeans from creating clusters that are different in content. The clusters may have very similar centroids, histograms, and rules.
See Also:
Chapter 19, "Automatic and Embedded Data Preparation"
Oracle Data Mining Application Developer's Guide for information about nested columns and missing data