User-Specified Data Preparation for O-Cluster

You can prepare the data for O-Cluster by considering equi-width binning and managing outliers.

Keep the following in mind if you choose to prepare the data for O-Cluster:

O-Cluster does not necessarily use all the input data when it builds a model. It reads the data in batches (the default batch size is 50000). It only reads another batch if it believes, based on statistical tests, that uncovered clusters can still exist.
Binary attributes must be declared as categorical.
Automatic equi-width binning is highly recommended. The bin identifiers are expected to be positive consecutive integers starting at 1.
The presence of outliers can significantly impact clustering algorithms. Use a clipping transformation before binning or normalizing. Outliers with equi-width binning can prevent O-Cluster from detecting clusters. As a result, the whole population appears to fall within a single cluster.

Related Topics