User-Specified Data Preparation for O-Cluster

You can prepare the data for O-Cluster by considering equi-width binning and managing outliers.

Keep the following in mind if you choose to prepare the data for O-Cluster:

  • O-Cluster does not necessarily use all the input data when it builds a model. It reads the data in batches (the default batch size is 50000). It only reads another batch if it believes, based on statistical tests, that uncovered clusters can still exist.

  • Binary attributes must be declared as categorical.

  • Automatic equi-width binning is highly recommended. The bin identifiers are expected to be positive consecutive integers starting at 1.

  • The presence of outliers can significantly impact clustering algorithms. Use a clipping transformation before binning or normalizing. Outliers with equi-width binning can prevent O-Cluster from detecting clusters. As a result, the whole population appears to fall within a single cluster.