User-Specified Data Preparation for O-Cluster
You can prepare the data for O-Cluster by considering equi-width binning and managing outliers.
Keep the following in mind if you choose to prepare the data for O-Cluster:
-
O-Cluster does not necessarily use all the input data when it builds a model. It reads the data in batches (the default batch size is 50000). It only reads another batch if it believes, based on statistical tests, that uncovered clusters can still exist.
-
Binary attributes must be declared as categorical.
-
Automatic equi-width binning is highly recommended. The bin identifiers are expected to be positive consecutive integers starting at 1.
-
The presence of outliers can significantly impact clustering algorithms. Use a clipping transformation before binning or normalizing. Outliers with equi-width binning can prevent O-Cluster from detecting clusters. As a result, the whole population appears to fall within a single cluster.
Related Topics