Data Preparation for k-Means

Learn about preparing data for k-Means algorithm.

Normalization is typically required by the k-Means algorithm. Automatic Data Preparation performs normalization for k-Means. If you do not use ADP, you must normalize numeric attributes before creating or applying the model.

When there are missing values in columns with simple data types (not nested), k-Means interprets them as missing at random. The algorithm replaces missing categorical values with the mode and missing numerical values with the mean.

When there are missing values in nested columns, k-Means interprets them as sparse. The algorithm replaces sparse numerical data with zeros and sparse categorical data with zero vectors.

Data can be constrained in a window size of 6 standard-deviations around the mean value by using the KMNS_WINSORIZE parameter. The KMNS_WINSORIZE parameter can be used whether ADP is set to ON or OFF. Values outside the range are mapped to the range's ends. This parameter is applicable only when the Euclidean distance is used.