oraclesai.clustering.KMeansClustering

class KMeansClustering(n_clusters=None, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='auto', init_method='elbow', n_jobs=None)

K-Means Clustering Algorithm. Based on centroids, each observation is associated with the nearest centroid. The new centroid is the average of all the observations associated with it. The algorithm stops until a certain number of iterations is reached, or until the location of the centroids doesn’t change from the previous iteration. If the number of clusters is not provided, the algorithm uses the method in the parameter init_method to estimate it. Regionalization is not supported, so elements of the same cluster can be geographically disconnected.

Parameters:
  • n_clusters – int, default=None. The number of clusters to form as well as the number of centroids to generate

  • init – {“k-means++”, “random”}, default=”k-means++”. Method for cluster initialization. If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. If a callable function is passed, it should take arguments X, n_clusters and random state, and return an initialization.

  • n_init – int, default=10. Number of times k-means will run with different centroid seeds. The final result will be the best output of n_init consecutive runs in terms of inertia

  • max_iter – int, default=300. Maximum number of iterations of the k-means algorithm for a single run.

  • tol – float, default=1e-4. Relative tolerance according to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.

  • verbose – int, default=0. Verbosity mode.

  • random_state – int, RandomState instance or None, default=None. Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.

  • copy_x – bool, default=True. If True, then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x=False. If original data is sparse, but not in CSR format, a copy will be made even if copy_x=False.

  • algorithm – {“auto”, “full”, “elkan”}, default=”auto”. K-means algorithm to use. The classical EM-style is “full”. The “elkan” variation is more efficient on data with well-defined clusters, by using the triangle inequality. However, it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).

  • init_method – {“elbow”, “silhouette”}, default=”elbow”. The method used to estimate the number of clusters, used only when n_clusters is not defined.

  • n_jobs – int, default=None. The maximum number of concurrently running jobs. None is a marker for ‘unset’ that will be interpreted as n_jobs=1.

Methods

__init__([n_clusters, init, n_init, ...])

fit(X[, y, geometries, spatial_weights, crs])

In case the number of clusters is not specified, it is estimated with the parameter init_method.

fit_predict(X[, y, geometries, ...])

Trains the clustering model and returns the labels assigned to each observation.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

Attributes

DEFAULT_RANGE

DEFAULT_RANGE_2

INIT_TYPE_ELBOW

INIT_TYPE_SILHOUETTE

cluster_centers_

Coordinates of cluster centers.

inertia_

The sum of squared distances of samples to their nearest center.

labels_

Array indicating the nearest centroid to each sample.