oraclesai.clustering.KMeansClustering
- class KMeansClustering(n_clusters=None, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='auto', init_method='elbow', n_jobs=None)
K-Means Clustering Algorithm. Based on centroids, each observation is associated with the nearest centroid. The new centroid is the average of all the observations associated with it. The algorithm stops until a certain number of iterations is reached, or until the location of the centroids doesn’t change from the previous iteration. If the number of clusters is not provided, the algorithm uses the method in the parameter
init_method
to estimate it. Regionalization is not supported, so elements of the same cluster can be geographically disconnected.- Parameters:
n_clusters – int, default=None. The number of clusters to form as well as the number of centroids to generate
init – {“k-means++”, “random”}, default=”k-means++”. Method for cluster initialization. If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. If a callable function is passed, it should take arguments
X
,n_clusters
andrandom state
, and return an initialization.n_init – int, default=10. Number of times k-means will run with different centroid seeds. The final result will be the best output of n_init consecutive runs in terms of inertia
max_iter – int, default=300. Maximum number of iterations of the k-means algorithm for a single run.
tol – float, default=1e-4. Relative tolerance according to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
verbose – int, default=0. Verbosity mode.
random_state – int, RandomState instance or None, default=None. Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
copy_x – bool, default=True. If True, then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if
copy_x=False
. If original data is sparse, but not in CSR format, a copy will be made even ifcopy_x=False
.algorithm – {“auto”, “full”, “elkan”}, default=”auto”. K-means algorithm to use. The classical EM-style is “full”. The “elkan” variation is more efficient on data with well-defined clusters, by using the triangle inequality. However, it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).
init_method – {“elbow”, “silhouette”}, default=”elbow”. The method used to estimate the number of clusters, used only when
n_clusters
is not defined.n_jobs – int, default=None. The maximum number of concurrently running jobs. None is a marker for ‘unset’ that will be interpreted as
n_jobs=1
.
Methods
__init__
([n_clusters, init, n_init, ...])fit
(X[, y, geometries, spatial_weights, crs])In case the number of clusters is not specified, it is estimated with the parameter
init_method
.fit_predict
(X[, y, geometries, ...])Trains the clustering model and returns the labels assigned to each observation.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
Attributes
DEFAULT_RANGE
DEFAULT_RANGE_2
INIT_TYPE_ELBOW
INIT_TYPE_SILHOUETTE
Coordinates of cluster centers.
The sum of squared distances of samples to their nearest center.
Array indicating the nearest centroid to each sample.