oraclesai.clustering

class AgglomerativeClustering(n_clusters=2, metric='euclidean', linkage='ward', distance_threshold=None, n_jobs=None, spatial_weights_definition=None)

Agglomerative Clustering Algorithm. Each observation starts in its own cluster; then, the two closest clusters are merged to form one cluster; the process is repeated until a stopping condition is met or until one cluster remains. By defining spatial weights, the algorithm executes Regionalization, including a spatial constraint that causes elements of the same cluster to be geographically connected.

Parameters:
  • n_clusters – int, default=2. The number of clusters to form

  • metric – str or callable, default=”euclidean”. The metric to use when calculating the distance between observations.

  • linkage – {‘ward’, ‘complete’, ‘average’, ‘single’}, default=’ward’. Determines the distance to use. The algorithm merges pairs of cluster that minimize this criterion. ‘ward’ minimizes the variance of the clusters. ‘average’ uses the average of the distances of each observation of the two clusters. ‘complete’ uses the maximum distances between all observations of the two clusters. ‘single’ uses the minimum distances between all observations of the two clusters.

  • distance_threshold – float, default=None. The linkage distance threshold. If not None, then n_clusters must be None

  • n_jobs – int, default=None. The number of parallel jobs to run

  • spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification. Defines the criteria used to identify neighbors, for example, KNNWeightsDefinition, DistanceBandWeightsDefinition, etc.

fit(X, y=None, geometries=None, spatial_weights=None, crs=None)

Initially, all observations are associated with a different cluster; then it merges the two closest clusters according to the linkage parameter; it continues doing this until the number of clusters is equal to n_clusters or until the distance between the two nearest clusters is greater than``distance_threshold``.

Parameters:
  • X – {numpy array, geopandas dataframe, spatial dataframe} of shape (n_samples, n_features). Training instances to cluster.

  • y – Ignored. Not used, present here for API consistency by convention

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix.

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system. In case of a geodataframe or spatial dataframe, the crs is obtained from the data; otherwise, it can be passed as a parameter.

Returns:

self. Fitted estimator.

fit_predict(X, y=None, geometries=None, spatial_weights=None, crs=None)

Trains the clustering model and returns the labels assigned to each observation.

Parameters:
  • X – {numpy array, geopandas dataframe, spatial dataframe} of shape (n_samples, n_features). Training instances to cluster.

  • y – Ignored. Not used, present here for API consistency by convention

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix.

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system. In case of a geodataframe or spatial dataframe, the crs is obtained from the data; otherwise, it can be passed as a parameter.

Returns:

The labels associated to each observation.

property isoperimetric_quotient_

The Isoperimetric quotient (IPQ) for the resulting clusters. It measures the “compactness” of a shape, where more compact shapes have an IPQ closer to 1.

property labels_

Array indicating the cluster associated with each sample.

property n_clusters_

The number of clusters.

property silhouette_score_

The Silhouette score for the resulting clusters. It measures how well each observation lies within its cluster. The best value is 1, and the worst value is -1. Values near 0 indicate overlapping clusters.

class DBScanClustering(eps=None, min_samples=2, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None, spatial_weights_definition=None, use_spatial_weights_distances=True)

DBSCAN is a density-based clustering technique capable of finding clusters of different shapes and sizes from a large amount of data. This algorithm doesn’t require the number of clusters as a parameter. The algorithm starts at any point; if at least min_samples points are within a radius of eps, then all the points in the neighborhood are considered part of the same cluster. Then the process is repeated for all the points in the neighborhood. There are three types of points:

  • Core Point

    At least has min_samples number of points in its neighborhood within the radius eps.

  • Border Point

    It is reachable from a core point, but there are fewer than min_samples number of points within its neighborhood.

  • Noise Point

    It is neither a core point nor a border point; it is a point that is not reachable from any core points.

Regionalization is used to provide a spatial context to the DBSCAN algorithm. This way, observations of the same cluster are similar not only in their attributes, but also in their spatial location.

Parameters:
  • eps – float, default=None. The maximum distance between two samples for one to be considered as in the neighborhood of the other. If it is None, the K-Distance method is used to estimate the best value for eps.

  • min_samples – int, default=None. The number of samples in a neighborhood for a point to be considered as a core point. If it is None, it is estimated using the number of features in the data.

  • metric – str, or callable, default=’euclidean’. The metric used to calculate the distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by :func:sklearn.metrics.pairwise_distances. If metric is ‘precomputed’, X is assumed to be the distance matrix and must be square.

  • metric_params – dict, default=None. Additional arguments for the metric function

  • algorithm – {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’. The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors.

  • leaf_size – int, default=30. Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

  • p – float, default=None. The power of the Minkowski metric to be used to calculate distance between points. If None, then p=2 (equivalent to the Euclidean distance).

  • n_jobs – int, default=None. The number of parallel jobs to run

  • spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification. Defines the criteria used to identify neighbors, for example, KNNWeightsDefinition, DistanceBandWeightsDefinition, etc.

  • use_spatial_weights_distances – bool, default=True. If True, it will use the spatial the weight matrix as distance. If False, it will set the distance to all neighbors to zero.

property eps_

Maximum distance between two samples for one to be considered neighbor of the other.

fit(X, y=None, geometries=None, spatial_weights=None, crs=None)

Fits a DBSCAN model with the given data and parameters; in case spatial weights were defined, Regionalization is executed, causing elements of the same cluster to be geographically connected.

Parameters:
  • X – {numpy array, geopandas dataframe, spatial dataframe} of shape (n_samples, n_features). Training instances to cluster.

  • y – Ignored. Not used, present here for API consistency by convention.

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix.

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system. Only used when X is a numpy array. It is ignored when CRS information is available in X (i.e. a SpatialDataFrame or GeoDataFrame).

fit_predict(X, y=None, geometries=None, spatial_weights=None, crs=None)

Trains the clustering model and returns the labels assigned to each observation.

Parameters:
  • X – {numpy array, geopandas dataframe, spatial dataframe} of shape (n_samples, n_features). Training instances to cluster.

  • y – Ignored. Not used, present here for API consistency by convention

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix.

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system. In case of a geodataframe or spatial dataframe, the crs is obtained from the data; otherwise, it can be passed as a parameter.

Returns:

The labels associated to each observation.

property isoperimetric_quotient_

The isoperimetric quotient (IPQ) for the resulting clusters. This is a metric for geographical coherence. It compares the area of a region to the area of a circle with the same perimeter as the region. For this metric, compact shapes have a value closer to 1, whereas extended or thin shapes have a value closer to zero.

property labels_

Array indicating the cluster associated with each sample.

property min_samples_

The number of samples in a neighborhood for a point to be considered as a core point.

property silhouette_score_

The Silhouette score for the resulting clusters. It measures how well each observation lies within its cluster. The best value is 1, and the worst value is -1. Values near 0 indicate overlapping clusters.

class KMeansClustering(n_clusters=None, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='auto', init_method='elbow', n_jobs=None)

K-Means Clustering Algorithm. Based on centroids, each observation is associated with the nearest centroid. The new centroid is the average of all the observations associated with it. The algorithm stops until a certain number of iterations is reached, or until the location of the centroids doesn’t change from the previous iteration. If the number of clusters is not provided, the algorithm uses the method in the parameter init_method to estimate it. Regionalization is not supported, so elements of the same cluster can be geographically disconnected.

Parameters:
  • n_clusters – int, default=None. The number of clusters to form as well as the number of centroids to generate

  • init – {“k-means++”, “random”}, default=”k-means++”. Method for cluster initialization. If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. If a callable function is passed, it should take arguments X, n_clusters and random state, and return an initialization.

  • n_init – int, default=10. Number of times k-means will run with different centroid seeds. The final result will be the best output of n_init consecutive runs in terms of inertia

  • max_iter – int, default=300. Maximum number of iterations of the k-means algorithm for a single run.

  • tol – float, default=1e-4. Relative tolerance according to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.

  • verbose – int, default=0. Verbosity mode.

  • random_state – int, RandomState instance or None, default=None. Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.

  • copy_x – bool, default=True. If True, then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x=False. If original data is sparse, but not in CSR format, a copy will be made even if copy_x=False.

  • algorithm – {“auto”, “full”, “elkan”}, default=”auto”. K-means algorithm to use. The classical EM-style is “full”. The “elkan” variation is more efficient on data with well-defined clusters, by using the triangle inequality. However, it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).

  • init_method – {“elbow”, “silhouette”}, default=”elbow”. The method used to estimate the number of clusters, used only when n_clusters is not defined.

  • n_jobs – int, default=None. The maximum number of concurrently running jobs. None is a marker for ‘unset’ that will be interpreted as n_jobs=1.

property cluster_centers_

Coordinates of cluster centers.

fit(X, y=None, geometries=None, spatial_weights=None, crs=None)

In case the number of clusters is not specified, it is estimated with the parameter init_method. The K-Means algorithm is executed to estimate the location of the centroids.

Parameters:
  • X – {numpy array, geopandas dataframe, spatial dataframe} of shape (n_samples, n_features). Training instances to cluster.

  • y – Ignored. Not used, present here for API consistency by convention.

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • spatial_weights – Not used, present for API consistency.

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system. Only used when X is a numpy array. It is ignored when CRS information is available in X (i.e. a SpatialDataFrame or GeoDataFrame).

Returns:

self. Fitted estimator.

fit_predict(X, y=None, geometries=None, spatial_weights=None, crs=None)

Trains the clustering model and returns the labels assigned to each observation.

Parameters:
  • X – {numpy array, geopandas dataframe, vector dataframe} of shape (n_samples, n_features). Training instances to cluster.

  • y – Ignored. Not used, present for API consistency,

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • spatial_weights – Not used, present for API consistency.

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system. In case of a geodataframe or spatial dataframe, the crs is obtained from the data; otherwise, it can be passed as a parameter.

Returns:

The labels associated to each observation.

property inertia_

The sum of squared distances of samples to their nearest center.

property labels_

Array indicating the nearest centroid to each sample.

class LISAHotspotClustering(column=None, spatial_weights_definition=None, max_p_value=None, supported_quadrants=None, seed=None, n_jobs=1)

Hotspot clustering implementation. Identifies spatial clusters of features with high or low values, as well as spatial outliers. For each sample it calculates the local Moran’s I, a p-value, and a label representing the cluster type. The p-value represents the statistical significance of the Moran’s I. There are four different labels.

  • 1 (High-High). A high value surrounded by high values, also called hot spots.

  • 2 (Low-High). A low value surrounded by high values.

  • 3 (Low-Low). A low value surrounded by low values, also called cold spots.

  • 4 (High-Low). A high value surrounded by low values.

Parameters:
  • column – int, default=None. The column that will be used to compute local correlations. In case of None, a single column in X is expected to fit the model.

  • spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification. Defines the criteria used to identify neighbors, for example, KNNWeightsDefinition, DistanceBandWeightsDefinition, etc.

  • max_p_value – float, default=None. Used to label only regions with a p-value below certain value

  • supported_quadrants – list of integers, default=None. Only observations from these quadrants will be labeled. Values indicate quadrant location, 1 (High-High), 2 (Low-High), 3 (Low-Low), 4 (High-Low).

  • seed – int, default=None. Seed to ensure reproducibility of conditional randomizations.

  • n_jobs – int, default=None. The maximum number of concurrently running jobs. None is a a marker for ‘unset’ that will be interpreted as n_jobs=1.

property Is

Array with the Local Moran’s I for each sample.

property coldspots_

An array of integers representing coldspots. If the i-th observation has a label different from 3, then the i-th element of the array is set to -1. If no labels have been assigned, it returns None.

fit(X, y=None, geometries=None, spatial_weights=None, column_map=None, crs=None)

Calculates the local auto-correlation index based on the column specified. If the data contains a single column, that column is used as reference. Labels and regions are then calculated using the results from the local auto-correlation test.

Parameters:
  • X – {numpy array, geopandas dataframe, spatial dataframe} of shape (n_samples, n_features). Training instances to cluster.

  • y – Ignored. Not used, present here for API consistency by convention.

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix.

  • column_map – Dictionary, default=None. A dictionary with column key-value pairs indicating the column name and the column index.

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system. In case of a geodataframe or spatial dataframe, the crs is obtained from the data; otherwise, it can be passed as a parameter.

Returns:

self. Fitted estimator.

fit_predict(X, y=None, geometries=None, spatial_weights=None, column_map=None, crs=None)

Trains the clustering model and returns the labels assigned to each observation.

Parameters:
  • X – {numpy array, geopandas dataframe, spatial dataframe} of shape (n_samples, n_features). Training instances to cluster.

  • y – Ignored. Not used, present here for API consistency by convention

  • geometries – shapely array, default=None. Geometry data for each sample in X.

  • spatial_weights – SpatialWeights, default=None. A spatial weights matrix.

  • column_map – Dictionary, default=None. A dictionary with column key-value pairs indicating the column name and the column index.

  • crs – pyproj.crs.CRS, default=None. Coordinate reference system. In case of a geodataframe or spatial dataframe, the crs is obtained from the data; otherwise, it can be passed as a parameter.

Returns:

The labels associated to each observation.

property hotspots_

An array of integers representing hotspots. If the i-th observation has a label different from 1, then the i-th element of the array is set to -1. If no labels have been assigned, it returns None.

property labels_

Array of integers indicating the quadrant location for each sample. Values indicate quadrant location, 1 (High-High), 2 (Low-High), 3 (Low-Low), 4 (High-Low).

property outliers_

An array of integers representing outliers. If the i-th observation has a label different from 2 or 4, then the i-th element of the array is set to -1. If no labels have been assigned, it returns None.

property ps

Array with p-values for each sample.

property regions_

Dictionary with quadrants as keys and all contiguous regions formed by samples from the corresponding quadrant as values.

property z_values

Array with z-values for each sample.