Agglomerative with Regionalization

Agglomerative clustering performs a hierarchical clustering using a bottom up approach.

In agglomerative clustering, initially there is one cluster for each observation. In each iteration, the two closest clusters are merged. The algorithm continues until any one of the following stopping criteria applies:

  • Reaches a certain number of clusters.
  • The distance between two clusters is larger than a certain threshold.

Standard Agglomerative clustering does not fully consider the observation's spatial location. When this algorithm is applied on spatial data, it often results in data points of a cluster dispersed across spatial regions. Regionalization is used to provide a spatial context to the agglomerative algorithm. By defining spatial weights, agglomerative with regionalization includes a spatial constraint in the clustering algorithm, so elements of the same cluster share common characteristics and are geographically connected.

See the AgglomerativeClustering class in Python API Reference for Oracle Spatial AI for more information.

The following table describes some of the properties of the AgglomerativeClustering class.

Parameters Description
n_clusters_ The algorithm stops when reaching the specified number of clusters.
distance_threshold The algorithm stops if the distance between the two closest clusters is greater than this value. If this parameter is defined, then it requires to set n_clusters=None.
spatial_weights_definition Defines the relationship between neighboring locations. It is required to retrieve information from the neighbors.

Only KNN and DistanceBand weights are supported.

linkage The strategy followed to identify the distance between two clusters. The options are:
  • ward: The variance between two clusters.
  • average: The distance between the average of two clusters.
  • complete: The maximum distance between a pair of points from two distinct clusters.
  • single: The minimum distance between a pair of points from two distinct clusters.
affinity The metric used to compute the distance.
n_jobs The maximum number of concurrently running jobs.

The following code uses the block_groups SpatialDataFrame and AgglomerativeClustering to identify locations sharing common characteristics according to certain features. It uses regionalization to keep the clusters geographically connected.

from oraclesai.weights import KNNWeightsDefinition 
from oraclesai.clustering import AgglomerativeClustering 
from oraclesai.pipeline import SpatialPipeline 
from sklearn.preprocessing import StandardScaler 

# Define training features 
X = block_groups[['MEDIAN_INCOME', 'MEAN_AGE', 'MEAN_EDUCATION_LEVEL', 'HOUSE_VALUE', 'geometry']] 

# Use geodetic reference systems to calculate distances. 
X = X.to_crs('epsg:3857') 
# Create an instance defining stopping criteria and spatial weights 
reg_agglomerative = AgglomerativeClustering(n_clusters=6, spatial_weights_definition=KNNWeightsDefinition(k=5)) 

# Create a spatial pipeline with preprocessing and clustering steps. 
agglomerative_pipeline = SpatialPipeline([('scale', StandardScaler()), ('clustering', reg_agglomerative)]) 

# Train the model 
agglomerative_pipeline.fit(X) 

# Print the labels associated with each observation 
print(f"labels = {agglomerative_pipeline.named_steps['clustering'].labels_[:20]}")

The output are the labels associated to the first 20 observations of the data.

labels = [1 4 4 4 4 0 0 0 1 1 1 1 1 1 1 1 0 2 0 2]