Agglomerative with Regionalization
Agglomerative clustering performs a hierarchical clustering using a bottom up approach.
In agglomerative clustering, initially there is one cluster for each observation. In each iteration, the two closest clusters are merged. The algorithm continues until any one of the following stopping criteria applies:
- Reaches a certain number of clusters.
- The distance between two clusters is larger than a certain threshold.
Standard Agglomerative clustering does not fully consider the observation's spatial location. When this algorithm is applied on spatial data, it often results in data points of a cluster dispersed across spatial regions. Regionalization is used to provide a spatial context to the agglomerative algorithm. By defining spatial weights, agglomerative with regionalization includes a spatial constraint in the clustering algorithm, so elements of the same cluster share common characteristics and are geographically connected.
See the AgglomerativeClustering class in Python API Reference for Oracle Spatial AI for more information.
The following table describes some of the properties of the
AgglomerativeClustering class.
| Parameters | Description |
|---|---|
n_clusters_ |
The algorithm stops when reaching the specified number of clusters. |
distance_threshold |
The algorithm stops if the distance between the two
closest clusters is greater than this value. If this parameter is
defined, then it requires to set
n_clusters=None.
|
spatial_weights_definition |
Defines the relationship between neighboring locations.
It is required to retrieve information from the neighbors.
Only KNN and DistanceBand weights are supported. |
linkage |
The strategy followed to identify the distance between
two clusters. The options are:
|
affinity |
The metric used to compute the distance. |
n_jobs |
The maximum number of concurrently running jobs. |
The following code uses the block_groups
SpatialDataFrame and AgglomerativeClustering to
identify locations sharing common characteristics according to certain features. It uses
regionalization to keep the clusters geographically connected.
from oraclesai.weights import KNNWeightsDefinition
from oraclesai.clustering import AgglomerativeClustering
from oraclesai.pipeline import SpatialPipeline
from sklearn.preprocessing import StandardScaler
# Define training features
X = block_groups[['MEDIAN_INCOME', 'MEAN_AGE', 'MEAN_EDUCATION_LEVEL', 'HOUSE_VALUE', 'geometry']]
# Use geodetic reference systems to calculate distances.
X = X.to_crs('epsg:3857')
# Create an instance defining stopping criteria and spatial weights
reg_agglomerative = AgglomerativeClustering(n_clusters=6, spatial_weights_definition=KNNWeightsDefinition(k=5))
# Create a spatial pipeline with preprocessing and clustering steps.
agglomerative_pipeline = SpatialPipeline([('scale', StandardScaler()), ('clustering', reg_agglomerative)])
# Train the model
agglomerative_pipeline.fit(X)
# Print the labels associated with each observation
print(f"labels = {agglomerative_pipeline.named_steps['clustering'].labels_[:20]}")The output are the labels associated to the first 20 observations of the data.
labels = [1 4 4 4 4 0 0 0 1 1 1 1 1 1 1 1 0 2 0 2]