Categorical Lag Transformer

The categorical lag is used for categorical variables and represents the most common value in the neighborhood.

For example, given a feature representing a property type (such as house, apartment, townhouse, and so on), the categorical lag is the most common property in the surroundings.

This is also a feature engineering method which computes categorical lag values that can be directly used to train any machine learning models. The CategoricalLagTransformer class computes the categorical lag of a given training data and changes the value of an observation for its categorical lag. It transforms an observation's value with the most common value in the neighborhood.

An instance of this class takes the spatial_weights_definition parameter, which defines the relationship between the neighboring observations.

The main methods of the class are described in the following table.

Method Description
fit Calculates the spatial weights of the training data using the algorithm associated with the spatial_weights_definition parameter and the geometry column.
transform Returns the most common value from each location's neighbors. By defining the use_fit_lag parameter, the method can use the neighbors from the training set, or the data passed into the transform method. The output is a NumPy array.
fit_transform Calls the fit and transform methods in sequence with the training set.

See the CategoricalLagTransformer class in Python API Reference for Oracle Spatial AI for more information.

The following example uses the block_groups SpatialDataFrame and the CategoricalLagTransfomer method to transform the values from the INCOME_CLASS feature for the most common value of the corresponding neighbors.

The INCOME_CLASS column has four categories: High, Medium-High, Medium-Low, Low. These represent the income level for a specific observation. The target variable (MEDIAN_INCOME) and the geometry column are not part of the output.

from oraclesai.weights import KNNWeightsDefinition 
from oraclesai.preprocessing import CategoricalLagTransformer 
import pandas as pd 
 
# Create a categorical variable based in the median income
labels=['Low', 'Medium-Low', 'Medium-High', 'High'] 
block_groups_extended = block_groups.add_column("INCOME_CLASS", pd.qcut(block_groups["MEDIAN_INCOME"].values, [0, 0.25, 0.5, 0.75, 1], labels=labels).tolist()) 
 
# Define the variables of the training data
X = block_groups_extended[["MEDIAN_INCOME", "INCOME_CLASS", "geometry"]] 
print(f">> Original data:\n {X['INCOME_CLASS'].values[:10]}")
 
# Define the spatial weights
weights_definition = KNNWeightsDefinition(k=20) 
 
# Create an instance of CategoricalLagTransformer
categorical_lag_transformer = CategoricalLagTransformer(weights_definition) 
 
# Transforms the training data with the categorical lag 
X_categorical_lag = categorical_lag_transformer.fit_transform(X, y='MEDIAN_INCOME', geometries='geometry') 
 
# Displays the transformed data
print(f"\n>> Transformed data:\n {X_categorical_lag[:10, :]}")

The resulting output is a NumPy array with a single column, representing the categorical lag of the INCOME_CLASS column. Note that both the target variable (MEDIAN_INCOME) and the geometries are not part of the output.

>> Original data:
 ['Medium-Low' 'Medium-High' 'Medium-High' 'High' 'High' 'High' 'High'
 'High' 'Medium-High' 'Medium-Low']

>> Transformed data:
 [['High']
 ['High']
 ['High']
 ['High']
 ['High']
 ['High']
 ['High']
 ['Medium-High']
 ['Medium-High']
 ['Medium-High']]