Categorical Lag Transformer
The categorical lag is used for categorical variables and represents the most common value in the neighborhood.
For example, given a feature representing a property type (such as house, apartment, townhouse, and so on), the categorical lag is the most common property in the surroundings.
This is also a feature engineering method which computes categorical lag
values that can be directly used to train any machine learning models. The
CategoricalLagTransformer class computes the categorical lag of a
given training data and changes the value of an observation for its categorical lag. It
transforms an observation's value with the most common value in the neighborhood.
An instance of this class takes the
spatial_weights_definition parameter, which defines the
relationship between the neighboring observations.
The main methods of the class are described in the following table.
| Method | Description |
|---|---|
fit |
Calculates the spatial weights of the training data
using the algorithm associated with the
spatial_weights_definition parameter and the
geometry column.
|
transform |
Returns the most common value from each location's
neighbors. By defining the use_fit_lag parameter, the
method can use the neighbors from the training set, or the data passed
into the transform method. The output is a NumPy
array.
|
fit_transform |
Calls the fit and
transform methods in sequence with the training
set.
|
See the CategoricalLagTransformer class in Python API Reference for Oracle Spatial AI for more information.
The following example uses the block_groups
SpatialDataFrame and the CategoricalLagTransfomer
method to transform the values from the INCOME_CLASS feature for
the most common value of the corresponding neighbors.
The INCOME_CLASS column has four categories:
High, Medium-High, Medium-Low, Low. These represent the income
level for a specific observation. The target variable
(MEDIAN_INCOME) and the geometry column are
not part of the output.
from oraclesai.weights import KNNWeightsDefinition
from oraclesai.preprocessing import CategoricalLagTransformer
import pandas as pd
# Create a categorical variable based in the median income
labels=['Low', 'Medium-Low', 'Medium-High', 'High']
block_groups_extended = block_groups.add_column("INCOME_CLASS", pd.qcut(block_groups["MEDIAN_INCOME"].values, [0, 0.25, 0.5, 0.75, 1], labels=labels).tolist())
# Define the variables of the training data
X = block_groups_extended[["MEDIAN_INCOME", "INCOME_CLASS", "geometry"]]
print(f">> Original data:\n {X['INCOME_CLASS'].values[:10]}")
# Define the spatial weights
weights_definition = KNNWeightsDefinition(k=20)
# Create an instance of CategoricalLagTransformer
categorical_lag_transformer = CategoricalLagTransformer(weights_definition)
# Transforms the training data with the categorical lag
X_categorical_lag = categorical_lag_transformer.fit_transform(X, y='MEDIAN_INCOME', geometries='geometry')
# Displays the transformed data
print(f"\n>> Transformed data:\n {X_categorical_lag[:10, :]}")The resulting output is a NumPy array with a single column, representing the
categorical lag of the INCOME_CLASS column. Note that both the
target variable (MEDIAN_INCOME) and the geometries are not part
of the output.
>> Original data:
['Medium-Low' 'Medium-High' 'Medium-High' 'High' 'High' 'High' 'High'
'High' 'Medium-High' 'Medium-Low']
>> Transformed data:
[['High']
['High']
['High']
['High']
['High']
['High']
['High']
['Medium-High']
['Medium-High']
['Medium-High']]