SLX Classifier
The Spatial Cross-Regressive (SLX) classification algorithm executes logistic regression involving a feature engineering step to add features that provide a spatial context to the data.
The algorithm adds one or more columns with the spatial lag of certain features, representing the average from neighboring observations.
Using the SLXClassifier class requires defining the spatial weights with
the spatial_weights_definition parameter, which establishes the
interaction between neighboring observations.
For a multi-class classification problem, the algorithm uses the
one-vs-rest strategy, training a model for each class. One common issue with
this strategy is that it can result in an imbalanced dataset, where the proportion of
elements of one class is much larger than the other. To handle this scenario, the
SLXClassifier class provides the following two oversampling
methods:
- Random. This method creates duplicates of random samples (with replacement) from the minority class.
- Synthetic Minority Oversampling Technique (SMOTE). This algorithm selects a random sample of the minority class, a, and from its k nearest neighbors, it selects a random neighbor, b. The vector ab is multiplied by a random number in the range [0, 1], and the result is added to sample a, generating a new synthetic instance. See [5] for more information on SMOTE.
The following parameters specify the oversampling method and the number of new samples:
| Parameter | Description |
|---|---|
balance_method |
The oversampling method. The default value is
None, the other options are random and
smote.
|
balance_ratio
|
A number in the range [0, 1]
representing the desired ratio of observations from the minority class.
A value of 1 will result in the same number of
observations for both classes.
|
The following table describes the main methods of the
SLXClassifier class.
| Method | Description |
|---|---|
fit |
The parameters of the fit method are the
same for most of the regression algorithms, except for the
column_ids parameter, which specify the columns
that are used to compute the spatial lag.
The algorithm estimates the parameters of the explanatory variables plus the parameters associated with those added with the spatial lag. |
predict |
The predict method calculates the
spatial lag of the dataset using the same columns defined in the fit
process and returns the category with the highest probability according
to Logistic Regression.
By setting the
|
fit_predict |
Calls the fit and
predict methods sequentially with the training
data.
|
score |
Returns the accuracy for the given data. By setting the
use_fit_lag=True, the algorithm calculates the
spatial lag from the training set. Otherwise, it computes the spatial
lag from the provided data.
|
See the SLXClassifier class in Python API Reference for Oracle Spatial AI for more information.
The following example uses the block_groups
SpatialDataFrame and performs the following steps:
- Creates a categorical variable,
INCOME_LABEL, based on theMEDIAN_INCOMEcolumn, to use as the target variable. - Creates an instance of
SXLClassifierspecifying thebalance_methodandbalance_ratioparameters. - Trains the model using a training set.
- Prints the predictions from the model and the model's accuracy using the test set.
import numpy as np
from oraclesai.preprocessing import spatial_train_test_split
from oraclesai.weights import KNNWeightsDefinition
from oraclesai.classification import SLXClassifier
from oraclesai.pipeline import SpatialPipeline
from sklearn.preprocessing import StandardScaler
# Define the categories for the target variable
labels=["low", "medium-low", "medium-high", "high"]
# The target variable comes from the column MEDIAN_INCOME
income_array = block_groups['MEDIAN_INCOME'].values
# Define constants to create the target variable
min_income = np.min(income_array)
max_income = np.max(income_array)
delta = (max_income - min_income) / 4
# Define a function that returns a category based on the median income
def get_label(income):
if income <= min_income + delta:
return "low"
elif min_income + delta < income <= min_income + 2 * delta:
return "medium-low"
elif min_income + 2 * delta < income <= min_income + 3 * delta:
return "medium-high"
return "high"
# Create a new SpatialDataFrame with the target variable "INCOME_LABEL"
block_groups_extended = block_groups.add_column("INCOME_LABEL", [get_label(income) for income in income_array])
# Define the target and explanatory variables
X = block_groups_extended[['INCOME_LABEL', 'MEAN_AGE', 'MEAN_EDUCATION_LEVEL', 'HOUSE_VALUE', 'INTERNET', 'geometry']]
# Split the data into training and test sets
X_train, X_test, _, _, _, _ = spatial_train_test_split(X, y="INCOME_LABEL", test_size=0.2, random_state=32)
# Define the spatial weights
weights_definition = KNNWeightsDefinition(k=20)
# Create the instance of SLXClassifier
slx_classifier = SLXClassifier(spatial_weights_definition=weights_definition,
random_state=15,
balance_method="smote",
balance_ratio=0.05)
# Add the model to a spatial pipeline along with a pre-processing step
classifier_pipeline = SpatialPipeline([('scale', StandardScaler()), ('classifier', slx_classifier)])
# Train the model specifying the target variable and the parameter column_ids
classifier_pipeline.fit(X_train, "INCOME_LABEL", classifier__column_ids=["MEAN_AGE", "HOUSE_VALUE"])
# Print the predictions with the test set
slx_predictions_test = classifier_pipeline.predict(X_test.drop("INCOME_LABEL")).flatten()
print(f"\n>> predictions (X_test):\n {slx_predictions_test[:10]}")
# Print the accuracy with the test set
slx_accuracy_test = classifier_pipeline.score(X_test, "INCOME_LABEL")
print(f"\n>> accuracy (X_test):\n {slx_accuracy_test}")The output consists of the predictions of the first ten observations and the model's accuracy using the test set.
>> predictions (X_test):
['medium-low' 'medium-low' 'low' 'low' 'high' 'low' 'medium-low' 'low'
'low' 'low']
>> accuracy (X_test):
0.7438136826783115The summary property displays the statistics of the trained model, or
models in case of multi-class, along with the mean value of the estimated
parameters.
Multi-Class Logistic Model Results
---------------------------------------------------------------------------
label deviance llf aic bic D2 adj_D2
high 342.409525 -171.204763 356.409525 -21710.554931 0.553921 0.552958
low 1855.672117 -927.836058 1869.672117 -20197.292339 0.498926 0.497844
medium-low 2506.593561 -1253.296780 2520.593561 -19546.370895 0.249868 0.248249
medium-high 840.588033 -420.294016 854.588033 -21212.376424 0.357128 0.355741
Parameters (Average Results)
Variable Est. STD Min Median Max
------------------------------- ---------- ---------- ---------- ---------- ----------
constant -3.740 4.007 -9.350 -3.432 1.256
MEAN_AGE -0.043 0.275 -0.345 -0.062 0.296
MEAN_EDUCATION_LEVEL 0.960 1.229 -1.077 1.472 1.974
HOUSE_VALUE -0.037 0.867 -1.175 -0.047 1.119
INTERNET 0.652 1.369 -1.435 0.824 2.394
SLX-MEAN_AGE 0.018 0.016 -0.006 0.022 0.035
SLX-HOUSE_VALUE 0.001 0.026 -0.017 -0.012 0.047