SLX Classifier

The Spatial Cross-Regressive (SLX) classification algorithm executes logistic regression involving a feature engineering step to add features that provide a spatial context to the data.

The algorithm adds one or more columns with the spatial lag of certain features, representing the average from neighboring observations.

Using the SLXClassifier class requires defining the spatial weights with the spatial_weights_definition parameter, which establishes the interaction between neighboring observations.

For a multi-class classification problem, the algorithm uses the one-vs-rest strategy, training a model for each class. One common issue with this strategy is that it can result in an imbalanced dataset, where the proportion of elements of one class is much larger than the other. To handle this scenario, the SLXClassifier class provides the following two oversampling methods:

Random. This method creates duplicates of random samples (with replacement) from the minority class.
Synthetic Minority Oversampling Technique (SMOTE). This algorithm selects a random sample of the minority class, a, and from its k nearest neighbors, it selects a random neighbor, b. The vector ab is multiplied by a random number in the range [0, 1], and the result is added to sample a, generating a new synthetic instance. See [5] for more information on SMOTE.

The following parameters specify the oversampling method and the number of new samples:

Parameter	Description
`balance_method`	The oversampling method. The default value is `None`, the other options are random and smote.
`balance_ratio`	A number in the range `[0, 1]` representing the desired ratio of observations from the minority class. A value of `1` will result in the same number of observations for both classes.

The following table describes the main methods of the SLXClassifier class.

Method	Description
`fit`	The parameters of the `fit` method are the same for most of the regression algorithms, except for the `column_ids` parameter, which specify the columns that are used to compute the spatial lag. The algorithm estimates the parameters of the explanatory variables plus the parameters associated with those added with the spatial lag.
`predict`	The `predict` method calculates the spatial lag of the dataset using the same columns defined in the fit process and returns the category with the highest probability according to Logistic Regression. By setting the `use_fit_lag=True` parameter, the algorithm calculates the spatial lag from the training set. This is helpful when the prediction dataset contains few observations.
`fit_predict`	Calls the `fit` and `predict` methods sequentially with the training data.
`score`	Returns the accuracy for the given data. By setting the `use_fit_lag=True`, the algorithm calculates the spatial lag from the training set. Otherwise, it computes the spatial lag from the provided data.

See the SLXClassifier class in Python API Reference for Oracle Spatial AI for more information.

The following example uses the block_groups SpatialDataFrame and performs the following steps:

Creates a categorical variable, INCOME_LABEL, based on the MEDIAN_INCOME column, to use as the target variable.
Creates an instance of SXLClassifier specifying the balance_method and balance_ratio parameters.
Trains the model using a training set.
Prints the predictions from the model and the model's accuracy using the test set.

import numpy as np 
from oraclesai.preprocessing import spatial_train_test_split 
from oraclesai.weights import KNNWeightsDefinition 
from oraclesai.classification import SLXClassifier 
from oraclesai.pipeline import SpatialPipeline 
from sklearn.preprocessing import StandardScaler 

# Define the categories for the target variable 
labels=["low", "medium-low", "medium-high", "high"] 

# The target variable comes from the column MEDIAN_INCOME 
income_array = block_groups['MEDIAN_INCOME'].values 

# Define constants to create the target variable 
min_income = np.min(income_array) 
max_income = np.max(income_array) 
delta = (max_income - min_income) / 4 

# Define a function that returns a category based on the median income 
def get_label(income): 
    if income <= min_income + delta: 
        return "low" 
    elif min_income + delta < income <= min_income + 2 * delta: 
        return "medium-low" 
    elif min_income + 2 * delta < income <= min_income + 3 * delta: 
        return "medium-high" 
    return "high" 

# Create a new SpatialDataFrame with the target variable "INCOME_LABEL" 
block_groups_extended = block_groups.add_column("INCOME_LABEL", [get_label(income) for income in income_array]) 

# Define the target and explanatory variables 
X = block_groups_extended[['INCOME_LABEL', 'MEAN_AGE', 'MEAN_EDUCATION_LEVEL', 'HOUSE_VALUE', 'INTERNET', 'geometry']] 

# Split the data into training and test sets 
X_train, X_test, _, _, _, _ = spatial_train_test_split(X, y="INCOME_LABEL", test_size=0.2, random_state=32) 

# Define the spatial weights 
weights_definition = KNNWeightsDefinition(k=20) 

# Create the instance of SLXClassifier 
slx_classifier = SLXClassifier(spatial_weights_definition=weights_definition, 
                                                random_state=15, 
                                                balance_method="smote",  
                                                balance_ratio=0.05) 

# Add the model to a spatial pipeline along with a pre-processing step 
classifier_pipeline = SpatialPipeline([('scale', StandardScaler()), ('classifier', slx_classifier)]) 

# Train the model specifying the target variable and the parameter column_ids 
classifier_pipeline.fit(X_train, "INCOME_LABEL", classifier__column_ids=["MEAN_AGE", "HOUSE_VALUE"]) 

# Print the predictions with the test set 
slx_predictions_test = classifier_pipeline.predict(X_test.drop("INCOME_LABEL")).flatten() 
print(f"\n>> predictions (X_test):\n {slx_predictions_test[:10]}") 

# Print the accuracy with the test set 
slx_accuracy_test = classifier_pipeline.score(X_test, "INCOME_LABEL") 
print(f"\n>> accuracy (X_test):\n {slx_accuracy_test}")

The output consists of the predictions of the first ten observations and the model's accuracy using the test set.

>> predictions (X_test):
 ['medium-low' 'medium-low' 'low' 'low' 'high' 'low' 'medium-low' 'low'
 'low' 'low']

>> accuracy (X_test):
 0.7438136826783115

The summary property displays the statistics of the trained model, or models in case of multi-class, along with the mean value of the estimated parameters.

Multi-Class Logistic Model Results
---------------------------------------------------------------------------
      label    deviance          llf         aic           bic       D2   adj_D2
       high  342.409525  -171.204763  356.409525 -21710.554931 0.553921 0.552958
        low 1855.672117  -927.836058 1869.672117 -20197.292339 0.498926 0.497844
 medium-low 2506.593561 -1253.296780 2520.593561 -19546.370895 0.249868 0.248249
medium-high  840.588033  -420.294016  854.588033 -21212.376424 0.357128 0.355741

Parameters (Average Results)
Variable                              Est.        STD        Min     Median        Max
------------------------------- ---------- ---------- ---------- ---------- ----------
constant                            -3.740      4.007     -9.350     -3.432      1.256
MEAN_AGE                            -0.043      0.275     -0.345     -0.062      0.296
MEAN_EDUCATION_LEVEL                 0.960      1.229     -1.077      1.472      1.974
HOUSE_VALUE                         -0.037      0.867     -1.175     -0.047      1.119
INTERNET                             0.652      1.369     -1.435      0.824      2.394
SLX-MEAN_AGE                         0.018      0.016     -0.006      0.022      0.035
SLX-HOUSE_VALUE                      0.001      0.026     -0.017     -0.012      0.047