Spatial Cross-Regressive Model

The Spatial Cross-Regressive (SLX) regression model executes a regular liner regression involving a feature engineering step to add features that provide a spatial context to the data.

This is according to Tobler's law that closer things are more related than distant things. The algorithm adds one or more columns with the spatial lag of certain features, representing the average from neighboring observations.

The SLXRegressor class requires the definition of the spatial weights with the spatial_weights_definition parameter to establish how the neighboring observations interact. The following table describes the main methods of the SLXRegressor class.

Method Description
fit The parameters of the fit method are the same for most of the regression algorithms, except for the column_ids parameter, which specify the columns that are used to compute the spatial lag.

The algorithm estimates the parameters of the explanatory variables plus the parameters associated with those added with the spatial lag.

predict The predict method calculates the spatial lag of the dataset using the same columns defined in the fit process and returns the value of the OLS equation evaluated in the extended dataset.

By setting the use_fit_lag=True parameter, the algorithm calculates the spatial lag from the training set. This is helpful when the prediction dataset contains few observations.

fit_predict Calls the fit and predict methods sequentially with the training data.
score Returns the R-squared statistic for the given data.

By setting the use_fit_lag=True parameter, the algorithm calculates the spatial lag from the training set. Otherwise, it computes the spatial lag from the provided data.

See the SLXRegressor class in Python API Reference for Oracle Spatial AI for more information.

The following example uses the block_groups SpatialDataFrame and the SLXRegressor class to train an SLX regression model with training data (X_train) using the MEDIAN_INCOME column as the target variable. The MEAN_AGE, MEAN_EDUCATION_LEVEL, and HOUSE_VALUE columns are used to calculate the spatial lag.

Using the test set (X_test), the code calls the predict and score methods to estimate the values of the target variable and the R-squared metric respectively.

from oraclesai.preprocessing import spatial_train_test_split 
from oraclesai.weights import KNNWeightsDefinition 
from oraclesai.regression import SLXRegressor 
from oraclesai.pipeline import SpatialPipeline 
from sklearn.preprocessing import StandardScaler 

# Define the explanatory variables 
X = block_groups[['MEDIAN_INCOME', 'MEAN_AGE', 'MEAN_EDUCATION_LEVEL', 'HOUSE_VALUE', 'INTERNET', 'geometry']] 

# Define the training and test sets 
X_train, X_test, _, _, _, _ = spatial_train_test_split(X, y="MEDIAN_INCOME", test_size=0.2, random_state=32) 

# Define the spatial weights 
weights_definition = KNNWeightsDefinition(k=10) 

# Create a SXL Regressor model 
slx_model = SLXRegressor(spatial_weights_definition=weights_definition) 

# Add the model to a pipeline along with a preprocessing step
slx_pipeline = SpatialPipeline([('scale', StandardScaler()), ('slx_regression', slx_model)]) 

# Train the model 
slx_pipeline.fit(X_train, "MEDIAN_INCOME", slx_regression__column_ids=["MEAN_AGE", "MEAN_EDUCATION_LEVEL", "HOUSE_VALUE"]) 

# Print the predictions with the test set 
slx_predictions_test = slx_pipeline.predict(X_test.drop(["MEDIAN_INCOME"])).flatten()
print(f"\n>> predictions (X_test):\n {slx_predictions_test[:10]}") 

# Print the score with the test set 
slx_r2_score = slx_pipeline.score(X_test, y="MEDIAN_INCOME") 
print(f"\n>> r2_score (X_test):\n {slx_r2_score}")

The program produces the following output:

>> predictions (X_test):
 [102070.14467552 103393.34495125  18080.13247972  28780.88885959
 166553.11466239  47847.19216301  97311.05264284  28621.06664768
  86030.99787827  18315.17778001]

>> r2_score (X_test):
 0.6520502048458249

Note that printing the property summary of the trained model displays new parameters which are associated with the spatial lag of the columns specified in the training process.

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :     dep_var                Number of Observations:        2750
Mean dependent var  :  69703.4815                Number of Variables   :           8
S.D. dependent var  :  39838.5789                Degrees of Freedom    :        2742
R-squared           :      0.6404
Adjusted R-squared  :      0.6395
Sum squared residual:1569034862694.453                F-statistic           :    697.5148
Sigma-square        :572222779.976                Prob(F-statistic)     :           0
S.E. of regression  :   23921.178                Log likelihood        :  -31625.004
Sigma-square ML     :570558131.889                Akaike info criterion :   63266.007
S.E of regression ML:  23886.3587                Schwarz criterion     :   63313.362

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT    69454.0719691     458.7116277     151.4111868       0.0000000
            MEAN_AGE    3407.3842392     632.8239483       5.3844110       0.0000001
MEAN_EDUCATION_LEVEL    11619.0976034    1254.9099676       9.2589093       0.0000000
         HOUSE_VALUE    20550.0723247     970.0583796      21.1843666       0.0000000
            INTERNET    10089.1251192     670.1690078      15.0545982       0.0000000
        SLX-MEAN_AGE     106.5803082     136.9729582       0.7781120       0.4365701
SLX-MEAN_EDUCATION_LEVEL    -995.5040769     172.6431756      -5.7662521       0.0000000
     SLX-HOUSE_VALUE       3.1809763     136.4013684       0.0233207       0.9813962
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER            9.435

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        1258.500           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test                7        1083.843           0.0000
Koenker-Bassett test              7         436.439           0.0000

DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                           MI/DF       VALUE           PROB
Moran's I (error)              0.2586        31.945           0.0000
Lagrange Multiplier (lag)         1        1044.952           0.0000
Robust LM (lag)                   1          55.266           0.0000
Lagrange Multiplier (error)       1         997.181           0.0000
Robust LM (error)                 1           7.495           0.0062
Lagrange Multiplier (SARMA)       2        1052.447           0.0000

================================ END OF REPORT =====================================