Spatial Regimes

In the spatial regimes algorithm, the regression equation parameters are estimated according to a categorical variable called regime.

This categorical variable can represent different things, such as a region in a spatial context. Neighborhoods, such as district or block names, can be used to define regimes. The model reflects spatial heterogeneity across regions, with different regions having their own regression models.

The SpatialRegimesRegressor class consists of linear regression models where the terms of the linear equation vary depending on the regime. The following table describes the main methods of the SpatialRegimesRegressor class.

Method Description
fit The regime parameter indicates the categorical variable used as regime. An OLS is run for each regime, obtaining a different set of parameters for each regime.
predict To predict new values, the algorithm uses the parameters associated with the regimes of the prediction data.
fit_predict Calls the fit and predict methods sequentially with the training data.
score Returns the R-squared statistic for the given data. For each observation, it uses the model associated with the corresponding regime.

Even when the SpatialRegimesRegressor class does not consider spatial weights in the training process, it uses the spatial_weights_definition parameter to obtain spatial diagnostics.

See the SpatialRegimesRegressor class in Python API Reference for Oracle Spatial AI for more information.

The following example uses the block_groups SpatialDataFrame and the SpatialRegimesRegressor class. However, before executing the regression task, the example requires to define a categorical variable as regime. Then the functions split the geographical area of a SpatialDataFrame into a grid with a certain number of rows and columns, where each grid cell is represented by an integer number that will serve as the categorical variable.

import bisect

def get_cell_id(array_x, array_y, point, ncols):
    point_x, point_y = point.x, point.y
    grid_x = bisect.bisect_left(array_x, point_x) - 1
    grid_y = bisect.bisect_left(array_y, point_y) - 1
    
    return grid_y * ncols + grid_x
    
def create_grid(pdf_data, grid_column, nrows=2, ncols=2):
    min_x, min_y, max_x, max_y = pdf_data.total_bounds
    geometries = pdf_data["geometry"].values
    centroids = [geom.centroid for geom in geometries]
    
    step_x = (max_x - min_x) / ncols
    step_y = (max_y - min_y) / nrows
    
    split_x = [min_x + step_x * i for i in range(ncols + 1)]
    split_y = [min_y + step_y * i for i in range(nrows + 1)]
    
    column_values = []
    for centroid in centroids:
        column_values.append(get_cell_id(split_x, split_y, centroid, ncols))
        
    return pdf_data.add_column(grid_column, column_values)

Using the preceding functions, the following code:

  1. Creates another instance of SpatialDataFrame with a categorical variable, GRID_ID, representing the grid cells that will serve as the regimes.
  2. Stores the regimes into a separate variable and removes the categorical variable from the dataset.
  3. Trains the SpatialRegimesRegressor model with the training set (X_train) by calling the fit method, setting the regime parameter, and using the MEDIAN_INCOME column as the target variable.
  4. Calls the predict and score methods using the test set (X_test), to estimate the target variable and obtain the R-squared metric.
from oraclesai.weights import KNNWeightsDefinition 
from oraclesai.regression import SpatialRegimesRegressor 
from oraclesai.pipeline import SpatialPipeline 
from sklearn.preprocessing import StandardScaler 

# Create a categorical variable by splitting the geographic region in a grid 
block_groups_grid = create_grid(block_groups, "GRID_ID", nrows=3, ncols=3) 

# Define the explanatory variables 
X = block_groups_grid[['MEDIAN_INCOME', 'MEAN_AGE', 'MEAN_EDUCATION_LEVEL', 'HOUSE_VALUE', 'INTERNET', 'GRID_ID', 'geometry']] 

# Define the training and test sets 
X_train, X_test, _, _, _, _ = spatial_train_test_split(X, y="MEDIAN_INCOME", test_size=0.2, random_state=32) 

# Get the regime values 
regimes_train = X_train["GRID_ID"].values.tolist() 
regimes_test = X_test["GRID_ID"].values.tolist()

# Discard the categorical variable 
X_train = X_train.drop("GRID_ID") 
X_test = X_test.drop("GRID_ID") 

# Define the spatial weights 
weights_definition = KNNWeightsDefinition(k=10) 

# Create a Spatial Regimes Regressor model 
spatial_regimes_model = SpatialRegimesRegressor(spatial_weights_definition=weights_definition) 

# Add the model to a spatial pipeline along with a preprocessing step 
spatial_regimes_pipeline = SpatialPipeline([('scale', StandardScaler()), ('spatial_regimes', spatial_regimes_model)]) 

# Train the model using "MEDIAN_INCOME" as the target variable and specifying the regime values 
spatial_regimes_pipeline.fit(X_train, "MEDIAN_INCOME", spatial_regimes__regimes=regimes_train) 

# Print the predictions with the test set
spatial_regimes_predictions_test = spatial_regimes_pipeline.predict(X_test.drop(["MEDIAN_INCOME"]), spatial_regimes__regimes=regimes_test).flatten() 
print(f"\n>> predictions (X_test):\n {spatial_regimes_predictions_test[:10]}") 

# Print the score with the test set 
spatial_regimes_r2_score = spatial_regimes_pipeline.score(X_test, y="MEDIAN_INCOME", spatial_regimes__regimes=regimes_test) 
print(f"\n>> r2_score (X_test):\n {spatial_regimes_r2_score}")

The output of this program is as follows:

>> predictions (X_test):
 [ 99973.28903064 119316.0422925   21627.0522275   26862.24033126
 176529.76909922  55563.36270093 115297.87445691  33401.15374394
  63827.11873494  26992.92679579]

>> r2_score (X_test):
 0.67377148094271

Since the spatial_weights_definition parameter was set when creating the SpatialRegimesRegressor instance, the summary property of the trained model displays spatial statistics. Note that there is a set of parameters for each regime, as well as some spatial statistics, such as Moran’s I and Lagrange Multipliers for spatial dependence.

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES - REGIMES
---------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :     dep_var                Number of Observations:        2750
Mean dependent var  :  69703.4815                Number of Variables   :          40
S.D. dependent var  :  39838.5789                Degrees of Freedom    :        2710
R-squared           :      0.6974
Adjusted R-squared  :      0.6930
Sum squared residual:1320270117156.439                F-statistic           :    160.1405
Sigma-square        :487184545.076                Prob(F-statistic)     :           0
S.E. of regression  :   22072.257                Log likelihood        :  -31387.645
Sigma-square ML     :480098224.421                Akaike info criterion :   62855.290
S.E of regression ML:  21911.1438                Schwarz criterion     :   63092.065

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
          1_CONSTANT    67301.4371567    1953.6056568      34.4498578       0.0000000
          1_MEAN_AGE    -787.8377162    1485.8378441      -0.5302313       0.5959950
1_MEAN_EDUCATION_LEVEL    19399.3182180    3114.5763711       6.2285576       0.0000000
       1_HOUSE_VALUE    18607.2342406    1584.1781459      11.7456703       0.0000000
          1_INTERNET    13025.7000079    2370.5082392       5.4948976       0.0000000
          2_CONSTANT    70316.2663016    3128.3635757      22.4770122       0.0000000
          2_MEAN_AGE    4475.1151552    1602.7038604       2.7922283       0.0052714
2_MEAN_EDUCATION_LEVEL    6155.3917348    2436.3442043       2.5264869       0.0115775
       2_HOUSE_VALUE    8287.3366860    4847.3374558       1.7096678       0.0874418
          2_INTERNET    9610.2177802    1714.0106903       5.6068599       0.0000000
          3_CONSTANT    24528.5879950    5872.7675236       4.1766659       0.0000305
          3_MEAN_AGE    4605.8239137    1904.1647555       2.4188159       0.0156366
3_MEAN_EDUCATION_LEVEL    22124.7054269    5152.1353075       4.2942788       0.0000181
       3_HOUSE_VALUE    22528.7956619    1505.5002005      14.9643259       0.0000000
          3_INTERNET    22442.8115822    3672.8299785       6.1104956       0.0000000
          4_CONSTANT    60346.7138163    1011.7946534      59.6432424       0.0000000
          4_MEAN_AGE    2025.4934828    1131.5366834       1.7900378       0.0735594
4_MEAN_EDUCATION_LEVEL    12613.8139792    1879.7592801       6.7103347       0.0000000
       4_HOUSE_VALUE    15802.2959953    1094.1149414      14.4429944       0.0000000
          4_INTERNET    7544.7984901    1423.9963625       5.2983271       0.0000001
          5_CONSTANT    60570.6305539    1375.4910298      44.0356420       0.0000000
          5_MEAN_AGE    4004.8956000    1338.2798927       2.9925695       0.0027914
5_MEAN_EDUCATION_LEVEL    7093.5634835    1762.1713354       4.0254675       0.0000584
       5_HOUSE_VALUE    4973.1688262    2760.5550262       1.8015105       0.0717336
          5_INTERNET    5212.2336124    1092.1003496       4.7726691       0.0000019
          6_CONSTANT    74193.6261803    1593.8110537      46.5510802       0.0000000
          6_MEAN_AGE    8804.9736797    1830.8258733       4.8092906       0.0000016
6_MEAN_EDUCATION_LEVEL    -1282.6669985    2732.9823394      -0.4693287       0.6388725
       6_HOUSE_VALUE    24763.0330906    2724.0892923       9.0903896       0.0000000
          6_INTERNET    14378.1718270    2116.0137823       6.7949330       0.0000000
          7_CONSTANT    72053.1153887    1522.5496169      47.3239851       0.0000000
          7_MEAN_AGE    3957.0149819    1885.0370696       2.0991709       0.0358941
7_MEAN_EDUCATION_LEVEL    -1604.5557316    2759.9951560      -0.5813618       0.5610450
       7_HOUSE_VALUE    25077.4167626    3315.7621906       7.5630927       0.0000000
          7_INTERNET    11840.4394166    2062.1006321       5.7419309       0.0000000
          8_CONSTANT    58026.3709199    3699.6679150      15.6842107       0.0000000
          8_MEAN_AGE    4496.6200307    2673.0921045       1.6821792       0.0926493
8_MEAN_EDUCATION_LEVEL    17341.3083231    5737.4485722       3.0224773       0.0025306
       8_HOUSE_VALUE    35050.3546911    3390.1281391      10.3389469       0.0000000
          8_INTERNET    15125.8210946    3364.7884860       4.4953260       0.0000072
------------------------------------------------------------------------------------
Regimes variable: unknown

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           10.296

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        1869.657           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test               39        1548.245           0.0000
Koenker-Bassett test             39         544.999           0.0000

DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                           MI/DF       VALUE           PROB
Moran's I (error)              0.1497        19.689           0.0000
Lagrange Multiplier (lag)         1         174.856           0.0000
Robust LM (lag)                   1           1.572           0.2099
Lagrange Multiplier (error)       1         334.438           0.0000
Robust LM (error)                 1         161.155           0.0000
Lagrange Multiplier (SARMA)       2         336.010           0.0000


REGIMES DIAGNOSTICS - CHOW TEST
                 VARIABLE        DF        VALUE           PROB
                 CONSTANT         7         141.366           0.0000
              HOUSE_VALUE         7          74.722           0.0000
                 INTERNET         7          41.075           0.0000
                 MEAN_AGE         7          19.445           0.0069
     MEAN_EDUCATION_LEVEL         7          54.041           0.0000
              Global test        35         566.146           0.0000
================================ END OF REPORT =====================================