Spatial Regimes
In the spatial regimes algorithm, the regression equation parameters are estimated according to a categorical variable called regime.
This categorical variable can represent different things, such as a region in a spatial context. Neighborhoods, such as district or block names, can be used to define regimes. The model reflects spatial heterogeneity across regions, with different regions having their own regression models.
The SpatialRegimesRegressor
class consists of linear
regression models where the terms of the linear equation vary depending on the regime.
The following table describes the main methods of the
SpatialRegimesRegressor
class.
Method | Description |
---|---|
fit |
The regime parameter indicates the
categorical variable used as regime. An OLS is run for each regime,
obtaining a different set of parameters for each regime.
|
predict |
To predict new values, the algorithm uses the parameters associated with the regimes of the prediction data. |
fit_predict |
Calls the fit and
predict methods sequentially with the training
data.
|
score |
Returns the R-squared statistic for the given data. For each observation, it uses the model associated with the corresponding regime. |
Even when the SpatialRegimesRegressor
class does not consider spatial
weights in the training process, it uses the spatial_weights_definition
parameter to obtain spatial diagnostics.
See the SpatialRegimesRegressor class in Python API Reference for Oracle Spatial AI for more information.
The following example uses the block_groups
and the
SpatialDataFrame
SpatialRegimesRegressor
class. However, before executing the
regression task, the example requires to define a categorical variable as regime. Then
the functions split the geographical area of a SpatialDataFrame
into a grid with a certain number of rows and columns, where each grid cell is
represented by an integer number that will serve as the categorical variable.
import bisect
def get_cell_id(array_x, array_y, point, ncols):
point_x, point_y = point.x, point.y
grid_x = bisect.bisect_left(array_x, point_x) - 1
grid_y = bisect.bisect_left(array_y, point_y) - 1
return grid_y * ncols + grid_x
def create_grid(pdf_data, grid_column, nrows=2, ncols=2):
min_x, min_y, max_x, max_y = pdf_data.total_bounds
geometries = pdf_data["geometry"].values
centroids = [geom.centroid for geom in geometries]
step_x = (max_x - min_x) / ncols
step_y = (max_y - min_y) / nrows
split_x = [min_x + step_x * i for i in range(ncols + 1)]
split_y = [min_y + step_y * i for i in range(nrows + 1)]
column_values = []
for centroid in centroids:
column_values.append(get_cell_id(split_x, split_y, centroid, ncols))
return pdf_data.add_column(grid_column, column_values)
Using the preceding functions, the following code:
- Creates another instance of
SpatialDataFrame
with a categorical variable,GRID_ID
, representing the grid cells that will serve as the regimes. - Stores the regimes into a separate variable and removes the categorical variable from the dataset.
- Trains the
SpatialRegimesRegressor
model with the training set (X_train
) by calling thefit
method, setting theregime
parameter, and using theMEDIAN_INCOME
column as the target variable. - Calls the
predict
andscore
methods using the test set (X_test
), to estimate the target variable and obtain the R-squared metric.
from oraclesai.weights import KNNWeightsDefinition
from oraclesai.regression import SpatialRegimesRegressor
from oraclesai.pipeline import SpatialPipeline
from sklearn.preprocessing import StandardScaler
# Create a categorical variable by splitting the geographic region in a grid
block_groups_grid = create_grid(block_groups, "GRID_ID", nrows=3, ncols=3)
# Define the explanatory variables
X = block_groups_grid[['MEDIAN_INCOME', 'MEAN_AGE', 'MEAN_EDUCATION_LEVEL', 'HOUSE_VALUE', 'INTERNET', 'GRID_ID', 'geometry']]
# Define the training and test sets
X_train, X_test, _, _, _, _ = spatial_train_test_split(X, y="MEDIAN_INCOME", test_size=0.2, random_state=32)
# Get the regime values
regimes_train = X_train["GRID_ID"].values.tolist()
regimes_test = X_test["GRID_ID"].values.tolist()
# Discard the categorical variable
X_train = X_train.drop("GRID_ID")
X_test = X_test.drop("GRID_ID")
# Define the spatial weights
weights_definition = KNNWeightsDefinition(k=10)
# Create a Spatial Regimes Regressor model
spatial_regimes_model = SpatialRegimesRegressor(spatial_weights_definition=weights_definition)
# Add the model to a spatial pipeline along with a preprocessing step
spatial_regimes_pipeline = SpatialPipeline([('scale', StandardScaler()), ('spatial_regimes', spatial_regimes_model)])
# Train the model using "MEDIAN_INCOME" as the target variable and specifying the regime values
spatial_regimes_pipeline.fit(X_train, "MEDIAN_INCOME", spatial_regimes__regimes=regimes_train)
# Print the predictions with the test set
spatial_regimes_predictions_test = spatial_regimes_pipeline.predict(X_test.drop(["MEDIAN_INCOME"]), spatial_regimes__regimes=regimes_test).flatten()
print(f"\n>> predictions (X_test):\n {spatial_regimes_predictions_test[:10]}")
# Print the score with the test set
spatial_regimes_r2_score = spatial_regimes_pipeline.score(X_test, y="MEDIAN_INCOME", spatial_regimes__regimes=regimes_test)
print(f"\n>> r2_score (X_test):\n {spatial_regimes_r2_score}")
The output of this program is as follows:
>> predictions (X_test):
[ 99973.28903064 119316.0422925 21627.0522275 26862.24033126
176529.76909922 55563.36270093 115297.87445691 33401.15374394
63827.11873494 26992.92679579]
>> r2_score (X_test):
0.67377148094271
Since the spatial_weights_definition
parameter was set when
creating the SpatialRegimesRegressor
instance, the
summary
property of the trained model displays spatial statistics.
Note that there is a set of parameters for each regime, as well as some spatial
statistics, such as Moran’s I and Lagrange Multipliers for spatial dependence.
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES - REGIMES
---------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : dep_var Number of Observations: 2750
Mean dependent var : 69703.4815 Number of Variables : 40
S.D. dependent var : 39838.5789 Degrees of Freedom : 2710
R-squared : 0.6974
Adjusted R-squared : 0.6930
Sum squared residual:1320270117156.439 F-statistic : 160.1405
Sigma-square :487184545.076 Prob(F-statistic) : 0
S.E. of regression : 22072.257 Log likelihood : -31387.645
Sigma-square ML :480098224.421 Akaike info criterion : 62855.290
S.E of regression ML: 21911.1438 Schwarz criterion : 63092.065
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
1_CONSTANT 67301.4371567 1953.6056568 34.4498578 0.0000000
1_MEAN_AGE -787.8377162 1485.8378441 -0.5302313 0.5959950
1_MEAN_EDUCATION_LEVEL 19399.3182180 3114.5763711 6.2285576 0.0000000
1_HOUSE_VALUE 18607.2342406 1584.1781459 11.7456703 0.0000000
1_INTERNET 13025.7000079 2370.5082392 5.4948976 0.0000000
2_CONSTANT 70316.2663016 3128.3635757 22.4770122 0.0000000
2_MEAN_AGE 4475.1151552 1602.7038604 2.7922283 0.0052714
2_MEAN_EDUCATION_LEVEL 6155.3917348 2436.3442043 2.5264869 0.0115775
2_HOUSE_VALUE 8287.3366860 4847.3374558 1.7096678 0.0874418
2_INTERNET 9610.2177802 1714.0106903 5.6068599 0.0000000
3_CONSTANT 24528.5879950 5872.7675236 4.1766659 0.0000305
3_MEAN_AGE 4605.8239137 1904.1647555 2.4188159 0.0156366
3_MEAN_EDUCATION_LEVEL 22124.7054269 5152.1353075 4.2942788 0.0000181
3_HOUSE_VALUE 22528.7956619 1505.5002005 14.9643259 0.0000000
3_INTERNET 22442.8115822 3672.8299785 6.1104956 0.0000000
4_CONSTANT 60346.7138163 1011.7946534 59.6432424 0.0000000
4_MEAN_AGE 2025.4934828 1131.5366834 1.7900378 0.0735594
4_MEAN_EDUCATION_LEVEL 12613.8139792 1879.7592801 6.7103347 0.0000000
4_HOUSE_VALUE 15802.2959953 1094.1149414 14.4429944 0.0000000
4_INTERNET 7544.7984901 1423.9963625 5.2983271 0.0000001
5_CONSTANT 60570.6305539 1375.4910298 44.0356420 0.0000000
5_MEAN_AGE 4004.8956000 1338.2798927 2.9925695 0.0027914
5_MEAN_EDUCATION_LEVEL 7093.5634835 1762.1713354 4.0254675 0.0000584
5_HOUSE_VALUE 4973.1688262 2760.5550262 1.8015105 0.0717336
5_INTERNET 5212.2336124 1092.1003496 4.7726691 0.0000019
6_CONSTANT 74193.6261803 1593.8110537 46.5510802 0.0000000
6_MEAN_AGE 8804.9736797 1830.8258733 4.8092906 0.0000016
6_MEAN_EDUCATION_LEVEL -1282.6669985 2732.9823394 -0.4693287 0.6388725
6_HOUSE_VALUE 24763.0330906 2724.0892923 9.0903896 0.0000000
6_INTERNET 14378.1718270 2116.0137823 6.7949330 0.0000000
7_CONSTANT 72053.1153887 1522.5496169 47.3239851 0.0000000
7_MEAN_AGE 3957.0149819 1885.0370696 2.0991709 0.0358941
7_MEAN_EDUCATION_LEVEL -1604.5557316 2759.9951560 -0.5813618 0.5610450
7_HOUSE_VALUE 25077.4167626 3315.7621906 7.5630927 0.0000000
7_INTERNET 11840.4394166 2062.1006321 5.7419309 0.0000000
8_CONSTANT 58026.3709199 3699.6679150 15.6842107 0.0000000
8_MEAN_AGE 4496.6200307 2673.0921045 1.6821792 0.0926493
8_MEAN_EDUCATION_LEVEL 17341.3083231 5737.4485722 3.0224773 0.0025306
8_HOUSE_VALUE 35050.3546911 3390.1281391 10.3389469 0.0000000
8_INTERNET 15125.8210946 3364.7884860 4.4953260 0.0000072
------------------------------------------------------------------------------------
Regimes variable: unknown
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 10.296
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 1869.657 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 39 1548.245 0.0000
Koenker-Bassett test 39 544.999 0.0000
DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST MI/DF VALUE PROB
Moran's I (error) 0.1497 19.689 0.0000
Lagrange Multiplier (lag) 1 174.856 0.0000
Robust LM (lag) 1 1.572 0.2099
Lagrange Multiplier (error) 1 334.438 0.0000
Robust LM (error) 1 161.155 0.0000
Lagrange Multiplier (SARMA) 2 336.010 0.0000
REGIMES DIAGNOSTICS - CHOW TEST
VARIABLE DF VALUE PROB
CONSTANT 7 141.366 0.0000
HOUSE_VALUE 7 74.722 0.0000
INTERNET 7 41.075 0.0000
MEAN_AGE 7 19.445 0.0069
MEAN_EDUCATION_LEVEL 7 54.041 0.0000
Global test 35 566.146 0.0000
================================ END OF REPORT =====================================