Geographical Regressor

A GeographicalRegressor is a spatial machine learning algorithm used to perform regression by leveraging the existing scikit-learn regression algorithms which are used to create both, a global model containing all the observations from the training data and a local model for each observation and by summing the weighted results of the global model and the local model.

Both, GeographicalRegressor and GeographicalClassifier extend the Geographical Random Forest algorithm by allowing the use of various underlying machine learning algorithms besides Random Forest and supporting parallelism in the training of local models, ensuring robust and scalable performance. See [4] for more information on the Geographical Random Forest algorithm.

This algorithm is useful when there is a high degree of spatial heterogeneity in the data. This implies that the data behaves differently in different geographical regions. Hence, it will be hard to fit a single model appropriately. This algorithm supports any scikit-learn regression algorithms. That means, for different applications, you can specify the underneath regression algorithm, which includes random forest, support vector, gradient boosting, and decision tree.

The GeographicalRegressor class implements this regressor. You can specify the scikit-learn regression algorithm when you create an instance of this class. First, a global model is built using the parameters provided at creation time. If the spatial relationship is not specified (by providing SpatialWeightsDefinition or bandwidth information), it is first computed. After the spatial relation is defined, several local models are built. The prediction is performed by locating the local model closer to the data to be predicted and summing the weighted results of the global model and the local model. Specifically, the returned prediction is calculated as follows:

local_model_prediction * local_weight + global_model_prediction * (1.0 - local_weight)

In the preceding calculation, local_weight can be a default or specified value.

When using the GeographicalRegressor class, you can specify a scikit-learn regression algorithm to be used when you create an instance of this class. The following table describes the principal methods of this class.

Method	Description
`fit`	First, the global model is built using the parameters provided at creation time. If the spatial relationship is not specified (either by the `spatial_weights_definition` or the `bandwidth` parameter), it is internally computed. Then, several local models are trained.
`predict`	The prediction is executed by locating the local model closer to the data to be predicted and summing the weighted results of the global and local models. The returned prediction is calculated as follows: `local_model_prediction * local_weight + global_model_prediction * (1.0 - local_weight)` In the preceding calculation, `local_weight` is a parameter that can specified by the user.
`fit_predict`	Calls the `fit` and `predict` methods sequentially with the training data.
`score`	Returns the R-squared statistic for the given data.

See the GeographicalRegressor class in Python API Reference for Oracle Spatial AI for more information.

The following example uses the houses_full SpatialDataFrame, with a connection to the la_median_house_values database table. This table contains housing information from the city of Los Angeles. The houses_full instance will be used in this example.

from oraclesai import SpatialDataFrame, DBSpatialDataset
import oml

houses_full = SpatialDataFrame.create(DBSpatialDataset(table='la_median_house_values', schema='oml_user'))
houses_full = houses_full.to_crs('epsg:3857')

The code then performs the following steps:

Defines the target variable (HOUSE_VALUE_MEDIAN) and the explanatory variables for the regression model.
Splits the data into training and test sets.
Defines the spatial weights and creates an instance of GeographicalRegressor, which runs multiple Random Forest regressors locally.
Calls the predict and score methods to estimate the target variable of the test set and the R-squared statistic from the same test set.

from oraclesai.preprocessing import spatial_train_test_split
from oraclesai.weights import SpatialWeights, KNNWeightsDefinition
from sklearn.ensemble import RandomForestRegressor
from oraclesai.regression import GeographicalRegressor

# Define explanatory variables
feature_columns = [
    'BEDROOMS_TOTAL',
    'EDU_LEVEL_SCORE_MEDIAN',
    'POPULATION_DENSITY',
    'ROOMS_TOTAL',
    'COMPLETE_PLUMBING_PERC',
    'COMPLETE_KITCHEN_PERC',
    'HOUSE_AGE_MEDIAN',
    'RENTED_PERC',
    'UNITS_TOTAL'
]

# Define the target variable
target_column = 'HOUSE_VALUE_MEDIAN'

# Select a subset of columns
houses = houses_full[[target_column] + feature_columns]

# Remove rows with null values
houses = houses.dropna()

# Define the training and test sets
X_train, X_test, y_train, y_test, geom_train, geom_test = spatial_train_test_split(houses, 
                                                                                   y=target_column, 
                                                                                   test_size=0.33, 
                                                                                   numpy_result=True, 
                                                                                   random_state=32)

# Define the spatial weights
weights_definition = KNNWeightsDefinition(k=3)
train_weights = SpatialWeights.create(geom_train, weights_definition)
test_weights = SpatialWeights.create(geom_test, weights_definition)

# Create an instance of GeographicalRegressor
grf_model = GeographicalRegressor(model_cls=RandomForestRegressor, n_estimators=10, random_state=32)

# Train the model
grf_model.fit(X_train, geometries=geom_train, y=y_train, spatial_weights=train_weights)

# Print the predictions with the test set
grf_predictions_test = grf_model.predict(X_test, geometries=geom_test).flatten()
print(f"\n>> predictions (X_test):\n {grf_predictions_test[:10]}")

# Print the score with the test set
grf_r2_score = grf_model.score(X_test, y_test, geometries=geom_test)
print(f"\n>> r2_score (X_test):\n {grf_r2_score}")

The output of the program is as follows:

>> predictions (X_test):
 [622135.  422560.  426457.5 749530.  925412.5 469420.  526467.5 880195.
 460922.5 421930. ]

>> r2_score (X_test):
 0.6774993920744854