Geographical Regressor
A GeographicalRegressor is a spatial machine learning
algorithm used to perform regression by leveraging the existing
scikit-learn regression algorithms which are used to create both, a
global model containing all the observations from the training data and a local model for
each observation and by summing the weighted results of the global model and the local
model.
Both, GeographicalRegressor and
GeographicalClassifier extend the Geographical Random Forest
algorithm by allowing the use of various underlying machine learning algorithms besides
Random Forest and supporting parallelism in the training of local models, ensuring
robust and scalable performance. See [4] for more information on the Geographical Random Forest
algorithm.
This algorithm is useful when there is a high degree of spatial
heterogeneity in the data. This implies that the data behaves differently in different
geographical regions. Hence, it will be hard to fit a single model appropriately. This
algorithm supports any scikit-learn regression algorithms. That means,
for different applications, you can specify the underneath regression algorithm, which
includes random forest, support vector, gradient boosting, and decision tree.
The GeographicalRegressor class implements this regressor. You can
specify the scikit-learn regression algorithm when you create an
instance of this class. First, a global model is built using the parameters provided at
creation time. If the spatial relationship is not specified (by providing
SpatialWeightsDefinition or bandwidth information), it is first
computed. After the spatial relation is defined, several local models are built. The
prediction is performed by locating the local model closer to the data to be predicted
and summing the weighted results of the global model and the local model. Specifically,
the returned prediction is calculated as follows:
local_model_prediction * local_weight + global_model_prediction * (1.0 - local_weight)In the preceding calculation, local_weight can be a default
or specified value.
When using the GeographicalRegressor class, you can specify a
scikit-learn regression algorithm to be used when you create an
instance of this class. The following table describes the principal methods of this
class.
| Method | Description |
|---|---|
fit |
First, the global model is built using the parameters
provided at creation time. If the spatial relationship is not specified
(either by the spatial_weights_definition or the
bandwidth parameter), it is internally computed.
Then, several local models are trained.
|
predict |
The prediction is executed by locating the local model
closer to the data to be predicted and summing the weighted results of
the global and local models. The returned prediction is calculated as
follows:
local_weight is a parameter
that can specified by the user.
|
fit_predict |
Calls the fit and
predict methods sequentially with the training
data.
|
score |
Returns the R-squared statistic for the given data. |
See the GeographicalRegressor class in Python API Reference for Oracle Spatial AI for more information.
The following example uses the houses_full
SpatialDataFrame, with a connection to the
la_median_house_values database table. This table contains housing
information from the city of Los Angeles. The houses_full
instance will be used in this example.
from oraclesai import SpatialDataFrame, DBSpatialDataset
import oml
houses_full = SpatialDataFrame.create(DBSpatialDataset(table='la_median_house_values', schema='oml_user'))
houses_full = houses_full.to_crs('epsg:3857')The code then performs the following steps:
- Defines the target variable (
HOUSE_VALUE_MEDIAN) and the explanatory variables for the regression model. - Splits the data into training and test sets.
- Defines the spatial weights and creates an instance of
GeographicalRegressor, which runs multiple Random Forest regressors locally. - Calls the
predictandscoremethods to estimate the target variable of the test set and the R-squared statistic from the same test set.
from oraclesai.preprocessing import spatial_train_test_split
from oraclesai.weights import SpatialWeights, KNNWeightsDefinition
from sklearn.ensemble import RandomForestRegressor
from oraclesai.regression import GeographicalRegressor
# Define explanatory variables
feature_columns = [
'BEDROOMS_TOTAL',
'EDU_LEVEL_SCORE_MEDIAN',
'POPULATION_DENSITY',
'ROOMS_TOTAL',
'COMPLETE_PLUMBING_PERC',
'COMPLETE_KITCHEN_PERC',
'HOUSE_AGE_MEDIAN',
'RENTED_PERC',
'UNITS_TOTAL'
]
# Define the target variable
target_column = 'HOUSE_VALUE_MEDIAN'
# Select a subset of columns
houses = houses_full[[target_column] + feature_columns]
# Remove rows with null values
houses = houses.dropna()
# Define the training and test sets
X_train, X_test, y_train, y_test, geom_train, geom_test = spatial_train_test_split(houses,
y=target_column,
test_size=0.33,
numpy_result=True,
random_state=32)
# Define the spatial weights
weights_definition = KNNWeightsDefinition(k=3)
train_weights = SpatialWeights.create(geom_train, weights_definition)
test_weights = SpatialWeights.create(geom_test, weights_definition)
# Create an instance of GeographicalRegressor
grf_model = GeographicalRegressor(model_cls=RandomForestRegressor, n_estimators=10, random_state=32)
# Train the model
grf_model.fit(X_train, geometries=geom_train, y=y_train, spatial_weights=train_weights)
# Print the predictions with the test set
grf_predictions_test = grf_model.predict(X_test, geometries=geom_test).flatten()
print(f"\n>> predictions (X_test):\n {grf_predictions_test[:10]}")
# Print the score with the test set
grf_r2_score = grf_model.score(X_test, y_test, geometries=geom_test)
print(f"\n>> r2_score (X_test):\n {grf_r2_score}")The output of the program is as follows:
>> predictions (X_test):
[622135. 422560. 426457.5 749530. 925412.5 469420. 526467.5 880195.
460922.5 421930. ]
>> r2_score (X_test):
0.6774993920744854