Geographical Classifier

Similar to GeographicalRegressor, the GeographicalClassifier class trains a global model and multiple local models and predicts by combining the weighted results from both models.

By defining the global_model and model_cls parameters, you can specify the scikit-learn global and local classifiers respectively. The classifiers can be any scikit-learn classifiers, including Random Forest, Support Vector, Gradient Boosting, Decision Trees, and so on.

Both, GeographicalClassifier and GeographicalRegressor extend the Geographical Random Forest algorithm by allowing the use of various underlying machine learning algorithms besides Random Forest and supporting parallelism in the training of local models, ensuring robust and scalable performance. See [4] for more information on the Geographical Random Forest algorithm.

The following table describes the main methods of the Geographical Classifier class.

Method Description
fit First, the global model is built using the parameters provided at creation time. If the spatial relationship is not specified (either by the spatial_weights_definition or the bandwidth parameter), it is internally computed. Then, several local models are trained.
predict The following steps describe the prediction method:
  1. The prediction is executed by locating the local model closer to the observation to be predicted.
  2. By using a weighted average of the predictions from the global and local model, the algorithm estimates a discrete range of values corresponding to classes, representing the probability of an observation belonging to each class.
  3. The category associated with the highest probability represents the predicted value.
fit_predict Calls the fit and predict methods sequentially with the training data.
score Returns the model's accuracy for the given data.

See the Geographical Classifier class in Python API Reference for Oracle Spatial AI for more information.

The following code uses the houses_full SpatialDataFrame, containing housing information for the city of Los Angeles. The example performs the following steps:

  1. Creates a categorical variable based on the HOUSE_VALUE_MEDIAN column.
  2. Defines the training and test sets.
  3. Creates an instance of GeographicalClassifier.
  4. Trains the local model using the RandomForestClassifier from scikit-learn.
  5. Calls the predict and score methods to estimate the target variable and the model’s accuracy of a test set respectively.
from oraclesai.preprocessing import spatial_train_test_split
from oraclesai.weights import DistanceBandWeightsDefinition
from sklearn.ensemble import RandomForestClassifier
from oraclesai.classification import GeographicalClassifier

# Define explanatory variables
feature_columns = [
    'BEDROOMS_TOTAL',
    'EDU_LEVEL_SCORE_MEDIAN',
    'POPULATION_DENSITY',
    'ROOMS_TOTAL',
    'COMPLETE_PLUMBING_PERC',
    'COMPLETE_KITCHEN_PERC',
    'HOUSE_AGE_MEDIAN',
    'RENTED_PERC',
    'UNITS_TOTAL'
]

# The target variable will be built from this column
target_column = 'HOUSE_VALUE_MEDIAN'

# Select a subset of columns
houses = houses_full[[target_column] + feature_columns]

# Remove rows with null values
houses = houses.dropna()

# Define training and test sets
X_train, X_test, y_train, y_test, geom_train, geom_test = spatial_train_test_split(houses,
                                                                                   y=target_column, 
                                                                                   test_size=0.33,
                                                                                   numpy_result=True,
                                                                                   random_state=32)

# Define constants to create a categorical variable
y = houses[target_column].values
y_mean = y.mean()
y_std = y.std()

# House prices below the mean minus 0.5 std are considered a low-value
# House prices above the mean plus 0.5 std are considered a high-value
mid_low_price =  y_mean - y_std * 0.5
mid_hi_price = y_mean + y_std * 0.5

# Define the function that generates the target variable based on the house value
def classify_income(income):
    if income < mid_low_price:
        return 0.0
    if income > mid_hi_price:
        return 2.0
    return 1.0

# Generate the target variable for the training and test sets
y_c_train = [classify_income(inc) for inc in y_train]
y_c_test = [classify_income(inc) for inc in y_test]

# Define the spatial weights
weights_definition = DistanceBandWeightsDefinition(threshold=2388.51)

# Create an instance of GeographicalClassifier
grfc_model = GeographicalClassifier(model_cls=RandomForestClassifier, 
                                    n_estimators=10, 
                                    local_weight=0.80, 
                                    spatial_weights_definition=weights_definition, 
                                    random_state=32) 
# Train the model
grfc_model.fit(X_train, y=y_c_train, geometries=geom_train, n_jobs=-1)

# Print the predictions with the test set
grfc_predictions_test = grfc_model.predict(X_test, geometries=geom_test).flatten()
print(f"\n>> predictions (X_test):\n {grfc_predictions_test[:10]}")

# Print the score with the test set
grfc_accuracy = grfc_model.score(X_test, y_c_test, geometries=geom_test)
print(f"\n>> accuracy (X_test):\n {grfc_accuracy}")

The output consists of the predictions of the first 10 observations of the test set and the model's accuracy using the same test set.

>> predictions (X_test):
 [1 1 0 2 2 1 1 0 0 0]

>> accuracy (X_test):
 0.7343004295345901