Spatial Imputer

The SpatialImputer class allows us to fill the missing value of an observation using data from its neighbors.

According to Tobler's law, closer things are more related than the distant ones. Therefore, the goal is to leverage spatial weights to compute the missing values.

The following table describes the parameters of the SpatialImputer class.

Parameter Description
spatial_weights_definition Defines the relationship between the neighboring locations. It is necessary to retrieve information from the neighbors.
missing_values All occurrences of missing_values will be imputed
strategy By default, SpatialImputer uses “mean” to fill in the missing values. In other words, the weighted average from the neighboring observations replaces the missing values. The other options are "median", "maximum", and "minimum".

The SpatialImputer class is a transformer, and its main methods are described the following table.

Method Description
fit Calculates the spatial lag from the training data.
transform Returns a NumPy array with the data parameters passed according to the specified strategy. It determines whether to use the neighbors from the training set by defining the use_fit_lag parameter.
fit_transform Calls the fit and transform methods sequentially with the training data.

See the SpatialImputer class in Python API Reference for Oracle Spatial AI for more information.

The following example uses the block_groups SpatialDataFrame that was created earlier and performs the following:

  1. Adds the missing values in the INTERNET column.
  2. Defines the spatial weights using the K-Nearest Neighbors method.
  3. Calls the fit_transform method of the SpatialImputer to fill in the missing values of the training set.

Note that the target column (MEDIAN_INCOME) and the column geometry are not part of the output.

import random
import numpy as np 
from oraclesai import GeoDataFrameDataset 
from oraclesai.preprocessing import SpatialImputer 
from oraclesai.weights import KNNWeightsDefinition 

random.seed(32) 
block_groups_missing_df = block_groups.as_geodataframe() 

# Assign missing values randomly to the internet column 
ix = [row for row in range(block_groups.shape[0])] 
for row in random.sample(ix, int(round(.1*len(ix)))): 
    block_groups_missing_df.loc[row, "INTERNET"] = np.nan 

# Create a SpatialDataFrame with the data containing missing values 
block_groups_missing_pdf = SpatialDataFrame.create(GeoDataFrameDataset(block_groups_missing_df)) 

# Define the variables of the model 
X = block_groups_missing_pdf[["MEDIAN_INCOME", "MEAN_AGE", "HOUSE_VALUE", "INTERNET", "geometry"]] 

# Define the spatial weights 
weights_definition = KNNWeightsDefinition(k=10) 

# Print the total number of missing values 
print(f"Missing Values Before Imputation = {np.sum(np.isnan(X.get_values()))}") 

# Create an instance of SpatialImputer 
spatial_imputer = SpatialImputer(missing_values=np.nan, spatial_weights_definition=weights_definition) 

# Fill the missing values of the training data 
X_imputed = spatial_imputer.fit_transform(X, y="MEDIAN_INCOME") 

# Print the total number of missing values (0 is expected) 
print(f"Missing Values After Imputation = {np.sum(np.isnan(X_imputed))}")

The resulting output shows the number of missing values before and after imputation.

Missing Values Before Imputation = 344
Missing Values After Imputation = 0