Spatial Imputer
The SpatialImputer class allows us to fill the missing
value of an observation using data from its neighbors.
According to Tobler's law, closer things are more related than the distant ones. Therefore, the goal is to leverage spatial weights to compute the missing values.
The following table describes the parameters of the SpatialImputer
class.
| Parameter | Description |
|---|---|
spatial_weights_definition |
Defines the relationship between the neighboring locations. It is necessary to retrieve information from the neighbors. |
missing_values |
All occurrences of missing_values will
be imputed
|
strategy |
By default, SpatialImputer uses
“mean” to fill in the missing values. In other words, the
weighted average from the neighboring observations replaces the missing
values. The other options are "median", "maximum", and
"minimum".
|
The SpatialImputer class is a transformer, and its main methods are
described the following table.
| Method | Description |
|---|---|
fit |
Calculates the spatial lag from the training data. |
transform |
Returns a NumPy array with the data parameters passed
according to the specified strategy. It determines
whether to use the neighbors from the training set by defining the
use_fit_lag parameter.
|
fit_transform |
Calls the fit and
transform methods sequentially with the training
data.
|
See the SpatialImputer class in Python API Reference for Oracle Spatial AI for more information.
The following example uses the block_groups
SpatialDataFrame that was created earlier and performs the
following:
- Adds the missing values in the
INTERNETcolumn. - Defines the spatial weights using the K-Nearest Neighbors method.
- Calls the fit_transform method of the SpatialImputer to fill in the missing values of the training set.
Note that the target column (MEDIAN_INCOME) and the column
geometry are not part of the output.
import random
import numpy as np
from oraclesai import GeoDataFrameDataset
from oraclesai.preprocessing import SpatialImputer
from oraclesai.weights import KNNWeightsDefinition
random.seed(32)
block_groups_missing_df = block_groups.as_geodataframe()
# Assign missing values randomly to the internet column
ix = [row for row in range(block_groups.shape[0])]
for row in random.sample(ix, int(round(.1*len(ix)))):
block_groups_missing_df.loc[row, "INTERNET"] = np.nan
# Create a SpatialDataFrame with the data containing missing values
block_groups_missing_pdf = SpatialDataFrame.create(GeoDataFrameDataset(block_groups_missing_df))
# Define the variables of the model
X = block_groups_missing_pdf[["MEDIAN_INCOME", "MEAN_AGE", "HOUSE_VALUE", "INTERNET", "geometry"]]
# Define the spatial weights
weights_definition = KNNWeightsDefinition(k=10)
# Print the total number of missing values
print(f"Missing Values Before Imputation = {np.sum(np.isnan(X.get_values()))}")
# Create an instance of SpatialImputer
spatial_imputer = SpatialImputer(missing_values=np.nan, spatial_weights_definition=weights_definition)
# Fill the missing values of the training data
X_imputed = spatial_imputer.fit_transform(X, y="MEDIAN_INCOME")
# Print the total number of missing values (0 is expected)
print(f"Missing Values After Imputation = {np.sum(np.isnan(X_imputed))}")The resulting output shows the number of missing values before and after imputation.
Missing Values Before Imputation = 344
Missing Values After Imputation = 0