Splitting Datasets

Spatial AI provides two ways of splitting a dataset into different subsets.

The following sections describe both the supported methods for splitting a dataset:

Using the SpatialDataFrame.split Function

A SpatialDataFrame can be split into two or more subsets by calling the split method. The split method takes a tuple containing the size ratio of each subset.

The number of elements contained in the ratio tuples dictates the number of subsets SpatialDataFrames returned.

See the SpatialDataFrame.split method in Python API Reference for Oracle Spatial AI for more information.

The following example splits the given SpatialDataFrame into train, test, and validation subsets, each containing 50%, 30%, and 20% of the number of elements of the original SpatialDataFrame, respectively.

# Print the size of the SpatialDataFrame defined as X
print(f"\n>> X (shape):\n {X.shape}")

# Split X into smaller SpatialDataFrames
X_train, X_test, X_validation = X.split(ratio=(0.5, 0.3, 0.2))

# Print the size of the resulting datasets
print(f"\n>> X_train (shape):\n {X_train.shape}")
print(f"\n>> X_test (shape):\n {X_test.shape}")
print(f"\n>> X_validation (shape):\n {X_test.shape}")

The output for the preceding example is:

>> X (shape):
(3437, 5)

>> X_train (shape):
(1718, 5)

>> X_test (shape):
(1031, 5)

>> X_validation (shape):(688, 5)

Using the spatial_train_test_split Function

The spatial_train_test_split function receives an instance of the SpatialDataFrame class and splits it into the training and test subsets.

Each subset is divided into the explanatory variables X, geometries, and target variable y. X is a vector of (n-samples * n-features), while geometry and y are vectors of n-samples. The training subsets can then be further split into training and validation subsets SpatialDataFrameby calling the same function.

See the spatial_train_test_split function in Python API Reference for Oracle Spatial AI for more information.

The following example splits the data stored in the block_groups SpatialDataFrame into two variables. X_train contains 90% of the original data, and X_test contains the remaining 10%. The proportion is indicated in the test_size parameter.

from oraclesai.preprocessing import spatial_train_test_split 
 
# Define variables
X = block_groups_missing_pdf[["MEDIAN_INCOME", "MEAN_AGE", "HOUSE_VALUE", "INTERNET", "geometry"]] 
 
# Print the size of the data
print(f"\n>> X (shape):\n {X.shape}") 
 
# Split the data into training and test sets, using 10% for testing
X_train, X_test, _, _, _, _ = spatial_train_test_split(X, y="MEDIAN_INCOME", test_size=0.1) 
 
# Print the size of the training and test sets
print(f"\n>> X_train (shape):\n {X_train.shape}") 
print(f"\n>> X_test (shape):\n {X_test.shape}")

The code prints the original size of the data and the size of the two subsets from the split. The number of features in both subsets remains the same after the split.

>> X (shape):
 (3437, 5)

>> X_train (shape):
 (3093, 5)

>> X_test (shape):
 (344, 5)