Splitting Datasets
Spatial AI provides two ways of splitting a dataset into different subsets.
The following sections describe both the supported methods for splitting a dataset:
Using the SpatialDataFrame.split
Function
A SpatialDataFrame
can be split into two or more subsets by calling
the split
method. The split
method takes a tuple
containing the size ratio of each subset.
The number of elements contained in the ratio tuples dictates the number of subsets
SpatialDataFrames
returned.
See the SpatialDataFrame.split method in Python API Reference for Oracle Spatial AI for more information.
The following example splits the given SpatialDataFrame
into train,
test, and validation subsets, each containing 50%, 30%, and 20% of the number of
elements of the original SpatialDataFrame
, respectively.
# Print the size of the SpatialDataFrame defined as X
print(f"\n>> X (shape):\n {X.shape}")
# Split X into smaller SpatialDataFrames
X_train, X_test, X_validation = X.split(ratio=(0.5, 0.3, 0.2))
# Print the size of the resulting datasets
print(f"\n>> X_train (shape):\n {X_train.shape}")
print(f"\n>> X_test (shape):\n {X_test.shape}")
print(f"\n>> X_validation (shape):\n {X_test.shape}")
The output for the preceding example is:
>> X (shape):
(3437, 5)
>> X_train (shape):
(1718, 5)
>> X_test (shape):
(1031, 5)
>> X_validation (shape):(688, 5)
Using the spatial_train_test_split
Function
The spatial_train_test_split
function receives an instance of the
SpatialDataFrame
class and splits it into the training and test
subsets.
Each subset is divided into the explanatory variables X
,
geometries, and target variable y
. X
is a vector
of (n
-samples * n
-features), while geometry and
y
are vectors of n
-samples. The training
subsets can then be further split into training and validation subsets
SpatialDataFrame
by calling the same function.
See the spatial_train_test_split function in Python API Reference for Oracle Spatial AI for more information.
The following example splits the data stored in the
block_groups
SpatialDataFrame
into two variables.
X_train
contains 90% of the original data, and
X_test
contains the remaining 10%. The proportion is
indicated in the test_size
parameter.
from oraclesai.preprocessing import spatial_train_test_split
# Define variables
X = block_groups_missing_pdf[["MEDIAN_INCOME", "MEAN_AGE", "HOUSE_VALUE", "INTERNET", "geometry"]]
# Print the size of the data
print(f"\n>> X (shape):\n {X.shape}")
# Split the data into training and test sets, using 10% for testing
X_train, X_test, _, _, _, _ = spatial_train_test_split(X, y="MEDIAN_INCOME", test_size=0.1)
# Print the size of the training and test sets
print(f"\n>> X_train (shape):\n {X_train.shape}")
print(f"\n>> X_test (shape):\n {X_test.shape}")
The code prints the original size of the data and the size of the two subsets from the split. The number of features in both subsets remains the same after the split.
>> X (shape):
(3437, 5)
>> X_train (shape):
(3093, 5)
>> X_test (shape):
(344, 5)