データセットの分割

機械翻訳について

データセットの分割

空間AIには、データセットを異なるサブセットに分割する2つの方法があります。

次の項では、データセットを分割するためにサポートされている両方のメソッドについて説明します:

SpatialDataFrame.split関数の使用

splitメソッドをコールすると、SpatialDataFrameを2つ以上のサブセットに分割できます。 splitメソッドは、各サブセットのサイズ比率を含むタプルを取ります。

比率タプルに含まれる要素の数によって、SpatialDataFramesが返されるサブセットの数が決まります。

詳細は、「Oracle Spatial AI Python APIリファレンス」のSpatialDataFrame.splitメソッドを参照してください。

次の例では、指定されたSpatialDataFrameをトレイン、テストおよび検証サブセットに分割し、それぞれに元のSpatialDataFrameの要素数の50%、30%、および20%を含めます。

# Print the size of the SpatialDataFrame defined as X
print(f"\n>> X (shape):\n {X.shape}")

# Split X into smaller SpatialDataFrames
X_train, X_test, X_validation = X.split(ratio=(0.5, 0.3, 0.2))

# Print the size of the resulting datasets
print(f"\n>> X_train (shape):\n {X_train.shape}")
print(f"\n>> X_test (shape):\n {X_test.shape}")
print(f"\n>> X_validation (shape):\n {X_test.shape}")

前述の例の出力は次のとおりです:

>> X (shape):
(3437, 5)

>> X_train (shape):
(1718, 5)

>> X_test (shape):
(1031, 5)

>> X_validation (shape):(688, 5)

spatial_train_test_split関数の使用

spatial_train_test_split関数は、SpatialDataFrameクラスのインスタンスを受信し、それをトレーニング・サブセットおよびテスト・サブセットに分割します。

各サブセットは、説明変数X、ジオメトリおよびターゲット変数yに分割されます。 X は(n-samples * n-features)のベクトルで、ジオメトリとyはn-samplesのベクトルです。その後、同じ関数をコールして、トレーニング・サブセットおよび検証サブセットSpatialDataFrameにさらに分割できます。

詳細は、「Oracle Spatial AI Python APIリファレンス」のspatial_train_test_split関数を参照してください。

次の例では、block_groups SpatialDataFrameに格納されているデータを2つの変数に分割します。 X_trainには元のデータの90%が含まれ、X_testには残りの10%が含まれます。割合は、test_sizeパラメータに示されます。

from oraclesai.preprocessing import spatial_train_test_split 
 
# Define variables
X = block_groups_missing_pdf[["MEDIAN_INCOME", "MEAN_AGE", "HOUSE_VALUE", "INTERNET", "geometry"]] 
 
# Print the size of the data
print(f"\n>> X (shape):\n {X.shape}") 
 
# Split the data into training and test sets, using 10% for testing
X_train, X_test, _, _, _, _ = spatial_train_test_split(X, y="MEDIAN_INCOME", test_size=0.1) 
 
# Print the size of the training and test sets
print(f"\n>> X_train (shape):\n {X_train.shape}") 
print(f"\n>> X_test (shape):\n {X_test.shape}")

コードによって、データの元のサイズと分割の2つのサブセットのサイズが出力されます。分割後も、両方のサブセット内のフィーチャの数は同じままです。

>> X (shape):
 (3437, 5)

>> X_train (shape):
 (3093, 5)

>> X_test (shape):
 (344, 5)