8.1.5 データの分割
split
およびKFold
メソッドを使用して、データをサンプリングしたり、ランダムにパーティション化します。
大規模なデータセットを分析する際の一般的な操作は、トレーニングおよびテストのためにデータセットをランダムにサブセットにパーティション化することです。これらのメソッドを使用して、これを行うことができます。split
メソッドではデータをサンプリングすることもできます。
例8-8 複数のセットへのデータの分割
この例では、複数のセットおよび連続するk個のフォールドにデータを分割する方法を示します。フォールドは、k-fold交差検証に使用できます。
import oml
import pandas as pd
from sklearn import datasets
digits = datasets.load_digits()
pd_digits = pd.DataFrame(digits.data,
columns=['IMG'+str(i) for i in
range(digits['data'].shape[1])])
pd_digits = pd.concat([pd_digits,
pd.Series(digits.target,
name = 'target')],
axis = 1)
oml_digits = oml.push(pd_digits)
# Sample 20% and 80% of the data.
splits = oml_digits.split(ratio=(.2, .8), use_hash = False)
[len(split) for split in splits]
# Split the data into four sets.
splits = oml_digits.split(ratio = (.25, .25, .25, .25),
use_hash = False)
[len(split) for split in splits]
# Perform stratification on the target column.
splits = oml_digits.split(strata_cols=['target'])
[split.shape for split in splits]
# Verify that the stratified sampling generates splits in which
# all of the different categories of digits (digits 0~9)
# are present in each split.
[split['target'].drop_duplicates().sort_values().pull()
for split in splits]
# Hash on the target column.
splits = oml_digits.split(hash_cols=['target'])
[split.shape for split in splits]
# Verify that the different categories of digits (digits 0~9) are present
# in only one of the splits generated by hashing on the category column.
[split['target'].drop_duplicates().sort_values().pull()
for split in splits]
# Split the data randomly into 4 consecutive folds.
folds = oml_digits.KFold(n_splits=4)
[(len(fold[0]), len(fold[1])) for fold in folds]
この例のリスト
>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> digits = datasets.load_digits()
>>> pd_digits = pd.DataFrame(digits.data,
... columns=['IMG'+str(i) for i in
... range(digits['data'].shape[1])])
>>> pd_digits = pd.concat([pd_digits,
... pd.Series(digits.target,
... name = 'target')],
... axis = 1)
>>> oml_digits = oml.push(pd_digits)
>>>
>>> # Sample 20% and 80% of the data.
... splits = oml_digits.split(ratio=(.2, .8), use_hash = False)
>>> [len(split) for split in splits]
[351, 1446]
>>>
>>> # Split the data into four sets.
... splits = oml_digits.split(ratio = (.25, .25, .25, .25),
... use_hash = False)
>>> [len(split) for split in splits]
[432, 460, 451, 454]
>>>
>>> # Perform stratification on the target column.
... splits = oml_digits.split(strata_cols=['target'])
>>> [split.shape for split in splits]
[(1285, 65), (512, 65)]
>>>
>>> # Verify that the stratified sampling generates splits in which
... # all of the different categories of digits (digits 0~9)
... # are present in each split.
... [split['target'].drop_duplicates().sort_values().pull()
... for split in splits]
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
>>>
>>> # Hash on the target column
... splits = oml_digits.split(hash_cols=['target'])
>>> [split.shape for split in splits]
[(899, 65), (898, 65)]
>>>
>>> # Verify that the different categories of digits (digits 0~9) are present
... # in only one of the splits generated by hashing on the category column.
... [split['target'].drop_duplicates().sort_values().pull()
... for split in splits]
[[0, 1, 3, 5, 8], [2, 4, 6, 7, 9]]
>>>
>>> # Split the data randomly into 4 consecutive folds.
... folds = oml_digits.KFold(n_splits=4)
>>> [(len(fold[0]), len(fold[1])) for fold in folds]
[(1352, 445), (1336, 461), (1379, 418), (1325, 472)]
親トピック: データの準備