列値でグループ化されたデータに対するPython関数の実行

10.4.4 列値でグループ化されたデータに対するPython関数の実行

oml.group_apply関数を使用して、データベース表の値を1つ以上の列でグループ化し、各グループに対してユーザー定義Python関数を実行します。

oml.group_apply関数は、データベース環境によって生成および管理されるPythonエンジンでユーザー定義Python関数を実行します。oml.group_apply関数は、data引数で指定されたoml.DataFrameをユーザー定義のfunc関数に最初の引数として渡します。oml.group_applyのindex引数では、ユーザー定義Python関数で処理するデータをデータベースがグループ化するために使用するoml.DataFrameの列を指定します。oml.group_apply関数では、1つ以上のPythonエンジンが同じPython関数を異なるデータ・グループに対して実行するデータ・パラレル実行を使用できます。

この関数の構文は次のとおりです。

oml.group_apply(data, index, func, func_owner=None, parallel=None, orderby=None, graphics=False, **kwargs)

data引数は、func関数が操作するデータベース内データを含むoml.DataFrameです。

index引数はoml.DataFrameオブジェクトであり、このオブジェクトの列を使用して、func関数に送信する前にデータがグループ化されます。

func引数は、実行する関数です。これには次のいずれかを指定できます。

Python関数
OML4Pyスクリプト・リポジトリ内のユーザー定義Python関数の名前を表す文字列
Python関数を定義する文字列
oml.script.load関数によって返されるoml.script.script.Callableオブジェクト

オプションのfunc_owner引数は文字列またはNone (デフォルト)で、引数funcが登録済ユーザー定義Python関数名の場合に、登録済ユーザー定義Python関数の所有者を指定します。

parallel引数は、Embedded Python Executionジョブで使用する推奨並列度を指定するブール値、intまたはNone (デフォルト)です。次の値のいずれかを指定できます。

特定の並列度では、1以上の正の整数
並列度なしの場合はFalse、Noneまたは0
デフォルトのデータ並列度の場合はTrue

オプションのorderby引数は、グループ・パーティションの順序を指定するoml.DataFrame、oml.Floatまたはoml.Stringです。

graphics引数は、イメージを検索するかどうかを指定するブール値です。デフォルト値は、Falseです。

**kwargsパラメータを使用すると、func関数に追加の引数を渡すことができます。oml_で始まる特殊な制御引数は、funcで指定された関数に渡されるのではなく、関数の実行前または実行後の動作を制御します。

関連項目: 特殊な制御引数について

oml.group_apply関数は、Pythonオブジェクトのdictまたはoml.embed.data_image._DataImageオブジェクトのdictを返します。ユーザー定義Python関数でイメージがレンダリングされない場合、oml.group_applyは、関数によって返されたPythonオブジェクトのdictを返します。それ以外の場合は、oml.embed.data_image._DataImageオブジェクトのdictを返します。

関連項目: 出力について

例10-8 oml.group_apply関数の使用方法

この例では、いくつかの関数を定義し、それぞれの関数についてoml.group_applyをコールします。

import pandas as pd
from sklearn import datasets 
import oml

# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()

x = pd.DataFrame(iris.data, 
                 columns = ['Sepal_Length','Sepal_Width',
                            'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x: 
                           {0: 'setosa', 1: 'versicolor', 
                            2:'virginica'}[x], iris.target)), 
                 columns = ['Species'])

# Drop the IRIS database table if it exists.
try:
    oml.drop('IRIS')
except: 
    pass

# Create the IRIS database table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')

# Define a function that counts the number of rows and returns a
# dataframe with the species and the count.
def group_count(dat):
    import pandas as pd
    return pd.DataFrame([(dat["Species"][0], dat.shape[0])],\
                        columns = ["Species", "COUNT"])

# Select the Species column to use as the index argument.
index = oml.DataFrame(oml_iris['Species'])

# Group the data by the Species column and run the user-defined 
# function for each species.
res = oml.group_apply(oml_iris, index, func=group_count,
                      oml_input_type="pandas.DataFrame")
res

# Define a function that builds a linear regression model, with  
# Petal_Width as the feature and Petal_Length as the target value, 
# and that returns the model after fitting the values.
def build_lm(dat):
    from sklearn import linear_model
    lm = linear_model.LinearRegression()
    X = dat[["Petal_Width"]]
    y = dat[["Petal_Length"]]
    lm.fit(X, y)
    return lm

# Run the model for each species and return an objectList in
# dict format with a model for each species.
mod = oml.group_apply(oml_iris[:,["Petal_Length", "Petal_Width",
                                  "Species"]], index, func=build_lm)

# The output is a dict of key-value pairs for each species and model.
type(mod)

# Sort dict by the key species.
{k: mod[k] for k in sorted(mod.keys())}

この例のリスト

>>> import pandas as pd
>>> from sklearn import datasets
>>> import oml
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>>
>>> x = pd.DataFrame(iris.data, 
...                  columns = ['Sepal_Length','Sepal_Width',
...                             'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x: 
...                            {0: 'setosa', 1: 'versicolor', 
...                             2:'virginica'}[x], iris.target)), 
...                  columns = ['Species'])
>>>
>>> # Drop the IRIS database table if it exists.
... try:
...     oml.drop('IRIS')
... except: 
...     pass
>>>
>>> # Create the IRIS database table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Define a function that counts the number of rows and returns a
... # dataframe with the species and the count.
... def group_count(dat):
...     import pandas as pd
...     return pd.DataFrame([(dat["Species"][0], dat.shape[0])],\
...                         columns = ["Species", "COUNT"])
...
>>> # Select the Species column to use as the index argument.
... index = oml.DataFrame(oml_iris['Species'])
>>>
>>> # Group the data by the Species column and run the user-defined 
... # function for each species.
... res = oml.group_apply(oml_iris, index, func=group_count,
...                       oml_input_type="pandas.DataFrame")
>>> res
{'setosa':   Species  COUNT 
0  setosa     50, 'versicolor':       Species  COUNT 
0  versicolor     50, 'virginica':      Species  COUNT 
0  virginica     50}
>>>
>>> # Define a function that builds a linear regression model, with 
... # Petal_Width  as the feature and Petal_Length as the target value, 
... # and that returns the model after fitting the values.
... def build_lm(dat):
...     from sklearn import linear_model
...     lm = linear_model.LinearRegression()
...     X = dat[["Petal_Width"]]
...     y = dat[["Petal_Length"]]
...     lm.fit(X, y)
...     return lm
...
>>> # Run the model for each species and return an objectList in
... # dict format with a model for each species.
... mod = oml.group_apply(oml_iris[:,["Petal_Length", "Petal_Width",
...                                   "Species"]], index, func=build_lm)
>>>
>>> # The output is a dict of key-value pairs for each species and model.
... type(mod)
<class 'dict'>
>>>
>>> # Sort dict by the key species.
... {k: mod[k] for k in sorted(mod.keys())}
{'setosa': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,normalize=False), 'versicolor': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False), 'virginica': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)}

親トピック: Embedded Python ExecutionのPython API