行セットに対するユーザー定義Python関数の実行

11.5.5 行セットに対するユーザー定義Python関数の実行

oml.row_apply関数を使用して、データを行セットにチャンク化し、各チャンクに対してユーザー定義Python関数を実行します。

oml.row_apply関数は、data引数で指定されたoml.DataFrameを最初の引数としてユーザー定義のfunc Python関数に渡します。rows引数では、各チャンクに割り当てるoml.DataFrameの最大行数を指定します。最後の行チャンクの行は、指定した数より少なくなることがあります。

oml.row_apply関数は、データベースによって生成されたPythonエンジンでPython関数を実行します。この関数では、1つ以上のPythonエンジンが同じPython関数を異なるデータ・チャンクに対して実行するデータ・パラレル実行を使用できます。

この関数の構文は次のとおりです。

oml.row_apply(data, func, func_owner=None, rows=1, parallel=None, graphics=False, **kwargs)

data引数は、func関数が操作するデータを含むoml.DataFrameです。

func引数は、実行する関数です。これには次のいずれかを指定できます。

Python関数
OML4Pyスクリプト・リポジトリ内のユーザー定義Python関数の名前を表す文字列
Python関数を定義する文字列
oml.script.load関数によって返されるoml.script.script.Callableオブジェクト

オプションのfunc_owner引数は文字列またはNone (デフォルト)で、引数funcが登録済ユーザー定義Python関数名の場合に、登録済ユーザー定義Python関数の所有者を指定します。

rows引数は、各チャンクに含める最大行数を指定するintです。

parallel引数は、Embedded Python Executionジョブで使用する推奨並列度を指定するブール値、intまたはNone (デフォルト)です。次の値のいずれかを指定できます。

特定の並列度では、1以上の正の整数
並列度なしの場合はFalse、Noneまたは0
デフォルトのデータ並列度の場合はTrue

graphics引数は、イメージを検索するかどうかを指定するブール値です。デフォルト値はTrueです。

**kwargsパラメータを使用すると、func関数に追加の引数を渡すことができます。oml_で始まる特殊な制御引数は、funcで指定された関数に渡されるのではなく、関数の実行前または実行後の動作を制御します。

oml.row_apply関数は、pandas.DataFrame、またはoml.embed.data_image._DataImageオブジェクトのリストを返します。ユーザー定義Python関数でイメージがレンダリングされない場合、oml.row_applyはpandas.DataFrameを返します。それ以外の場合は、oml.embed.data_image._DataImageオブジェクトのリストを返します。

例11-9 oml.row_apply関数の使用

この例では、irisデータセットを使用してx変数とy変数を作成します。その後、永続データベース表IRISとoml.DataFrameオブジェクトoml_irisを表のプロキシとして作成します。

この例では、irisデータに基づいて回帰モデルを構築します。入力データのSepal_Length、Sepal_WidthおよびPetal_Length列に基づいてPetal_Width値を予測する関数を定義します。次に、返すオブジェクトとしてSpecies列、Petal_Width列および予測されたPetal_Widthを連結します。最後に、oml.row_apply関数をコールして、入力データの4行のチャンクそれぞれに対してmake_pred()関数を適用します。

import oml
import pandas as pd
from sklearn import datasets
from sklearn import linear_model

# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data, 
                 columns = ['Sepal_Length','Sepal_Width',
                            'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x: 
                           {0: 'setosa', 1: 'versicolor', 
                            2:'virginica'}[x], iris.target)), 
                 columns = ['Species'])

# Drop the IRIS database table if it exists.
try:
    oml.drop('IRIS')
except: 
    pass

# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')

# Build a regression model to predict Petal_Width using in-memory 
# data.
iris = oml_iris.pull()
regr = linear_model.LinearRegression()
regr.fit(iris[['Sepal_Length', 'Sepal_Width', 'Petal_Length']],
         iris[['Petal_Width']])
regr.coef_

# Define a Python function.
def make_pred(dat, regr):
    import pandas as pd
    import numpy as np
    pred = regr.predict(dat[['Sepal_Length', 
                             'Sepal_Width',
                             'Petal_Length']])
    return pd.concat([dat[['Species', 'Petal_Width']], 
                     pd.DataFrame(pred, 
                                  columns=['Pred_Petal_Width'])], 
                                  axis=1)

input_data = oml_iris.split(ratio=(0.9, 0.1), strata_cols='Species')[1]
input_data.crosstab(index = 'Species').sort_values('Species')

res = oml.row_apply(input_data, rows=4, func=make_pred, 
                    regr=regr, parallel=2)
type(res)
res

この例のリスト

>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>> from sklearn import linear_model
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data, 
...                  columns = ['Sepal_Length','Sepal_Width',
...                             'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x: 
...                            {0: 'setosa', 1: 'versicolor', 
...                             2:'virginica'}[x], iris.target)), 
...                  columns = ['Species'])
>>>
>>> # Drop the IRIS database table if it exists.
... try:
...     oml.drop('IRIS')
... except: 
...     pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
>>> oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Build a regression model to predict Petal_Width using in-memory
... # data.
... iris = oml_iris.pull()
>>> regr = linear_model.LinearRegression()
>>> regr.fit(iris[['Sepal_Length', 'Sepal_Width', 'Petal_Length']],
...          iris[['Petal_Width']])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
>>> regr.coef_
array([[-0.20726607,  0.22282854,  0.52408311]])
>>> 
>>> # Define a Python function.
... def make_pred(dat, regr):
...     import pandas as pd
...     import numpy as np
...     pred = regr.predict(dat[['Sepal_Length', 
...                              'Sepal_Width',
...                              'Petal_Length']])
...     return pd.concat([dat[['Species', 'Petal_Width']], 
...                      pd.DataFrame(pred, 
...                                   columns=['Pred_Petal_Width'])], 
...                                   axis=1)
>>>
>>> input_data = oml_iris.split(ratio=(0.9, 0.1), strata_cols='Species')[1]
>>> input_data.crosstab(index = 'Species').sort_values('Species')
      SPECIES  count
0      setosa      7
1  versicolor      8
2   virginica      4
>>>  res = oml.row_apply(input_data, rows=4, func=make_pred, regr=regr, 
...                     columns=['Species', 
...                              'Petal_Width',
...                              'Pred_Petal_Width']))
>>> res = oml.row_apply(input_data, rows=4, func=make_pred,
...                     regr=regr, parallel=2)
>>> type(res)
<class 'pandas.core.frame.DataFrame'>
>>> res
       Species  Petal_Width  Pred_Petal_Width
0       setosa          0.4          0.344846
1       setosa          0.3          0.335509
2       setosa          0.2          0.294117
3       setosa          0.2          0.220982
4       setosa          0.2          0.080937
5   versicolor          1.5          1.504615
6   versicolor          1.3          1.560570
7   versicolor          1.0          1.008352
8   versicolor          1.3          1.131905
9   versicolor          1.3          1.215622
10  versicolor          1.3          1.272388
11   virginica          1.8          1.623561
12   virginica          1.8          1.878132

親トピック: Embedded Python ExecutionのPython API