Run a Python Function on Data Grouped By Column Values

10.4.4 Run a Python Function on Data Grouped By Column Values

Use the oml.group_apply function to group the values in a database table by one or more columns and then run a user-defined Python function on each group.

The oml.group_apply function runs a user-defined Python function in a Python engine spawned and managed by the database environment. The oml.group_apply function passes the oml.DataFrame specified by the data argument to the user-defined func function as its first argument. The index argument to oml.group_apply specifies the columns of the oml.DataFrame by which the database groups the data for processing by the user-defined Python function. The oml.group_apply function can use data-parallel execution, in which one or more Python engines perform the same Python function on different groups of data.

The syntax of the function is the following.

oml.group_apply(data, index, func, func_owner=None, parallel=None, orderby=None, graphics=False, **kwargs)

The data argument is an oml.DataFrame that contains the in-database data that the func function operates on.

The index argument is an oml.DataFrame object, the columns of which are used to group the data before sending it to the func function.

The func argument is the function to run. It may be one of the following:

A Python function
A string that is the name of a user-defined Python function in the OML4Py script repository
A string that defines a Python function
An oml.script.script.Callable object returned by the oml.script.load function

The optional func_owner argument is a string or None (the default) that specifies the owner of the registered user-defined Python function when argument func is a registered user-defined Python function name.

The parallel argument is a boolean, an int, or None (the default) that specifies the preferred degree of parallelism to use in the Embedded Python Execution job. The value may be one of the following:

A positive integer greater than or equal to 1 for a specific degree of parallelism
False, None, or 0 for no parallelism
True for the default data parallelism

The optional orderby argument is an oml.DataFrame, oml.Float, or oml.String that specifies the ordering of the group partitions.

The graphics argument is a boolean that specifies whether to look for images. The default value is False.

With the **kwargs parameter, you can pass additional arguments to the func function. Special control arguments, which start with oml_, are not passed to the function specified by func, but instead control what happens before or after the running of the function.

The oml.group_apply function returns a dict of Python objects or a dict of oml.embed.data_image._DataImage objects. If no image is rendered in the user-defined Python function, oml.group_apply returns a dict of Python object returned by the function. Otherwise, it returns a dict of oml.embed.data_image._DataImage objects.

See Also: About Output

Example 10-8 Using the oml.group_apply Function

This example defines some functions and calls oml.group_apply for each function.

import pandas as pd
from sklearn import datasets 
import oml

# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()

x = pd.DataFrame(iris.data, 
                 columns = ['Sepal_Length','Sepal_Width',
                            'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x: 
                           {0: 'setosa', 1: 'versicolor', 
                            2:'virginica'}[x], iris.target)), 
                 columns = ['Species'])

# Drop the IRIS database table if it exists.
try:
    oml.drop('IRIS')
except: 
    pass

# Create the IRIS database table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')

# Define a function that counts the number of rows and returns a
# dataframe with the species and the count.
def group_count(dat):
    import pandas as pd
    return pd.DataFrame([(dat["Species"][0], dat.shape[0])],\
                        columns = ["Species", "COUNT"])

# Select the Species column to use as the index argument.
index = oml.DataFrame(oml_iris['Species'])

# Group the data by the Species column and run the user-defined 
# function for each species.
res = oml.group_apply(oml_iris, index, func=group_count,
                      oml_input_type="pandas.DataFrame")
res

# Define a function that builds a linear regression model, with  
# Petal_Width as the feature and Petal_Length as the target value, 
# and that returns the model after fitting the values.
def build_lm(dat):
    from sklearn import linear_model
    lm = linear_model.LinearRegression()
    X = dat[["Petal_Width"]]
    y = dat[["Petal_Length"]]
    lm.fit(X, y)
    return lm

# Run the model for each species and return an objectList in
# dict format with a model for each species.
mod = oml.group_apply(oml_iris[:,["Petal_Length", "Petal_Width",
                                  "Species"]], index, func=build_lm)

# The output is a dict of key-value pairs for each species and model.
type(mod)

# Sort dict by the key species.
{k: mod[k] for k in sorted(mod.keys())}

Listing for This Example

>>> import pandas as pd
>>> from sklearn import datasets
>>> import oml
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>>
>>> x = pd.DataFrame(iris.data, 
...                  columns = ['Sepal_Length','Sepal_Width',
...                             'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x: 
...                            {0: 'setosa', 1: 'versicolor', 
...                             2:'virginica'}[x], iris.target)), 
...                  columns = ['Species'])
>>>
>>> # Drop the IRIS database table if it exists.
... try:
...     oml.drop('IRIS')
... except: 
...     pass
>>>
>>> # Create the IRIS database table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Define a function that counts the number of rows and returns a
... # dataframe with the species and the count.
... def group_count(dat):
...     import pandas as pd
...     return pd.DataFrame([(dat["Species"][0], dat.shape[0])],\
...                         columns = ["Species", "COUNT"])
...
>>> # Select the Species column to use as the index argument.
... index = oml.DataFrame(oml_iris['Species'])
>>>
>>> # Group the data by the Species column and run the user-defined 
... # function for each species.
... res = oml.group_apply(oml_iris, index, func=group_count,
...                       oml_input_type="pandas.DataFrame")
>>> res
{'setosa':   Species  COUNT 
0  setosa     50, 'versicolor':       Species  COUNT 
0  versicolor     50, 'virginica':      Species  COUNT 
0  virginica     50}
>>>
>>> # Define a function that builds a linear regression model, with 
... # Petal_Width  as the feature and Petal_Length as the target value, 
... # and that returns the model after fitting the values.
... def build_lm(dat):
...     from sklearn import linear_model
...     lm = linear_model.LinearRegression()
...     X = dat[["Petal_Width"]]
...     y = dat[["Petal_Length"]]
...     lm.fit(X, y)
...     return lm
...
>>> # Run the model for each species and return an objectList in
... # dict format with a model for each species.
... mod = oml.group_apply(oml_iris[:,["Petal_Length", "Petal_Width",
...                                   "Species"]], index, func=build_lm)
>>>
>>> # The output is a dict of key-value pairs for each species and model.
... type(mod)
<class 'dict'>
>>>
>>> # Sort dict by the key species.
... {k: mod[k] for k in sorted(mod.keys())}
{'setosa': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,normalize=False), 'versicolor': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False), 'virginica': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)}

Parent topic: Python API for Embedded Python Execution