PGX 20.1.1
Documentation

PgxFrame (Tabular Data-Structure)

PgxFrame is in beta stage

Please note that the PgxFrames are released as beta, and the APIs may change in future releases, without any further notice.

Overview

PgxFrame is a data-structure to load/store and manipulate tabular data. It contains rows and columns. A PgxFrame can contain multiple columns where each column consist of elements of the same data type, and has a name. The list of the columns with their names and data types defines the schema of the frame. (The number of rows in the PgxFrame is not part of the schema of the frame.)

PgxFrame provides some operations that also output PgxFrames (described later in the tutorial). Those operations can be performed in-place (meaning that the frame is mutated during the operation) in order to save memory. In place operations should be used whenever possible. However, we provide out-place variants, i.e., a new frame is created during the operation. For all the following operations, we mention the respective out-place operations:

In-place operations Out-place operations
headInPlace head
tailInPlace tail
flattenAllInPlace flattenAll
renameColumnInPlace renameColumn
renameColumnsInPlace renameColumns
selectInPlace select

Functionalities

We show here the current functionalities of PgxFrames using some toy examples.

Loading a PgxFrame (with Multiple Data Types) from Some Specified Path

First, create a session:

cd $PGX_HOME
./bin/pgx-jshell
// starting the shell will create an implicit session

pgx> import static oracle.pgx.api.beta.frames.functions.ColumnRenaming.renaming
pgx> import static oracle.pgx.api.beta.frames.schema.ColumnDescriptor.columnDescriptor
pgx> import oracle.pgx.api.beta.frames.schema.*
pgx> import oracle.pgx.api.beta.frames.schema.datatypes.*
import oracle.pgx.api.*;
import oracle.pgx.api.beta.frames.*;
import oracle.pgx.api.beta.frames.functions.*;
import oracle.pgx.api.beta.frames.schema.*;
import oracle.pgx.api.beta.frames.schema.datatypes.*;

import static oracle.pgx.api.beta.frames.functions.ColumnRenaming.renaming;
import static oracle.pgx.api.beta.frames.schema.ColumnDescriptor.columnDescriptor;
...

PgxSession session = Pgx.createSession("my-session");

We use the following sample data (in CSV format, with a space separator instead of comma) in the next examples of our tutorial:

"John" 27 4133300.0 true 11.0 123456782 "1985-10-18"
"Albert" 23 5813000.5 false 12.0 124343142 "2000-01-14"
"Heather" 28 1.0130302E7 true 10.5 827520917 "1985-10-18"
"Emily" 24 9380080.5 false 13.0 128973221 "1910-07-30"
"""D'Juan""" 27 1582093.0 true 11.0 92384 "1955-12-01"

A frame schema is necessary to load a PgxFrame. An example frame schema with various data types can be defined as follows:

pgx> var exampleFrameSchema = List.of(
  columnDescriptor("name", DataTypes.STRING_TYPE),
  columnDescriptor("age", DataTypes.INTEGER_TYPE),
  columnDescriptor("salary", DataTypes.DOUBLE_TYPE),
  columnDescriptor("married", DataTypes.BOOLEAN_TYPE),
  columnDescriptor("tax_rate", DataTypes.FLOAT_TYPE),
  columnDescriptor("random", DataTypes.LONG_TYPE),
  columnDescriptor("date_of_birth", DataTypes.LOCAL_DATE_TYPE)
).toArray(new ColumnDescriptor[0])
ColumnDescriptor[] exampleFrameSchema = {
    columnDescriptor("name", DataTypes.STRING_TYPE),
    columnDescriptor("age", DataTypes.INTEGER_TYPE),
    columnDescriptor("salary", DataTypes.DOUBLE_TYPE),
    columnDescriptor("married", DataTypes.BOOLEAN_TYPE),
    columnDescriptor("tax_rate", DataTypes.FLOAT_TYPE),
    columnDescriptor("random", DataTypes.LONG_TYPE),
    columnDescriptor("date_of_birth", DataTypes.LOCAL_DATE_TYPE)
};

Loading the CSV file with the above-mentioned schema can be performed as follows:

pgx> var exampleFrame = session.readFrame().
  name("simple frame").
  columns(exampleFrameSchema).
  csv().
  separator(' ').
  load("<path>/simple_frame.csv")
exampleFrame = session.readFrame()
    .name("simple frame")
    .columns(exampleFrameSchema)
    .csv()
    .separator(' ')
    .load("<path>/simple_frame.csv");

Printing the Content of a PgxFrame

Now, we can also observe the frame contents using print functionality as follows:

pgx> exampleFrame.print()
exampleFrame.print();

The output looks like:

+-------------------------------------------------------------------------------+
| name     | age | salary      | married | tax_rate | random    | date_of_birth |
+-------------------------------------------------------------------------------+
| John     | 27  | 4133300.0   | true    | 11.0     | 123456782 | 1985-10-18    |
| Albert   | 23  | 5813000.5   | false   | 12.0     | 124343142 | 2000-01-14    |
| Heather  | 28  | 1.0130302E7 | true    | 10.5     | 827520917 | 1985-10-18    |
| Emily    | 24  | 9380080.5   | false   | 13.0     | 128973221 | 1910-07-30    |
| "D'Juan" | 27  | 1582093.0   | true    | 11.0     | 92384     | 1955-12-01    |
+-------------------------------------------------------------------------------+

Destroying a PgxFrame

As PgxFrames can take a lot of memory on the PGX server if they have a lot of rows or columns, it may be necessary to close them with the close() operation. After this operation, the content of the PgxFrame is not available anymore.

pgx> exampleFrame.close()
exampleFrame.close();

For the rest of this tutorial, we reload the PgxFrame, as specified in the previous sub-section.

Storing a PgxFrame to Some Specified Path

We can store the PgxFrame in CSV format as follows:

pgx> exampleFrame.write().
    overwrite(true).
    csv("<path>/stored_simple_frame.csv")
exampleFrame.write()
    .overwrite(true)
    .csv("<path>/stored_simple_frame.csv");

We can also store PgxFrames in PGB binary format using a pgb storer instead of the csv storer in the above-mentioned example.

pgx> exampleFrame.write().
    overwrite(true).
    pgb("<path>/stored_simple_frame.pgb")
exampleFrame.write()
    .overwrite(true)
    .pgb("<path>/stored_simple_frame.pgb");

Loading and Storing Vector Properties

With PgxFrames, we can load/store vector properties which are fundamental for PgxML functionality in PGX. We use the following sample CSV (space separated) data consisting of a 3-dimensional and 2-dimensional vector properties along with two integer properties and a string property.

0 2 0.1;0.2;0.3 "testProp0" 0.1;0.2
1 1 0.1;0.2;0.3 "testProp10" 0.1;0.2
1 2 0.1;0.2;0.3 "testProp20" 0.1;0.2
2 3 0.1;0.2;0.3 "testProp30" 0.1;0.2
3 1 0.1;0.2;0.3 "testProp40" 0.1;0.2

We load this PgxFrame with vector properties as follows. First, we specify the PgxFrame schema:

pgx> var vecFrameSchema = List.of(
  columnDescriptor("intProp", DataTypes.INTEGER_TYPE),
  columnDescriptor("intProp2", DataTypes.INTEGER_TYPE),
  columnDescriptor("vectProp", DataTypes.vector(DataTypes.FLOAT_TYPE, 3)),
  columnDescriptor("stringProp", DataTypes.STRING_TYPE),
  columnDescriptor("vectProp2", DataTypes.vector(DataTypes.FLOAT_TYPE, 2))
).toArray(new ColumnDescriptor[0])
ColumnDescriptor[] vecFrameSchema = {
    columnDescriptor("intProp", DataTypes.INTEGER_TYPE),
    columnDescriptor("intProp2", DataTypes.INTEGER_TYPE),
    columnDescriptor("vectProp", DataTypes.vector(DataTypes.FLOAT_TYPE, 3)),
    columnDescriptor("stringProp", DataTypes.STRING_TYPE),
    columnDescriptor("vectProp2", DataTypes.vector(DataTypes.FLOAT_TYPE, 2))
};

Then, we load the PgxFrame with the given schema from the specified path:

pgx> var vecFrame = session.readFrame().
    name("vector PgxFrame").
    columns(vecFrameSchema).
    csv().
    separator(' ').
    load("<path>/vec_rows.csv");
PgxFrame vecFrame = session.readFrame()
    .named("vector PgxFrame")
    .withColumns(vecFrameSchema)
    .csv()
    .separator(' ')
    .load("<path>/vec_rows.csv");

Lets see how the PgxFrame looks using the print() functionality again.

+-----------------------------------------------------------+
| intProp | intProp2 | vectProp    | stringProp | vectProp2 |
+-----------------------------------------------------------+
| 0       | 2        | 0.1;0.2;0.3 | testProp0  | 0.1;0.2   |
| 1       | 1        | 0.1;0.2;0.3 | testProp10 | 0.1;0.2   |
| 1       | 2        | 0.1;0.2;0.3 | testProp20 | 0.1;0.2   |
| 2       | 3        | 0.1;0.2;0.3 | testProp30 | 0.1;0.2   |
| 3       | 1        | 0.1;0.2;0.3 | testProp40 | 0.1;0.2   |
+-----------------------------------------------------------+

Flattening Vector Properties

It might be useful in some use-cases to split the vector properties into multiple columns. We support this functionality using our flattenAll() operation. If we flatten the above PgxFrame, we get the following flattened PgxFrame:

+----------------------------------------------------------------------------------------------------+
| intProp | intProp2 | vectProp_0 | vectProp_1 | vectProp_2 | stringProp | vectProp2_0 | vectProp2_1 |
+----------------------------------------------------------------------------------------------------+
| 0       | 2        | 0.1        | 0.2        | 0.3        | testProp0  | 0.1         | 0.2         |
| 1       | 1        | 0.1        | 0.2        | 0.3        | testProp10 | 0.1         | 0.2         |
| 1       | 2        | 0.1        | 0.2        | 0.3        | testProp20 | 0.1         | 0.2         |
| 2       | 3        | 0.1        | 0.2        | 0.3        | testProp30 | 0.1         | 0.2         |
| 3       | 1        | 0.1        | 0.2        | 0.3        | testProp40 | 0.1         | 0.2         |
+----------------------------------------------------------------------------------------------------+

One use-case of this flattening is in our MLlib where we export the embeddings using this flattening operation as classical features in a CSV file that can be easily used for post-processing in PGX or other frameworks.

Union of PGX Frames

If we have two PgxFrames that have compatible columns (i.e. same type and order) we are able to union them.

Let's say we have another frame secondExampleFrame, besides the exampleFrame described above, with the following content.

pgx> secondExampleFrame = session.readFrame()
    .name("another simple frame")
    .columns(exampleFrameSchema)
    .csv()
    .separator(' ' as char)
    .load("<path>/more_frame.csv")
pgx> secondExampleFrame.print()
PgxFrame secondExampleFrame = session.readFrame()
    .name("another simple frame")
    .columns(exampleFrameSchema)
    .csv()
    .separator(' ' as char)
    .load("<path>/more_frame.csv");
exampleFrame.print();
+-------------------------------------------------------------------------+
| name | age | salary    | married | tax_rate | random    | date_of_birth |
+-------------------------------------------------------------------------+
| Mary | 25  | 6821092.0 | false   | 11.0     | 88231223  | 1995-12-23    |
| Anca | 23  | 5813000.5 | false   | 12.0     | 124343142 | 2000-01-14    |
+-------------------------------------------------------------------------+

Now, if we want to create the union of exampleFrame with the secondExampleFrame, we only need to execute the following:

pgx> exampleFrame.union(secondExampleFrame).print()
exampleFrame.union(secondExampleFrame).print();
+-------------------------------------------------------------------------------+
| name     | age | salary      | married | tax_rate | random    | date_of_birth |
+-------------------------------------------------------------------------------+
| John     | 27  | 4133300.0   | true    | 11.0     | 123456782 | 1985-10-18    |
| Albert   | 23  | 5813000.5   | false   | 12.0     | 124343142 | 2000-01-14    |
| Heather  | 28  | 1.0130302E7 | true    | 10.5     | 827520917 | 1985-10-18    |
| Emily    | 24  | 9380080.5   | false   | 13.0     | 128973221 | 1910-07-30    |
| "D'Juan" | 27  | 1582093.0   | true    | 11.0     | 92384     | 1955-12-01    |
| Mary     | 25  | 6821092.0   | false   | 11.0     | 88231223  | 1995-12-23    |
| Anca     | 23  | 5813000.5   | false   | 12.0     | 124343142 | 2000-01-14    |
+-------------------------------------------------------------------------------+

We can observe that the rows of the resulting PgxFrame are the union of the rows from the two original frames. One thing to note here is that the union operation will not remove duplicate rows resulted from the union operation.

Joining PGX Frames

It might happen that we have two frames whose rows are correlated through one of the columns. This is the case of many machine learning problems where we have to join embeddings coming from different sources. For this, we have the join functionality that allows us to combine frames by checking for equality between rows for a specific column.

Let's say we have another frame moreInfoFrame that contains additional information about the people in the exampleFrame.

pgx> moreInfoFrame.print()
moreInfoFrame.print();
+-------------------------------------------------+
| name   | title                        | reports |
+-------------------------------------------------+
| John   | Software Engineering Manager | 5       |
| Albert | Sales Manager                | 10      |
| Emily  | Operations Manager           | 20      |
+-------------------------------------------------+

Now, if we want to combine this frame with the exampleFrame on the name column, we only need to call the join method.

pgx> exampleFrame.join(moreInfoFrame, "name", "leftFrame", "rightFrame").print()
exampleFrame.join(moreInfoFrame, "name", "leftFrame", "rightFrame").print();
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| leftFrame_name | leftFrame_age | leftFrame_salary | leftFrame_married | leftFrame_tax_rate | leftFrame_random | leftFrame_date_of_birth | rightFrame_name | rightFrame_title             | rightFrame_reports |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| John           | 27            | 4133300.0        | true              | 11.0               | 123456782        | 1985-10-18              | John            | Software Engineering Manager | 5                  |
| Albert         | 23            | 5813000.5        | false             | 12.0               | 124343142        | 2000-01-14              | Albert          | Sales Manager                | 10                 |
| Emily          | 24            | 9380080.5        | false             | 13.0               | 128973221        | 1910-07-30              | Emily           | Operations Manager           | 20                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

We can see that the joined frame contains the columns of the two frames involved in the operation for the rows with the same name. Also note the column prefixes specified in the call, leftFrame and rightFrame.

PgxFrame Helpers

We also support operations on PgxFrame such as head, tail, select as follows.

Head Operation

The head operation can be used to only keep the first rows of a PgxFrame. (The result is deterministic only for ordered PgxFrame) Here, we apply the head operation on the PgxFrame used above and print it:

pgx> vecFrame.head(2).print()
vecFrame.head(2).print();

The output looks as follows:

+-----------------------------------------------------------+
| intProp | intProp2 | vectProp    | stringProp | vectProp2 |
+-----------------------------------------------------------+
| 0       | 2        | 0.1;0.2;0.3 | testProp0  | 0.1;0.2   |
| 1       | 1        | 0.1;0.2;0.3 | testProp10 | 0.1;0.2   |
+-----------------------------------------------------------+

Tail Operation

The tail operation can be used to only keep the last rows of a PgxFrame. (The result is deterministic only for ordered PgxFrame) Next, we apply the tail operation on the PgxFrame used above and print it:

pgx> vecFrame.tail(2).print()
vecFrame.tail(2).print();

The output looks as follows:

+-----------------------------------------------------------+
| intProp | intProp2 | vectProp    | stringProp | vectProp2 |
+-----------------------------------------------------------+
| 2       | 3        | 0.1;0.2;0.3 | testProp30 | 0.1;0.2   |
| 3       | 1        | 0.1;0.2;0.3 | testProp40 | 0.1;0.2   |
+-----------------------------------------------------------+

Select Operation

The select operation can be used to keep only a specified list of columns of an input PgxFrame. We now apply the select operation on the PgxFrame used above and print it:

pgx> var vecFrame_selected = vecFrame.select("vectProp2", "vectProp", "stringProp")
PgxFrame vecFrame_selected = vecFrame.select("vectProp2", "vectProp", "stringProp");

We take a look at how the selected PgxFrame looks like (using vecFrame_selected.print()):

+--------------------------------------+
| vectProp2 | vectProp    | stringProp |
+--------------------------------------+
| 0.1;0.2   | 0.1;0.2;0.3 | testProp0  |
| 0.1;0.2   | 0.1;0.2;0.3 | testProp10 |
| 0.1;0.2   | 0.1;0.2;0.3 | testProp20 |
| 0.1;0.2   | 0.1;0.2;0.3 | testProp30 |
| 0.1;0.2   | 0.1;0.2;0.3 | testProp40 |
+--------------------------------------+

Rename PgxFrame Columns

We can rename the columns to customized names as follows:

pgx> var vecFrame_renamed = vecFrame.renameColumns(
  renaming("vectProp2", "vectProp2_renamed"),
  renaming("vectProp", "vectProp_renamed"),
  renaming("stringProp", "stringProp_renamed")
)
vecFrame_renamed = vecFrame.renameColumns(renaming("vectProp2", "vectProp2_renamed"),
                                            renaming("vectProp", "vectProp_renamed"),
                                            renaming("stringProp", "stringProp_renamed"));

We take a look at how the renamed PgxFrame looks like (using vecFrame_renamed.print()):

+--------------------------------------------------------------------------------+
| intProp | intProp2 | vectProp_renamed | stringProp_renamed | vectProp2_renamed |
+--------------------------------------------------------------------------------+
| 0       | 2        | 0.1;0.2;0.3      | testProp0          | 0.1;0.2           |
| 1       | 1        | 0.1;0.2;0.3      | testProp10         | 0.1;0.2           |
| 1       | 2        | 0.1;0.2;0.3      | testProp20         | 0.1;0.2           |
| 2       | 3        | 0.1;0.2;0.3      | testProp30         | 0.1;0.2           |
| 3       | 1        | 0.1;0.2;0.3      | testProp40         | 0.1;0.2           |
+--------------------------------------------------------------------------------+

PgxFrame-PgqlResultSet Conversions

We now explain the conversions between PgxFrames and PgqlResultSets.

PgxFrame to PgqlResultSet

We convert a PgxFrame to PgqlResultSet as follows:

pgx> var resultSet = exampleFrame.toPgqlResultSet()
PgqlResultSet resultSet = exampleFrame.toPgqlResultSet();

We now have a look at the content of the resultSet using resultSet.print() as follows:

+-------------------------------------------------------------------------------+
| name     | age | salary      | married | tax_rate | random    | date_of_birth |
+-------------------------------------------------------------------------------+
| John     | 27  | 4133300.0   | true    | 11.0     | 123456782 | 1985-10-18    |
| Albert   | 23  | 5813000.5   | false   | 12.0     | 124343142 | 2000-01-14    |
| Heather  | 28  | 1.0130302E7 | true    | 10.5     | 827520917 | 1985-10-18    |
| Emily    | 24  | 9380080.5   | false   | 13.0     | 128973221 | 1910-07-30    |
| "D'Juan" | 27  | 1582093.0   | true    | 11.0     | 92384     | 1955-12-01    |
+-------------------------------------------------------------------------------+

The content of the result set can be accessed through the usual PgqlResultSet APIs.

PgqlResultSet to PgxFrame

We convert a PgqlResultSet to PgxFrame as follows:

pgx> var query = ...;
pgx> var graph = ...;
pgx> var resultSet = graph.queryPgql(query);
pgx> resultSet.toFrame();
String query = ...;
PgxGraph graph = ...;
PgqlResultSet resultSet = graph.queryPgql(query);
resultSet.toFrame();