PgxFrame (Tabular Data-Structure)

Overview

PgxFrame is a data-structure to load/store and manipulate tabular data. It contains rows and columns. A PgxFrame can contain multiple columns where each column consist of elements of the same data type, and has a name. The list of the columns with their names and data types defines the schema of the frame. (The number of rows in the PgxFrame is not part of the schema of the frame.)

PgxFrame provides some operations that also output PgxFrames (described later in the tutorial). Those operations can be performed in-place (meaning that the frame is mutated during the operation) in order to save memory. In place operations should be used whenever possible. However, we provide out-place variants, i.e., a new frame is created during the operation. For all the following operations, we mention the respective out-place operations:

In-place operations	Out-place operations
headInPlace	head
tailInPlace	tail
flattenAllInPlace	flattenAll
renameColumnInPlace	renameColumn
renameColumnsInPlace	renameColumns
selectInPlace	select

Functionalities

We show here the current functionalities of PgxFrames using some toy examples.

Loading a PgxFrame (with multiple data types) from some specified path

First, create a session:

session = pypgx.create_session(session_name="my-session")

We use the following sample data (in CSV format, with a space separator instead of comma) in the next examples of our tutorial:

"John" 27 4133300.0 true 11.0 123456782 "1985-10-18"
"Albert" 23 5813000.5 false 12.0 124343142 "2000-01-14"
"Heather" 28 1.0130302E7 true 10.5 827520917 "1985-10-18"
"Emily" 24 9380080.5 false 13.0 128973221 "1910-07-30"
"""D'Juan""" 27 1582093.0 true 11.0 92384 "1955-12-01"

A frame schema is necessary to load a PgxFrame. An example frame schema with various data types can be defined as follows:

example_frame_schema = [
    ("name", "STRING_TYPE"),
    columnDescriptor("age", "INTEGER_TYPE"),
    columnDescriptor("salary", "DOUBLE_TYPE"),
    columnDescriptor("married", "BOOLEAN_TYPE"),
    columnDescriptor("tax_rate", "FLOAT_TYPE"),
    columnDescriptor("random", "LONG_TYPE"),
    columnDescriptor("date_of_birth", "LOCAL_DATE_TYPE")
]

Loading the CSV file with the above-mentioned schema can be performed as follows:

example_frame = session.read_frame()
example_frame = example_frame.name("simple frame")
example_frame = example_frame.columns(example_frame_schema)
example_frame = example_frame.csv()
example_frame = example_frame.separator(' ')
example_frame = example_frame.load("<path>/simple_frame.csv")

Loading a PgxFrame from client-side data

PgxFrames can also be loaded directly from client-side data, a frame schema is necessary to load a PgxFrame from client-side data. An example frame schema with various data types can be defined as follows:

example_frame_schema = [
    ("name", "STRING_TYPE"),
    ("age", "INTEGER_TYPE"),
    ("salary", "DOUBLE_TYPE"),
    ("married", "BOOLEAN_TYPE"),
    ("tax_rate", "FLOAT_TYPE"),
    ("random", "LONG_TYPE"),
    ("date_of_birth", "LOCAL_DATE_TYPE")
]

Once we have the schema defined we need to define our data

from datetime import date

example_frame_data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 27, 29],
    "salary": [10000.0, 15000.0, 20000.0],
    "married": [False, False, True],
    "tax_rate": [0.21, 0.26, 0.32],
    "random": [2394293898324, 45640604960495, 12312323409087654],
    "date_of_birth": [
        date(1990, 9, 15),
        date(1991, 11, 4),
        date(1993, 10, 4)
    ]
}

We can now load the frame as follows:

example_frame = session.create_frame(
    example_frame_schema,
    example_frame_data,
    'example frame'
)

We can also load the frame incrementally as we receive more data:

example_frame_builder = session.create_frame_builder(example_frame_schema)
example_frame_builder.add_rows(example_frame_data)
example_frame_data_part_2 = {
    "name": ["Dave"],
    "age": [26],
    "salary": [18000.0],
    "married": [True],
    "tax_rate": [0.30],
    "random": [456783423423],
    "date_of_birth": [date(1989, 9, 15)]
}
example_frame_builder.add_rows(example_frame_data_part_2)
example_frame = example_frame_builder.build("example frame")

Finally, we can also load a frame from a pandas dataframe in python:

import pandas as pd
example_pandas_dataframe = pd.DataFrame(data=example_frame_data)
example_frame = session.pandas_to_pgx_frame(
    example_pandas_dataframe,
    "example frame"
)

Printing the content of a PgxFrame

Now, we can also observe the frame contents using print() functionality as follows:

example_frame.print()

The output looks like:

name	age	salary	married	tax_rate	random	date_of_birth
John	27	4133300.0	true	11.0	123456782	1985-10-18
Albert	23	5813000.5	false	12.0	124343142	2000-01-14
Heather	28	1.0130302E7	true	10.5	827520917	1985-10-18
Emily	24	9380080.5	false	13.0	128973221	1910-07-30
“D’Juan”	27	1582093.0	true	11.0	92384	1955-12-01

Destroying a PgxFrame

As PgxFrames can take a lot of memory on the PGX server if they have a lot of rows or columns, it may be necessary to close them with the close() operation. After this operation, the content of the PgxFrame is not available anymore.

example_frame.close()

For the rest of this tutorial, we reload the PgxFrame, as specified in the previous sub-section.

Storing a PgxFrame to some specified path

We can store the PgxFrame in CSV format as follows:

example_frame.store("<path>/stored_simple_frame.csv", file_format="csv", overwrite=True)

We can also store PgxFrames in PGB binary format using a pgb storer instead of the csv storer in the above-mentioned example.

example_frame.store("<path>/stored_simple_frame.pgb", file_format="pgb", overwrite=True)

Flattening vector properties

It might be useful in some use-cases to split the vector properties into multiple columns. We support this functionality using our flatten_all() operation. If we flatten the above PgxFrame, we get the following flattened PgxFrame:

intProp	intProp2	vectProp_0	vectProp_1	vectProp_2	stringProp	vectProp2_0	vectProp2_1
0	2	0.1	0.2	0.3	testProp0	0.1	0.2
1	1	0.1	0.2	0.3	testProp10	0.1	0.2
1	2	0.1	0.2	0.3	testProp20	0.1	0.2
2	3	0.1	0.2	0.3	testProp30	0.1	0.2
3	1	0.1	0.2	0.3	testProp40	0.1	0.2

One use-case of this flattening is in our MLlib where we export the embeddings using this flattening operation as classical features in a CSV file that can be easily used for post-processing in PGX or other frameworks.

Union of PGX Frames

If we have two PgxFrames that have compatible columns (i.e. same type and order) we are able to union them. Let’s say we have another frame second_example_frame, besides the example_frame described above, with the following content.

second_example_frame = session.read_frame()
second_example_frame =second_example_frame.name("another simple frame")
second_example_frame =second_example_frame.columns(example_frame_schema)
second_example_frame =second_example_frame.csv()
second_example_frame =second_example_frame.separator(' ')
second_example_frame =second_example_frame.load("<path>/more_frame.csv")

name	age	salary	married	tax_rate	random	date_of_birth
Mary	25	6821092.0	false	11.0	88231223	1995-12-23
Anca	23	5813000.5	false	12.0	124343142	2000-01-14

Now, if we want to create the union of example_frame with the second_example_frame, we only need to execute the following:

example_frame.union(second_example_frame).print()

name	age	salary	married	tax_rate	random	date_of_birth
John	27	4133300.0	true	11.0	123456782	1985-10-18
Albert	23	5813000.5	false	12.0	124343142	2000-01-14
Heather	28	1.0130302E7	true	10.5	827520917	1985-10-18
Emily	24	9380080.5	false	13.0	128973221	1910-07-30
“D’Juan”	27	1582093.0	true	11.0	92384	1955-12-01
Mary	25	6821092.0	false	11.0	88231223	1995-12-23
Anca	23	5813000.5	false	12.0	124343142	2000-01-14

We can observe that the rows of the resulting PgxFrame are the union of the rows from the two original frames. One thing to note here is that the union operation will not remove duplicate rows resulted from the union() operation.

Joining PGX Frames

It might happen that we have two frames whose rows are correlated through one of the columns. This is the case of many machine learning problems where we have to join embeddings coming from different sources. For this, we have the join() functionality that allows us to combine frames by checking for equality between rows for a specific column.

Let’s say we have another frame more_info_frame that contains additional information about the people in the example_frame.

more_info_frame.print()

name	title	reports
John	Software Engineering Manager	5
Albert	Sales Manager	10
Emily	Operations Manager	20

Now, if we want to combine this frame with the example_frame on the name column, we only need to call the join() method.

example_frame.join(more_info_frame, "name", "leftFrame", "rightFrame").print()

leftFrame_name	leftFrame_age	leftFrame_salary	leftFrame_married	leftFrame_tax_rate	leftFrame_random	leftFrame_date_of_birth	rightFrame_name	rightFrame_title	rightFrame_reports
John	27	4133300.0	true	11.0	123456782	1985-10-18	John	Software Engineering Manager	5
Albert	23	5813000.5	false	12.0	124343142	2000-01-14	Albert	Sales Manager	10
Emily	24	9380080.5	false	13.0	128973221	1910-07-30	Emily	Operations Manager	20

We can see that the joined frame contains the columns of the two frames involved in the operation for the rows with the same name. Also note the column prefixes specified in the call, leftFrame and rightFrame.

PgxFrame helpers

We also support operations on PgxFrame such as head(), tail(), select() as follows.

Head operation

The head() operation can be used to only keep the first rows of a PgxFrame. (The result is deterministic only for ordered PgxFrame) Here, we apply the head() operation on the PgxFrame used above and print it:

vec_frame.head(2).print()

The output looks as follows:

intProp	intProp2	vectProp	stringProp	vectProp2
0	2	0.1;0.2;0.3	testProp0	0.1;0.2
1	1	0.1;0.2;0.3	testProp10	0.1;0.2

Tail operation

The tail() operation can be used to only keep the last rows of a PgxFrame. (The result is deterministic only for ordered PgxFrame) Next, we apply the tail() operation on the PgxFrame used above and print it:

vec_frame.tail(2).print()

The output looks as follows:

intProp	intProp2	vectProp	stringProp	vectProp2
2	3	0.1;0.2;0.3	testProp30	0.1;0.2
3	1	0.1;0.2;0.3	testProp40	0.1;0.2

Select operation

The select() operation can be used to keep only a specified list of columns of an input PgxFrame. We now apply the select() operation on the PgxFrame used above and print it:

vec_frame_selected = vec_frame.select("vectProp2", "vectProp", "stringProp")

We take a look at how the selected PgxFrame looks like (using vec_frame_selected.print()):

vectProp2	vectProp	stringProp
0.1;0.2	0.1;0.2;0.3	testProp0
0.1;0.2	0.1;0.2;0.3	testProp10
0.1;0.2	0.1;0.2;0.3	testProp20
0.1;0.2	0.1;0.2;0.3	testProp30
0.1;0.2	0.1;0.2;0.3	testProp40

PgxFrame-PgqlResultSet conversions

We now explain the conversions between PgxFrames and PgqlResultSets.

PgxFrame to PgqlResultSet

We convert a PgxFrame to PgqlResultSet as follows:

result_set= example_frame.to_pgql_result_set()

We now have a look at the content of the result_set using result_set.print() as follows:

name	age	salary	married	tax_rate	random	date_of_birth
John	27	4133300.0	true	11.0	123456782	1985-10-18
Albert	23	5813000.5	false	12.0	124343142	2000-01-14
Heather	28	1.0130302E7	true	10.5	827520917	1985-10-18
Emily	24	9380080.5	false	13.0	128973221	1910-07-30
“D’Juan”	27	1582093.0	true	11.0	92384	1955-12-01

The content of the result set can be accessed through the usual PgqlResultSet APIs.

PgqlResultSet to PgxFrame

We convert a PgqlResultSet to PgxFrame as follows:

query = ...
graph = ...
result_set = graph.query_pgql(query)
result_set.to_frame()

Creating a graph from multiple PgxFrame instances

We can create a PgxGraph with vertex PgxFrame (s) and edge PgxFrame (s). Given the following PgxFrame instances:

people:

id	name
1	Alice
2	Bob
3	Charlie

houses:

identification	location
1	Road 1
2	Street 5
3	Avenue 4

knows:

src	dst
1	1
2	3
3	2

lives:

source	destination
1	2
2	1
3	3

We can now create a PgxGraph as follows:

vertex_providers_from_frames = [
    session.vertex_provider_from_frame(
        "person",
        people
    ),
    session.vertex_provider_from_frame(
        "house",
        frame=houses,
        vertex_key_column="identification"
    )
]

edge_providers_from_frames = [
    session.edge_provider_from_frame(
        "person_knows_person",
        source_provider="person",
        destination_provider="person",
        frame=knows),
    session.edge_provider_from_frame(
        "person_lives_at_house",
        source_provider="person",
        destination_provider="house",
        frame=lives,
        source_vertex_column="source",
        destination_vertex_column="destination"
    )
]

graph = session.graph_from_frames(
    "example graph",
    vertex_providers_from_frames,
    edge_providers_from_frames,
    partitioned=True
)