PgxFrame (Tabular Data-Structure)
Overview
PgxFrame is a data-structure to load/store and manipulate tabular data. It contains rows and columns.
A PgxFrame can contain multiple columns where each column consist of elements of the same data type, and has a name.
The list of the columns with their names and data types defines the schema of the frame.
(The number of rows in the PgxFrame is not part of the schema of the frame.)
PgxFrame provides some operations that also output PgxFrames (described later in the tutorial).
Those operations can be performed in-place (meaning that the frame is mutated during the operation) in order to save memory.
In place operations should be used whenever possible.
However, we provide out-place variants, i.e., a new frame is created during the operation.
For all the following operations, we mention the respective out-place operations:
In-place operations |
Out-place operations |
|---|---|
headInPlace |
head |
tailInPlace |
tail |
flattenAllInPlace |
flattenAll |
renameColumnInPlace |
renameColumn |
renameColumnsInPlace |
renameColumns |
selectInPlace |
select |
Functionalities
We show here the current functionalities of PgxFrames using some toy examples.
Loading a PgxFrame (with multiple data types) from some specified path
First, create a session:
1session = pypgx.get_session(session_name="my-session")
We use the following sample data (in CSV format, with a space separator instead of comma) in the next examples of our tutorial:
1"John" 27 4133300.0 true 11.0 123456782 "1985-10-18"
2"Albert" 23 5813000.5 false 12.0 124343142 "2000-01-14"
3"Heather" 28 1.0130302E7 true 10.5 827520917 "1985-10-18"
4"Emily" 24 9380080.5 false 13.0 128973221 "1910-07-30"
5"""D'Juan""" 27 1582093.0 true 11.0 92384 "1955-12-01"
A frame schema is necessary to load a PgxFrame.
An example frame schema with various data types can be defined as follows:
1example_frame_schema = [
2 ("name", "STRING_TYPE"), # columnDescriptor
3 ("age", "INTEGER_TYPE"),
4 ("salary", "DOUBLE_TYPE"),
5 ("married", "BOOLEAN_TYPE"),
6 ("tax_rate", "FLOAT_TYPE"),
7 ("random", "LONG_TYPE"),
8 ("date_of_birth", "LOCAL_DATE_TYPE")
9]
Loading the CSV file with the above-mentioned schema can be performed as follows:
1example_frame = session.read_frame()
2example_frame = example_frame.name("simple frame")
3example_frame = example_frame.columns(example_frame_schema)
4example_frame = example_frame.csv()
5example_frame = example_frame.separator(' ')
6example_frame = example_frame.load(simple_frame_csv)
Loading a PgxFrame from client-side data
PgxFrames can also be loaded directly from client-side data, a frame schema is necessary to load a PgxFrame from client-side data.
An example frame schema with various data types can be defined as follows:
1example_frame_schema = [
2 ("name", "STRING_TYPE"),
3 ("age", "INTEGER_TYPE"),
4 ("salary", "DOUBLE_TYPE"),
5 ("married", "BOOLEAN_TYPE"),
6 ("tax_rate", "FLOAT_TYPE"),
7 ("random", "LONG_TYPE"),
8 ("date_of_birth", "LOCAL_DATE_TYPE")
9]
Once we have the schema defined we need to define our data
1from datetime import date
2
3example_frame_data = {
4 "name": ["Alice", "Bob", "Charlie"],
5 "age": [25, 27, 29],
6 "salary": [10000.0, 15000.0, 20000.0],
7 "married": [False, False, True],
8 "tax_rate": [0.21, 0.26, 0.32],
9 "random": [2394293898324, 45640604960495, 12312323409087654],
10 "date_of_birth": [
11 date(1990, 9, 15),
12 date(1991, 11, 4),
13 date(1993, 10, 4)
14 ]
15}
We can now load the frame as follows:
1example_frame = session.create_frame(
2 example_frame_schema,
3 example_frame_data,
4 'example frame'
5)
We can also load the frame incrementally as we receive more data:
1example_frame_builder = session.create_frame_builder(
2 example_frame_schema)
3example_frame_builder.add_rows(example_frame_data)
4example_frame_data_part_2 = {
5 "name": ["Dave"],
6 "age": [26],
7 "salary": [18000.0],
8 "married": [True],
9 "tax_rate": [0.30],
10 "random": [456783423423],
11 "date_of_birth": [date(1989, 9, 15)]
12}
13example_frame_builder.add_rows(example_frame_data_part_2)
14example_frame2 = example_frame_builder.build("example_frame")
Finally, we can also load a frame from a pandas dataframe in python:
1import pandas as pd
2example_pandas_dataframe = pd.DataFrame(data=example_frame_data)
3example_frame = session.pandas_to_pgx_frame(
4 example_pandas_dataframe,
5 "pandas frame"
6)
Printing the content of a PgxFrame
Now, we can also observe the frame contents using print() functionality as follows:
1example_frame.print()
The output looks like:
name |
age |
salary |
married |
tax_rate |
random |
date_of_birth |
|---|---|---|---|---|---|---|
John |
27 |
4133300.0 |
true |
11.0 |
123456782 |
1985-10-18 |
Albert |
23 |
5813000.5 |
false |
12.0 |
124343142 |
2000-01-14 |
Heather |
28 |
1.0130302E7 |
true |
10.5 |
827520917 |
1985-10-18 |
Emily |
24 |
9380080.5 |
false |
13.0 |
128973221 |
1910-07-30 |
“D’Juan” |
27 |
1582093.0 |
true |
11.0 |
92384 |
1955-12-01 |
Destroying a PgxFrame
As PgxFrames can take a lot of memory on the PGX server if they have a lot of rows or columns, it may be necessary to close them with the close() operation.
After this operation, the content of the PgxFrame is not available anymore.
1example_frame.close()
For the rest of this tutorial, we reload the PgxFrame, as specified in the previous sub-section.
Storing a PgxFrame to some specified path
We can store the PgxFrame in CSV format as follows:
1path = "/tmp/stored_simple_frame.csv"
2example_frame2.store(path, file_format="csv", overwrite=True)
We can also store PgxFrames in PGB binary format using a pgb storer instead of the csv storer in the above-mentioned example.
1pgb_path = "/tmp/stored_simple_frame.pgb"
2example_frame2.store(pgb_path, file_format="pgb", overwrite=True)
Flattening vector properties
It might be useful in some use-cases to split the vector properties into multiple columns.
We support this functionality using our flatten_all() operation. If we flatten the above PgxFrame, we get the following flattened PgxFrame:
intProp |
intProp2 |
vectProp_0 |
vectProp_1 |
vectProp_2 |
stringProp |
vectProp2_0 |
vectProp2_1 |
|---|---|---|---|---|---|---|---|
0 |
2 |
0.1 |
0.2 |
0.3 |
testProp0 |
0.1 |
0.2 |
1 |
1 |
0.1 |
0.2 |
0.3 |
testProp10 |
0.1 |
0.2 |
1 |
2 |
0.1 |
0.2 |
0.3 |
testProp20 |
0.1 |
0.2 |
2 |
3 |
0.1 |
0.2 |
0.3 |
testProp30 |
0.1 |
0.2 |
3 |
1 |
0.1 |
0.2 |
0.3 |
testProp40 |
0.1 |
0.2 |
One use-case of this flattening is in our MLlib where we export the embeddings using this flattening operation as classical features in a CSV file that can be easily used for post-processing in PGX or other frameworks.
Union of PGX Frames
If we have two PgxFrames that have compatible columns (i.e. same type and order) we are able to union them.
Let’s say we have another frame second_example_frame, besides the example_frame described above, with the following content.
1second_example_frame = session.read_frame()
2second_example_frame = second_example_frame.name("another simple frame")
3second_example_frame = second_example_frame.columns(
4 example_frame_schema)
5second_example_frame = second_example_frame.csv()
6second_example_frame = second_example_frame.separator(' ')
7second_example_frame = second_example_frame.load(
8 second_example_frame_path)
name |
age |
salary |
married |
tax_rate |
random |
date_of_birth |
|---|---|---|---|---|---|---|
Mary |
25 |
6821092.0 |
false |
11.0 |
88231223 |
1995-12-23 |
Anca |
23 |
5813000.5 |
false |
12.0 |
124343142 |
2000-01-14 |
Now, if we want to create the union of example_frame with the second_example_frame, we only need to execute the following:
1example_frame.union(second_example_frame).print()
name |
age |
salary |
married |
tax_rate |
random |
date_of_birth |
|---|---|---|---|---|---|---|
John |
27 |
4133300.0 |
true |
11.0 |
123456782 |
1985-10-18 |
Albert |
23 |
5813000.5 |
false |
12.0 |
124343142 |
2000-01-14 |
Heather |
28 |
1.0130302E7 |
true |
10.5 |
827520917 |
1985-10-18 |
Emily |
24 |
9380080.5 |
false |
13.0 |
128973221 |
1910-07-30 |
“D’Juan” |
27 |
1582093.0 |
true |
11.0 |
92384 |
1955-12-01 |
Mary |
25 |
6821092.0 |
false |
11.0 |
88231223 |
1995-12-23 |
Anca |
23 |
5813000.5 |
false |
12.0 |
124343142 |
2000-01-14 |
We can observe that the rows of the resulting PgxFrame are the union of the rows from the two original frames. One thing to note here is that the union operation will not remove duplicate rows resulted from the union() operation.
Joining PGX Frames
It might happen that we have two frames whose rows are correlated through one of the columns.
This is the case of many machine learning problems where we have to join embeddings coming from different sources.
For this, we have the join() functionality that allows us to combine frames by checking for equality between rows for a specific column.
Let’s say we have another frame more_info_frame that contains additional information about the people in the example_frame.
1more_info_frame.print()
name |
title |
reports |
|---|---|---|
John |
Software Engineering Manager |
5 |
Albert |
Sales Manager |
10 |
Emily |
Operations Manager |
20 |
Now, if we want to combine this frame with the example_frame on the name column, we only need to call the join() method.
1example_frame\
2 .join(more_info_frame, "name", left_prefix="leftFrame", right_prefix="rightFrame")\
3 .print()
leftFrame_name |
leftFrame_age |
leftFrame_salary |
leftFrame_married |
leftFrame_tax_rate |
leftFrame_random |
leftFrame_date_of_birth |
rightFrame_name |
rightFrame_title |
rightFrame_reports |
|---|---|---|---|---|---|---|---|---|---|
John |
27 |
4133300.0 |
true |
11.0 |
123456782 |
1985-10-18 |
John |
Software Engineering Manager |
5 |
Albert |
23 |
5813000.5 |
false |
12.0 |
124343142 |
2000-01-14 |
Albert |
Sales Manager |
10 |
Emily |
24 |
9380080.5 |
false |
13.0 |
128973221 |
1910-07-30 |
Emily |
Operations Manager |
20 |
We can see that the joined frame contains the columns of the two frames involved in the operation for the rows with the
same name. Also note the column prefixes specified in the call, leftFrame and rightFrame.
PgxFrame helpers
We also support operations on PgxFrame such as head(), tail(), select() as follows.
Head operation
The head() operation can be used to only keep the first rows of a PgxFrame. (The result is deterministic only for ordered PgxFrame)
Here, we apply the head() operation on the PgxFrame used above and print it:
1example_frame.head(2).print()
The output looks as follows:
intProp |
intProp2 |
vectProp |
stringProp |
vectProp2 |
|---|---|---|---|---|
0 |
2 |
0.1;0.2;0.3 |
testProp0 |
0.1;0.2 |
1 |
1 |
0.1;0.2;0.3 |
testProp10 |
0.1;0.2 |
Tail operation
The tail() operation can be used to only keep the last rows of a PgxFrame. (The result is deterministic only for ordered PgxFrame)
Next, we apply the tail() operation on the PgxFrame used above and print it:
1example_frame.tail(2).print()
The output looks as follows:
intProp |
intProp2 |
vectProp |
stringProp |
vectProp2 |
|---|---|---|---|---|
2 |
3 |
0.1;0.2;0.3 |
testProp30 |
0.1;0.2 |
3 |
1 |
0.1;0.2;0.3 |
testProp40 |
0.1;0.2 |
Select operation
The select() operation can be used to keep only a specified list of columns of an input PgxFrame.
We now apply the select() operation on the PgxFrame used above and print it:
1vec_frame_selected = example_frame.select(
2 "name", "age", "date_of_birth")
We take a look at how the selected PgxFrame looks like (using vec_frame_selected.print()):
vectProp2 |
vectProp |
stringProp |
|---|---|---|
0.1;0.2 |
0.1;0.2;0.3 |
testProp0 |
0.1;0.2 |
0.1;0.2;0.3 |
testProp10 |
0.1;0.2 |
0.1;0.2;0.3 |
testProp20 |
0.1;0.2 |
0.1;0.2;0.3 |
testProp30 |
0.1;0.2 |
0.1;0.2;0.3 |
testProp40 |
PgxFrame-PgqlResultSet conversions
We now explain the conversions between PgxFrames and PgqlResultSets.
PgxFrame to PgqlResultSet
We convert a PgxFrame to PgqlResultSet as follows:
1result_set = example_frame.to_pgql_result_set()
We now have a look at the content of the result_set using result_set.print() as follows:
name |
age |
salary |
married |
tax_rate |
random |
date_of_birth |
|---|---|---|---|---|---|---|
John |
27 |
4133300.0 |
true |
11.0 |
123456782 |
1985-10-18 |
Albert |
23 |
5813000.5 |
false |
12.0 |
124343142 |
2000-01-14 |
Heather |
28 |
1.0130302E7 |
true |
10.5 |
827520917 |
1985-10-18 |
Emily |
24 |
9380080.5 |
false |
13.0 |
128973221 |
1910-07-30 |
“D’Juan” |
27 |
1582093.0 |
true |
11.0 |
92384 |
1955-12-01 |
The content of the result set can be accessed through the usual PgqlResultSet APIs.
PgqlResultSet to PgxFrame
We convert a PgqlResultSet to PgxFrame as follows:
1query = "SELECT v.age FROM MATCH (v)"
2graph = session.read_graph_with_properties(self.pgql_graph)
3result_set = graph.query_pgql(query)
4result_set.to_frame()
Creating a graph from multiple PgxFrame instances
We can create a PgxGraph with vertex PgxFrame (s) and edge PgxFrame (s).
Given the following PgxFrame instances:
people:
id |
name |
|---|---|
1 |
Alice |
2 |
Bob |
3 |
Charlie |
houses:
identification |
location |
|---|---|
1 |
Road 1 |
2 |
Street 5 |
3 |
Avenue 4 |
knows:
src |
dst |
|---|---|
1 |
1 |
2 |
3 |
3 |
2 |
lives:
source |
destination |
|---|---|
1 |
2 |
2 |
1 |
3 |
3 |
We can now create a PgxGraph as follows:
1vertex_providers_from_frames = [
2 session.vertex_provider_from_frame(
3 "person",
4 people_frame
5 ),
6 session.vertex_provider_from_frame(
7 "house",
8 frame=houses_frame,
9 vertex_key_column="identification"
10 )
11]
12
13edge_providers_from_frames = [
14 session.edge_provider_from_frame(
15 "person_knows_person",
16 source_provider="person",
17 destination_provider="person",
18 frame=knows_frame),
19 session.edge_provider_from_frame(
20 "person_lives_at_house",
21 source_provider="person",
22 destination_provider="house",
23 frame=lives_frame,
24 source_vertex_column="source",
25 destination_vertex_column="destination"
26 )
27]
28
29graph = session.graph_from_frames(
30 "example graph",
31 vertex_providers_from_frames,
32 edge_providers_from_frames,
33 partitioned=True
34)