PgxFrame (Tabular Data-Structure)
Overview
PgxFrame
is a data-structure to load/store and manipulate tabular data. It contains rows and columns.
A PgxFrame
can contain multiple columns where each column consist of elements of the same data type, and has a name.
The list of the columns with their names and data types defines the schema of the frame.
(The number of rows in the PgxFrame
is not part of the schema of the frame.)
PgxFrame
provides some operations that also output PgxFrames
(described later in the tutorial).
Those operations can be performed in-place (meaning that the frame is mutated during the operation) in order to save memory.
In place operations should be used whenever possible.
However, we provide out-place variants, i.e., a new frame is created during the operation.
For all the following operations, we mention the respective out-place operations:
In-place operations |
Out-place operations |
---|---|
headInPlace |
head |
tailInPlace |
tail |
flattenAllInPlace |
flattenAll |
renameColumnInPlace |
renameColumn |
renameColumnsInPlace |
renameColumns |
selectInPlace |
select |
Functionalities
We show here the current functionalities of PgxFrames using some toy examples.
Loading a PgxFrame (with multiple data types) from some specified path
First, create a session:
1session = pypgx.get_session(session_name="my-session")
We use the following sample data (in CSV format, with a space separator instead of comma) in the next examples of our tutorial:
1"John" 27 4133300.0 true 11.0 123456782 "1985-10-18"
2"Albert" 23 5813000.5 false 12.0 124343142 "2000-01-14"
3"Heather" 28 1.0130302E7 true 10.5 827520917 "1985-10-18"
4"Emily" 24 9380080.5 false 13.0 128973221 "1910-07-30"
5"""D'Juan""" 27 1582093.0 true 11.0 92384 "1955-12-01"
A frame schema is necessary to load a PgxFrame
.
An example frame schema with various data types can be defined as follows:
1example_frame_schema = [
2 ("name", "STRING_TYPE"), # columnDescriptor
3 ("age", "INTEGER_TYPE"),
4 ("salary", "DOUBLE_TYPE"),
5 ("married", "BOOLEAN_TYPE"),
6 ("tax_rate", "FLOAT_TYPE"),
7 ("random", "LONG_TYPE"),
8 ("date_of_birth", "LOCAL_DATE_TYPE")
9]
Loading the CSV
file with the above-mentioned schema can be performed as follows:
1example_frame = session.read_frame()
2example_frame = example_frame.name("simple frame")
3example_frame = example_frame.columns(example_frame_schema)
4example_frame = example_frame.csv()
5example_frame = example_frame.separator(' ')
6example_frame = example_frame.load(simple_frame_csv)
Loading a PgxFrame from client-side data
PgxFrames can also be loaded directly from client-side data, a frame schema is necessary to load a PgxFrame
from client-side data.
An example frame schema with various data types can be defined as follows:
1example_frame_schema = [
2 ("name", "STRING_TYPE"),
3 ("age", "INTEGER_TYPE"),
4 ("salary", "DOUBLE_TYPE"),
5 ("married", "BOOLEAN_TYPE"),
6 ("tax_rate", "FLOAT_TYPE"),
7 ("random", "LONG_TYPE"),
8 ("date_of_birth", "LOCAL_DATE_TYPE")
9]
Once we have the schema defined we need to define our data
1from datetime import date
2
3example_frame_data = {
4 "name": ["Alice", "Bob", "Charlie"],
5 "age": [25, 27, 29],
6 "salary": [10000.0, 15000.0, 20000.0],
7 "married": [False, False, True],
8 "tax_rate": [0.21, 0.26, 0.32],
9 "random": [2394293898324, 45640604960495, 12312323409087654],
10 "date_of_birth": [
11 date(1990, 9, 15),
12 date(1991, 11, 4),
13 date(1993, 10, 4)
14 ]
15}
We can now load the frame as follows:
1example_frame = session.create_frame(
2 example_frame_schema,
3 example_frame_data,
4 'example frame'
5)
We can also load the frame incrementally as we receive more data:
1example_frame_builder = session.create_frame_builder(
2 example_frame_schema)
3example_frame_builder.add_rows(example_frame_data)
4example_frame_data_part_2 = {
5 "name": ["Dave"],
6 "age": [26],
7 "salary": [18000.0],
8 "married": [True],
9 "tax_rate": [0.30],
10 "random": [456783423423],
11 "date_of_birth": [date(1989, 9, 15)]
12}
13example_frame_builder.add_rows(example_frame_data_part_2)
14example_frame2 = example_frame_builder.build("example_frame")
Finally, we can also load a frame from a pandas dataframe in python:
1import pandas as pd
2example_pandas_dataframe = pd.DataFrame(data=example_frame_data)
3example_frame = session.pandas_to_pgx_frame(
4 example_pandas_dataframe,
5 "pandas frame"
6)
Printing the content of a PgxFrame
Now, we can also observe the frame contents using print()
functionality as follows:
1example_frame.print()
The output looks like:
name |
age |
salary |
married |
tax_rate |
random |
date_of_birth |
---|---|---|---|---|---|---|
John |
27 |
4133300.0 |
true |
11.0 |
123456782 |
1985-10-18 |
Albert |
23 |
5813000.5 |
false |
12.0 |
124343142 |
2000-01-14 |
Heather |
28 |
1.0130302E7 |
true |
10.5 |
827520917 |
1985-10-18 |
Emily |
24 |
9380080.5 |
false |
13.0 |
128973221 |
1910-07-30 |
“D’Juan” |
27 |
1582093.0 |
true |
11.0 |
92384 |
1955-12-01 |
Destroying a PgxFrame
As PgxFrames
can take a lot of memory on the PGX server if they have a lot of rows or columns, it may be necessary to close them with the close()
operation.
After this operation, the content of the PgxFrame
is not available anymore.
1example_frame.close()
For the rest of this tutorial, we reload the PgxFrame
, as specified in the previous sub-section.
Storing a PgxFrame to some specified path
We can store the PgxFrame
in CSV
format as follows:
1path = "/tmp/stored_simple_frame.csv"
2example_frame2.store(path, file_format="csv", overwrite=True)
We can also store PgxFrames
in PGB
binary format using a pgb
storer instead of the csv
storer in the above-mentioned example.
1pgb_path = "/tmp/stored_simple_frame.pgb"
2example_frame2.store(pgb_path, file_format="pgb", overwrite=True)
Flattening vector properties
It might be useful in some use-cases to split the vector properties into multiple columns.
We support this functionality using our flatten_all()
operation. If we flatten the above PgxFrame
, we get the following flattened PgxFrame
:
intProp |
intProp2 |
vectProp_0 |
vectProp_1 |
vectProp_2 |
stringProp |
vectProp2_0 |
vectProp2_1 |
---|---|---|---|---|---|---|---|
0 |
2 |
0.1 |
0.2 |
0.3 |
testProp0 |
0.1 |
0.2 |
1 |
1 |
0.1 |
0.2 |
0.3 |
testProp10 |
0.1 |
0.2 |
1 |
2 |
0.1 |
0.2 |
0.3 |
testProp20 |
0.1 |
0.2 |
2 |
3 |
0.1 |
0.2 |
0.3 |
testProp30 |
0.1 |
0.2 |
3 |
1 |
0.1 |
0.2 |
0.3 |
testProp40 |
0.1 |
0.2 |
One use-case of this flattening is in our MLlib where we export the embeddings using this flattening operation as classical features in a CSV
file that can be easily used for post-processing in PGX or other frameworks.
Union of PGX Frames
If we have two PgxFrames
that have compatible columns (i.e. same type and order) we are able to union them.
Let’s say we have another frame second_example_frame
, besides the example_frame
described above, with the following content.
1second_example_frame = session.read_frame()
2second_example_frame = second_example_frame.name("another simple frame")
3second_example_frame = second_example_frame.columns(
4 example_frame_schema)
5second_example_frame = second_example_frame.csv()
6second_example_frame = second_example_frame.separator(' ')
7second_example_frame = second_example_frame.load(
8 second_example_frame_path)
name |
age |
salary |
married |
tax_rate |
random |
date_of_birth |
---|---|---|---|---|---|---|
Mary |
25 |
6821092.0 |
false |
11.0 |
88231223 |
1995-12-23 |
Anca |
23 |
5813000.5 |
false |
12.0 |
124343142 |
2000-01-14 |
Now, if we want to create the union of example_frame
with the second_example_frame
, we only need to execute the following:
1example_frame.union(second_example_frame).print()
name |
age |
salary |
married |
tax_rate |
random |
date_of_birth |
---|---|---|---|---|---|---|
John |
27 |
4133300.0 |
true |
11.0 |
123456782 |
1985-10-18 |
Albert |
23 |
5813000.5 |
false |
12.0 |
124343142 |
2000-01-14 |
Heather |
28 |
1.0130302E7 |
true |
10.5 |
827520917 |
1985-10-18 |
Emily |
24 |
9380080.5 |
false |
13.0 |
128973221 |
1910-07-30 |
“D’Juan” |
27 |
1582093.0 |
true |
11.0 |
92384 |
1955-12-01 |
Mary |
25 |
6821092.0 |
false |
11.0 |
88231223 |
1995-12-23 |
Anca |
23 |
5813000.5 |
false |
12.0 |
124343142 |
2000-01-14 |
We can observe that the rows of the resulting PgxFrame
are the union of the rows from the two original frames. One thing to note here is that the union operation will not remove duplicate rows resulted from the union()
operation.
Joining PGX Frames
It might happen that we have two frames whose rows are correlated through one of the columns.
This is the case of many machine learning problems where we have to join embeddings coming from different sources.
For this, we have the join()
functionality that allows us to combine frames by checking for equality between rows for a specific column.
Let’s say we have another frame more_info_frame that contains additional information about the people in the example_frame
.
1more_info_frame.print()
name |
title |
reports |
---|---|---|
John |
Software Engineering Manager |
5 |
Albert |
Sales Manager |
10 |
Emily |
Operations Manager |
20 |
Now, if we want to combine this frame with the example_frame
on the name
column, we only need to call the join()
method.
1example_frame\
2 .join(more_info_frame, "name", left_prefix="leftFrame", right_prefix="rightFrame")\
3 .print()
leftFrame_name |
leftFrame_age |
leftFrame_salary |
leftFrame_married |
leftFrame_tax_rate |
leftFrame_random |
leftFrame_date_of_birth |
rightFrame_name |
rightFrame_title |
rightFrame_reports |
---|---|---|---|---|---|---|---|---|---|
John |
27 |
4133300.0 |
true |
11.0 |
123456782 |
1985-10-18 |
John |
Software Engineering Manager |
5 |
Albert |
23 |
5813000.5 |
false |
12.0 |
124343142 |
2000-01-14 |
Albert |
Sales Manager |
10 |
Emily |
24 |
9380080.5 |
false |
13.0 |
128973221 |
1910-07-30 |
Emily |
Operations Manager |
20 |
We can see that the joined frame contains the columns of the two frames involved in the operation for the rows with the
same name
. Also note the column prefixes specified in the call, leftFrame
and rightFrame
.
PgxFrame helpers
We also support operations on PgxFrame
such as head()
, tail()
, select()
as follows.
Head operation
The head()
operation can be used to only keep the first rows of a PgxFrame
. (The result is deterministic only for ordered PgxFrame
)
Here, we apply the head()
operation on the PgxFrame
used above and print it:
1example_frame.head(2).print()
The output looks as follows:
intProp |
intProp2 |
vectProp |
stringProp |
vectProp2 |
---|---|---|---|---|
0 |
2 |
0.1;0.2;0.3 |
testProp0 |
0.1;0.2 |
1 |
1 |
0.1;0.2;0.3 |
testProp10 |
0.1;0.2 |
Tail operation
The tail()
operation can be used to only keep the last rows of a PgxFrame
. (The result is deterministic only for ordered PgxFrame
)
Next, we apply the tail()
operation on the PgxFrame
used above and print it:
1example_frame.tail(2).print()
The output looks as follows:
intProp |
intProp2 |
vectProp |
stringProp |
vectProp2 |
---|---|---|---|---|
2 |
3 |
0.1;0.2;0.3 |
testProp30 |
0.1;0.2 |
3 |
1 |
0.1;0.2;0.3 |
testProp40 |
0.1;0.2 |
Select operation
The select()
operation can be used to keep only a specified list of columns of an input PgxFrame
.
We now apply the select()
operation on the PgxFrame
used above and print it:
1vec_frame_selected = example_frame.select(
2 "name", "age", "date_of_birth")
We take a look at how the selected PgxFrame
looks like (using vec_frame_selected.print()
):
vectProp2 |
vectProp |
stringProp |
---|---|---|
0.1;0.2 |
0.1;0.2;0.3 |
testProp0 |
0.1;0.2 |
0.1;0.2;0.3 |
testProp10 |
0.1;0.2 |
0.1;0.2;0.3 |
testProp20 |
0.1;0.2 |
0.1;0.2;0.3 |
testProp30 |
0.1;0.2 |
0.1;0.2;0.3 |
testProp40 |
PgxFrame-PgqlResultSet conversions
We now explain the conversions between PgxFrames
and PgqlResultSets
.
PgxFrame to PgqlResultSet
We convert a PgxFrame
to PgqlResultSet
as follows:
1result_set = example_frame.to_pgql_result_set()
We now have a look at the content of the result_set
using result_set.print()
as follows:
name |
age |
salary |
married |
tax_rate |
random |
date_of_birth |
---|---|---|---|---|---|---|
John |
27 |
4133300.0 |
true |
11.0 |
123456782 |
1985-10-18 |
Albert |
23 |
5813000.5 |
false |
12.0 |
124343142 |
2000-01-14 |
Heather |
28 |
1.0130302E7 |
true |
10.5 |
827520917 |
1985-10-18 |
Emily |
24 |
9380080.5 |
false |
13.0 |
128973221 |
1910-07-30 |
“D’Juan” |
27 |
1582093.0 |
true |
11.0 |
92384 |
1955-12-01 |
The content of the result set can be accessed through the usual PgqlResultSet
APIs.
PgqlResultSet to PgxFrame
We convert a PgqlResultSet
to PgxFrame
as follows:
1query = "SELECT v.age FROM MATCH (v)"
2graph = session.read_graph_with_properties(self.pgql_graph)
3result_set = graph.query_pgql(query)
4result_set.to_frame()
Creating a graph from multiple PgxFrame instances
We can create a PgxGraph
with vertex PgxFrame
(s) and edge PgxFrame
(s).
Given the following PgxFrame instances:
people:
id |
name |
---|---|
1 |
Alice |
2 |
Bob |
3 |
Charlie |
houses:
identification |
location |
---|---|
1 |
Road 1 |
2 |
Street 5 |
3 |
Avenue 4 |
knows:
src |
dst |
---|---|
1 |
1 |
2 |
3 |
3 |
2 |
lives:
source |
destination |
---|---|
1 |
2 |
2 |
1 |
3 |
3 |
We can now create a PgxGraph
as follows:
1vertex_providers_from_frames = [
2 session.vertex_provider_from_frame(
3 "person",
4 people_frame
5 ),
6 session.vertex_provider_from_frame(
7 "house",
8 frame=houses_frame,
9 vertex_key_column="identification"
10 )
11]
12
13edge_providers_from_frames = [
14 session.edge_provider_from_frame(
15 "person_knows_person",
16 source_provider="person",
17 destination_provider="person",
18 frame=knows_frame),
19 session.edge_provider_from_frame(
20 "person_lives_at_house",
21 source_provider="person",
22 destination_provider="house",
23 frame=lives_frame,
24 source_vertex_column="source",
25 destination_vertex_column="destination"
26 )
27]
28
29graph = session.graph_from_frames(
30 "example graph",
31 vertex_providers_from_frames,
32 edge_providers_from_frames,
33 partitioned=True
34)