PgxFrame (Tabular Data-Structure)

Overview

PgxFrame is a data-structure to load/store and manipulate tabular data. It contains rows and columns. A PgxFrame can contain multiple columns where each column consist of elements of the same data type, and has a name. The list of the columns with their names and data types defines the schema of the frame. (The number of rows in the PgxFrame is not part of the schema of the frame.)

PgxFrame provides some operations that also output PgxFrames (described later in the tutorial). Those operations can be performed in-place (meaning that the frame is mutated during the operation) in order to save memory. In place operations should be used whenever possible. However, we provide out-place variants, i.e., a new frame is created during the operation. For all the following operations, we mention the respective out-place operations:

In-place operations

Out-place operations

headInPlace

head

tailInPlace

tail

flattenAllInPlace

flattenAll

renameColumnInPlace

renameColumn

renameColumnsInPlace

renameColumns

selectInPlace

select

Functionalities

We show here the current functionalities of PgxFrames using some toy examples.

Loading a PgxFrame (with multiple data types) from some specified path

First, create a session:

1session = pypgx.create_session(session_name="my-session")

We use the following sample data (in CSV format, with a space separator instead of comma) in the next examples of our tutorial:

1"John" 27 4133300.0 true 11.0 123456782 "1985-10-18"
2"Albert" 23 5813000.5 false 12.0 124343142 "2000-01-14"
3"Heather" 28 1.0130302E7 true 10.5 827520917 "1985-10-18"
4"Emily" 24 9380080.5 false 13.0 128973221 "1910-07-30"
5"""D'Juan""" 27 1582093.0 true 11.0 92384 "1955-12-01"

A frame schema is necessary to load a PgxFrame. An example frame schema with various data types can be defined as follows:

1example_frame_schema = [
2    ("name", "STRING_TYPE"),
3    columnDescriptor("age", "INTEGER_TYPE"),
4    columnDescriptor("salary", "DOUBLE_TYPE"),
5    columnDescriptor("married", "BOOLEAN_TYPE"),
6    columnDescriptor("tax_rate", "FLOAT_TYPE"),
7    columnDescriptor("random", "LONG_TYPE"),
8    columnDescriptor("date_of_birth", "LOCAL_DATE_TYPE")
9]

Loading the CSV file with the above-mentioned schema can be performed as follows:

1example_frame = session.read_frame()
2example_frame = example_frame.name("simple frame")
3example_frame = example_frame.columns(example_frame_schema)
4example_frame = example_frame.csv()
5example_frame = example_frame.separator(' ')
6example_frame = example_frame.load("<path>/simple_frame.csv")

Loading a PgxFrame from client-side data

PgxFrames can also be loaded directly from client-side data, a frame schema is necessary to load a PgxFrame from client-side data. An example frame schema with various data types can be defined as follows:

1example_frame_schema = [
2    ("name", "STRING_TYPE"),
3    ("age", "INTEGER_TYPE"),
4    ("salary", "DOUBLE_TYPE"),
5    ("married", "BOOLEAN_TYPE"),
6    ("tax_rate", "FLOAT_TYPE"),
7    ("random", "LONG_TYPE"),
8    ("date_of_birth", "LOCAL_DATE_TYPE")
9]

Once we have the schema defined we need to define our data

 1from datetime import date
 2
 3example_frame_data = {
 4    "name": ["Alice", "Bob", "Charlie"],
 5    "age": [25, 27, 29],
 6    "salary": [10000.0, 15000.0, 20000.0],
 7    "married": [False, False, True],
 8    "tax_rate": [0.21, 0.26, 0.32],
 9    "random": [2394293898324, 45640604960495, 12312323409087654],
10    "date_of_birth": [
11        date(1990, 9, 15),
12        date(1991, 11, 4),
13        date(1993, 10, 4)
14    ]
15}

We can now load the frame as follows:

1example_frame = session.create_frame(
2    example_frame_schema,
3    example_frame_data,
4    'example frame'
5)

We can also load the frame incrementally as we receive more data:

 1example_frame_builder = session.create_frame_builder(example_frame_schema)
 2example_frame_builder.add_rows(example_frame_data)
 3example_frame_data_part_2 = {
 4    "name": ["Dave"],
 5    "age": [26],
 6    "salary": [18000.0],
 7    "married": [True],
 8    "tax_rate": [0.30],
 9    "random": [456783423423],
10    "date_of_birth": [date(1989, 9, 15)]
11}
12example_frame_builder.add_rows(example_frame_data_part_2)
13example_frame = example_frame_builder.build("example frame")

Finally, we can also load a frame from a pandas dataframe in python:

1import pandas as pd
2example_pandas_dataframe = pd.DataFrame(data=example_frame_data)
3example_frame = session.pandas_to_pgx_frame(
4    example_pandas_dataframe,
5    "example frame"
6)

Printing the content of a PgxFrame

Now, we can also observe the frame contents using print() functionality as follows:

1example_frame.print()

The output looks like:

name

age

salary

married

tax_rate

random

date_of_birth

John

27

4133300.0

true

11.0

123456782

1985-10-18

Albert

23

5813000.5

false

12.0

124343142

2000-01-14

Heather

28

1.0130302E7

true

10.5

827520917

1985-10-18

Emily

24

9380080.5

false

13.0

128973221

1910-07-30

“D’Juan”

27

1582093.0

true

11.0

92384

1955-12-01

Destroying a PgxFrame

As PgxFrames can take a lot of memory on the PGX server if they have a lot of rows or columns, it may be necessary to close them with the close() operation. After this operation, the content of the PgxFrame is not available anymore.

1example_frame.close()

For the rest of this tutorial, we reload the PgxFrame, as specified in the previous sub-section.

Storing a PgxFrame to some specified path

We can store the PgxFrame in CSV format as follows:

1example_frame.store("<path>/stored_simple_frame.csv", file_format="csv", overwrite=True)

We can also store PgxFrames in PGB binary format using a pgb storer instead of the csv storer in the above-mentioned example.

1example_frame.store("<path>/stored_simple_frame.pgb", file_format="pgb", overwrite=True)

Flattening vector properties

It might be useful in some use-cases to split the vector properties into multiple columns. We support this functionality using our flatten_all() operation. If we flatten the above PgxFrame, we get the following flattened PgxFrame:

intProp

intProp2

vectProp_0

vectProp_1

vectProp_2

stringProp

vectProp2_0

vectProp2_1

0

2

0.1

0.2

0.3

testProp0

0.1

0.2

1

1

0.1

0.2

0.3

testProp10

0.1

0.2

1

2

0.1

0.2

0.3

testProp20

0.1

0.2

2

3

0.1

0.2

0.3

testProp30

0.1

0.2

3

1

0.1

0.2

0.3

testProp40

0.1

0.2

One use-case of this flattening is in our MLlib where we export the embeddings using this flattening operation as classical features in a CSV file that can be easily used for post-processing in PGX or other frameworks.

Union of PGX Frames

If we have two PgxFrames that have compatible columns (i.e. same type and order) we are able to union them. Let’s say we have another frame second_example_frame, besides the example_frame described above, with the following content.

1second_example_frame = session.read_frame()
2second_example_frame =second_example_frame.name("another simple frame")
3second_example_frame =second_example_frame.columns(example_frame_schema)
4second_example_frame =second_example_frame.csv()
5second_example_frame =second_example_frame.separator(' ')
6second_example_frame =second_example_frame.load("<path>/more_frame.csv")

name

age

salary

married

tax_rate

random

date_of_birth

Mary

25

6821092.0

false

11.0

88231223

1995-12-23

Anca

23

5813000.5

false

12.0

124343142

2000-01-14

Now, if we want to create the union of example_frame with the second_example_frame, we only need to execute the following:

1example_frame.union(second_example_frame).print()

name

age

salary

married

tax_rate

random

date_of_birth

John

27

4133300.0

true

11.0

123456782

1985-10-18

Albert

23

5813000.5

false

12.0

124343142

2000-01-14

Heather

28

1.0130302E7

true

10.5

827520917

1985-10-18

Emily

24

9380080.5

false

13.0

128973221

1910-07-30

“D’Juan”

27

1582093.0

true

11.0

92384

1955-12-01

Mary

25

6821092.0

false

11.0

88231223

1995-12-23

Anca

23

5813000.5

false

12.0

124343142

2000-01-14

We can observe that the rows of the resulting PgxFrame are the union of the rows from the two original frames. One thing to note here is that the union operation will not remove duplicate rows resulted from the union() operation.

Joining PGX Frames

It might happen that we have two frames whose rows are correlated through one of the columns. This is the case of many machine learning problems where we have to join embeddings coming from different sources. For this, we have the join() functionality that allows us to combine frames by checking for equality between rows for a specific column.

Let’s say we have another frame more_info_frame that contains additional information about the people in the example_frame.

1more_info_frame.print()

name

title

reports

John

Software Engineering Manager

5

Albert

Sales Manager

10

Emily

Operations Manager

20

Now, if we want to combine this frame with the example_frame on the name column, we only need to call the join() method.

1example_frame.join(more_info_frame, "name", "leftFrame", "rightFrame").print()

leftFrame_name

leftFrame_age

leftFrame_salary

leftFrame_married

leftFrame_tax_rate

leftFrame_random

leftFrame_date_of_birth

rightFrame_name

rightFrame_title

rightFrame_reports

John

27

4133300.0

true

11.0

123456782

1985-10-18

John

Software Engineering Manager

5

Albert

23

5813000.5

false

12.0

124343142

2000-01-14

Albert

Sales Manager

10

Emily

24

9380080.5

false

13.0

128973221

1910-07-30

Emily

Operations Manager

20

We can see that the joined frame contains the columns of the two frames involved in the operation for the rows with the same name. Also note the column prefixes specified in the call, leftFrame and rightFrame.

PgxFrame helpers

We also support operations on PgxFrame such as head(), tail(), select() as follows.

Head operation

The head() operation can be used to only keep the first rows of a PgxFrame. (The result is deterministic only for ordered PgxFrame) Here, we apply the head() operation on the PgxFrame used above and print it:

1vec_frame.head(2).print()

The output looks as follows:

intProp

intProp2

vectProp

stringProp

vectProp2

0

2

0.1;0.2;0.3

testProp0

0.1;0.2

1

1

0.1;0.2;0.3

testProp10

0.1;0.2

Tail operation

The tail() operation can be used to only keep the last rows of a PgxFrame. (The result is deterministic only for ordered PgxFrame) Next, we apply the tail() operation on the PgxFrame used above and print it:

1vec_frame.tail(2).print()

The output looks as follows:

intProp

intProp2

vectProp

stringProp

vectProp2

2

3

0.1;0.2;0.3

testProp30

0.1;0.2

3

1

0.1;0.2;0.3

testProp40

0.1;0.2

Select operation

The select() operation can be used to keep only a specified list of columns of an input PgxFrame. We now apply the select() operation on the PgxFrame used above and print it:

1vec_frame_selected = vec_frame.select("vectProp2", "vectProp", "stringProp")

We take a look at how the selected PgxFrame looks like (using vec_frame_selected.print()):

vectProp2

vectProp

stringProp

0.1;0.2

0.1;0.2;0.3

testProp0

0.1;0.2

0.1;0.2;0.3

testProp10

0.1;0.2

0.1;0.2;0.3

testProp20

0.1;0.2

0.1;0.2;0.3

testProp30

0.1;0.2

0.1;0.2;0.3

testProp40

PgxFrame-PgqlResultSet conversions

We now explain the conversions between PgxFrames and PgqlResultSets.

PgxFrame to PgqlResultSet

We convert a PgxFrame to PgqlResultSet as follows:

1result_set= example_frame.to_pgql_result_set()

We now have a look at the content of the result_set using result_set.print() as follows:

name

age

salary

married

tax_rate

random

date_of_birth

John

27

4133300.0

true

11.0

123456782

1985-10-18

Albert

23

5813000.5

false

12.0

124343142

2000-01-14

Heather

28

1.0130302E7

true

10.5

827520917

1985-10-18

Emily

24

9380080.5

false

13.0

128973221

1910-07-30

“D’Juan”

27

1582093.0

true

11.0

92384

1955-12-01

The content of the result set can be accessed through the usual PgqlResultSet APIs.

PgqlResultSet to PgxFrame

We convert a PgqlResultSet to PgxFrame as follows:

1query = ...
2graph = ...
3result_set = graph.query_pgql(query)
4result_set.to_frame()

Creating a graph from multiple PgxFrame instances

We can create a PgxGraph with vertex PgxFrame (s) and edge PgxFrame (s). Given the following PgxFrame instances:

people:

id

name

1

Alice

2

Bob

3

Charlie

houses:

identification

location

1

Road 1

2

Street 5

3

Avenue 4

knows:

src

dst

1

1

2

3

3

2

lives:

source

destination

1

2

2

1

3

3

We can now create a PgxGraph as follows:

 1vertex_providers_from_frames = [
 2    session.vertex_provider_from_frame(
 3        "person",
 4        people
 5    ),
 6    session.vertex_provider_from_frame(
 7        "house",
 8        frame=houses,
 9        vertex_key_column="identification"
10    )
11]
12
13edge_providers_from_frames = [
14    session.edge_provider_from_frame(
15        "person_knows_person",
16        source_provider="person",
17        destination_provider="person",
18        frame=knows),
19    session.edge_provider_from_frame(
20        "person_lives_at_house",
21        source_provider="person",
22        destination_provider="house",
23        frame=lives,
24        source_vertex_column="source",
25        destination_vertex_column="destination"
26    )
27]
28
29graph = session.graph_from_frames(
30    "example graph",
31    vertex_providers_from_frames,
32    edge_providers_from_frames,
33    partitioned=True
34)