Configless Loading

PGX allows to load graphs from certain file formats without having to write a configuration file. This guide illustrates how to load a graph from CSV files in such a way.

Vertex table

Consider a CSV file containing personal data. Each row contains a record. The first column contains the people’s social security number, the second their date of birth, and the third their name. This file will be the vertices file.

To be able to load the vertices into PGX, this file needs a header. This header needs to specify a column holding the vertex IDs, with the :VID keyword. The other columns will be loaded as properties, and as such their names need to be suffixed with the type of the data they contain. The annotated file looks as follows:

ssn:string,age:integer,name:VID(string),:LABEL
555-55-5555,45,"John Doe",Person
666-66-6666,29,"Jane Smith",Person
...

The :VID keyword needs to be parameterized with the type of the data contained in the column. Also, as name is specified before the colon, the column will also be loaded as a property with this name. :LABEL marks column containing the vertex label.

Edge table

Consider also relationship data between people, in another CSV file. The first and third columns contain the people involved in the relationship, and the second column holds the relationship type. This file will be the edge table.

Both name columns will be the source, respectively destination vertex columns, and the type column will be the edge label. The source and destination columns are specified with the :SRC and :DST keywords, and the label with the :LABEL keyword. :EID keyword marks the column with edge IDs. The resulting file is:

:SRC,:LABEL,:DST,:EID
"John Doe",friendsWith,"Jane Smith",1
"Jane Smith",friendsWith,"John Doe",2
"John Doe",employs,"Jack Brown",3
...

Loading the graph

Assuming the vertex data file is named people.csv, the edge data file is named relationships.csv and both files are in the current directory, loading the graph from the PGX shell is done by the following API call:

people_csv = self.pgx_test_resources + "/documentation-graphs/people.csv"
relationship_csv = self.pgx_test_resources + \
    "/documentation-graphs/relationships.csv"
session.read_graph_files(
    people_csv,
    edge_file_paths=relationship_csv,
    graph_name="tutorial"
)

The third argument allows to specify the name of the loaded graph.

Partitioned graph example

It is also possible to load graphs with multiple vertex tables and edge tables. Consider another vertex file universities.csv. The file contains the name, location and foundation year of several universities.

The header for this file is very similar to the one for non-partitioned graphs. The only difference is that :VID takes a second argument to specify the table name.

name:VID(string;universities),location,founding_year:integer
"MIT","Boston, MA",1861
"Carnegie Mellon","Pittsburgh, PA",1900
"Stanford","Stanford, CA",1891
"UC Berkeley","Berkeley, CA",1868
...

The header doesn’t specify a property type for location, so it will default to str at loading. The people.csv file we used above can be used as is in a partitioned graph. The table name will be inferred from the file name, and the data will be loaded into a table named people.

The edges of the partitioned graph will be in the studiesAt.csv file. The file contains information about who goes to which university, as well as the respective student ID numbers. These numbers will not be loaded into the graph, as they will be skipped with the :IGNORE keyword. Contrarily to what is the case for non-partitioned graphs, the edge table header needs to specify to which table the two ends of the edges belong by giving it as argument to the :SRC and :DST keywords.

studentId:IGNORE,:SRC(people),:DST(universities)
792,"John Doe","MIT"
4289,"Jane Smith","Stanford"
...

Loading the graph is then done as follows, assuming like above that the files are in the current directory:

universities_csv = self.pgx_test_resources + \
    "/documentation-graphs/universities.csv"
studiesat_csv = self.pgx_test_resources + "/documentation-graphs/studiesAt.csv"
session.read_graph_files(
    [people_csv, universities_csv],
    edge_file_paths=[studiesat_csv]
)