PGX allows to load graphs from certain file formats without having to write a configuration file. This guide illustrates how to load a graph from CSV files in such a way.
For a complete overview of the configuration detection capabilities of PGX, see the reference page
Consider a CSV file containing personal data. Each row contains a record. The first column contains the people's social security number, the second their date of birth, and the third their name. This file will be the vertices file.
To be able to load the vertices into PGX, this file needs a header.
This header needs to specify a column holding the vertex IDs, with the
The other columns will be loaded as properties, and as such their names need to be suffixed with the type of the data they contain.
The annotated file looks as follows:
ssn:string,age:integer,name:VID(string),:LABEL 555-55-5555,45,"John Doe",Person 666-66-6666,29,"Jane Smith",Person ...
:VID keyword needs to be parameterized with the type of the data contained in the column.
name is specified before the colon, the column will also be loaded as a property with this name.
:LABEL marks column containing the vertex label.
Consider also relationship data between people, in another CSV file. The first and third columns contain the people involved in the relationship, and the second column holds the relationship type. This file will be the edge table.
Both name columns will be the source, respectively destination vertex columns, and the type column will be the edge label.
The source and destination columns are specified with the
:DST keywords, and the label with the
:EID keyword marks the column with edge IDs. The resulting file is:
:SRC,:LABEL,:DST,:EID "John Doe",friendsWith,"Jane Smith",1 "Jane Smith",friendsWith,"John Doe",2 "John Doe",employs,"Jack Brown",3 ...
Assuming the vertex data file is named
people.csv, the edge data file is named
relationships.csv and both files are in the current directory,
loading the graph from the PGX shell is done by the following API call:
session.readGraphFiles("people.csv", "relationships.csv", "tutorial")
The third argument allows to specify the name of the loaded graph.
It is also possible to load graphs with multiple vertex tables and edge tables.
Consider another vertex file
universities.csv. The file contains the name, location and foundation year of several universities.
The header for this file is very similar to the one for non-partitioned graphs.
The only difference is that
:VID takes a second argument to specify the table name.
name:VID(string;universities),location,founding_year:integer "MIT","Boston, MA",1861 "Carnegie Mellon","Pittsburgh, PA",1900 "Stanford","Stanford, CA",1891 "UC Berkeley","Berkeley, CA",1868 ...
The header doesn't specify a property type for
location, so it will default to
string at loading.
people.csv file we used above can be used as is in a partitioned graph.
The table name will be inferred from the file name, and the data will be loaded into a table named
The edges of the partitioned graph will be in the
The file contains information about who goes to which university, as well as the respective student ID numbers.
These numbers will not be loaded into the graph, as they will be skipped with the
Contrarily to what is the case for non-partitioned graphs, the edge table header needs to specify to which table the two ends of the edges belong by giving it as argument to the
studentId:IGNORE,:SRC(people),:DST(universities) 792,"John Doe","MIT" 4289,"Jane Smith","Stanford" ...
Loading the graph is then done as follows, assuming like above that the files are in the current directory:
session.readGraphFiles(Arrays.asList("people.csv", "universities.csv"), Arrays.asList("studiesAt.csv"))
A complete description of the configuration detection capabilities of PGX is available here.