The following plain-text formats are supported:
PGX supports three types of vertex identifies (id): integer
, long
and string
. The type defaults to integer
, but can be configured through the vertex_id_type
option in the graph config.
Of the various formats and protocols supported by PGX, only CSV and flat file parsing supports edge identifiers. For all other data sources, the id of an edge is PGX's internal id, which is an integer from zero to num_edges - 1
.
string
properties, spatial properties (currently only point2d
) and temporal properties (date
, local_date
, time
, timestamp
, time_with_timezone
and timestamp_with_timezone
) must be quoted ("<string>"
) only if they contain a separator character (usually ,
for CSV and ' '
for Edge List and Adjacency List) or if they contain "
or \n
(see the CSV specifications).
date
properties are parsed using Java's SimpleDateFormat utility, instantiated with the format string yyyy-MM-dd HH:mm:ss
unless specified otherwise in the graph config.
All other types of temporal properties are parsed using Java's DateTimeFormatter utility.
point2d
can be specified by its longitude followed by its latitude, separated by a space. Both longitude and latitude are doubles.
As an example, "-74.0445 40.6892"
is the representation of a point2d instance representing the location of the Statue of Liberty.
Boolean values are interpreted as true if the value is true
(ignoring case), Y
(ignoring case) or 1
, false otherwise. The suggested notation for false is false
(ignoring case), N
(ignoring case) or 0
.
All other types are parsed using the parseXXX()
functions of its corresponding Java type, e.g. Integer.parseInt(...)
for integer types.
Vector properties are supported in the Adjacency List (ADJ_LIST), Comma-Separated Values (CSV), Edge List (EDGE_LIST), and Two Tables text (TWO_TABLES) formats. Vector properties with vector components of type integer, long, float and double can be loaded from these formats. In order to specify that a vertex or edge property is a vector property, the dimension
field of the graph property configuration must be set to the dimension of the vector and be a strictly positive integer value.
A vector value is represented in the supported text formats by the list of the vector components values separated by the vector component delimiter. By default the vector component delimiter is ;
, but this delimiter can be changed by changing the vector_component_delimiter
graph config entry.
Therefore a 3-dimensional vector of doubles could for example look like 0.1;0.0004;3.14
in the text file if the vector component delimiter is ;
.
When using single file formats IDs and properties are separated with tab or one single space ("\t ") by default, for multiple file formats comma (",") is used instead. However, PGX allows to configure the separator string.
The following formats support parallel loading from multiple files:
vertex_uris
and/or edge_uris
)uris
)uris
)vertex_uris
and/or edge_uris
)vertex_uris
and/or edge_uris
)The following abbreviations are used to specify text formats:
For example <V-2, VG-4>
or <V-2, VG-4>
denotes the 4th neighbor of the 2nd vertex.
The CSV format is a text file format with vertices and edges stored in different files. Each line of the files represents a vertex or an edge. The vertex key and labels, the edge key, source, destination and label, and the attached properties are stored in the order specified by the file header (first line) and the configuration.
A graph with V vertices, having N vertex properties and K neighbors each, and E edges, having M edge properties, would be represented in CSV like this:
vertices.csv
<V-1>,<VL-1>,<V-1, NP-1>,...,<V-1, NP-N> <V-2>,<VL-2>,<V-2, NP-1>,...,<V-2, NP-N> ... <V-V>,<VL-N>,<V-V, NP-1>,...,<V-V, NP-N>
edges.csv
<E-1>,<V-1>,<V-1, VG-1>,<EL-1>,<E-1, EP-1>,...,<E-1, EP-M> ... <E-K>,<V-1>,<V-1, VG-K>,<EL-N>,<E-K, EP-1>,...,<E-K, EP-M> <E-K+1>,<V-2>,<V-2, VG-1>,<EL-N+1>,<E-K+1, EP-1>,...,<E-K+1, EP-M> ... <E-V*K>,<V-V>,<V-V, VG-K>,<EL-V*K>,<E-V*K, EP-1>,...,<E-V*K, EP-M>
Here is a graph with two vertices and two edges, with two properties each.
vertices.csv
key,integer_prop,string_prop 1,33,"Alice" 2,42,"Bob"
edges.csv
source,dest,integer_prop,string_prop 1,2,0,"baz" 2,2,-12,"bat"
The corresponding configuration file is as follows:
{ "format": "csv", "header": true, "vertex_id_column": "key", "edge_source_column": "source", "edge_destination_column": "dest", "vertex_uris": ["vertices.csv"], "edge_uris": ["edges.csv"], "vertex_props": [ { "name": "integer_prop", "type": "integer" }, { "name": "string_prop", "type": "string" } ], "edge_props": [ { "name": "integer_prop", "type": "integer" }, { "name": "string_prop", "type": "string" } ] }
See the graph configuration page for the complete configuration specification.
Here we load the same graph, but with no header included in the file.
vertices.csv
1,33,"Alice" 2,42,"Bob"
edges.csv
1,2,0,"baz" 2,2,-12,"bat"
The corresponding configuration file is as follows. The column indices are given in place of the column names.
{ "format": "csv", "header": false, "vertex_id_column": 1, "edge_source_column": 1, "edge_destination_column": 2, "vertex_uris": ["vertices.csv"], "edge_uris": ["edges.csv"], "vertex_props": [ { "name": "integer_prop", "type": "integer", "column": 2 }, { "name": "string_prop", "type": "string", "column": 3 } ], "edge_props": [ { "name": "integer_prop", "type": "integer", "column": 3 }, { "name": "string_prop", "type": "string", "column": 4 } ] }
If no column indices are set in the configuration file, the columns are assumed to be in the following order:
For vertex files: - Vertex ID - Vertex labels (if present) - Vertex properties in the order they are declared in the configuration
For edge files: - Edge ID (if present) - Edge source - Edge destination - Edge label (if present) - Edge properties in the order they are declared in the configuration
Therefore the previous configuration is equivalent to:
{ "format": "csv", "header": false, "vertex_uris": ["vertices.csv"], "edge_uris": ["edges.csv"], "vertex_props": [ { "name": "integer_prop", "type": "integer" }, { "name": "string_prop", "type": "string" } ], "edge_props": [ { "name": "integer_prop", "type": "integer" }, { "name": "string_prop", "type": "string" } ] }
The Adjacency List format is a text file format containing a list of neighbors from a vertex, per line. The format is extended to encode properties. A graph with V vertices, having N vertex properties and M edge properties would look like this:
<V-1> <V-1, VP-1> ... <V-1, VP-N> <V-1, VG-1> <EP-1> ... <EP-M> <V-1, VG-2> <EP-1> ... <EP-M> <V-2> <V-2, VP-1> ... <V-2, VP-N> <V-2, VG-1> <EP-1> ... <EP-M> <V-2, VG-2> <EP-1> ... <EP-M> ... <V-V> <V-V, VP-1> ... <V-V, VP-N> <V-V, VG-1> <EP-1> ... <EP-M> <V-V, VG-2> <EP-1> ... <EP-M>
Trailing Separators
Trailing separators will be considered as errors, because they instruct the parser to expect another property but none is given. i.e if whitespace is used to separate the properties, any trailing whitespace will cause an exception to be raised.
Here is a graph of 4 vertices (1, 2, 3 and 4), each having a double and a string property, and 3 edges, each having a boolean and a date property, encoded in Adjacency List format:
1 8.0 "foo" 2 4.3 "bar" 1 false "1985-10-18 10:00:00" 3 6.1 "bax" 2 true "1961-12-30 14:45:14" 4 false "2001-01-15 07:00:43" 4 17.78 "f00"
See the graph configuration examples page for a json configuration example.
The Edge List format is a text file format starting with a section with one vertex per line, followed by a section with one edge per line. If a vertex does not have any labels or properties, it is possible to omit the vertex in the first section, but still specify edges for the vertex in the second section.
The grammar is as follows:
EdgeList := {Vertex '\n'}* '\n' {Edge '\n'}* Vertex := VertexId '*' VertexLabels? PropertyValue* VertexId := Integer | Long | String VertexLabels := '{' String* '}' Edge := SrcVertex DstVertex EdgeLabel? PropertyValue* SrcVertex := VertexId DstVertex := VertexId EdgeLabel := String PropertyValue := Integer | Long | Double | Float | Boolean | String | Date
The vertices start with an identifier (VertexId
), followed by a *
, an optional set of vertex labels (VertexLabels?
) and the vertex properties (PropertyValue*
). A vertex identifier is either an Integer, a Long, or a String. Furthermore, vertex labels are zero or more Strings between curly braces ('{' String* '}'
).
The edges start with source and destination vertex identifiers (SrcVertex DstVertex
), followed by optional edge label (EdgeLabel?
) and the edge properties (PropertyValue*
). The edge label is a String.
Here is a graph with two vertices and two edges, with labels and properties:
1 * { "Person" "Male" } "Mario" 15 2 * { "Person" "Male" } "Luigi" 14 1 2 "likes" 3.5 2 1 "likes" 2.1
The two vertices (lines 1-2) have identifiers 1 and 2 and both have the labels "Person" and "Male", a string property ("Mario" and "Luigi") and an integer property (15 and 14). There is an edge from vertex 1 to vertex 2 (line 3) with label "likes" and a double property with value 3.5, and another edge from vertex 2 to vertex 1 with label "likes" and a double property with value 2.1.
A corresponding graph configuration is as follows:
{ "format":"edge_list", "uri":"example.edgelist", "vertex_id_type":"long", "vertex_labels":true, "edge_label":true, "vertex_props":[ { "name":"name", "type":"string" }, { "name":"age", "type":"int" } ], "edge_props":[ { "name":"rating", "type":"double" } ], "loading_options": { "load_vertex_labels":true, "load_edge_label":true }, "separator":" " }
When configured to use file
as datastore the Two Tables format becomes a text file format similar to the Edge List format, with the difference that the vertices and edges are stored in two different files.
The vertices file contains vertex IDs followed by vertex properties. The edges file contains the source vertices and target vertices, followed by edge properties.
A graph with V vertices, having N vertex properties and M edge properties would be represented in two files like this:
vertices.ttt:
<V-1> <V-1, NP-1> ... <V-1, NP-N> <V-2> <V-2, NP-1> ... <V-2, NP-N> ... <V-V> <V-V, NP-1> ... <V-V, NP-N>
edges.ttt:
<V-1> <V-1, VG-1> <EP-1> ... <EP-M> <V-1> <V-1, VG-2> <EP-1> ... <EP-M> ... <V-V> <V-V, VG-1> <EP-1> ... <EP-M>
The following example shows the graph of 4 vertices (1, 2, 3 and 4), each having a double
and a string
property, and 3 edges, each having a boolean
and a date
property, encoded in Two Tables Text format:
vertices.ttt:
1 8.0 "foo" 2 4.3 "bar" 3 6.1 "bax" 4 17.78 "f00"
edges.ttt:
2 1 false "1985-10-18 10:00:00" 3 2 true "1961-12-30 14:45:14" 3 4 false "2001-01-15 07:00:43"
See the graph configuration examples page for a json configuration example.
Edge List vs Adjacency List
ADJ_LIST is more space efficient than EDGE_LIST this is caused because vertices are being defined and then, edges are being created, this means that we are repeating each vertex at least once. EDGE_LIST is a lot easier to read but if you are planning on storing big graphs you should probably consider this, in order to save disk space.
The Flat File format is a text file format containing two description files, one for vertices and one for edges. Each file consists of a list of properties with the following format:
vertices.opv:
vertex_ID, key_name, value_type, value, value, value <V-1> <V-1, VPK-1> <V-1, VPT-1> [<V-1, VP-1> <V-1, VP-1> <V-1, VP-1>] ... <V-1> <V-1, VPK-N> <V-1, VPT-1> [<V-1, VP-N> <V-1, VP-N> <V-1, VP-N>] <V-2> <V-2, VPK-1> <V-2, VPT-1> [<V-2, VP-1> <V-2, VP-1> <V-2, VP-1>] ... <V-2> <V-2, VPK-N> <V-2, VPT-N> [<V-2, VP-N> <V-2, VP-N> <V-2, VP-N>] ... <V-V> <V-V, VPK-N> <V-V, VPT-N> [<V-V, VP-N> <V-V, VP-N> <V-V, VP-N>]
edges.ope:
edge_ID, source_vertex_ID, destination_vertex_ID, edge_label, key_name, value_type, value, value, value <E-1> <V-1, VG-1> <E-1, EL-1> <E-1, EPK-1> <E-1, EPT-1> [<E-1, EP-1> <E-1, EP-1> <E-1, EP-1>] ... <E-1> <V-N, VG-N> <E-1, EL-N> <E-1, EPK-N> <E-1, EPT-N> [<E-1, EP-N> <E-1, EP-N> <E-1, EP-N>] <E-2> <V-1, VG-1> <E-2, EL-1> <E-2, EPK-1> <E-2, EPT-1> [<E-2, EP-1> <E-2, EP-1> <E-2, EP-1>] ... <E-2> <V-N, VG-N> <E-2, EL-N> <E-2, EPK-N> <E-2, EPT-N> [<E-2, EP-N> <E-2, EP-N> <E-2, EP-N>] ... <E-E> <V-N, VG-N> <E-E, EL-N> <E-E, EPK-N> <E-E, EPT-N> [<E-E, EP-N> <E-E, EP-N> <E-E, EP-N>]
No properties
When no properties are defined for a certain vertex or edge, %20 is used instead of the key name:
Vertices: 1,%20,,,,
Edges: 1,2,1,"label",%20,,,,
Value fields
Values that are not numeric nor date go in the first field; numeric values go in the second, and dates in the third.
Value types
Mapping from PGX property type to flat file value_type
PGX property type | Flat file value_type |
---|---|
STRING | 1 |
INTEGER | 2 |
FLOAT | 3 |
DOUBLE | 4 |
DATE | 5 |
LOCAL_DATE | 5 |
TIME | 5 |
TIMESTAMP | 5 |
TIME_WITH_TIMEZONE | 5 |
TIMESTAMP_WITH_TIMEZONE | 5 |
BOOLEAN | 6 |
LONG | 7 |
POINT2D | 20 |
When loading a graph in flat file format into PGX, the graph config is used to find the right temporal or spatial type.
Comma delimiter
The standard for the flat file format defines commma as the only valid delimiter, therefore any delimiter set in the graph config is ignored and comma is used instead.
Quoted Strings
Strings must not be quoted, however the following encoding is needed for some characters:
'%'
-> '%25'
'\t'
-> '%09'
' '
-> '%20'
'\n'
-> '%0A'
','
-> '%2C'
Storing
When storing a graph into flat file format, vertex labels will be ignored. Also, when a graph has no edge label, an empty string ("") will be stored instead.
Parallel loading
When loading a graph in parallel using flat file format, all information regarding an specific vertex/edge must be contained in the same partition otherwise unexpected behavior might occur.
The following example shows a graph of 4 vertices (1, 2, 3 and 4), each having a double
and a string
property, and 3 edges, each having a boolean
and a date
property, encoded in Two Tables Text format:
vertices.opv:
1,doubleProp,4,,8.0, 1,stringProp,1,foo,, 2,doubleProp,4,,4.3, 2,stringProp,1,bar,, 3,doubleProp,4,,6.1, 3,stringProp,1,bax,, 4,doubleProp,4,,17.78, 4,stringProp,1,f00,,
edges.ope:
1,2,1,label,boolProp,6,false,, 1,2,1,label,dateProp,5,,,1985-10-18%2010:00:00 2,3,2,label,boolProp,6,true,, 2,3,2,label,dateProp,5,,,1961-12-30%2014:45:14 3,3,4,label,boolProp,6,false,, 3,3,4,label,dateProp,5,,,2001-01-15%2007:00:43
See the graph configuration examples page for a json configuration example.