PGX 2.6.1
Documentation

Plain-Text Formats

Common Facts

How vertices are parsed

PGX supports three types of vertex identifies (id): integer, long and string. The type can be configured in the graph config and defaults to integer.

How edges are parsed

Of the various formats and protocols supported by PGX, only flat file parsing supports edge identifiers. For all other data sources, the id of an edge is PGX's internal id, which is an integer from zero to num_edges - 1.

How properties are parsed

string properties, spatial properties (currently only point2d) and temporal properties (date, local_date, time, timestamp, time_with_timezone and timestamp_with_timezone) must be quoted: "<string>".

date properties are parsed using Java's SimpleDateFormat utility, instantiated with the format string yyyy-MM-dd HH:mm:ss unless specified otherwise in the graph config. All other types of temporal properties are parsed using Java's DateTimeFormatter utility.

point2d can be specified by its longitude followed by its latitude, separated by a space. Both longitude and latitude are doubles. As an example, "-74.0445 40.6892" is the representation of a point2d instance representing the location of the Statue of Liberty.

Boolean values are interpreted as true if the value is true (ignoring case), Y (ignoring case) or 1, false otherwise. The suggested notation for false is false (ignoring case), N (ignoring case) or 0. All other types are parsed using the parseXXX() functions of its corresponding Java type, e.g. Integer.parseInt(...) for integer types.

Separators

When using single file formats IDs and properties are separated with tab or one single space ("\t ") by default, for multiple file formats comma (",") is used instead. However, PGX allows to configure the separator string.

Parallel Loading

The following formats support parallel loading from multiple files:

  • Flat File (specify multiple files in vertex_uris and/or edge_uris)
  • Two Tables (specify multiple files in vertex_uris and/or edge_uris)
  • Adjacency List (specify multiple files in uris)
  • Edge List (specify multiple files in uris)

Legend

The following abbreviations are used to specify text formats:

  • V = Vertex Key
  • VG = Neighbor Vertex
  • VP = Vertex Property
  • VPK = Vertex Property Key
  • VPT = Vertex Property Type
  • EL = Edge Label
  • EP = Edge Property
  • EPK = Edge Property Key
  • EPT = Edge Property Type

For example <V-2, VG-4> or <V-2, VG-4> denotes the 4th neighbor of the 2nd vertex.


Adjacency List (ADJ_LIST)

The Adjacency List format is a text file format containing a list of neighbors from a vertex, per line. The format is extended to encode properties. A graph with V vertices, having N vertex properties and M edge properties would look like this:

<V-1> <V-1, VP-1> ... <V-1, VP-N> <V-1, VG-1> <EP-1> ... <EP-M> <V-1, VG-2> <EP-1> ... <EP-M>
<V-2> <V-2, VP-1> ... <V-2, VP-N> <V-2, VG-1> <EP-1> ... <EP-M> <V-2, VG-2> <EP-1> ... <EP-M>
...
<V-V> <V-V, VP-1> ... <V-V, VP-N> <V-V, VG-1> <EP-1> ... <EP-M> <V-V, VG-2> <EP-1> ... <EP-M>

Trailing Separators

Trailing separators will be considered as errors, because they instruct the parser to expect another property but none is given. i.e if whitespace is used to separate the properties, any trailing whitespace will cause an exception to be raised.

Example

Here is a graph of 4 vertices (1, 2, 3 and 4), each having a double and a string property, and 3 edges, each having a boolean and a date property, encoded in Adjacency List format:

1 8.0 "foo"
2 4.3 "bar" 1 false "1985-10-18 10:00:00"
3 6.1 "bax" 2 true "1961-12-30 14:45:14" 4 false "2001-01-15 07:00:43"
4 17.78 "f00"

See the graph configuration examples page for a json configuration example.


Edge List (EDGE_LIST)

The Edge List format is a text file format starting with a section with one vertex per line, followed by a section with one edge per line. If a vertex does not have any labels or properties, it is possible to omit the vertex in the first section, but still specify edges for the vertex in the second section.

The grammar is as follows:

EdgeList      := {Vertex '\n'}* '\n' {Edge '\n'}*

Vertex        := VertexId '*' VertexLabels? PropertyValue*
VertexId      := Integer | Long | String
VertexLabels  := '{' String* '}'

Edge          := SrcVertex DstVertex EdgeLabel? PropertyValue*
SrcVertex     := VertexId
DstVertex     := VertexId
EdgeLabel     := String

PropertyValue := Integer | Long | Double | Float | Boolean | String | Date

The vertices start with an identifier (VertexId), followed by a *, an optional set of vertex labels (VertexLabels?) and the vertex properties (PropertyValue*). A vertex identifier is either an Integer, a Long, or a String. Furthermore, vertex labels are zero or more Strings between curly braces ('{' String* '}').

The edges start with source and destination vertex identifiers (SrcVertex DstVertex), followed by optional edge label (EdgeLabel?) and the edge properties (PropertyValue*). The edge label is a String.

Example

Here is a graph with two vertices and two edges, with labels and properties:

1 * { "Person" "Male" } "Mario" 15
2 * { "Person" "Male" } "Luigi" 14
1 2 "likes" 3.5
2 1 "likes" 2.1

The two vertices (lines 1-2) have identifiers 1 and 2 and both have the labels "Person" and "Male", a string property ("Mario" and "Luigi") and an integer property (15 and 14). There is an edge from vertex 1 to vertex 2 (line 3) with label "likes" and a double property with value 3.5, and another edge from vertex 2 to vertex 1 with label "likes" and a double property with value 2.1.

A corresponding graph configuration is as follows:

{
  "format":"edge_list",
  "uri":"example.edgelist",
  "vertex_id_type":"long",
  "vertex_labels":true,
  "edge_label":true,
  "vertex_props":[
    {
      "name":"name",
      "type":"string"
    },
    {
      "name":"age",
      "type":"int"
    }
  ],
  "edge_props":[
    {
      "name":"rating",
      "type":"double"
    }
  ],
  "loading": {
    "load_vertex_labels":true,
    "load_edge_label":true
  },
  "separator":" "
}

Two Tables (TWO_TABLES)

When configured to use file as datastore the Two Tables format becomes a text file format similar to the Edge List format, with the difference that the vertices and edges are stored in two different files. The vertices file contains vertex IDs followed by vertex properties. The edges file contains the source vertices and target vertices, followed by edge properties.

A graph with V vertices, having N vertex properties and M edge properties would be represented in two files like this:

vertices.ttt:

<V-1>  <V-1, NP-1> ... <V-1, NP-N>
<V-2>  <V-2, NP-1> ... <V-2, NP-N>
...
<V-V> <V-V, NP-1> ... <V-V, NP-N>

edges.ttt:

<V-1> <V-1, VG-1> <EP-1> ... <EP-M>
<V-1> <V-1, VG-2> <EP-1> ... <EP-M>
...
<V-V> <V-V, VG-1> <EP-1> ... <EP-M>

Example

The following example shows the graph of 4 vertices (1, 2, 3 and 4), each having a double and a string property, and 3 edges, each having a boolean and a date property, encoded in Two Tables Text format:

vertices.ttt:

1 8.0 "foo"
2 4.3 "bar"
3 6.1 "bax"
4 17.78 "f00"

edges.ttt:

2 1 false "1985-10-18 10:00:00"
3 2 true "1961-12-30 14:45:14"
3 4 false "2001-01-15 07:00:43"

See the graph configuration examples page for a json configuration example.

Edge List vs Adjacency List

ADJ_LIST is more space efficient than EDGE_LIST this is caused because vertices are being defined and then, edges are being created, this means that we are repeating each vertex at least once. EDGE_LIST is a lot easier to read but if you are planning on storing big graphs you should probably consider this, in order to save disk space.

Flat File (FLAT_FILE)

The Flat File format is a text file format containing two description files, one for vertices and one for edges. Each file consists of a list of properties with the following format:

vertices.opv:

vertex_ID, key_name, value_type, value, value, value

<V-1> <V-1, VPK-1> <V-1, VPT-1> [<V-1, VP-1> <V-1, VP-1> <V-1, VP-1>]
...
<V-1> <V-1, VPK-N> <V-1, VPT-1> [<V-1, VP-N> <V-1, VP-N> <V-1, VP-N>]
<V-2> <V-2, VPK-1> <V-2, VPT-1> [<V-2, VP-1> <V-2, VP-1> <V-2, VP-1>]
...
<V-2> <V-2, VPK-N> <V-2, VPT-N> [<V-2, VP-N> <V-2, VP-N> <V-2, VP-N>]
...
<V-V> <V-V, VPK-N> <V-V, VPT-N> [<V-V, VP-N> <V-V, VP-N> <V-V, VP-N>]

edges.ope:

edge_ID, source_vertex_ID, destination_vertex_ID, edge_label, key_name, value_type, value, value, value


<E-1> <V-1, VG-1> <E-1, EL-1> <E-1, EPK-1> <E-1, EPT-1> [<E-1, EP-1> <E-1, EP-1> <E-1, EP-1>]
...
<E-1> <V-N, VG-N> <E-1, EL-N> <E-1, EPK-N> <E-1, EPT-N> [<E-1, EP-N> <E-1, EP-N> <E-1, EP-N>]
<E-2> <V-1, VG-1> <E-2, EL-1> <E-2, EPK-1> <E-2, EPT-1> [<E-2, EP-1> <E-2, EP-1> <E-2, EP-1>]
...
<E-2> <V-N, VG-N> <E-2, EL-N> <E-2, EPK-N> <E-2, EPT-N> [<E-2, EP-N> <E-2, EP-N> <E-2, EP-N>]
...
<E-E> <V-N, VG-N> <E-E, EL-N> <E-E, EPK-N> <E-E, EPT-N> [<E-E, EP-N> <E-E, EP-N> <E-E, EP-N>]

No properties

When no properties are defined for a certain vertex or edge, %20 is used instead of the key name:

Vertices: 1,%20,,,,
Edges: 1,2,1,"label",%20,,,,

Value fields

Values that are not numeric nor date go in the first field; numeric values go in the second, and dates in the third.

Value types

Mapping from PGX property type to flat file value_type

PGX property type Flat file value_type
STRING 1
INTEGER 2
FLOAT 3
DOUBLE 4
DATE 5
LOCAL_DATE 5
TIME 5
TIMESTAMP 5
TIME_WITH_TIMEZONE 5
TIMESTAMP_WITH_TIMEZONE 5
BOOLEAN 6
LONG 7
POINT2D 20

When loading a graph in flat file format into PGX, the graph config is used to find the right temporal or spatial type.

Comma delimiter

The standard for the flat file format defines commma as the only valid delimiter, therefore any delimiter set in the graph config is ignored and comma is used instead.

Quoted Strings

Strings must not be quoted, however the following encoding is needed for some characters:

  • '%' -> '%25'
  • '\t' -> '%09'
  • ' ' -> '%20'
  • '\n' -> '%0A'
  • ',' -> '%2C'

Storing

When storing a graph into flat file format, vertex labels will be ignored. Also, when a graph has no edge label, an empty string ("") will be stored instead.

Parallel loading

When loading a graph in parallel using flat file format, all information regarding an specific vertex/edge must be contained in the same partition otherwise unexpected behavior might occur.

Example

The following example shows a graph of 4 vertices (1, 2, 3 and 4), each having a double and a string property, and 3 edges, each having a boolean and a date property, encoded in Two Tables Text format:

vertices.opv:

1,doubleProp,4,,8.0,
1,stringProp,1,foo,,
2,doubleProp,4,,4.3,
2,stringProp,1,bar,,
3,doubleProp,4,,6.1,
3,stringProp,1,bax,,
4,doubleProp,4,,17.78,
4,stringProp,1,f00,,

edges.ope:

1,2,1,label,boolProp,6,false,,
1,2,1,label,dateProp,5,,,1985-10-18%2010:00:00
2,3,2,label,boolProp,6,true,,
2,3,2,label,dateProp,5,,,1961-12-30%2014:45:14
3,3,4,label,boolProp,6,false,,
3,3,4,label,dateProp,5,,,2001-01-15%2007:00:43

See the graph configuration examples page for a json configuration example.