PGX 1.2.0
Documentation

Plain-Text Formats

Common Facts

How vertices are parsed

PGX supports three types of vertex identifies (id): integer, long and string. The type can be configured in the graph config and defaults to integer.

How edges are parsed

Of the various formats and protocols supported by PGX, only flat file parsing supports edge identifiers. For all other data sources, the id of an edge is PGX's internal id, which is an integer from zero to num_edges - 1.

How properties are parsed

Both string and date properties must be quoted: "<string>". date properties are parsed using Java's SimpleDateFormat utility, instantiated with the format string yyyy-MM-dd HH:mm:ss. All other types are parsed using the parseXXX() functions of its corresponding Java type, e.g. Boolean.parseBoolean(...) for boolean types or Integer.parseInt(...) for integer types.

Separators

IDs and properties are separated with one single space by default. However, PGX allows to configure the separator string.

Legend

The following abbreviations are used to specify text formats:

  • V = Vertex Key
  • VG = Neighbor Vertex
  • VP = Vertex Property
  • VPK = Vertex Property Key
  • VPT = Vertex Property Type
  • EL = Edge Label
  • EP = Edge Property
  • EPK = Edge Property Key
  • EPT = Edge Property Type

For example <V-2, VG-4> or <V-2, VG-4> denotes the 4th neighbor of the 2nd vertex.


Adjacency List (ADJ_LIST)

The Adjacency List format is a text file format containing a list of neighbors from a vertex, per line. The format is extended to encode properties. A graph with V vertices, having N vertex properties and M edge properties would look like this:

<V-1> <V-1, VP-1> ... <V-1, VP-N> <V-1, VG-1> <EP-1> ... <EP-M> <V-1, VG-2> <EP-1> ... <EP-M>
<V-2> <V-2, VP-1> ... <V-2, VP-N> <V-2, VG-1> <EP-1> ... <EP-M> <V-2, VG-2> <EP-1> ... <EP-M>
...
<V-V> <V-V, VP-1> ... <V-V, VP-N> <V-V, VG-1> <EP-1> ... <EP-M> <V-V, VG-2> <EP-1> ... <EP-M>

Trailing Separators

Trailing separators will be considered as errors, because they instruct the parser to expect another property but none is given. i.e if whitespace is used to separate the properties, any trailing whitespace will cause an exception to be raised.

Example

Here is a graph of 4 vertices (1, 2, 3 and 4), each having a double and a string property, and 3 edges, each having a boolean and a date property, encoded in Adjacency List format:

1 8.0 "foo"
2 4.3 "bar" 1 false "1985-10-18 10:00:00"
3 6.1 "bax" 2 true "1961-12-30 14:45:14" 4 false "2001-01-15 07:00:43"
4 17.78 "f00"

Edge List (EDGE_LIST)

The Edge List format is a text file format starting with a section with one vertex per line followed by * with its vertex properties, followed in the second section with the source and destination vertex of each edge per line, including the edge properties. The format is extended to encode properties. Note that declaring a vertex in the first section is optional.

SNAP data set

Most of the data sets available for download on the SNAP website are encoded into Edge List format and can be loaded into PGX directly.

A graph with V vertices, having N vertex properties and M edge properties would look like this:

<V-1> * <V-1, VP-1> ... <V-1, VP-N>
<V-2> * <V-2, VP-1> ... <V-2, VP-N>
...
<V-V> * <V-V, VP-1> ... <V-V, VP-N>
<V-1> <V-1, VG-1> * <EP-1> ... <EP-M>
<V-1> <V-1, VG-2> * <EP-1> ... <EP-M>
...
<V-V> <V-V, VG-1> * <EP-1> ... <EP-M>
...

file format limitation

PGX 1.2.0 requires the entries in the file to be grouped by source node. That means that all edges of V must be grouped together. This restriction will be removed in future versions.

Example

Here is a graph of 4 vertices (1, 2, 3 and 4), each having a double and a string property, and 3 edges, each having a boolean and a date property, encoded in Edge List format:

1 * 8.0 "foo"
2 * 4.3 "bar"
3 * 6.1 "bax"
4 * 17.78 "f00"
2 1 false "1985-10-18 10:00:00"
3 2 true "1961-12-30 14:45:14"
3 4 false "2001-01-15 07:00:43"

Two Tables Text (TWO_TABLES_TEXT)

The Two Tables Text format is a text file format similar to the Edge List format, with the difference that the vertices and edges are stored in two different files. The vertices file contains vertex IDs followed by vertex properties. The edges file contains the source vertices and target vertices, followed by edge properties.

A graph with V vertices, having N vertex properties and M edge properties would be represented in two files like this:

vertices.ttt:

<V-1>  <V-1, NP-1> ... <V-1, NP-N>
<V-2>  <V-2, NP-1> ... <V-2, NP-N>
...
<V-V> <V-V, NP-1> ... <V-V, NP-N>

edges.ttt:

<V-1> <V-1, VG-1> <EP-1> ... <EP-M>
<V-1> <V-1, VG-2> <EP-1> ... <EP-M>
...
<V-V> <V-V, VG-1> <EP-1> ... <EP-M>

Example

The following example shows the graph of 4 vertices (1, 2, 3 and 4), each having a double and a string property, and 3 edges, each having a boolean and a date property, encoded in Two Tables Text format:

vertices.ttt:

1 8.0 "foo"
2 4.3 "bar"
3 6.1 "bax"
4 17.78 "f00"

edges.ttt:

2 1 false "1985-10-18 10:00:00"
3 2 true "1961-12-30 14:45:14"
3 4 false "2001-01-15 07:00:43"

Edge List vs Adjacency List

ADJ_LIST is more space efficient than EDGE_LIST this is caused because vertices are being defined and then, edges are being created, this means that we are repeating each vertex at least once. EDGE_LIST is a lot easier to read but if you are planning on storing big graphs you should probably consider this, in order to save disk space.

Flat file (FLAT_FILE)

The Flat File format is a text file format containing two description files, one for vertices and one for edges. Each file consists of a list of properties with the following format:

vertices.opv:

vertex_ID, key_name, value_type, value, value, value

<V-1> <V-1, VPK-1> <V-1, VPT-1> [<V-1, VP-1> <V-1, VP-1> <V-1, VP-1>]
...
<V-1> <V-1, VPK-N> <V-1, VPT-1> [<V-1, VP-N> <V-1, VP-N> <V-1, VP-N>]
<V-2> <V-2, VPK-1> <V-2, VPT-1> [<V-2, VP-1> <V-2, VP-1> <V-2, VP-1>]
...
<V-2> <V-2, VPK-N> <V-2, VPT-N> [<V-2, VP-N> <V-2, VP-N> <V-2, VP-N>]
...
<V-V> <V-V, VPK-N> <V-V, VPT-N> [<V-V, VP-N> <V-V, VP-N> <V-V, VP-N>]

edges.ope:

edge_ID, source_vertex_ID, destination_vertex_ID, edge_label, key_name, value_type, value, value, value


<E-1> <V-1, VG-1> <E-1, EL-1> <E-1, EPK-1> <E-1, EPT-1> [<E-1, EP-1> <E-1, EP-1> <E-1, EP-1>]
...
<E-1> <V-N, VG-N> <E-1, EL-N> <E-1, EPK-N> <E-1, EPT-N> [<E-1, EP-N> <E-1, EP-N> <E-1, EP-N>]
<E-2> <V-1, VG-1> <E-2, EL-1> <E-2, EPK-1> <E-2, EPT-1> [<E-2, EP-1> <E-2, EP-1> <E-2, EP-1>]
...
<E-2> <V-N, VG-N> <E-2, EL-N> <E-2, EPK-N> <E-2, EPT-N> [<E-2, EP-N> <E-2, EP-N> <E-2, EP-N>]
...
<E-E> <V-N, VG-N> <E-E, EL-N> <E-E, EPK-N> <E-E, EPT-N> [<E-E, EP-N> <E-E, EP-N> <E-E, EP-N>]

No properties

When no properties are defined for a certain vertex or edge, %20 is used instead of the key name:

Vertices: 1,%20,,,,
Edges: 1,2,1,"label",%20,,,,

Value fields

Values that are not numeric nor date go in the first field; numeric values go in the second, and dates in the third.

Quoted Strings

Strings must not be quoted, however the following encoding is needed for some characters:

  • '%' -> '%25'
  • '\t' -> '%09'
  • ' ' -> '%20'
  • '\n' -> '%0A'
  • ',' -> '%2C'

Example

The following example shows a graph of 4 vertices (1, 2, 3 and 4), each having a double and a string property, and 3 edges, each having a boolean and a date property, encoded in Two Tables Text format:

vertices.opv:

1,doubleProp,4,,8.0,
1,stringProp,1,foo,,
2,doubleProp,4,,4.3,
2,stringProp,1,bar,,
3,doubleProp,4,,6.1,
3,stringProp,1,bax,,
4,doubleProp,4,,17.78,
4,stringProp,1,f00,,

edges.ope:

1,2,1,label,boolProp,6,false,,
1,2,1,label,dateProp,5,,,1985-10-18%2010:00:00
2,3,2,label,boolProp,6,true,,
2,3,2,label,dateProp,5,,,1961-12-30%2014:45:14
3,3,4,label,boolProp,6,false,,
3,3,4,label,dateProp,5,,,2001-01-15%2007:00:43