PGX 20.1.1

Documentation

Documentation

Partitioned graph (beta version)

Not all PGX features and APIs may be available for partitioned graphs and partitioned graphs are not supported in distributed execution mode. Please refer to the "Unsupported features" section to read about the current limitations.

The PGX partitioned graph model is a representation of a property graph particularly suited for graphs having vertices and edges of different "types", where each type of vertex or edge has a different set of properties. For example, in a graph that represents people and locations, vertices of type "Person" would have properties such as "Name" or "Birthday", while vertices of type "Place" would have properties such as "Address". Similarly, edges from "Person" to "Person" may have properties like "Meeting date", while edges from "Person" to "Place" may have properties such as "Lives at floor".

Loading such graphs as partitioned graphs in PGX can result in potentially large memory savings, thanks to the specific memory layout, optimized for different types of vertices and edges.

PGX enforces by default the existence of a unique identifier for each vertex and edge in a graph,
so that they can be retrieved with the `PgxGraph.getVertex(ID id)`

and `PgxGraph.getEdge(ID id)`

or PGQL
queries using the built-in `id()`

method. That remains the case for partitioned graphs.

The default strategy to generate the vertex IDs is to use the keys provided during loading of the graph. In that case each vertex should have a vertex key that is unique across all the types of vertices. For edges, by default no keys are required in the edge data, and edge IDs will be automatically generated by PGX. Please note that the generation of edge IDs is not guaranteed to be deterministic. If required, it is also possible to load edge keys as IDs.

However, as it may cumbersome for partitioned graphs to define such identifiers, it is possible to disable that
requirement for the vertices and/or edges by setting the `vertex_id_strategy`

and `edge_id_strategy`

graph
configuration fields to the value `no_ids`

.
When disabling vertex (resp. edge) IDs, the implication is that PGX will forbid the call to APIs using vertex (resp.
edge) IDs, including the ones indicated previously.

Please refer to the Java API documentation of the `IdStrategy`

enumeration, the
Graph Loading Guide, and
Graph Configuration Guide to learn more about the possible ID strategies, and
how to specify them in graph configurations.

In partitioned graphs, vertices and edges are typed, meaning that they have a defined set of properties. Vertices and edges of a partitioned graphs are loaded from "providers", where each provider is a data source that provides vertices (or edges) of a specific type (i.e., with a specific set of properties).

See the partitioned graph loading documentation for more information on loading partitioned graphs.

Additionally to loading partitioned graphs directly from a partitioned graph configuration, it is possible for some non-partitioned graph formats
(currently `CSV`

, `TWO_TABLES RDBMS`

, and the `PG`

formats) to let PGX detect the vertex and edge types while loading the non-partitioned graph data, and create a partitioned graph.
To do that, PGX relies on the vertex and/or edge labels present in the non-partitioned graph data to find the the vertex and edge types.

Loading partitioned graphs in this way presents the advantage of requiring few changes if a non-partitioned graph is already available in a supported format, while giving the memory improvements of partitioned graphs.

Loading a partitioned graph in this way is described further in the Auto-Heterogenization Guide.

It is possible to add or remove vertex and edge providers from a partitioned graph by applying a graph alteration mutation. To get more information about how to apply graph alterations on partitioned graphs, please read the dedicated documentation available at the graph alteration reference documentation.

By giving the ability to model precisely the types of the vertices and edges and their associated properties, the memory consumption for a partitioned graph can be very different of non-partitioned graphs. The memory consumption documentation page provides more information on the memory requirements for loading partitioned graphs.

Partitioned graphs can be modified by using changesets, subject to some constraints. Due to the fact that partitioned graphs are made of vertices and edges of specific types, the changes in a graph changeset on a partitioned graph have to obey to the types defined when initially creating or loading the partitioned graph. To get more information about how to create and apply graph changesets on partitioned graphs, please read the dedicated documentation available at the graph change set reference documentation.

All the features of the PGQL language available for non-partitioned graphs are supported for partitioned graphs.

Furthermore, since partitioned graphs associate the vertices and edges of specific types, PGQL queries can execute
faster by applying some specific optimizations. In order to benefit from all possible optimizations, we recommend to
enable the creation of a label histogram when loading partitioned graphs. Please refer to the documentation of the
`create_label_histogram`

configuration field at the
graph config reference documentation.

In partitioned graphs, not all the vertices or edges may have all properties. If a property access is attempted for
a vertex or an edge that does not have this property, the PGQL query engine will continue the query by giving a `NULL`

value as result of this access.
If this `NULL`

value is used in the rest of the query in an expression, the same rules as the
SQL Three-valued logic
are used to evaluate the expression. Sorting of `NULL`

values with an `ORDER BY`

clause is supported, and the `NULL`

values will be placed after any other non-`NULL`

value when using an ascending ordering, and before any non-`NULL`

value when using descending ordering.

Current limitation when grouping by NULL values

The current PGQL engine may not function correctly for queries that execute a `GROUP BY`

aggregation on keys that
contain `NULL`

values.

For more information about the PGQL language, please refer to the PGQL reference documentation.

`INSERT`

/`UPDATE`

/`DELETE`

queries are also supported for partitioned graphs.
`UPDATE`

and `DELETE`

queries can be executed without limitations.
In case of `INSERT`

queries, the type of the inserted entity is determined by its label(s).
For this reason, vertices inserted through PGQL must have their labels defined,
and it should correspond to exactly one vertex type.
In case of edge insert, the label of the inserted edge must refer to an edge type from the graph that is defined between the type of the source and the type of the destination vertex.
Furthermore, the assigned properties must be defined for the type of the inserted entity.

More details on how to run `INSERT`

/`UPDATE`

/`DELETE`

queries on a graph can be found here.
For the exact syntax and semantics of `INSERT`

/`UPDATE`

/`DELETE`

queries, please refer to the corresponding section of the PGQL specification.

The methods provided in the PGX Analyst API do support partitioned graphs in the same way as for non-partitioned graphs.

Current limitation when using the Analyst API

Most of the algorithms are supported on partitioned graphs. Among the currently non-supported algorithms are the community-label-propagation and infomap algorithms.

Partitioned graphs can also be used for custom algorithms written in Green-Marl or using PGX Algorithm.

Current limitation when using custom algorithms

Currently custom algorithms cannot be executed using partitioned graphs if they use certain features including local procedures, and ordered iterations.

We list here the notable features that are currently not supported on partitioned graphs (potentially non-exhaustive list):

- distributed runtime support (PGX.D) is not implemented
- graph mutations (subgraph, undirect, sort by degree and others) except
`PgxGraph.clone()`

are not supported - the
`PgxMap`

APIs are not supported - the
`GraphBuilder`

APIs are not supported - Analyst algorithm and custom GreenMarl/PGX Algorithms using local procedures and ordered iterations are not supported
- Delta-refresh of partitioned graph is not supported: a full snapshot is created by reading the entire graph again