PGX 20.1.1
Documentation

Memory Consumption

PGX is an in-memory graph analytic engine optimized for performance. Before PGX runs any analysis on graphs, it requires the whole graph and the properties needed for the analysis to be loaded into main memory (except for properties offloaded to external stores). The memory consumed by PGX for a graph is split between the memory to store the topology of the graph (the information to indicate what are the vertices and edges in the graph without their attached properties), and the memory for the properties attached to the vertices and edges. Internally, PGX stores the graph topology in compressed sparse row (CSR) format, a data structure which has minimal memory footprint while providing very fast read access.

Non-partitioned Vs. Partitioned Graphs

The shared-memory version of PGX supports both non-partitioned graph and partitioned graph models which differ in how properties are associated to vertices and edges. In the non-partitioned graph model, all the vertices (resp. edges) of a PGX graph have the same properties. In that model, if a graph contained vertices that represent persons and books, to which we would like to attach the properties 'PersonName', 'PersonAge', and 'BookISBN', 'BookGenre' respectively, we would in fact have all four properties defined for every vertex in the graph.

In the case of a graph using the partitioned graph model, vertex properties (resp. edge properties) are specified per vertex provider (resp. edge provider). For example, using a partitioned graph model, it is possible to define the properties Name and Age only for vertices loaded from a Person vertex provider and define the properties ISBN and Genre only for vertices loaded from a Book vertex provider. Therefore, using the partitioned graph model can reduce the memory consumption required for the vertex and edge properties of a graph by associating only the necessary properties to the entities based on what the entities represent.

However, storing the topology of a partitioned graph can require more memory depending on how the edges are specified in edge providers: for each edge provider PGX shared memory requires CSR indices to put in relation the vertex and source vertices of the edges.

Memory Consumption of Properties

Each property associated to a vertex or an edge consumes memory depending on the type of the property. The following table indicates the memory consumption for each property type:

Property Type Size in Bytes
int 4
float 4
long 8
double 8
boolean 1
date 8
local date 4
time 4
timestamp 8
time with timezone 8
timestamp with timezone 12
point 2d 16
string variable

How Much Memory Do You Need for Your Non-partitioned Graph?

Here is a simple formula you can use to estimate whether your graph data fits in memory:

number of bytes = 48 * V + 16 * E

where V is the number of vertices and E is the number of edges in the graph.

Assuming 8-byte vertex keys

The formula presented above assumes vertices are identified using 8 byte long IDs (see vertex_id_type in graph configuration guide), and that no edge IDs are used. If you load edge keys, you should add the 4 * E or 8 * E depending on if Integer or Long keys are used.

If your graph has properties, you have to add for each property V * type-size or E * type-size — depending on whether it is a vertex or edge property — to that number, where type-size is the size indicated by the table in the Memory consumption of properties section.

Example: A 10M vertex 100M edge graph with one double edge cost property consumes at least 54 * 10M + 16 * 100M + 100M * 8 bytes = 2.94 GB of memory.

Note that this estimate only refers to the amount of memory that is required to hold the graph in main memory after it has been successfully loaded. Depending on the format, PGX might allocate temporary data structures during loading that will consume additional memory. We therefore recommend that you have at least twice the size of memory available than the sum of all the graph data you plan to load.

How Much Memory Do You Need for Your Partitioned Graph?

For partitioned graphs, the memory required for the entire graph is the sum of the memory required to load the vertices (resp. edges) from all the vertex (resp. edge) providers, including their properties.

number of bytes = SUM(memory_vertex_provider_i) + SUM(memory_edge_provider_j)

The following sections explain how to compute the memory consumption for the vertex and edge providers.

How Much Memory Do You Need to Load a Vertex Provider?

Here is a simple formula to determine the memory required for the vertices loaded from a vertex provider containing V vertices:

number of bytes vertex provider= 32 * V

Assuming 8-byte vertex keys

The formula presented above assumes vertices are identified using 8 byte long keys.

If your vertex provider has properties, you have to add for each property V * type-size to that number, where type-size is the size indicated by the table in the Memory consumption of properties section.

How Much Memory Do You Need to Load an Edge Provider?

Edges loaded from each edge provider have their own CSR representation, referring to the vertices loaded from the source and destination vertex providers. For this reason, the memory requirement for edge providers depends on the number of vertices in the source (V_src) and destination (V_dst) vertex providers, as well as on the number of edges in the edge provider (E), as the following formula illustrates.

number of bytes edge provider= 8 * V_src + 8 * V_dst + 16 * E

Assuming no edge keys are loaded

The formula presented above assumes that no edge keys are used. If you load edge keys, you should add the 4 * E or 8 * E depending on if Integer or Long keys are used.

If your edge provider has properties, you have to add for each property E * type-size to that number, where type-size is the size indicated by the table in the Memory consumption of properties section.

On and Off Heap Memory

Although PGX 20.1.1 is mainly a Java application, it stores some of its graph data in off-heap memory, meaning in memory locations outside the control of the Java virtual machine. This is because of Java's 32bit array-length limitation. PGX 20.1.1 uses off-heap memory to store all vertices, edges and properties (with the exception of non-primitive property types string).

String Properties and String Pooling

As indicated in the previous section, string properties are stored in on-heap memory by PGX.

PGX implements a "string pooling" optimization to save memory in case there are just a few repeated values stored in a string property (e.g., a color property storing just a few categorical values: red, green, blue, ...). When PGX applies that technique, the memory consumption of the string properties follows a different rule. In that case the memory consumed can be approximated by the memory consumed by the distinct strings (each individual string still consumes memory as indicated in string memory consumption), plus the memory consumed by all the references to those strings (one per vertex/edge depending on if the property is a vertex or edge property). The memory consumed by a reference in java is generally 4 bytes on 32-bits JVMs and 8 bytes on 64-bits JVMs.

String pooling behavior can be configured at a PGX server level with the PGX runtime fields string_pooling_strategy and max_distinct_strings_per_pool. These same settings can be overridden by the user for a given string EDGE/VERTEX property when first loading a graph into memory, an example of how to set string pooling for properties can be found here Handing Graph Config in Application. For more information on the configuration fields, please see the Engine and Runtime Configuration Guide or the Graph Config Guide.

In addition the runtime configuration field pooling_factoris relevant in Graph Mutation context, it defines a factor that prevents cases where string pooling can be ineffective such as when the number of distinct property values is large which adds up the cost of the structures used for the string pools. The default value is set to 0.25 which estimates to lowest pooling factor above which pooling may actually help saving memory.

What Happens If PGX Runs out of Memory

PGX memory allocation requests can fail for a couple of reasons:

  • The maximum Java heap size is reached and PGX tries to allocate more on-heap memory. Then the underlying JVM will throw an OutOfMemoryError.
  • The maximum PGX off-heap size is reached and PGX tries to allocate more off-heap memory. Then the PGX runtime will throw an OutOfMemoryError.
  • The maximum PGX off-heap size is not yet reached but the underlying OS is running out of memory and PGX tries to allocate more off-heap memory. Then the OS might reject the allocation request, which will result in an OutOfMemoryError being thrown. However, the OS might also simply kill the whole process. We therefore recommend that you always specify an upper off-heap memory allocation limit in order to prevent PGX from trying to allocate more memory than physically available on the current machine, running the risk of the JVM being accidentally shut down by the OS.

If an OutOfMemoryError is thrown while processing a user request (like loading a graph), the request stops and the corresponding Future of the request will be completed exceptionally, having an OutOfMemoryError as cause. The PGX engine will remain fully operational, continuing accepting and processing other incoming requests. The user can try again later, when more memory becomes available.

If an OutOfMemoryError is thrown on the engine's main thread, PGX will shut down. The engine's main thread is mainly responsible for dispatching incoming user requests to the thread pools. It only allocates small objects, so if those allocations fail, it usually means that the Java heap is completely full. The Java virtual machine is probably spending most of the time in garbage collection cycles and the thread-pools are not able to complete any tasks without OutOfMemoryError. Hence, an OutOfMemoryError on the engine's main thread is treated as critical error, causing PGX to reject any further incoming requests, clean up and terminate.

In both cases, the OutOfMemoryError is logged accordingly, so administrators can understand what happened.

Defaults and Configuration of Memory Limits

You can configure both on- and off-heap memory limits. If you don't explicitly set either, both maximum on- and off-heap size will default to the maximum on-heap size determined by Java Hotspot, which is based on various factors, including the total amount of physical memory available.

This implies that — by default — the total amount of memory PGX is allowed to allocate is twice the default maximum on-heap size.

Configure On-Heap Limits

The maximum on-heap size of a Java application is controlled via the -Xmx command-line option.

  • If you start PGX as local Java application, simply pass -Xmx<SIZE> to your java command.
  • If you're using local PGX shell, set the JAVA_OPTS environment variable before starting the shell, for example:
export JAVA_OPTS="-Xmx128g"
$PGX_HOME/bin/pgx-jshell
  • If you deploy PGX as a web application, consult the documentation of your target web server on how to specify Java command-line options.

Configure Off-Heap Limits

You can specify the off-heap limit by setting the max_off_heap_size field in the PGX config.

Warning: Max Off-Heap Precision

The off-heap limit is not guaranteed to never be exceeded because of rounding and synchronization trade-offs.