PGX 21.1.1
Documentation

Memory Consumption

PGX is an in-memory graph analytic engine optimized for performance. Before PGX runs any analysis on graphs, it requires the whole graph and the properties needed for the analysis to be loaded into main memory (except for properties offloaded to external stores). The memory consumed by PGX for a graph is split between the memory to store the topology of the graph (the information to indicate what are the vertices and edges in the graph without their attached properties), and the memory for the properties attached to the vertices and edges. Internally, PGX stores the graph topology in compressed sparse row (CSR) format, a data structure which has minimal memory footprint while providing very fast read access.

Non-partitioned Vs. Partitioned Graphs

The shared-memory version of PGX supports both non-partitioned graph and partitioned graph models which differ in how properties are associated to vertices and edges. In the non-partitioned graph model, all the vertices (resp. edges) of a PGX graph have the same properties. In that model, if a graph contained vertices that represent persons and books, to which we would like to attach the properties 'PersonName', 'PersonAge', and 'BookISBN', 'BookGenre' respectively, we would in fact have all four properties defined for every vertex in the graph.

In the case of a graph using the partitioned graph model, vertex properties (resp. edge properties) are specified per vertex provider (resp. edge provider). For example, using a partitioned graph model, it is possible to define the properties Name and Age only for vertices loaded from a Person vertex provider and define the properties ISBN and Genre only for vertices loaded from a Book vertex provider. Therefore, using the partitioned graph model can reduce the memory consumption required for the vertex and edge properties of a graph by associating only the necessary properties to the entities based on what the entities represent.

However, storing the topology of a partitioned graph can require more memory depending on how the edges are specified in edge providers: for each edge provider PGX shared memory requires CSR indices to put in relation the vertex and source vertices of the edges.

Memory Consumption of Properties

Each property associated to a vertex or an edge consumes memory depending on the type of the property. The following table indicates the memory consumption for each property type:

Property Type Size in Bytes
int 4
float 4
long 8
double 8
boolean 1
date 8
local date 4
time 4
timestamp 8
time with timezone 8
timestamp with timezone 12
point 2d 16
string variable (refer to String section)

How Much Memory Do You Need for Your Non-partitioned Graph?

Here is a simple formula you can use to estimate whether your graph data fits in memory:

number of bytes = 48 * V + 16 * E

where V is the number of vertices and E is the number of edges in the graph.

Assuming 8-byte vertex keys

The formula presented above assumes vertices are identified using 8 byte long IDs (see vertex_id_type in graph configuration guide), and that no edge IDs are used. If you load edge keys, you should add the 4 * E or 8 * E depending on if Integer or Long keys are used.

If your graph has properties, you have to add for each property V * type-size or E * type-size — depending on whether it is a vertex or edge property — to that number, where type-size is the size indicated by the table in the Memory consumption of properties section.

Example: A 10M vertex 100M edge graph with one double edge cost property consumes at least 48 * 10M + 16 * 100M + 100M * 8 bytes = 2.880 GB of memory.

Note that this estimate only refers to the amount of memory that is required to hold the graph in main memory after it has been successfully loaded. Depending on the format, PGX might allocate temporary data structures during loading that will consume additional memory. We therefore recommend that you have at least twice the size of memory available than the sum of all the graph data you plan to load.

How Much Memory Do You Need for Your Partitioned Graph?

For partitioned graphs, the memory required for the entire graph is the sum of the memory required to load the vertices (resp. edges) from all the vertex (resp. edge) providers, including their properties.

number of bytes = SUM(memory_vertex_provider_i) + SUM(memory_edge_provider_j)

The following sections explain how to compute the memory consumption for the vertex and edge providers.

How Much Memory Do You Need to Load a Vertex Provider?

Here is a simple formula to determine the memory required for the vertices loaded from a vertex provider containing V vertices:

number of bytes vertex provider= 32 * V

Assuming 8-byte vertex keys

The formula presented above assumes vertices are identified using 8 byte long keys.

If your vertex provider has properties, you have to add for each property V * type-size to that number, where type-size is the size indicated by the table in the Memory consumption of properties section.

How Much Memory Do You Need to Load an Edge Provider?

Edges loaded from each edge provider have their own CSR representation, referring to the vertices loaded from the source and destination vertex providers. For this reason, the memory requirement for edge providers depends on the number of vertices in the source (V_src) and destination (V_dst) vertex providers, as well as on the number of edges in the edge provider (E), as the following formula illustrates.

number of bytes edge provider= 8 * V_src + 8 * V_dst + 16 * E

Assuming no edge keys are loaded

The formula presented above assumes that no edge keys are used. If you load edge keys, you should add the 4 * E or 8 * E depending on if Integer or Long keys are used.

If your edge provider has properties, you have to add for each property E * type-size to that number, where type-size is the size indicated by the table in the Memory consumption of properties section.

On and Off Heap Memory

Although PGX 21.1.1 is mainly a Java application, it stores some of its graph data in off-heap memory, meaning in memory locations outside the control of the Java virtual machine. This is because of Java's 32bit array-length limitation. PGX 21.1.1 uses off-heap memory to store all vertices, edges and properties (with the exception of non-primitive property types string).

The repartition is the following:

  • Graph indexes and graph topology are stored off-heap
  • All primitive properties (integer, long, double, float, boolean, date, local_date, timestamp, time, point2d) are stored off-heap.
  • String properties are stored on-heap

String Properties and String Pooling

As indicated in the previous section, string properties are stored in on-heap memory by PGX. For a string, the amount of memory required to store it is:

number of bytes string = 64 + 2 * string_length + padding
This calculation includes:

  • 8 bytes for the reference (inside the property array) to the String object
  • 32 bytes for the String object (which points to an array of characters)
  • 24 bytes of header for the character array
  • as many bytes as the string's length * 2 because of UTF16 encoding in Java 8 (only 1 in Java 11)
  • complete the previously calculated size to be a multiple of 8 bytes (padding because objects need to be aligned on the reference size: 8 bytes).

PGX implements a "string pooling" optimization to save memory in case there are just a few repeated values stored in a string property (e.g., a color property storing just a few categorical values: red, green, blue, ...). When PGX applies that technique, the memory consumption of the string properties follows a different rule. In that case the memory consumed can be approximated by the memory consumed by the distinct strings (each individual string still consumes memory as indicated in string memory consumption), plus the memory consumed by all the references to those strings (one per vertex/edge depending on if the property is a vertex or edge property). The memory consumed by a reference in java is generally 4 bytes on 32-bits JVMs and 8 bytes on 64-bits JVMs.

String pooling behavior can be configured at a PGX server level with the PGX runtime fields string_pooling_strategy and max_distinct_strings_per_pool. These same settings can be overridden by the user for a given string EDGE/VERTEX property when first loading a graph into memory, an example of how to set string pooling for properties can be found here Handing Graph Config in Application. For more information on the configuration fields, please see the Engine and Runtime Configuration Guide or the Graph Config Guide.

In addition the runtime configuration field pooling_factoris relevant in Graph Mutation context, it defines a factor that prevents cases where string pooling can be ineffective such as when the number of distinct property values is large which adds up the cost of the structures used for the string pools. The default value is set to 0.25 which estimates to lowest pooling factor above which pooling may actually help saving memory.

Example of Graph Size Calculation

Let's assume we have a graph that is composed out of:

  • 2 vertex providers:

    • companies: 1,000,000 vertices with properties:
      • name: string
      • number_of_employees: long
    • employees: 100,000,000 vertices with properties:
      • last_name: string
      • first_name: string
      • country: string
      • hired_since: local_date
  • 1 edge provider:

    • works_in: 100,000,000 edges from 'employees' to 'companies' with properties:
      • salary: integer
      • contract_id: string

The length of a contract_id is 10, the average length of the name of a company is 10 and the average length of the last_name of an employee is 7. We will now show how one can estimate on-heap and off-heap memory needed for this graph.

On-heap Graph Size

The only part of the graph which is stored on-heap are the string properties.

Property name Entity name Entity type Pooled ? Size of one element Size of the property
name companies vertex the string has many different values (all companies have different names), so the string will not be pooled 64 + 2 * average_length(name) + padding = 64 + 2 * 10 + 4 = 88 bytes size of the string property for one vertex * number of these vertices = 88,000,000 bytes
last_name employees vertex there are many different last names, so the string will not be pooled 64 + 2 * average_length(last_name) + padding = 64 + 2 * 7 + 2 = 80 bytes size of the string property for one vertex * number of these vertices = 8,000,000,000 bytes
first_name employees vertex there are not so many different first names, so the string will be pooled 8 bytes size of the string property for one vertex * number of these vertices = 800,000,000 bytes
country employees vertex there are only around 300 countries, so the string will be pooled 8 bytes size of the string property for one vertex * number of these vertices = 800,000,000 bytes
contract_id works_in edge contract_id are all unique, so the string will not be pooled 64 + 2 * average_length(contract_id) + padding = 64 + 2 * 10 + 4 = 88 bytes size of the string property for one vertex * number of edges = 8,800,000,000 bytes

The approximation of the on-heap graph size will be: size = 18,488,000,000 bytes = 17.22 GB.

Off-heap Graph Size

The other properties of the vertices and edges are stored off-heap:

Property name Property type Entity name Entity type Size of one element Size of the property
number_of_employees long companies vertex 8 bytes size of one long * number of these vertices = 8,000,000 bytes
hired_since local_date employees vertex 4 bytes size of one local_date * number of these vertices = 400,000,000 bytes
salary integer works_in edge 4 bytes size of one integer * number of these edges = 400,000,000 bytes

The off-heap will also store the topology and key-mapping information about the graph:

  • vertices (refer to dedicated section for an explanation):
    • size = 32 * number_of_all_vertices = 32 * (1,000,000 + 100,000,000) = 3,232,000,000 bytes
  • edges (refer to dedicated section for an explanation):
    • 8 * V_src + 8 * V_dst + 16 * number_of_edges = 8 * (1,000,000 + 100,000,000) + 16 * 100,000,000 = 2,408,000,000 bytes

The off-heap size of the graph will be: size = 8,000,000 + 400,000,000 + 400,000,000 + 3,232,000,000 + 2,408,000,000 = 6,448,000,000 bytes = 6 GB

What Happens If PGX Runs out of Memory

PGX memory allocation requests can fail for a couple of reasons:

  • The maximum Java heap size is reached and PGX tries to allocate more on-heap memory. Then the underlying JVM will throw an OutOfMemoryError.
  • The maximum PGX off-heap size is reached and PGX tries to allocate more off-heap memory. Then the PGX runtime will throw an OutOfMemoryError.
  • The maximum PGX off-heap size is not yet reached but the underlying OS is running out of memory and PGX tries to allocate more off-heap memory. Then the OS might reject the allocation request, which will result in an OutOfMemoryError being thrown. However, the OS might also simply kill the whole process. We therefore recommend that you always specify an upper off-heap memory allocation limit in order to prevent PGX from trying to allocate more memory than physically available on the current machine, running the risk of the JVM being accidentally shut down by the OS.

If an OutOfMemoryError is thrown while processing a user request (like loading a graph), the request stops and the corresponding Future of the request will be completed exceptionally, having an OutOfMemoryError as cause. The PGX engine will remain fully operational, continuing accepting and processing other incoming requests. The user can try again later, when more memory becomes available.

If an OutOfMemoryError is thrown on the engine's main thread, PGX will shut down. The engine's main thread is mainly responsible for dispatching incoming user requests to the thread pools. It only allocates small objects, so if those allocations fail, it usually means that the Java heap is completely full. The Java virtual machine is probably spending most of the time in garbage collection cycles and the thread-pools are not able to complete any tasks without OutOfMemoryError. Hence, an OutOfMemoryError on the engine's main thread is treated as critical error, causing PGX to reject any further incoming requests, clean up and terminate.

In both cases, the OutOfMemoryError is logged accordingly, so administrators can understand what happened.

Defaults and Configuration of Memory Limits

You can configure both on- and off-heap memory limits. If you don't explicitly set either, both maximum on- and off-heap size will default to the maximum on-heap size determined by Java Hotspot, which is based on various factors, including the total amount of physical memory available.

This implies that — by default — the total amount of memory PGX is allowed to allocate is twice the default maximum on-heap size.

Configure On-Heap Limits

On-heap limits can be configured by command-line options.

The available options are:

  • -Xmx: to set the maximum on-heap size of the JVM.
  • -Xms: to set the initial on-heap size of the JVM.
  • -XX:NewSize: to set the initial size of the young generation (refer to advanced config section section for details)
  • -XX:MaxNewSize: to set the maximum size of the young generation (refer to advanced config section for details)

To set these settings:

java -Xmx<size_mb>m -Xms<size_mb>m -XX:MaxNewSize=<size_mb>m -XX:NewSize=<size_mb>m 
  • If you're using local PGX shell, set the JAVA_OPTS environment variable before starting the shell, for example:
export JAVA_OPTS="-Xmx<size_gb>g -Xms<size_gb>g -XX:MaxNewSize=<size_gb>g -XX:NewSize=<size_gb>g "
$PGX_HOME/bin/pgx-jshell
  • If you deploy PGX as a web application, consult the documentation of your target web server on how to specify Java command-line options.

Configure Off-Heap Limits

You can specify the off-heap limit by setting the max_off_heap_size field in the PGX config.

Warning: Max Off-Heap Precision

The off-heap limit is not guaranteed to never be exceeded because of rounding and synchronization trade-offs.

The off-heap limit can be set also in the JVM arguments with -Dpgx.max_off_heap_size=<size_in_mb>.

Advanced Configuration

This section gives an overview of how PGX is using the memory and how it can be configured for best performances.

Memory Requirements to Load a Graph

As explained in the previous section about datatypes, graphs properties are stored on-heap for string properties and off-heap for all other datatypes. Therefore, you can estimate the memory that the graph will take in memory. In addition to the memory required to store the topology and the properties, PGX requires also temporary memory used during loading process.

Off-heap

The amount of memory will be the estimation of the total off-heap properties of the graph plus the topology bytes. Setting more memory to the off-heap will not improve the performance of the PGX server.

On-heap

PGX is using the default JVM garbage collector, Parallel GC. With this garbage collector, the total on-heap memory will be split into 2 categories:

  • the young generation
  • the old generation.

The default setting is 2/3 to the old generation and 1/3 to the young generation.

When loading the graph, PGX will instantiate many temporary objects, that will be deleted after some time. All the instantiated objects are allocated in the young generation. When the young generation is full, it does a minor garbage collection that will transfer objects that are living for some time in the old generation, and remove the rest unused objects. To learn more about garbage collection in java, see this link.

In order to ensure best performance during loading, the on-heap part of the graph needs to fit into the old-generation. This will minimize the amount of expensive major GC runs that would occur if the old generation was full.

Therefore, a minimum of 1.05 times the on-heap size of the graph needs to be allocated to the old generation and keep a reasonable amount of memory to the young generation (to avoid too many minor GC). The size of the old generation can be chosen by setting the size of the young generation (-XX:NewSize and -XX:MaxNewSize), the remaining memory will go to the old generation.

A suggested minimal amount of memory to load a graph would be:

  • 1.2 * on_heap size of the graph to the total jvm memory
  • 0.15 * on_heap size of the graph to the young generation (so the old generation will have the remaining 1.05)

For a graph estimated at 20Gb on-heap, the configuration would be:

java -Xmx24g -Xms24g -XX:MaxNewSize=3g -XX:NewSize=3g 

Memory Requirements to Perform Queries / Algorithms on a Graph

Both on-heap and off-heap memory are used when PGQL queries or algorithms are run on a graph. To estimate memory consumption, one can check the amount of off-heap and on-heap memory used while running the workload. To measure the workload's required off-heap memory, one can either check the server logs or gradually increase the off-heap limit setting of the PGX server until the workload succeeds.

Following is an overview of how memory is consumed by PGQL queries.

Off-heap Requirements

The off-heap requirements will be proportional to the size of intermediate results. Therefore the more intermediate results are generated the more memory will be consumed. Notice that the amount of intermediate results depends on the query and its degree of filtering.

On-heap Requirements

While typical PGQL queries do not use a lot of on-heap memory, Group by and Order by require on-heap memory. Grouping requires amount of on-heap memory proportional to the number of groups, while ordering needs amount of on-heap memory proportional to the number of results to be sorted.