PGX is an in-memory graph analytic engine optimized for performance. Before PGX runs any analysis on graphs, it requires the whole graph and the properties needed for the analysis to be loaded into main memory (except for properties offloaded to external stores). The memory consumed by PGX for a graph is split between the memory to store the topology of the graph (the information to indicate what are the vertices and edges in the graph without their attached properties), and the memory for the properties attached to the vertices and edges. Internally, PGX stores the graph topology in compressed sparse row (CSR) format, a data structure which has minimal memory footprint while providing very fast read access.
The shared-memory version of PGX supports both non-partitioned graph and partitioned graph models which differ in how properties are associated to vertices and edges. In the non-partitioned graph model, all the vertices (resp. edges) of a PGX graph have the same properties. In that model, if a graph contained vertices that represent persons and books, to which we would like to attach the properties 'PersonName', 'PersonAge', and 'BookISBN', 'BookGenre' respectively, we would in fact have all four properties defined for every vertex in the graph.
In the case of a graph using the partitioned graph model, vertex properties (resp. edge properties) are specified per
vertex provider (resp. edge provider). For example, using a partitioned graph model, it is possible to define the
properties Name
and Age
only for vertices loaded from a Person
vertex provider and define the properties ISBN
and Genre
only for vertices loaded from a Book
vertex provider. Therefore, using the partitioned graph model can
reduce the memory consumption required for the vertex and edge properties of a graph by associating only the necessary
properties to the entities based on what the entities represent.
However, storing the topology of a partitioned graph can require more memory depending on how the edges are specified in edge providers: for each edge provider PGX shared memory requires CSR indices to put in relation the vertex and source vertices of the edges.
Each property associated to a vertex or an edge consumes memory depending on the type of the property. The following table indicates the memory consumption for each property type:
Property Type | Size in Bytes |
---|---|
int |
4 |
float |
4 |
long |
8 |
double |
8 |
boolean |
1 |
date |
8 |
local date |
4 |
time |
4 |
timestamp |
8 |
time with timezone |
8 |
timestamp with timezone |
12 |
point 2d |
16 |
string |
variable (refer to String section) |
Here is a simple formula you can use to estimate whether your graph data fits in memory:
number of bytes = 48 * V + 16 * E
where V is the number of vertices and E is the number of edges in the graph.
Assuming 8-byte vertex keys
The formula presented above assumes vertices are identified using 8 byte long
IDs (see vertex_id_type
in
graph configuration guide), and that no edge IDs are used.
If you load edge keys, you should add the 4 * E or 8 * E depending on if Integer
or Long
keys are used.
If your graph has properties, you have to add for each property
V * type-size or E * type-size — depending on whether it is a vertex
or edge property — to that number, where
type-size is the size indicated by the table in the Memory consumption of properties
section.
Example: A 10M vertex 100M edge graph with one double
edge cost property consumes at least
48 * 10M + 16 * 100M + 100M * 8 bytes = 2.880 GB of memory.
Note that this estimate only refers to the amount of memory that is required to hold the graph in main memory after it has been successfully loaded. Depending on the format, PGX might allocate temporary data structures during loading that will consume additional memory. We therefore recommend that you have at least twice the size of memory available than the sum of all the graph data you plan to load.
For partitioned graphs, the memory required for the entire graph is the sum of the memory required to load the vertices (resp. edges) from all the vertex (resp. edge) providers, including their properties.
number of bytes = SUM(memory_vertex_provider_i) + SUM(memory_edge_provider_j)
The following sections explain how to compute the memory consumption for the vertex and edge providers.
Here is a simple formula to determine the memory required for the vertices loaded from a vertex provider containing
V
vertices:
number of bytes vertex provider= 32 * V
Assuming 8-byte vertex keys
The formula presented above assumes vertices are identified using 8 byte long
keys.
If your vertex provider has properties, you have to add for each property V * type-size to that
number, where type-size is the size indicated by the table in the Memory consumption of properties
section.
Edges loaded from each edge provider have their own CSR representation, referring to the vertices loaded from the
source and destination vertex providers. For this reason, the memory requirement for edge providers depends on the
number of vertices in the source (V_src
) and destination (V_dst
) vertex providers, as well as on the number of edges
in the edge provider (E
), as the following formula illustrates.
number of bytes edge provider= 8 * V_src + 8 * V_dst + 16 * E
Assuming no edge keys are loaded
The formula presented above assumes that no edge keys are used.
If you load edge keys, you should add the 4 * E or 8 * E depending on if Integer
or Long
keys are used.
If your edge provider has properties, you have to add for each property E * type-size to that
number, where type-size is the size indicated by the table in the Memory consumption of properties
section.
Although PGX 21.1.1 is mainly a Java application, it stores some of its graph data in off-heap memory, meaning in
memory locations outside the control of the Java virtual machine. This is because of
Java's 32bit array-length limitation. PGX 21.1.1 uses
off-heap memory to store all vertices, edges and properties (with the exception of non-primitive property types
string
).
The repartition is the following:
As indicated in the previous section, string properties are stored in on-heap memory by PGX. For a string, the amount of memory required to store it is:
number of bytes string = 64 + 2 * string_length + paddingThis calculation includes:
PGX implements a "string pooling" optimization to save memory in case there are just a few repeated values stored in a
string property (e.g., a color
property storing just a few categorical values: red
, green
, blue
, ...). When
PGX applies that technique, the memory consumption of the string properties follows a different rule. In that case the
memory consumed can be approximated by the memory consumed by the distinct strings (each individual string still
consumes memory as indicated in string
memory consumption), plus the memory consumed by all the references to those strings (one per vertex/edge depending
on if the property is a vertex or edge property). The memory consumed by a reference in java is generally 4 bytes on
32-bits JVMs and 8 bytes on 64-bits JVMs.
String pooling behavior can be configured at a PGX server level with the PGX runtime fields string_pooling_strategy
and max_distinct_strings_per_pool
. These same settings can be overridden by the user for a given string EDGE/VERTEX
property when first loading a graph into memory, an example of how to set string pooling for properties can be found
here Handing Graph Config in Application.
For more information on the configuration fields, please see the
Engine and Runtime Configuration Guide or the
Graph Config Guide.
In addition the runtime configuration field pooling_factor
is relevant in Graph Mutation context, it defines a factor
that prevents cases where string pooling can be ineffective such as when the number of distinct property values is large
which adds up the cost of the structures used for the string pools. The default value is set to 0.25
which estimates to
lowest pooling factor above which pooling may actually help saving memory.
Let's assume we have a graph that is composed out of:
2 vertex providers:
companies
: 1,000,000 vertices with properties:name
: stringnumber_of_employees
: longemployees
: 100,000,000 vertices with properties:last_name
: stringfirst_name
: stringcountry
: stringhired_since
: local_date1 edge provider:
works_in
: 100,000,000 edges from 'employees' to 'companies' with properties:salary
: integercontract_id
: stringThe length of a contract_id
is 10
, the average length of the name
of a company
is 10
and the average length of the last_name
of an employee
is 7.
We will now show how one can estimate on-heap and off-heap memory needed for this graph.
The only part of the graph which is stored on-heap are the string properties.
Property name | Entity name | Entity type | Pooled ? | Size of one element | Size of the property |
---|---|---|---|---|---|
name |
companies |
vertex | the string has many different values (all companies have different names), so the string will not be pooled | 64 + 2 * average_length(name) + padding = 64 + 2 * 10 + 4 = 88 bytes |
size of the string property for one vertex * number of these vertices = 88,000,000 bytes |
last_name |
employees |
vertex | there are many different last names, so the string will not be pooled | 64 + 2 * average_length(last_name) + padding = 64 + 2 * 7 + 2 |
size of the string property for one vertex * number of these vertices = 8,000,000,000 bytes |
first_name |
employees |
vertex | there are not so many different first names, so the string will be pooled | 8 bytes |
size of the string property for one vertex * number of these vertices = 800,000,000 bytes |
country |
employees |
vertex | there are only around 300 countries, so the string will be pooled | 8 bytes |
size of the string property for one vertex * number of these vertices = 800,000,000 bytes |
contract_id |
works_in |
edge | contract_id are all unique, so the string will not be pooled | 64 + 2 * average_length(contract_id) + padding = 64 + 2 * 10 + 4 = 88 bytes |
size of the string property for one vertex * number of edges = 8,800,000,000 bytes |
The approximation of the on-heap graph size will be: size = 18,488,000,000 bytes = 17.22 GB
.
The other properties of the vertices and edges are stored off-heap:
Property name | Property type | Entity name | Entity type | Size of one element | Size of the property |
---|---|---|---|---|---|
number_of_employees |
long | companies |
vertex | 8 bytes |
size of one long * number of these vertices = 8,000,000 bytes |
hired_since |
local_date | employees |
vertex | 4 bytes |
size of one local_date * number of these vertices = 400,000,000 bytes |
salary |
integer | works_in |
edge | 4 bytes |
size of one integer * number of these edges = 400,000,000 bytes |
The off-heap will also store the topology and key-mapping information about the graph:
size = 32 * number_of_all_vertices = 32 * (1,000,000 + 100,000,000) = 3,232,000,000 bytes
8 * V_src + 8 * V_dst + 16 * number_of_edges = 8 * (1,000,000 + 100,000,000) + 16 * 100,000,000 = 2,408,000,000 bytes
The off-heap size of the graph will be:
size = 8,000,000 + 400,000,000 + 400,000,000 + 3,232,000,000 + 2,408,000,000 = 6,448,000,000 bytes = 6 GB
PGX memory allocation requests can fail for a couple of reasons:
If an OutOfMemoryError is thrown while processing a user request (like loading a graph), the request stops and the corresponding Future of the request will be completed exceptionally, having an OutOfMemoryError as cause. The PGX engine will remain fully operational, continuing accepting and processing other incoming requests. The user can try again later, when more memory becomes available.
If an OutOfMemoryError is thrown on the engine's main thread, PGX will shut down. The engine's main thread is mainly responsible for dispatching incoming user requests to the thread pools. It only allocates small objects, so if those allocations fail, it usually means that the Java heap is completely full. The Java virtual machine is probably spending most of the time in garbage collection cycles and the thread-pools are not able to complete any tasks without OutOfMemoryError. Hence, an OutOfMemoryError on the engine's main thread is treated as critical error, causing PGX to reject any further incoming requests, clean up and terminate.
In both cases, the OutOfMemoryError is logged accordingly, so administrators can understand what happened.
You can configure both on- and off-heap memory limits. If you don't explicitly set either, both maximum on- and off-heap size will default to the maximum on-heap size determined by Java Hotspot, which is based on various factors, including the total amount of physical memory available.
This implies that — by default — the total amount of memory PGX is allowed to allocate is twice the default maximum on-heap size.
On-heap limits can be configured by command-line options.
The available options are:
-Xmx
: to set the maximum on-heap size of the JVM. -Xms
: to set the initial on-heap size of the JVM.-XX:NewSize
: to set the initial size of the young generation (refer to advanced config section
section for details)-XX:MaxNewSize
: to set the maximum size of the young generation (refer to advanced config section
for details)To set these settings:
java
command :java -Xmx<size_mb>m -Xms<size_mb>m -XX:MaxNewSize=<size_mb>m -XX:NewSize=<size_mb>m
export JAVA_OPTS="-Xmx<size_gb>g -Xms<size_gb>g -XX:MaxNewSize=<size_gb>g -XX:NewSize=<size_gb>g " $PGX_HOME/bin/pgx-jshell
You can specify the off-heap limit by setting the max_off_heap_size
field in the
PGX config.
Warning: Max Off-Heap Precision
The off-heap limit is not guaranteed to never be exceeded because of rounding and synchronization trade-offs.
The off-heap limit can be set also in the JVM arguments with -Dpgx.max_off_heap_size=<size_in_mb>
.
This section gives an overview of how PGX is using the memory and how it can be configured for best performances.
As explained in the previous section about datatypes, graphs properties are stored on-heap for string properties and off-heap for all other datatypes. Therefore, you can estimate the memory that the graph will take in memory. In addition to the memory required to store the topology and the properties, PGX requires also temporary memory used during loading process.
The amount of memory will be the estimation of the total off-heap properties of the graph plus the topology bytes. Setting more memory to the off-heap will not improve the performance of the PGX server.
PGX is using the default JVM garbage collector, Parallel GC. With this garbage collector, the total on-heap memory will be split into 2 categories:
The default setting is 2/3 to the old generation and 1/3 to the young generation.
When loading the graph, PGX will instantiate many temporary objects, that will be deleted after some time. All the instantiated objects are allocated in the young generation. When the young generation is full, it does a minor garbage collection that will transfer objects that are living for some time in the old generation, and remove the rest unused objects. To learn more about garbage collection in java, see this link.
In order to ensure best performance during loading, the on-heap part of the graph needs to fit into the old-generation. This will minimize the amount of expensive major GC runs that would occur if the old generation was full.
Therefore, a minimum of 1.05 times the on-heap size of the graph needs to be allocated to the old generation and
keep a reasonable amount of memory to the young generation (to avoid too many minor GC).
The size of the old generation can be chosen by setting the size of the young generation (-XX:NewSize
and
-XX:MaxNewSize
), the remaining memory will go to the old generation.
A suggested minimal amount of memory to load a graph would be:
1.2 * on_heap size of the graph
to the total jvm memory0.15 * on_heap size of the graph
to the young generation (so the old generation will have the remaining 1.05)For a graph estimated at 20Gb on-heap, the configuration would be:
java -Xmx24g -Xms24g -XX:MaxNewSize=3g -XX:NewSize=3g
Both on-heap and off-heap memory are used when PGQL queries or algorithms are run on a graph. To estimate memory consumption, one can check the amount of off-heap and on-heap memory used while running the workload. To measure the workload's required off-heap memory, one can either check the server logs or gradually increase the off-heap limit setting of the PGX server until the workload succeeds.
Following is an overview of how memory is consumed by PGQL queries.
The off-heap requirements will be proportional to the size of intermediate results. Therefore the more intermediate results are generated the more memory will be consumed. Notice that the amount of intermediate results depends on the query and its degree of filtering.
While typical PGQL queries do not use a lot of on-heap memory, Group by
and Order by
require on-heap memory. Grouping
requires amount of on-heap memory proportional to the number of groups, while ordering needs amount of on-heap memory
proportional to the number of results to be sorted.