Partitioned graphs
The information on this page refers to graph configuration for loading "non-partitioned" graphs. Read the partitioned graph configuration reference for information on partitioned graph configurations.
For loading graph data, PGX requires Graph Configs, i.e. the meta-information about the graph data. A Graph Config includes the following information about the data:
For instance, the following
shell snippet loads the graph that is specified in mygraph.json
.
pgx> var G = session.readGraphWithProperties("/path/to/mygraph.json", "my-graph")
Note that, typically, a Graph Config is given as a JSON file. The user can use Java Properties format, instead of JSON — check this document for an example.
Loading when in Remote mode
When loading a graph from file in Remote mode (Server-Client), the JSON file containing the graph configuration should be on the client side and the file containing the actual graph data on the server side.
It is also possible to create Graph Config programmatically. See the related document for details.
Some of the graph formats supported by PGX are partially or fully self-describing. For a subset of those formats, graph configurations can be automatically generated by PGX. For details, take a look at the configuration detection document.
All graph configs have the following JSON fields in common:
Field | Type | Description | Default |
---|---|---|---|
array_compaction_threshold | number | [only relevant if the graph is optimized for updates] threshold used to determined when to compact the delta-logs into a new array. If lower than the engine min_array_compaction_threshold value, min_array_compaction_threshold will be used instead | 0.2 |
attributes | object | additional attributes needed to read/write the graph data | null |
edge_id_strategy | enum[no_ids, keys_as_ids, unstable_generated_ids] | Indicates what ID strategy should be used for the edges of this graph. If not specified (or set to null), the strategy will be determined during loading or using a default value | null |
edge_id_type | enum[long] | type of the edge ID. For homogeneous graphs, if not specified (or set to null), it will default to long. | null |
edge_props | array of object | specification of edge properties associated with graph | [] |
error_handling | object | error handling configuration | null |
external_stores | array of object | Specification of the external stores where external string properties reside. | [] |
format | enum[pgb, edge_list, adj_list, graphml, pg, rdf, two_tables] | graph format | null |
keystore_alias | string | alias to the keystore to use when connecting to database | null |
loading | object | loading-specific configuration | null |
local_date_format | array of string | array of local_date formats to use when loading and storing local_date properties. Please see DateTimeFormatter for a documentation of the format string | [] |
optimized_for | enum[read, updates] | Indicates if the graph should use data-structures optimized for read-intensive scenarios or for fast updates | read |
partition_while_loading | enum[by_label, no] | Indicates if the graph should be partitioned while loading | null |
password | string | password to use when connecting to database | null |
point2d | string | longitude and latitude as floating point values separated by a space | 0.0 0.0 |
time_format | array of string | the time format to use when loading and storing time properties. Please see DateTimeFormatter for a documentation of the format string | [] |
time_with_timezone_format | array of string | the time with timezone format to use when loading and storing time with timezone properties. Please see DateTimeFormatter for a documentation of the format string | [] |
timestamp_format | array of string | the timestamp format to use when loading and storing timestamp properties. Please see DateTimeFormatter for a documentation of the format string | [] |
timestamp_with_timezone_format | array of string | the timestamp with timezone format to use when loading and storing timestamp with timezone properties. Please see DateTimeFormatter for a documentation of the format string | [] |
vector_component_delimiter | character | delimiter for the different components of vector properties | ; |
vertex_id_strategy | enum[no_ids, keys_as_ids, unstable_generated_ids] | Indicates what ID strategy should be used for the vertices of this graph. If not specified (or set to null), the strategy will be automatically detected | null |
vertex_id_type | enum[int, integer, long, string] | type of the vertex ID. For homogeneous graphs, if not specified (or set to null), it will default to a specific value (depending on the origin of the data). | null |
vertex_props | array of object | specification of vertex properties associated with graph | [] |
Security warning
Vertex/edge IDs can be part of REST calls, and so they are visible to others. PGX highly recommends designing your graph model not to use any sensitive information as vertex/edge IDs.
where vertex_props
and edge_props
are objects with the JSON fields
Field | Type | Description | Default |
---|---|---|---|
name | string | name of property | required |
type | enum[boolean, integer, vertex, edge, float, long, double, string, date, local_date, time, timestamp, time_with_timezone, timestamp_with_timezone, point2d] | type of property (Note: date is deprecated, use one of local_date / time / timestamp / time_with_timezone / timestamp_with_timezone instead). vertex/edge are place-holders for the type specified in vertex_id_type/edge_id_type fields. | required |
aggregate | enum[identity, group_key, min, max, avg, sum, concat, count] | [currently unsupported] which aggregation function to use, aggregation always happens by vertex key | null |
column | value | name or index (starting from 0) of the column holding the property data. If it is not specified, the loader will try to use the property name as column name (for CSV format only) | null |
default | value | default value to be assigned to this property if datasource does not provide it. In case of date type: string is expected to be formatted with yyyy-MM-dd HH:mm:ss . If no default is present (null ), non-existent properties will contain default Java types (primitives) or empty string (string) or 01.01.1970 00:00 (date). | null |
dimension | integer | dimension of property | 0 |
drop_after_loading | boolean | [currently unsupported] indicating helper properties only used for aggregation, which are dropped after loading | false |
field | value | name of the JSON field holding the property data. Nesting is denoted by dot - separation. Field names containing dots are possible, in this case the dots need to be escaped using backslashes to resolve ambiguities. Only the exactly specified object are loaded, if they are non existent, the default value is used | null |
format | array of string | array of formats of property | [] |
group_key | string | [currently unsupported] can only be used if the property / key is part of the grouping expression | null |
max_distinct_strings_per_pool | integer | [only relevant if string_pooling_strategy is indexed] amount of distinct strings per property after which to stop pooling. If the limit is reached an exception is thrown. If set to null, the default value from the global PGX configuration will be used. | null |
stores | array of object | A list of storage identifiers that indicate where this property resides. | [] |
string_pooling_strategy | enum[indexed, on_heap, none] | which string pooling strategy to use. If set to null, the default value from the global PGX configuration will be used. | null |
and loading
a JSON object with the JSON fields
Field | Type | Description | Default |
---|---|---|---|
auto_refresh | boolean | if true the graph gets refreshed automatically in periodic intervals. Note: Depending on the global settings, only fixed (pre-loaded) graphs can be auto-refreshed | false |
create_edge_id_index | boolean | if true , an index is prepared during loading which enables retrieval of edge paths | false |
create_edge_id_mapping | boolean | if true , a mapping is prepared during loading which enables edge key arguments and filters containing edge keys | false |
create_label_histogram | boolean | whether a label histogram needs to be generated when the graph is loaded | false |
create_vertex_id_index | boolean | if true , an index is prepared during loading which enables retrieval of vertex paths | true |
create_vertex_id_mapping | boolean | if true , a mapping is prepared during loading which enables vertex arguments and vertex filters | true |
fetch_interval_sec | integer | (only relevant if the format supports delta updates) the interval in which the graph source is queried for changes | -1 |
filter | object | if not null , load subgraph specified by this filter | null |
filter_strategy | enum[DB, STREAM, POST, AUTO] | the strategy to process the filter | auto |
load_edge_label | boolean | whether or not to load the edge label if it is available | false |
load_vertex_labels | boolean | whether or not to load the vertex label if it is available | false |
loading_progress_reporting_frequency | integer | indicates at what frequency the loading of vertices and edges should be logged. The frequency will be rounded up to the next multiple of 10,000. | 10000000 |
partition_discard_default_values | boolean | [relevant for partition_while_loading]when partition_while_loading is specified, if set to by_label , the properties that contain only default values are removed from vertex and edge providers. | false |
property_value_delimiter | string | if null read the whole string value as label. Otherwise, split the string using the specified delimiter and use all values as vertex labels | null |
skip_edges | boolean | whether or not to load the edges | false |
skip_vertices | boolean | whether or not to load the vertices | false |
snapshots_source | enum[REFRESH, CHANGE_SET] | source of graph snapshots: if REFRESH , new snapshots can be created only by reading the graph again via this config (e.g., with `readGraphWithProperties`), or equivalently via auto-refresh if enabled; if CHANGE_SET , new snapshots can be added only via changesets by any session. Note: CHANGE_SET is not compatible with auto-refresh | refresh |
strict_mode | boolean | if true , exceptions are thrown and logged with ERROR level whenever loader encounters problems with input file, such as invalid format, repeated keys, missing fields, mismatches and other potential errors. If false , loader may use less memory during loading phase, but behave unexpectedly with erratic input files | true |
update_interval_sec | integer | the interval in which a new snapshot is created, either by reloading the entire graph or if the format supports delta-updates, out of the cached changes. (only relevant if the format supports delta updates) Set to -1 if you want to disable periodic snapshot creation. Note: one of update_interval_sec and update_threshold must be set | 60 |
update_properties_in_place | boolean | if true , non-structural updates get applied to the graph in-place, else non-structural updates also cause new snapshots of the graph to be created. | false |
update_threshold | integer | (only relevant if the format supports delta updates) the maximum number of changes that are cached before a new snapshot is created. Set to -1 if you want to disable the threshold for snapshot creation. Note: one of update_interval_sec and update_threshold must be set | -1 |
use_vertex_property_value_as_label | string | load the given property as vertex label. Currently only available for loading from PG | null |
and error_handling
a JSON object with the JSON fields
Field | Type | Description | Default |
---|---|---|---|
on_missed_prop_key | enum[silent, log_warn, log_warn_once, error] | what to do when missing property key is encountered | log_warn_once |
on_missing_vertex | enum[ignore_edge, ignore_edge_log, ignore_edge_log_once, create_vertex, create_vertex_log, create_vertex_log_once, error] | what to do when a source or destination vertex of an edge is not found in a vertex data source. | error |
on_parsing_issue | enum[silent, log_warn, log_warn_once, error] | what to do when the data cannot be parsed correctly. If set to silent, log_warn or log_warn_once, will attempt to continue loading. Some parsing issues may not be recoverable and provoke the end of loading. | error |
on_prop_conversion | enum[silent, log_warn, log_warn_once, error] | what to do when different property type is encountered than specified, but coercion is possible | log_warn_once |
on_type_mismatch | enum[silent, log_warn, log_warn_once, error] | what to do when different property type is encountered than specified, but coercion is not possible | error |
on_vector_length_mismatch | enum[silent, log_warn, log_warn_once, error] | what to do when a vector property has not the correct dimension | error |
However, each Graph Config may contain additional JSON fields that are specific to the type of the data source. See Loading from Files and Loading from DB for details.
PGX remote limitation
PGX does not support loading graphs from local file system in the remote use case
by default. The allow_local_filesystem
engine configuration option can enable this
feature at the expense of security. If enabled, directories from which loading should be allowed must be specified
with the datasource_dir_whitelist
engine configuration option and
permission must be granted to the user / role that needs to load graphs
from the file-location.
Further details: