PGX 20.1.1
Documentation

Loading Graph Data from Files

PGX supports loading graph data from files for various data formats:

Graph Config for Loading from File

In order to load a graph from supported files, the client needs to set the following additional fields in the Graph Config:

Field Type Description Default
array_compaction_thresholdnumber[only relevant if the graph is optimized for updates] threshold used to determined when to compact the delta-logs into a new array. If lower than the engine min_array_compaction_threshold value, min_array_compaction_threshold will be used instead0.2
attributesobjectadditional attributes needed to read/write the graph datanull
detect_gzipbooleanenable/disable automatic gzip compression detection when loading graphstrue
edge_id_strategyenum[no_ids, keys_as_ids, unstable_generated_ids]Indicates what ID strategy should be used for the edges of this graph. If not specified (or set to null), the strategy will be determined during loading or using a default valuenull
edge_id_typeenum[long]type of the edge ID. For homogeneous graphs, if not specified (or set to null), it will default to long.null
edge_propsarray of objectspecification of edge properties associated with graph[]
edge_urisarray of stringlist of unified resource identifiers[]
error_handlingobjecterror handling configurationnull
external_storesarray of objectSpecification of the external stores where external string properties reside.[]
formatenum[pgb, edge_list, adj_list, graphml, pg, rdf, two_tables]graph formatnull
headerbooleanfirst line of file is meant for headers, e.g. 'EdgeId, SourceId, DestId, EdgeProp1, EdgeProp2'false
keystore_aliasstringalias to the keystore to use when connecting to databasenull
loadingobjectloading-specific configurationnull
local_date_formatarray of stringarray of local_date formats to use when loading and storing local_date properties. Please see DateTimeFormatter for a documentation of the format string[]
optimized_forenum[read, updates]Indicates if the graph should use data-structures optimized for read-intensive scenarios or for fast updatesread
partition_while_loadingenum[by_label, no]Indicates if the graph should be partitioned while loadingnull
passwordstringpassword to use when connecting to databasenull
point2dstringlongitude and latitude as floating point values separated by a space0.0 0.0
separatorstringa series of single-character separators for tokenizing. The characters ", {, } and \n cannot be used as separators. Default value is "," for CSV files, and "\t " for other formats. The first character will be used as a separator when storing. null
storingobjectstoring-specific configurationnull
time_formatarray of stringthe time format to use when loading and storing time properties. Please see DateTimeFormatter for a documentation of the format string[]
time_with_timezone_formatarray of stringthe time with timezone format to use when loading and storing time with timezone properties. Please see DateTimeFormatter for a documentation of the format string[]
timestamp_formatarray of stringthe timestamp format to use when loading and storing timestamp properties. Please see DateTimeFormatter for a documentation of the format string[]
timestamp_with_timezone_formatarray of stringthe timestamp with timezone format to use when loading and storing timestamp with timezone properties. Please see DateTimeFormatter for a documentation of the format string[]
vector_component_delimitercharacterdelimiter for the different components of vector properties;
vertex_id_strategyenum[no_ids, keys_as_ids, unstable_generated_ids]Indicates what ID strategy should be used for the vertices of this graph. If not specified (or set to null), the strategy will be automatically detectednull
vertex_id_typeenum[int, integer, long, string]type of the vertex ID. For homogeneous graphs, if not specified (or set to null), it will default to a specific value (depending on the origin of the data).null
vertex_propsarray of objectspecification of vertex properties associated with graph[]
vertex_urisarray of stringlist of unified resource identifiers[]

In the CSV format, the columns used to specify the vertex ID column, vertex labels column, edge ID column, edge source ID column, edge destination ID column and the edge label column can be configured with the following CSV specific fields:

Field Type Description Default
array_compaction_thresholdnumber[only relevant if the graph is optimized for updates] threshold used to determined when to compact the delta-logs into a new array. If lower than the engine min_array_compaction_threshold value, min_array_compaction_threshold will be used instead0.2
attributesobjectadditional attributes needed to read/write the graph datanull
detect_gzipbooleanenable/disable automatic gzip compression detection when loading graphstrue
edge_destination_columnvaluename or index (starting from 1) of column corresponding to edge destination (for CSV format only)null
edge_id_columnvaluename or index (starting from 1) of column corresponding to edge id (for CSV format only)null
edge_id_strategyenum[no_ids, keys_as_ids, unstable_generated_ids]Indicates what ID strategy should be used for the edges of this graph. If not specified (or set to null), the strategy will be determined during loading or using a default valuenull
edge_id_typeenum[long]type of the edge ID. For homogeneous graphs, if not specified (or set to null), it will default to long.null
edge_label_columnvaluename or index (starting from 1) of column corresponding to edge label (for CSV format only)null
edge_propsarray of objectspecification of edge properties associated with graph[]
edge_source_columnvaluename or index (starting from 1) of column corresponding to edge source (for CSV format only)null
error_handlingobjecterror handling configurationnull
external_storesarray of objectSpecification of the external stores where external string properties reside.[]
formatenum[pgb, edge_list, adj_list, graphml, pg, rdf, two_tables]graph formatnull
headerbooleanfirst line of file is meant for headers, e.g. 'EdgeId, SourceId, DestId, EdgeProp1, EdgeProp2'false
keystore_aliasstringalias to the keystore to use when connecting to databasenull
loadingobjectloading-specific configurationnull
local_date_formatarray of stringarray of local_date formats to use when loading and storing local_date properties. Please see DateTimeFormatter for a documentation of the format string[]
optimized_forenum[read, updates]Indicates if the graph should use data-structures optimized for read-intensive scenarios or for fast updatesread
partition_while_loadingenum[by_label, no]Indicates if the graph should be partitioned while loadingnull
passwordstringpassword to use when connecting to databasenull
point2dstringlongitude and latitude as floating point values separated by a space0.0 0.0
separatorstringa series of single-character separators for tokenizing. The characters ", {, } and \n cannot be used as separators. Default value is "," for CSV files, and "\t " for other formats. The first character will be used as a separator when storing. null
storingobjectstoring-specific configurationnull
time_formatarray of stringthe time format to use when loading and storing time properties. Please see DateTimeFormatter for a documentation of the format string[]
time_with_timezone_formatarray of stringthe time with timezone format to use when loading and storing time with timezone properties. Please see DateTimeFormatter for a documentation of the format string[]
timestamp_formatarray of stringthe timestamp format to use when loading and storing timestamp properties. Please see DateTimeFormatter for a documentation of the format string[]
timestamp_with_timezone_formatarray of stringthe timestamp with timezone format to use when loading and storing timestamp with timezone properties. Please see DateTimeFormatter for a documentation of the format string[]
vector_component_delimitercharacterdelimiter for the different components of vector properties;
vertex_id_columnvaluename or index (starting from 1) of column corresponding to vertex id (for CSV format only)null
vertex_id_strategyenum[no_ids, keys_as_ids, unstable_generated_ids]Indicates what ID strategy should be used for the vertices of this graph. If not specified (or set to null), the strategy will be automatically detectednull
vertex_id_typeenum[int, integer, long, string]type of the vertex ID. For homogeneous graphs, if not specified (or set to null), it will default to a specific value (depending on the origin of the data).null
vertex_labels_columnvaluename or index (starting from 1) of column corresponding to vertex labels (for CSV format only)null
vertex_propsarray of objectspecification of vertex properties associated with graph[]

How to Specify the Path to the File

For formats that contain vertices and edges specified in one file (e.g. EdgeList), use uris:

{ "uris": ["path/to/file.format"] }

For formats that require separate files for edges and vertices (e.g. FlatFile), use vertex_uris and edge_uris:

{ "vertex_uris": ["vertices1.format", "vertices2.format"], "edge_uris": ["edges1.format", "edges2.format"] }

PGX will parse graphs in most of the plain text formats in parallel if the graph data is split into multiple files, for example:

{ "uris": ["file1.format", "file2.format", ..., "fileN.format"] }

Supported File Systems

PGX supports loading from graph configuration files and graph data files over various protocols and virtual file systems. The type of file system or protocol is determined by the scheme of the uniform resource identifier (URI):

  • local file system (file:) - this is also the default if the given URI does not contain any scheme
  • classpath (classpath: or res:)
  • HDFS (hdfs:)
  • HTTPS (https:)
  • FTPS (ftps:)
  • various archive formats (zip:, jar:, tar:, tgz:, tbz2:, gz: and bz2:). The URI format is scheme://arch-file-uri[!absolute-path] (if you would like to use the ! as a literal file-name character it must be escaped using %21). Example: jar:../lib/classes.jar!/META-INF/graph.json. Paths may be nested - for example: tar:gz:https://anyhost/dir/mytar.tar.gz!/mytar.tar!/path/in/tar/graph.data.

Note that relative paths are always resolved relative to the parent directory of the configuration file. See this document for examples.

PGX remote limitation

PGX does not support loading graphs from local file system in the remote use case by default. The allow_local_filesystem engine configuration option can enable this feature at the expense of security. If enabled, directories from which loading should be allowed must be specified with the datasource_dir_whitelist engine configuration option and permission must be granted to the user / role that needs to load graphs from the file-location.

Loading graphs from remote locations

By default, PGX does not allow loading graphs from remote locations (https, ftps, s3 and hdfs). Administrators can list the locations to enable via the allowed_remote_loading_locations engine configuration option and can also enable all remote locations at the expense of security with ["*"]. Note that this restriction does not apply to pre-loaded graphs, which are loaded from any location regardless of the value of allowed_remote_loading_locations.

From PGX 19.4.0, ftp and http are not supported anymore for data loading/storing

These protocols are unencrypted and thus highly insecure.

Continue reading: