External stores (beta version)
Please note that this is a beta feature
An external store can be used to:
Currently, external store support is limited to:
The main reason to offload vertex string properties to an external store is to allow for graphs with large string properties to fit into PGX (shared-memory). We introduce a modular design that allows users to plugin their own custom implementation that connects PGX to an external store where the strings are stored. An additional reason to offload string properties is to leverage specialized capabilities of an external store, such as text search.
The design is centered around a new connection layer between PGX and an external system. The motivation for this design choice is to allow the user to choose where the offloaded string properties are stored. An advantage of this design is that if the data is already in an external store, the user can provide a suitable plugin that implements the API of the connection layer and immediately load a graph in PGX with the external string properties. The general architecture is shown in the following figure.
PGX does not manage the offloaded string properties. The external store must be set up correctly and a suitable plugin must be provided that implements the API of the connection layer. It is important that the vertex identifiers of the offloaded vertex string properties match the vertex identifiers of the corresponding loaded graph in PGX. Another implication is that updates on the graph within PGX are not automatically propagated to the external system; they need to be manually propagated to the external store.
The API of the new connection layer describes how PGX accesses the string properties in an external store. The most important endpoints are fetching a single string or a batch of strings. The possibility of fetching a whole batch of strings allows for some optimizations in PGX to speed up overall performance when accessing external strings. Additionally there are endpoints which enable PGX to use the specialized capabilities of external systems:
Graph configuration in a .json is extended, as explained below, to support new external stores with offloaded graph properties.
Current support is limited to Edge List and CSV file formats.
When PGX reads an offloaded graph property from these file formats, it skips reading its string in the file.
There is a new field external_stores
, which can be added to the top level of the graph configuration .json and is a list of external stores used by the graph for offloaded graph properties.
For each external store there are two required properties: a name
and a identifier
.
The name
is later used in a vertex property description to reference in which external store the property is stored.
The identifier
is used to identify a suitable plugin for the corresponding external store.
Then there is an optional property options
, which is a freely formatted json object that is passed to the plugin on initialization.
{ "external_stores": [{ "name": "es1", "identifier": "plugin1", "options": {} }, { "name": "es2", "identifier": "plugin2", "options": {} }] }
For vertex properties there is a new optional field stores
that defines where the property resides.
If it is not provided, it means that the property is loaded into memory.
Otherwise this field is a list of the external store's name where this property resides.
The name memory
is reserved and can be added to the list to signify that the property should also be loaded in memory.
{ "vertex_props": [{ "name":"name", "type":"string" }, { "name":"emailAddress", "type":"string", "stores": [{ "name": "memory" }, { "name": "es2" }] }, { "name":"researchInterest", "type":"string", "stores": [{ "name": "es1" }] }] }
In the above, the vertex string property name
resides in memory.
emailAddress
resides both in memory and in the es2
external store.
researchInterest
resides only in the es1
external store.
This section outlines the API that a custom plugin needs to implement in order to connect a new external store to PGX. Below is an overview of the interfaces and the classes of the API.
// Classes to implement: public interface ExternalStore extends AutoCloseable { void initialize(Map<String, Object> options); ExternalPropertyAccessor getVertexProperty(String propertyName); void close(); } public interface ExternalPropertyAccessor { long getBatchSize(); Object getById(Object vertexId); Map<Object, Object> getByIds(Set<Object> ids); Set<Object> getIdsByOperation(Operator operator); Map<Object, Double> getScoresByOperation(ScoreOperator operator); } public interface ExternalStoreService { ExternalStore instantiateNewExternalStore(); String getIdentifier(); } // Provided classes: public abstract class Operator { Set<Object> filterIds; // optional } public class StringLiteralEqualityOperator extends Operator { String literal; } public abstract class ScoreOperator extends Operator { long limit = 0; // optional bool desc = false; // ascending (default) or descending } public class TextSearchScoreOperator extends ScoreOperator { String literal; }
A plugin needs to implement the interfaces ExternalStore
, ExternalPropertyAccessor
and ExternalStoreService
.
You can find the above interfaces and classes in the pgx-external_stores_api.jar
in PGX's OTN server distribution.
The plugin is identified by using the ServiceLoader
Java API and through the unique identifier returned by the getIdentifier()
method of the ExternalStoreService
which should correspond to the identifier
field of the corresponding external_stores
entry in a graph configuration file.
Please note that in order for the ServiceLoader to correctly identify your plugin, you need to include the fully-qualified name of your implementation class of ExternalStoreService
in the resources/META-INF/services/oracle.pgx.externalstores.api.ExternalStoreService
file of your plugin .jar.
ExternalStore
represents the actual connection to the storage system and contains an initialize()
and close()
method to initialize and close the connection.
The method getVertexProperty(String propertyName)
returns an ExternalPropertyAccessor
for the corresponding offloaded vertex property.
An instance of ExternalPropertyAccessor
represents a handler that operates on a single offloaded external vertex property.
This interface contains the operations that are currently supported on external properties and a helper function that returns the batch size which PGX should use when fetching property values in a batch.
The result of getIdsByOperation()
is a set containing all the IDs of vertices that have an external property which matches the operation.
The type of these IDs must be the same as the vertex id type specified in the graph configuration file.
Finally, getScoresByOperation()
returns a Map with vertex IDs as keys the corresponding scores as values.
Additionally, there are a few provided classes.
These are containers that represent an operation which is passed to the plugin to execute in the external system.
The base class Operator
has an optional field filterIds
.
When this is set, the result of the operation should only contain vertices from this set.
The StringLiteralEqualityOperator
represents a search operation.
The field literal
contains the term that should be searched for.
The ScoreOperator
class represents a general set of operations that return a score for each hit.
Typically, a user may ask only for the top scores in some order.
Therefore the class also contains the optional fields limit
and desc
to control which results should be returned.
The TextSearchScoreOperator
is a score operator which represents a text search operation.
The implementation of the plugin defines how the text search is done to return the final scores to PGX.
To use text search, you can use the new text_search
in PGQL as in the following example:
SELECT text_search(v.external_property, 'literal') FROM MATCH(v)
Support of text_search
is limited to certain simple cases of PGQL, such as selecting, filtering with, or ordering by a text search score.
The instructions for loading a custom plugin depend on whether you run PGX in local mode or remote mode.
If you run PGX in local mode (local shell mode or local Java mode) you need to add the plugin .jar
file to the PGX classpath.
For the local shell mode you can add the .jar
to the classpath by setting the CLASSPATH
environment variable before starting the shell:
export CLASSPATH=/tmp/plugin.jar ./bin/pgx-jshell
For local Java mode you need to add the plugin .jar
to the classpath when you run your user application.
The instructions on how to do this depend on the way you run your Java application.
If you run PGX in remote mode you need to add the generated .jar
file to the WEB-INF/lib
directory inside the .war
file.
The following shell commands illustrate how this is done:
cd shared-memory/server
mkdir -p WEB-INF/lib
cp /path/to/plugin.jar WEB-INF/lib
zip -ur pgx-webapp-21.1.1.war WEB-INF/lib
PGX includes a plugin to connect to Elasticsearch as an external store.
Elasticsearch is a distributed RESTful search and analytics engine; more information can be found on the official web site: https://www.elastic.co/.
Elasticsearch organizes contents in documents that have multiple fields.
One such field should be the vertex id, which has to be unique, while other fields contain the actual graph properties.
These properties must have the same name as in the graph configuration file.
Furthermore, the type of fields should be keyword
, otherwise it is not guaranteed that the exact matches are returned by the plugin.
For the properties where text search should be supported, it is necessary to define a sub field with type text
.
Below is an example configuration JSON for Elasticsearch.
{ "mapping": { "properties": { "vid": { "type": "keyword" }, "name": { "type": "keyword", "fields": { "text_sub_field": { "type": "text" } } }, "email": { "type": "keyword", "fields": { "text_sub_field": { "type": "text" } } }, "address": { "type": "keyword", "fields": { "text_sub_field": { "type": "text" } } }, "city_code": { "type": "keyword" } } } }
In the above, vid
represents the vertex id.
The text_sub_field
is the sub field that should be used for full text search queries.
The city_code
field above does not support full text search queries.
Once Elasticsearch is set up, the data has to be loaded.
When the data is loaded, it is possible to load a graph in PGX that references these external properties.
To use the provided Elasticsearch plugin there are some options that have to be passed in the options
field of the external_stores
section of the graph configuration file.
For this plugin the options are as follows:
uri
specifies where the Elasticsearch instance is running, including the endpoint to use for REST communication.vertex_id_name
denotes the field name of the vertex ids in Elasticsearch.batch_size
indicates how many strings can be fetched in one batch.text_search_sub_field
is the name of the sub field that is used for text searching.Below is a an example for the options to be passed to the plugin for an Elasticsearch server running on the same machine as PGX.
{ "external_stores": [{ "name": "es", "identifier": "elastic-search", "options": { "uri": "https://localhost:8080/graph1", "vertex_id_name": "vid", "batch_size": 1000, "text_search_sub_field": "text_sub_field" } }] }