PGX 20.2.2
Documentation

External Stores

External stores (beta version)

Please note that this is a beta feature

An external store can be used to:

  • Offload graph properties to an external store.
  • Enable the use of an external store's specialized operations, e.g., text search.

Currently, external store support is limited to:

  • PGX (shared-memory),
  • offloading vertex string properties,
  • text search of the external vertex string properties.

Introduction

The main reason to offload vertex string properties to an external store is to allow for graphs with large string properties to fit into PGX (shared-memory). We introduce a modular design that allows users to plugin their own custom implementation that connects PGX to an external store where the strings are stored. An additional reason to offload string properties is to leverage specialized capabilities of an external store, such as text search.

General Architecture

The design is centered around a new connection layer between PGX and an external system. The motivation for this design choice is to allow the user to choose where the offloaded string properties are stored. An advantage of this design is that if the data is already in an external store, the user can provide a suitable plugin that implements the API of the connection layer and immediately load a graph in PGX with the external string properties. The general architecture is shown in the following figure.

general architecture

PGX does not manage the offloaded string properties. The external store must be set up correctly and a suitable plugin must be provided that implements the API of the connection layer. It is important that the vertex identifiers of the offloaded vertex string properties match the vertex identifiers of the corresponding loaded graph in PGX. Another implication is that updates on the graph within PGX are not automatically propagated to the external system; they need to be manually propagated to the external store.

The API of the new connection layer describes how PGX accesses the string properties in an external store. The most important endpoints are fetching a single string or a batch of strings. The possibility of fetching a whole batch of strings allows for some optimizations in PGX to speed up overall performance when accessing external strings. Additionally there are endpoints which enable PGX to use the specialized capabilities of external systems:

  • Fetching of string properties by a certain operation, e.g., exact match.
  • Execute queries which return a score, e.g., text search.

New Graph Configuration Options

Graph configuration in a .json is extended, as explained below, to support new external stores with offloaded graph properties. Current support is limited to Edge List and CSV file formats. When PGX reads an offloaded graph property from these file formats, it skips reading its string in the file. There is a new field external_stores, which can be added to the top level of the graph configuration .json and is a list of external stores used by the graph for offloaded graph properties. For each external store there are two required properties: a name and a identifier. The name is later used in a vertex property description to reference in which external store the property is stored. The identifier is used to identify a suitable plugin for the corresponding external store. Then there is an optional property options, which is a freely formatted json object that is passed to the plugin on initialization.

{
  "external_stores": [{
    "name": "es1",
    "identifier": "plugin1",
    "options": {}
  }, {
    "name": "es2",
    "identifier": "plugin2",
    "options": {}
  }]
}

For vertex properties there is a new optional field stores that defines where the property resides. If it is not provided, it means that the property is loaded into memory. Otherwise this field is a list of the external store's name where this property resides. The name memory is reserved and can be added to the list to signify that the property should also be loaded in memory.

{
  "vertex_props": [{
    "name":"name",
    "type":"string"
  }, {
    "name":"emailAddress",
    "type":"string",
    "stores": [{
      "name": "memory"
    }, {
      "name": "es2"
    }]
  }, {
    "name":"researchInterest",
    "type":"string",
    "stores": [{
      "name": "es1"
    }]
  }]
}

In the above, the vertex string property name resides in memory. emailAddress resides both in memory and in the es2 external store. researchInterest resides only in the es1 external store.

Public API for Custom Plugins

This section outlines the API that a custom plugin needs to implement in order to connect a new external store to PGX. Below is an overview of the interfaces and the classes of the API.

// Classes to implement:

public interface ExternalStore extends AutoCloseable {
  void initialize(Map<String, Object> options);
  ExternalPropertyAccessor getVertexProperty(String propertyName);
  void close();
}

public interface ExternalPropertyAccessor {
  long getBatchSize();
  Object getById(Object vertexId);
  Map<Object, Object> getByIds(Set<Object> ids);
  Set<Object> getIdsByOperation(Operator operator);
  Map<Object, Double> getScoresByOperation(ScoreOperator operator);
}

public interface ExternalStoreService {
  ExternalStore instantiateNewExternalStore();
  String getIdentifier();
}

// Provided classes:

public abstract class Operator {
  Set<Object> filterIds;     // optional
}

public class StringLiteralEqualityOperator extends Operator {
  String literal;
}

public abstract class ScoreOperator extends Operator {
  long limit = 0;            // optional
  bool desc = false;         // ascending (default) or descending
}

public class TextSearchScoreOperator extends ScoreOperator {
  String literal;
}

A plugin needs to implement the interfaces ExternalStore, ExternalPropertyAccessor and ExternalStoreService. You can find the above interfaces and classes in the pgx-external_stores_api.jar in PGX's OTN server distribution. The plugin is identified by using the ServiceLoader Java API and through the unique identifier returned by the getIdentifier() method of the ExternalStoreService which should correspond to the identifier field of the corresponding external_stores entry in a graph configuration file. Please note that in order for the ServiceLoader to correctly identify your plugin, you need to include the fully-qualified name of your implementation class of ExternalStoreService in the resources/META-INF/services/oracle.pgx.externalstores.api.ExternalStoreService file of your plugin .jar.

ExternalStore represents the actual connection to the storage system and contains an initialize() and close() method to initialize and close the connection. The method getVertexProperty(String propertyName) returns an ExternalPropertyAccessor for the corresponding offloaded vertex property. An instance of ExternalPropertyAccessor represents a handler that operates on a single offloaded external vertex property. This interface contains the operations that are currently supported on external properties and a helper function that returns the batch size which PGX should use when fetching property values in a batch. The result of getIdsByOperation() is a set containing all the IDs of vertices that have an external property which matches the operation. The type of these IDs must be the same as the vertex id type specified in the graph configuration file. Finally, getScoresByOperation() returns a Map with vertex IDs as keys the corresponding scores as values.

Additionally, there are a few provided classes. These are containers that represent an operation which is passed to the plugin to execute in the external system. The base class Operator has an optional field filterIds. When this is set, the result of the operation should only contain vertices from this set. The StringLiteralEqualityOperator represents a search operation. The field literal contains the term that should be searched for. The ScoreOperator class represents a general set of operations that return a score for each hit. Typically, a user may ask only for the top scores in some order. Therefore the class also contains the optional fields limit and desc to control which results should be returned. The TextSearchScoreOperator is a score operator which represents a text search operation. The implementation of the plugin defines how the text search is done to return the final scores to PGX. To use text search, you can use the new text_search in PGQL as in the following example:

SELECT text_search(v.external_property, 'literal') FROM MATCH(v)

Support of text_search is limited to certain simple cases of PGQL, such as selecting, filtering with, or ordering by a text search score.

Loading a Custom Plugin

The instructions for loading a custom plugin depend on whether you run PGX in local mode or remote mode.

Local Mode

If you run PGX in local mode (local shell mode or local Java mode) you need to add the plugin .jar file to the PGX classpath. For the local shell mode you can add the .jar to the classpath by setting the CLASSPATH environment variable before starting the shell:

export CLASSPATH=/tmp/plugin.jar
./bin/pgx-jshell

For local Java mode you need to add the plugin .jar to the classpath when you run your user application. The instructions on how to do this depend on the way you run your Java application.

Remote Mode

If you run PGX in remote mode you need to add the generated .jar file to the WEB-INF/lib directory inside the .war file. The following shell commands illustrate how this is done:

cd shared-memory/server
mkdir -p WEB-INF/lib
cp /path/to/plugin.jar WEB-INF/lib
zip -ur pgx-webapp-20.2.2.war WEB-INF/lib

Elasticsearch Plugin

PGX includes a plugin to connect to Elasticsearch as an external store. Elasticsearch is a distributed RESTful search and analytics engine; more information can be found on the official web site: https://www.elastic.co/. Elasticsearch organizes contents in documents that have multiple fields. One such field should be the vertex id, which has to be unique, while other fields contain the actual graph properties. These properties must have the same name as in the graph configuration file. Furthermore, the type of fields should be keyword, otherwise it is not guaranteed that the exact matches are returned by the plugin. For the properties where text search should be supported, it is necessary to define a sub field with type text. Below is an example configuration JSON for Elasticsearch.

{
  "mapping": {
    "properties": {
      "vid": {
        "type": "keyword"
      },
      "name": {
        "type": "keyword",
        "fields": {
          "text_sub_field": {
            "type": "text"
          }
        }
      },
      "email": {
        "type": "keyword",
        "fields": {
          "text_sub_field": {
            "type": "text"
          }
        }
      },
      "address": {
        "type": "keyword",
        "fields": {
          "text_sub_field": {
            "type": "text"
          }
        }
      },
      "city_code": {
        "type": "keyword"
      }
    }
  }
}

In the above, vid represents the vertex id. The text_sub_field is the sub field that should be used for full text search queries. The city_code field above does not support full text search queries.

Once Elasticsearch is set up, the data has to be loaded. When the data is loaded, it is possible to load a graph in PGX that references these external properties. To use the provided Elasticsearch plugin there are some options that have to be passed in the options field of the external_stores section of the graph configuration file. For this plugin the options are as follows:

  • uri specifies where the Elasticsearch instance is running, including the endpoint to use for REST communication.
  • vertex_id_name denotes the field name of the vertex ids in Elasticsearch.
  • batch_size indicates how many strings can be fetched in one batch.
  • text_search_sub_field is the name of the sub field that is used for text searching.

Below is a an example for the options to be passed to the plugin for an Elasticsearch server running on the same machine as PGX.

{
  "external_stores": [{
    "name": "es",
    "identifier": "elastic-search",
    "options": {
      "uri": "https://localhost:8080/graph1",
      "vertex_id_name": "vid",
      "batch_size": 1000,
      "text_search_sub_field": "text_sub_field"
    }
  }]
}