Auto-refreshing graphs

This guide describes how to auto-refresh graphs in a periodic interval to keep the in-memory graph in sync with the data source. This guide shows:

How to configure the PGX server to allow every user to set up auto-refresh
How to set up auto-refresh for a graph
How to reference the snapshots created from auto-refresh
How auto-refresh works with data sources that support delta updates (a.k.a. “delta-refresh”)

Warning

The interval the user sets for periodic refresh in the graph configuration is to be intended as a best-effort hint to PGX, which cannot guarantee that auto-refresh occurs exactly with the requested frequency.

When setting the periodic interval for auto-refresh (either update_interval_sec for basic auto-refresh or fetch_interval_sec for delta-refresh), you should consider factors like the I/O characteristics of the data source, the size of updates and the features of the machine PGX runs in; if PGX cannot gather and apply updates within the specified interval, the actual auto-refresh interval will be larger. Furthermore, the PGX administrator can enforce a lower bound on the value of the refresh interval (as from the following section), which causes shorter interval values to be ignored.

Configuring the PGX server

Since auto-refresh can possibly create a lot of snapshots and therefore can lead to a high memory usage, the option to enable auto-refresh for graphs is only available to administrators by default. To allow auto-refreshed graphs to all users, you need to put the following field into your pgx.conf:

{
  "allow_user_auto_refresh": true
}

Since users can set custom auto-refresh intervals, the PGX server can become overloaded due to too frequent auto-refresh activity; to limit this activity, you can set lower bounds on the auto-refresh interval inside pgx.conf, with separate thresholds for auto-refresh (min_update_interval_sec) and for delta-refresh (min_fetch_interval_sec) as in the following example:

{
  "min_update_interval_sec": 10,
  "min_fetch_interval_sec": 5
}

Any refresh interval being lower than these thresholds is ignored and the threshold value is enforced instead.

Configuring basic auto-refresh

Auto-refresh is configured in the loading section of the graph config. We will setup the auto-refresh to check for updates every minute, creating a new snapshot when the data source has changed. The following block shows an example of how to enable the auto-refresh feature in the configuration file of the sample graph.

conf = '''
{
    "format": "csv",
    "vertex_uris": ["sample.vertices.csv"],
    "edge_uris": ["sample.edges.csv"],
    "vertex_props": [{
    "name": "prop",
    "type": "integer"
    }],
    "edge_props": [{
        "name": "cost",
        "type": "double"
    }],
    "loading": {
        "auto_refresh": true,
        "update_interval_sec": 60
    }
}
'''

Notice the additional loading section containing the auto-refresh settings.

Reading the graph using the PGX Shell or a java application

After you have modified the graph config, you can load the graph into PGX. After the graph is loaded, a background task, which will check the data source for updates periodically, is started automatically.

session = pypgx.get_session(session_name="my-session")
g = session.read_graph_with_properties(self.cfg)

Checking out the newest version of the graph

The data source is queried every minute for updates. If the data source changed, the graph is reloaded and a new snapshot is created automatically.

Let us try this out by editing the vertices and edges files to add an additional vertex and an additional edge. For example add the vertex “42” with property “7” and an edge from “42” to “333” with the property “10.0”. To do this add the line 42,7 at the end of examples/graphs/sample.vertices.csv, and the line 42,333,10.0 at the end of examples/graphs/sample.edges.csv.

If you wait for one minute a new snapshot will be created and placed in the snapshot cache for the graph automatically.

You can check the available snapshots of the graph using the PgxSession.get_available_snapshots() method.

snapshots = session.get_available_snapshots(g)

After one minute passed the list should contain two entries, one for the originally loaded graph with 4 vertices and 4 edges, and one for the graph created by auto-refresh with 5 vertices and 5 edges.

To check out the latest snapshot (or any available snapshot), you can use the PgxSession.setSnapshot() method. In particular, the constant PgxSession.LATEST_SNAPSHOT is conveniently provided to check out the latest snapshot of a graph, as in the following example.

session.set_snapshot(g, creation_timestamp=PgxSession.LATEST_SNAPSHOT)

Similarly, you can check out any other version by calling set_snapshot() with the creation_timestamp of the desired snapshot, as in the following example.

# you can check out any other version by calling set_snapshot()
# with the creation_timestamp of the desired snapshot
session.set_snapshot(
    g, creation_timestamp=snapshots[0].get_creation_timestamp())

Note this last call has the same effect of the previous one, since the passed timestamp (1453315122685) is that of the latest snapshot.

Warning

Updates to file sources should happen in-between PGX auto-refresh periods, as data-corruption might happen if the source files are updated during an auto-refresh. This is a temporary limitation. To avoid data races between PGX’s auto-refresh mechanism and updating the source files, please keep the write access to the source files minimal, for example by writing the updates to temporary files and then renaming the temporary files to the original files.

Auto-refresh with delta update (“delta-refresh”)

Some data sources support delta updates, which means that the data-source can keep track over the changes happening to it automatically. For example a graph loaded from a RDBMS supporting the Oracle Property Graph schema supports delta-updates. In the case that the data-source supports delta update, the auto-refresh mechanism will not reload the whole graph, but only load the deltas since the last update and apply them to the latest snapshot in the cache to form a new snapshot.

Differences to normal auto-refresh

The delta refresh provides two timers: One for fetching and caching the deltas from the data-source. Another one for actually applying the deltas and creating a new snapshot.

Additionally you can specify a threshold for the number of cached deltas. If the number of cached changes grow over this threshold a new snapshot is created as well. The number of cached changes are composed of the number of vertex changes plus the number of edge changes.

The deltas are fetched periodically and cached on the PGX server for two reasons:

Speeding up the actual snapshot creation process.
Accounting for data-sources that “forget” changes after a while.

You can specify both a threshold and an update timer, which means that both conditions will be checked and can create a new snapshot. At least one of these parameters (threshold or update timer) must be specified to prevent the delta cache from becoming too large. The interval in which the source is queried for changes must not be omitted.

Delta-refresh requirements

In order to use the delta-refresh mechanism, the graph needs to have the vertex IDs and edge IDs loaded from the data-source. The vertex / edge IDs are necessary in order to correctly identify the modified vertices and edges during the delta-refresh process.

PGX loads the keys in memory as IDs for vertices by default, while retaining edge IDs is disabled by default. Therefore, the graph configuration has to be modified to specify that the edge keys from the data source should be loaded and used for the edge IDs. In order to do so, both create_edge_id_index and create_edge_id_mapping flags that must be set to true, as illustrated in the example below.

Example configuration

The following parameters show a configuration where the data-source is queried for new deltas every 5 minutes. New snapshots are created every 20 minutes or if the cached deltas reach a size of 1000 changes.

{
  "format": "pg",
  "jdbc_url": "jdbc:oracle:thin:@mydatabaseserver:1521/dbName",
  "username": "username",
  "password": "password",
  "name": "my_graph",

  "loading_options": {
    "auto_refresh": true,
    "fetch_interval_sec": 300,
    "update_interval_sec": 1200,
    "update_threshold": 1000,
    "create_edge_id_index": true,
    "create_edge_id_mapping": true
  }
}