PGX 20.1.1
Documentation

Graph Optimized for Updates Vs. Graph Optimized for Reads

PGX offers the possibility to adopt a graph storage tuned either for read performance or for update performance (through graph changesets). This page discusses the differences and trade-offs between those two storages and how to use one or the other. We also provide some information on how it is possible to optimize a graph schema to have faster updates.

Graphs optimized for updates

Not all PGX features and APIs may be available for graphs optimized for updates and graphs optimized for updates are not supported in distributed execution mode. Please refer to the "Unsupported features" section to read about the current limitations.

Differences and Trade-offs

Graphs Optimized for Reads

Graphs optimized for reads will provide the best performance for graph analytics and PGQL queries. The downside is potentially higher latencies to update the graph (adding or removing vertex and edges or updating the property values of previously existing vertex or edges through a graph changeset) and higher memory consumption. When using graphs optimized for reads, each updated graph or graph snapshot consumes memory proportional to the size of the graph in terms of vertices and edges.

Graphs Optimized for Updates

Graphs optimized for updates leverage a representation enabling low-latency update of graphs. With this representation, PGX can reach millisecond-scale latencies when updating graphs with millions of vertices and edges (this is indicative and will vary depending on the hardware configuration).

To achieve update operations this fast, PGX avoids as much as possible doing a full duplication of the previous graph (snapshot) to create a new graph (snapshot). This also improves the memory consumption (in typical scenarios). New snapshots (or new graphs) will only consume additional memory close to proportional to the memory required for the changes applied.

However, offering faster updates and lower memory consumption presently comes at the trade-off of lower performance on graph analytics and PGQL queries.

Selecting the Graph Storage Representation

Choosing between having a graph optimized for reads or for updates can be done when reading a graph by modifying the graph configuration to set the flag optimized_for to updates or reads.

Both partitioned graphs and homogeneous graphs can use the representation optimized for updates.

Partitioned Graphs

The following is an example of a partitioned-graph with several providers that will use the representation optimized for updates:

{
    "name": "transaction_graph",
    "optimized_for": "updates",
    "vertex_id_strategy" : "keys_as_ids",
    "edge_id_strategy" : "keys_as_ids",
    "vertex_id_type" : "integer",
    "edge_id_type" : "long",
    "vertex_providers": [{
        "format": "csv",
        "name": "account",
        "key_type":"integer",
        "props": [
          {"name": "balance", "type": "long"}
        ],
        "uris": ["accounts.csv"]
    }],
    "edge_providers": [{
        "name": "transaction",
        "source_vertex_provider": "account",
        "destination_vertex_provider": "account",
        "props": [
          {"name": "amount", "type": "integer"},
          {"name": "date", "type": "local_date"}
        ],
        "uris": ["transactions.csv"],
        "format": "csv"
    }]
}

For more information about partitioned graph configurations in general, please refer to the documentation for the partitioned graph configuration APIs, partitioned graph configuration examples.

Homogeneous Graphs

Homogeneous graphs can also use the representation optimized for updates, as in the following example:

{
    "format": "csv",
    "optimized_for" : "updates",
    "vertex_uris" : ["accounts.csv"],
    "vertex_props":[
      {"name": "balance", "type": "long"}
    ],
    "edge_uris" : ["transactions.csv"],
    "edge_props":[
      {"name": "amount", "type": "integer"},
      {"name": "date", "type": "local_date"}
    ]
}

For more information about homogeneous graph configurations in general, please refer to the documentation for the graph configuration APIs, graph configuration examples.

Optimizing the Graph Schema for Faster Updates

Even though graphs optimized for updates can make graph updates much faster, the graph schema itself can have a big impact on the performance of updates. We provide here a collection of advice that can help improving the performance of updates.

Partitioning the Vertices and Edges by Insertion Order

Most - if not all - applications requiring low-latency updates have a time aspect (timestamped data or just insertion time) for the vertices and edges being inserted/updated that can be leveraged to have faster updates: partitioning the vertices or edges by insertion time into several vertex or edge providers helps making the updates to be localized in those providers which reduces the work necessary to build a new graph or new graph snapshot.

The following graph configuration presents a potential graph configuration for a banking use-case. In the graph, there are persons that have accounts, and transactions can be made between accounts. Persons and accounts are represented by vertices while transactions are edges.

{
  "name": "transaction_graph",
  "optimized_for": "updates",
  "vertex_id_strategy" : "keys_as_ids",
  "edge_id_strategy" : "keys_as_ids",
  "vertex_id_type" : "integer",
  "edge_id_type" : "long",
  "vertex_providers": [
    {
      "format": "csv",
      "name": "customer",
      "key_type":"integer",
      "props": [
        {
          "name": "name",
          "type": "string"
        },
        {
          "name": "date_of_birth",
          "type": "local_date"
        }
      ],
      "separator": " ",
      "uris": ["customers.csv"]
    },
        {
      "format": "csv",
      "name": "account",
      "key_type":"integer",
      "props": [
        {
          "name": "balance",
          "type": "long"
        }
      ],
      "uris": ["accounts.csv"]
    }
    ],
    "edge_providers": [
        {
            "name": "has_account",
            "source_vertex_provider": "customer",
            "destination_vertex_provider": "account",
            "props": [],
            "uris": ["customer_has_account.csv"],
            "format": "csv",
            "loading" : {
              "create_key_mapping" : true
            }
        },
        {
            "name": "transaction",
            "source_vertex_provider": "account",
            "destination_vertex_provider": "account",
            "props": [
              {
                "name": "amount",
                "type": "integer"
              },{
                "name": "date",
                "type": "local_date"
              }
            ],
            "uris": ["transactions.csv"],
            "format": "csv",
            "loading" : {
              "create_key_mapping" : true
            }
        }    ],
    "error_handling":{ "on_missing_vertex" :  "ignore_edge"}
}

In such use-case, we can expect that transactions should be added in the graph as fast as possible so that they can be validated. To improve the speed at which we can add transaction edges in the graph, we can partition the transaction edges so that insertions happen in a single specific edge provider for the current time period. There are two approaches that can achieve that: modulus-partitioning where there is a fixed number of partitions to cover a recurring period. For example partitions corresponding to quarters to cover several years: each partition storing edges for one quarter, but for the years 2020, 2019, 2018 etc. the "one active provider and several archive providers" approach where insertions happen in one specific provider and new providers may be created as they grow or as time goes. For example, there could be a partition for the first quarter of the year 2020, one for the fourth quarter of 2019 etc. In this example insertions only would happen in the current time period (first quarter of 2020) and the archive partitions correspond to the other providers. Deletions of the archive partitiones may be done by removing entire providers, for example by removing the entire provider for edges corresponding to transactions that happened in the first quarter of the year 2013.

We present further each approach below.

Modulus-partitioning of Providers

In the previous banking example, we can use a partitioning scheme leveraging a year-periodicity by partitioning for example by quarters. In that case, on a given day, edges will be inserted in the provider for the current quarter, and be removed in the same edge provider if edges are removed after a duration that is a multiple of a year.

The following picture illustrates this idea:

Modulus-partitioning of providers

Figure: Modulus-partitioning of providers

The graph configuration to achieve that would look as follows:

{
  "name": "transaction_graph",
  "optimized_for": "updates",
  "vertex_id_strategy" : "keys_as_ids",
  "edge_id_strategy" : "keys_as_ids",
  "vertex_id_type" : "integer",
  "edge_id_type" : "long",
  "vertex_providers": [
    {
      "format": "csv",
      "name": "customer",
      "key_type":"integer",
      "props": [
        {
          "name": "name",
          "type": "string"
        },
        {
          "name": "date_of_birth",
          "type": "local_date"
        }
      ],
      "separator": " ",
      "uris": ["customers.csv"]
    },
        {
      "format": "csv",
      "name": "account",
      "key_type":"integer",
      "props": [
        {
          "name": "balance",
          "type": "long"
        }
      ],
      "uris": ["accounts.csv"]
    }
    ],
    "edge_providers": [
        {
            "name": "has_account",
            "source_vertex_provider": "customer",
            "destination_vertex_provider": "account",
            "props": [],
            "uris": ["customer_has_account.csv"],
            "format": "csv",
            "loading" : {
              "create_key_mapping" : true
            }
        },
        {
            "name": "transaction_q1",
            "source_vertex_provider": "account",
            "destination_vertex_provider": "account",
            "props": [
              {
                "name": "amount",
                "type": "integer"
              },{
                "name": "date",
                "type": "local_date"
              }
            ],
            "uris": ["transactions_a1.csv"],
            "format": "csv",
            "loading" : {
              "create_key_mapping" : true
            }
        },
        {
            "name": "transaction_q2",
            "source_vertex_provider": "account",
            "destination_vertex_provider": "account",
            "props": [
              {
                "name": "amount",
                "type": "integer"
              },{
                "name": "date",
                "type": "local_date"
              }
            ],
            "uris": ["transactions_q2.csv"],
            "format": "csv",
            "loading" : {
              "create_key_mapping" : true
            }
        },
        {
            "name": "transaction_q3",
            "source_vertex_provider": "account",
            "destination_vertex_provider": "account",
            "props": [
              {
                "name": "amount",
                "type": "integer"
              },{
                "name": "date",
                "type": "local_date"
              }
            ],
            "uris": ["transactions_q3.csv"],
            "format": "csv",
            "loading" : {
              "create_key_mapping" : true
            }
        },
        {
            "name": "transaction_q4",
            "source_vertex_provider": "account",
            "destination_vertex_provider": "account",
            "props": [
              {
                "name": "amount",
                "type": "integer"
              },{
                "name": "date",
                "type": "local_date"
              }
            ],
            "uris": ["transactions_q4.csv"],
            "format": "csv",
            "loading" : {
              "create_key_mapping" : true
            }
        }    ],
    "error_handling":{ "on_missing_vertex" :  "ignore_edge"}
}

One Active Provider and Several Archive Providers

Another approach is to perform insertions in an "active provider" until the provider reaches either a certain size or a reaches its maximal age (for example inserting data in it for maximum a day). Then a new provider is added and becomes the new active provider. The archive providers correspond to all the providers other than the current provider. In typical use-cases, old archive providers may be removed either due to data-retention policies or because of expiration of the data, or to save memory.

The following image illustrates that method:

Insertion in active partition

Figure: Insertion in active partition

Old data from old providers can be removed in bulk when desired (for example by using the alterGraph mutation), as shown by the following illustration:

Bulk deletion by removing provider

Figure: Bulk deletion by removing provider

Optimizing the Critical Vertex and Edge Providers

For use-cases requiring fast graph updates, it is important to make the vertices or edges being inserted, modified or removed at high frequency to have as few properties as possible. Doing so minimizes the memory that has to be allocated and modified during updates. Avoiding properties of type STRING can also be important for maximal performance.

Unsupported Features and Limitations

We list here the notable features that are currently not supported for graphs optimized for updates (potentially non-exhaustive list), or current limitations:

  • distributed runtime support (PGX.D) is not implemented
  • Properties of type TIME_WITH_TIMEZONE, TIMESTAMP_WITH_TIMEZONE, POINT2D are not supported for graphs optimized for updates
  • Analyst algorithm and custom GreenMarl/PGX Algorithms may have reduced performance
  • PGQL queries may have reduced performance