PGX 20.1.1
Documentation

Loading Graph Data in Parallel

In this tutorial, you will learn how to load a graph in parallel using multiple files. We will split a file into multiple files, each holding a partition of the graph and create a graph configuration that specifies the multiple file URIs so that the loading of the graph data is done in parallel. We will focus on the Flat File format during this tutorial. For more information about graph file formats, please refer to the file format reference; in particular, CSV files can also be loaded in parallel in the same way.

For this tutorial we will consider the following file contents.

1,Color,1,red,,
2,Color,1,yellow,,
3,Color,1,blue,,
4,Color,1,green,,
5,Color,1,orange,,
6,Color,1,white,,
7,Color,1,black,,
1,1,2,edge1,Weight,4,,1.0,
2,2,3,edge2,Weight,4,,2.0,
3,3,4,edge3,Weight,4,,3.0,
4,4,5,edge4,Weight,4,,4.0,
5,5,6,edge5,Weight,4,,5.0,
6,6,7,edge6,Weight,4,,6.0,

Splitting the Files into Multiple File Partitions.

To let PGX do the loading in parallel we must split the files into multiple partitions. We will split the vertex file into four and the edge file into two, to demonstrate how PGX handle different number of partitions.

1,Color,1,red,,
2,Color,1,yellow,,
3,Color,1,blue,,
4,Color,1,green,,
5,Color,1,orange,,
6,Color,1,white,,
7,Color,1,black,,
1,1,2,edge1,Weight,4,,1.0,
2,2,3,edge2,Weight,4,,2.0,
3,3,4,edge3,Weight,4,,3.0,
4,4,5,edge4,Weight,4,,4.0,
5,5,6,edge5,Weight,4,,5.0,
6,6,7,edge6,Weight,4,,6.0,

Create a Graph Configuration with Multiple File Partitions

Now we will create the graph configuration. Since we are using the flat file format we set the format to flat_file and specify one list of URIs for vertices and one for edges. We also need to set the proper separator for the files and the corresponding properties. For this tutorial we split up the vertex file into four and the edge file into two partitions that we want to load all these graph data into the same graph. To do so we have to specify all of the uris inside the graph configuration. For more configuration options have a look at graph config.

{
  "format": "flat_file",
  "vertex_uris": ["vertex_file1", "vertex_file2", "vertex_file3", "vertex_file4"],
  "edge_uris": ["edge_file1", "edge_file2"],
  "separator": ",",
  "edge_props": [
    {
      "name": "Weight",
      "type": "double"
    }
  ],
  "vertex_props": [
    {
      "name": "Color",
      "type": "string"
    }
  ]
}
FileGraphConfig config = GraphConfigBuilder
   .forMultipleFileFormat(Format.FLAT_FILE)
   .setSeparator(",")
   .addVertexUri("vertex_file1")
   .addVertexUri("vertex_file2")
   .addVertexUri("vertex_file3")
   .addVertexUri("vertex_file4")
   .addEdgeUri("edge_file1")
   .addEdgeUri("edge_file2")
   .addVertexProperty("Color", PropertyType.STRING)
   .addEdgeProperty("Weight", PropertyType.DOUBLE)
   .build();

Note we also added one double edge property named "Weight" and one string vertex property named "Color".

Parallel Loading

With the graph files and the graph config, we can load the graph into PGX.

When loading the graph defined above, PGX will automatically load it in parallel, using one thread for each file. This means that a graph can be loaded in parallel with as many threads as files are given depending on the configured parallelism for the PGX instance.

Special Notes

Since the graph config will be used for all of the specified files, it is crucial to use the same format for all these files; i.e. using the same separator, having the same defined properties, complying with same format specification, etc...

The parallel loading capabilities can be used in conjunction with the Exporting to Multiple Files feature.

Feel free to try out other options, for example, by adding additional properties and by splitting files into more partitions.