PGX supports the Hadoop Distributed File System (HDFS). In this tutorial you will learn how to load and store graph data from and to HDFS via PGX APIs. PGX Hadoop support was designed to work with any Cloudera CDH 5.x and 6.x -compatible Hadoop cluster.
Conceptually, you have to
$HADOOP_CONF_DIR
is on the classpath, so your Hadoop configuration is foundhdfs:
as path prefix in the uri graph configuration field when referring to files located in HDFS.Graph configuration files are parsed client-side
In PGX client/server mode you also need to have Hadoop available on the client-side (HADOOP_CONF_DIR
set) if not only the graph data, but also the graph configuration files are located in HDFS. This is because the configuration files are parsed on the client-side before they are sent to the server.
Let's assume that we have the connections.edge_list
graph data and its configuration file of the load a graph tutorial stored
in HDFS instead of the local file system. First, copy the graph data into HDFS:
cd $PGX_HOME hadoop fs -mkdir -p /user/pgx hadoop fs -copyFromLocal examples/graphs/connections.edge_list /user/pgx
Next, edit the uri
field of the sample graph configuration file to point to the newly created HDFS resource:
{ "uri": "hdfs:/user/pgx/connections.edge_list", "format": "adj_list", "vertex_props": [{ "name": "prop", "type": "integer" }], "edge_props": [{ "name": "cost", "type": "double" }], "separator": " " }
Copy the configuration file into HDFS as well:
cd $PGX_HOME hadoop fs -copyFromLocal examples/graphs/connections.edge_list.json /user/pgx
To load the sample graph from HDFS into PGX, do
var g = session.readGraphWithProperties("hdfs:/user/pgx/connections.edge_list.json")
import oracle.pgx.api.*; ... PgxGraph g = session.readGraphWithProperties("hdfs:/user/pgx/connections.edge_list.json")
g = session.read_graph_with_properties("hdfs:/user/pgx/connections.edge_list.json")
Let's store our loaded sample graph back into HDFS in PGB format.
var config = g.store(Format.PGB, "hdfs:/user/pgx/connections.pgb")
import oracle.pgx.api.*; import oracle.pgx.config.*; GraphConfig pgbGraphConfig = g.store(Format.PGB, "hdfs:/user/pgx/connections.pgb");
g = g.store("pgb", "hdfs:/user/pgx/connections.pgb")
Verify that the PGB file was created:
hadoop fs -ls /user/pgx
PGX supports compilation of Green-Marl code stored in HDFS. Example:
var p = session.compileProgram("hdfs:/user/pgx/max_degree.gm")
import oracle.pgx.api.*; CompiledProgram p = session.compileProgram("hdfs:/user/pgx/max_degree.gm"); ::Python p = session.compile_program("hdfs:/user/pgx/max_degree.gm")
As with graph configuration files, the Green-Marl code is read from HDFS client-side if running in client/server mode.
Here is the full Java class of the above examples:
import oracle.pgx.api.CompiledProgram; import oracle.pgx.api.Pgx; import oracle.pgx.api.PgxGraph; import oracle.pgx.api.PgxSession; import oracle.pgx.config.Format; import oracle.pgx.config.GraphConfig; public class HdfsExample { public static void main(String[] mainArgs) throws Exception { PgxSession session = Pgx.createSession("my-session"); PgxGraph g1 = session.readGraphWithProperties("hdfs:/user/pgx/connections.edge_list.json"); GraphConfig pgbConfig = g1.store(Format.PGB, "hdfs:/user/pgx/sample.pgb"); PgxGraph g2 = session.readGraphWithProperties(pgbConfig); System.out.println("g1 N = " + g1.getNumVertices() + " E = " + g1.getNumEdges()); System.out.println("g2 N = " + g2.getNumVertices() + " E = " + g2.getNumEdges()); CompiledProgram p = session.compileProgram("hdfs:/user/pgx/max_degree.gm"); System.out.println("compiled " + p.getName()); } }
To compile above class, do
cd $PGX_HOME mkdir classes javac -cp lib/common/*:lib/embedded/*:third-party/* examples/java/HdfsExample.java -d classes
To run it, do
java -cp lib/common/*:lib/embedded/*:third-party/*:classes:conf:$HADOOP_CONF_DIR HdfsExample