PGX 20.1.1
Documentation

Pg2vec

Overview of the Algorithm

Pg2vec learns representations of graphlets (partitions inside a graph) by employing edges as the principal learning units and thereby packing more information in each learning unit (as compared to previous approaches employing vertices as learning units) for the representation learning task. It consists of three main steps:

  1. We generate random walks for each vertex (with pre-defined length per walk and pre-defined number of walks per vertex).

  2. Each edge in this random walk is mapped as a property edge-word in the created document (with the document label as the graph-id) where the property edge-word is defined as the concatenation of the properties of the source and destination vertices.

  3. We feed the generated documents (with their attached document labels) to a doc2vec algorithm which generates the vector representation for each document (which is a graph in this case).

Pg2vec creates graphlet embeddings for a specific set of graphlets and cannot be updated to incorporate modifications on these graphlets. Instead, a new Pg2vec model should be trained on these modified graphlets. Lastly, it is important to note that the memory consumption of Pg2vec model is O(2(n+m)*d) where n is the number of vertices in the graph, m is the number of graphlets in the graph, and d is the embedding length.

Functionalities

We provide here the main functionalities for our implementation of Pg2vec in PGX using NCI109 dataset as an example (with 4127 graphs in it).

Loading a Graph

First, we create a session and an analyst:

cd $PGX_HOME
./bin/pgx-jshell
// starting the shell will create an implicit session and analyst
import oracle.pgx.api.*;
import oracle.pgx.api.beta.mllib.Pg2vecModel;
import oracle.pgx.api.beta.frames.*;

...

PgxSession session = Pgx.createSession("my-session");

Analyst analyst = session.createAnalyst();

Our Pg2vec algorithm can be applied to directed or undirected graphs (even though we only consider undirected random walks). To begin with, we can load a graph as follows:

pgx> var graph = session.readGraphWithProperties("<path>/NCI109.json")
PgxGraph graph = session.readGraphWithProperties("<path>/NCI109.json");

Building a Pg2vec Model (minimal)

We can build a Pg2vec model using the minimal configuration and default hyper-parameters as follows:

pgx> var model = analyst.pg2vecModelBuilder().
         setGraphLetIdPropertyName("graph_id").
         setVertexPropertyNames(Arrays.asList("category")).
         setWindowSize(4).
         setWalksPerVertex(5).
         setWalkLength(8).
         build()
Pg2vecModel model = analyst.pg2vecModelBuilder()
    .setGraphLetIdPropertyName("graph_id")
    .setVertexPropertyNames(Arrays.asList("category"))
    .setWindowSize(4)
    .setWalksPerVertex(5)
    .setWalkLength(8)
    .build();

We specify the property name to determine each graphlet using the Pg2vecModelBuilder#setGraphLetIdPropertyName operation and also employ the vertex properties in Pg2vec which are specified using the Pg2vecModelBuilder#setVertexPropertyNames operation. We can also use the weakly connected component (WCC) functionality in PGX to determine the graphlets in a given graph.

Building a Pg2vec Model (customized)

We can build a Pg2vec model using customized hyper-parameters as follows:

pgx> var model = analyst.pg2vecModelBuilder().
         setGraphLetIdPropertyName("graph_id").
         setVertexPropertyNames(Arrays.asList("category")).
         setMinWordFrequency(1).
         setBatchSize(128).
         setNumEpochs(5).
         setLayerSize(200).
         setLearningRate(0.04).
         setMinLearningRate(0.0001).
         setWindowSize(4).
         setWalksPerVertex(5).
         setWalkLength(8).
         setUseGraphletSize(true). 
         setValidationFraction(0.05).
         setGraphletSizePropertyName("<propertyName>")
         build()
Pg2vecModel model = analyst.pg2vecModelBuilder()
    .setGraphLetIdPropertyName("graph_id")
    .setVertexPropertyNames(Arrays.asList("category"))
    .setMinWordFrequency(1)
    .setBatchSize(128)
    .setNumEpochs(5)
    .setLayerSize(200)
    .setLearningRate(0.04)
    .setMinLearningRate(0.0001)
    .setWindowSize(4)
    .setWalksPerVertex(5)
    .setWalkLength(8)
    .setUseGraphletSize(true)
    .setGraphletSizePropertyName("<propertyName>")
    .setValidationFraction(0.05)
    .build();

We provide complete explanation for each builder operation (along with the default values) in our Pg2vecModelBuilder javadocs.

Training the Pg2vec Model

We can train a Pg2vec model with the specified (default or customized) settings as follows:

pgx> model.fit(graph)
model.fit(graph);

Getting the Loss Value

We can fetch the loss value on a specified fraction of training data (set in builder using setValidationFraction) as follows:

pgx> var loss = model.getLoss()
double loss = model.getLoss();

Computing the Similar Graphlets

We can fetch the k most similar graphlets for a given graphlet with the following code:

pgx> var similars = model.computeSimilars(52, 10)
PgxFrame similars = model.computeSimilars(52, 10);

The output results will be in the following format, for e.g., searching for similar vertices for graphlet with ID = 52 using the trained model and printing it with similars.print():

+----------------------------------+
| dstGraphlet | similarity         |
+----------------------------------+
| 52          | 1.0                |
| 10          | 0.8748674392700195 |
| 23          | 0.8551455140113831 |
| 26          | 0.8493421673774719 |
| 47          | 0.8411962985992432 |
| 25          | 0.8281504511833191 |
| 43          | 0.8202780485153198 |
| 24          | 0.8179885745048523 |
| 8           | 0.796689510345459  |
| 9           | 0.7947834134101868 |
+----------------------------------+

The visualization of two similar graphlets (top: ID = 52 and bottom: ID = 10).

similar_graphlets_pg2vec similar_graphlets_pg2vec

Computing the Similars (for a Graphlet Batch)

We can fetch the k most similar graphlets for a batch of input graphlets with the following code:

pgx> var graphlets = new ArrayList()
pgx> graphlets.add(52)
pgx> graphlets.add(41)
pgx> var batchedSimilars = model.computeSimilars(graphlets, 10)
List graphlets = Arrays.asList(52,41);
PgxFrame batchedSimilars = model.computeSimilars(graphlets, 10);

The output results will be in the following format, for e.g., searching for similar vertices for graphlets with ID = 52 and ID = 41 using the trained model and printing it with batched_similars.print().

+------------------------------------------------+
| srcGraphlet | dstGraphlet | similarity         |
+------------------------------------------------+
| 52          | 52          | 1.0                |
| 52          | 10          | 0.8748674392700195 |
| 52          | 23          | 0.8551455140113831 |
| 52          | 26          | 0.8493421673774719 |
| 52          | 47          | 0.8411962985992432 |
| 52          | 25          | 0.8281504511833191 |
| 52          | 43          | 0.8202780485153198 |
| 52          | 24          | 0.8179885745048523 |
| 52          | 8           | 0.796689510345459  |
| 52          | 9           | 0.7947834134101868 |
| 41          | 41          | 1.0                |
| 41          | 197         | 0.9653506875038147 |
| 41          | 84          | 0.9552277326583862 |
| 41          | 157         | 0.9465565085411072 |
| 41          | 65          | 0.9287481307983398 |
| 41          | 248         | 0.9177336096763611 |
| 41          | 315         | 0.9043129086494446 |
| 41          | 92          | 0.8998928070068359 |
| 41          | 297         | 0.8897411227226257 |
| 41          | 50          | 0.8810243010520935 |
+------------------------------------------------+

Inferring a Graphlet Vector

We can infer the vector representation for a given new graphlet with the following code:

pgx> var graphlet = session.
         readGraphWithProperties("<path>/<graphletConfig.json>")
pgx> inferredVector = model.inferGraphletVector(graphlet)
pgx> inferredVector.print()
PgxGraph graphlet = session
    .readGraphWithProperties("<path>/<graphletConfig.json>");
PgxFrame inferredVector = model.inferGraphletVector(graphlet);
inferredVector.print();

The schema for the inferredVector would be as follows:

+---------------------------------------------------------------+
| graphlet                                | embedding           |
+---------------------------------------------------------------+

Inferring Vectors (for a Graphlet Batch)

We can infer the vector representations for multiple graphlets (specified with different graph-ids in a graph) with the following code:

pgx> var graphlets = session.
         readGraphWithProperties("<path>/<graphletConfig.json>")
pgx> inferredVectorBatched = model.inferGraphletVectorBatched(graphlets)
pgx> inferredVectorBatched.print()
PgxGraph graphlets = session
    .readGraphWithProperties("<path>/<graphletConfig.json>");
PgxFrame inferredVectorBatched = model.inferGraphletVectorBatched(graphlets);
inferredVectorBatched.print();

The schema is same as for inferGraphletVector but with more rows corresponding to the input graphlets.

Getting All Trained Graphlet Vectors

We can retrieve the trained graphlet vectors for the current Pg2vec model as follows:

pgx> var graphletVectors = model.getTrainedGraphletVectors().flattenAll()
pgx> graphletVectors.write().
    overwrite(true).
    csv().
    separator('\t' as char).
    store("<path>/graphlet_vectors.tsv")
PgxFrame graphletVectors = model.getTrainedGraphletVectors().flattenAll();
graphletVectors.write()
    .overwrite(true)
    .csv()
    .separator('\t')
    .store("<path>/graphlet_vectors.tsv");

The schema is the same as for inferGraphletVector but with more rows corresponding to all the graphlets in the input graph.

Storing a Trained Model

Storing a trained Pg2vec encrypted model to a specified path can be done as follows:

pgx> model.store("<path>/<modelName>","encryption_key")
model.store("<path>/<modelName>","encryption_key");

Loading a Pre-trained Model

It is also possible to load a pre-trained Pg2vec encrypted model from a specified path as follows:

pgx> var model = analyst.loadPg2vecModel("<path>/<modelName>","encryption_key")
Pg2vecModel model = analyst.loadPg2vecModel("<path>/<modelName>","encryption_key");

Destroying a Model

We can destroy a model with the following operation:

pgx> model.destroy()
model.destroy();