8.4 Using the Pg2vec Algorithm

Pg2vec learns representations of graphlets (partitions inside a graph) by employing edges as the principal learning units and thereby packing more information in each learning unit (as compared to employing vertices as learning units) for the representation learning task.

It consists of three main steps:

  1. Random walks for each vertex (with pre-defined length per walk and pre-defined number of walks per vertex) are generated.
  2. Each edge in this random walk is mapped as a property.edge-word in the created document (with the document label as the graph-id) where the property.edge-word is defined as the concatenation of the properties of the source and destination vertices.
  3. The generated documents (with their attached document labels) are fed to a doc2vec algorithm which generates the vector representation for each document, which is a graph in this case.

Pg2vec creates graphlet embeddings for a specific set of graphlets and cannot be updated to incorporate modifications on these graphlets. Instead, a new Pg2vec model should be trained on these modified graphlets.

The following represents the memory consumption of Pg2vec model.
O(2(n+m)*d)
where:
  • n: is the number of vertices in the graph
  • m: is the number of graphlets in the graph
  • d: is the embedding length

The following describes the usage of the main functionalities of the implementation of Pg2vec in PGX using NCI109 dataset as an example with 4127 graphs in it: