Pg2vec

Overview of the algorithm

Pg2vec learns representations of graphlets (partitions inside a graph) by employing edges as the principal learning units and thereby packing more information in each learning unit (as compared to previous approaches employing vertices as learning units) for the representation learning task. It consists of three main steps:

  1. We generate random walks for each vertex (with pre-defined length per walk and pre-defined number of walks per vertex).

  2. Each edge in this random walk is mapped as a property edge-word in the created document (with the document label as the graph-id) where the property edge-word is defined as the concatenation of the properties of the source and destination vertices.

  3. We feed the generated documents (with their attached document labels) to a doc2vec algorithm which generates the vector representation for each document (which is a graph in this case).

Pg2vec creates graphlet embeddings for a specific set of graphlets and cannot be updated to incorporate modifications on these graphlets. Instead, a new Pg2vec model should be trained on these modified graphlets. Lastly, it is important to note that the memory consumption of Pg2vec model is O(2(n+m)*d) where n is the number of vertices in the graph, m is the number of graphlets in the graph, and d is the embedding length.

Functionalities

We provide here the main functionalities for our implementation of Pg2vec in PGX using NCI109 dataset as an example (with 4127 graphs in it).

Loading a graph

First, we create a session and an analyst:

1session = pypgx.get_session(session_name="my-session")
2analyst = session.create_analyst()

Our h algorithm can be applied to directed or undirected graphs (even though we only consider undirected random walks). To begin with, we can load a graph as follows:

1graph = session.read_graph_with_properties(self.small_graph)

Building a Pg2vec Model (minimal)

We can build a Pg2vec model using the minimal configuration and default hyper-parameters as follows:

1model = analyst.pg2vec_builder(
2    graphlet_id_property_name="graph_id",
3    vertex_property_names=["category"],
4    window_size=4,
5    walks_per_vertex=5,
6    walk_length=8
7)

We specify the property name to determine each graphlet using the Pg2vecModelBuilder.setGraphLetIdPropertyName() operation and also employ the vertex properties in Pg2vec which are specified using the Pg2vecModelBuilder.setVertexPropertyNames() operation. We can also use the weakly connected component (WCC) functionality in PGX to determine the graphlets in a given graph.

Building a Pg2vec Model (customized)

We can build a Pg2vec model using customized hyper-parameters as follows:

 1model = analyst.pg2vec_builder(
 2    graphlet_id_property_name="graph_id",
 3    vertex_property_names=["category"],
 4    min_word_frequency=1,
 5    batch_size=128,
 6    num_epochs=5,
 7    layer_size=200,
 8    learning_rate=0.04,
 9    min_learning_rate=0.0001,
10    window_size=4,
11    walks_per_vertex=5,
12    walk_length=8,
13    use_graphlet_size=True,
14    graphlet_size_property_name="graphletSize-Pg2vec",
15)

We provide complete explanation for each builder operation (along with the default values) in our Pg2vecModelBuilder docs.

Training the Pg2vec model

We can train a Pg2vec model with the specified (default or customized) settings as follows:

1model.fit(graph)

Getting the loss value

We can fetch the loss value on a specified fraction of training data as follows:

1loss = model.loss

Computing the similar graphlets

We can fetch the k most similar graphlets for a given graphlet with the following code:

1similars = model.compute_similars(1, 10)

The output results will be in the following format, for e.g., searching for similar vertices for graphlet with ID = 52 using the trained model:

dstGraphlet

similarity

52

1.0

10

0.8748674392700195

23

0.8551455140113831

26

0.8493421673774719

47

0.8411962985992432

25

0.8281504511833191

43

0.8202780485153198

24

0.8179885745048523

8

0.796689510345459

9

0.7947834134101868

The visualization of two similar graphlets (top: ID = 52 and bottom: ID = 10).

similar graphlets pg2vec similar graphlets pg2vec

Computing the similars (for a graphlet batch)

We can fetch the k most similar graphlets for a batch of input graphlets with the following code:

1batched_similars = model.compute_similars([1, 2], 10)

The output results will be in the following format, for e.g., searching for similar vertices for graphlets with ID = 52 and ID = 41 using the trained model.

srcGraphlet

dstGraphlet

similarity

52

52

1.0

52

10

0.8748674392700195

52

23

0.8551455140113831

52

26

0.8493421673774719

52

47

0.8411962985992432

52

25

0.8281504511833191

52

43

0.8202780485153198

52

24

0.8179885745048523

52

8

0.796689510345459

52

9

0.7947834134101868

41

41

1.0

41

197

0.9653506875038147

41

84

0.9552277326583862

41

157

0.9465565085411072

41

65

0.9287481307983398

41

248

0.9177336096763611

41

315

0.9043129086494446

41

92

0.8998928070068359

41

297

0.8897411227226257

41

50

0.8810243010520935

Inferring a graphlet vector

We can infer the vector representation for a given new graphlet with the following code:

1from pypgx.api.filters import VertexFilter
2graphlet = graph.filter(VertexFilter("vertex.graph_id = 1"))
3inferred_vector = model.infer_graphlet_vector(graphlet)
4inferred_vector.print()

The schema for the inferred_vector would be as follows:

graphlet

embedding

Inferring vectors (for a graphlet batch)

We can infer the vector representations for multiple graphlets (specified with different graph-ids in a graph) with the following code:

1graphlets = session.read_graph_with_properties(
2    self.small_graph
3)
4inferred_vector_batched = model.infer_graphlet_vector_batched(
5    graphlets
6)
7inferred_vector_batched.print()

The schema is same as for inferGraphletVector but with more rows corresponding to the input graphlets.

Getting all trained graphlet vectors

We can retrieve the trained graphlet vectors for the current Pg2vec model as follows:

1vertex_vectors = model.trained_graphlet_vectors.flatten_all()
2vertex_vectors.store(
3    path=tmp + "/graphlet_vectors.tsv",
4    overwrite=True,
5    file_format="csv"
6)

The schema is the same as for inferGraphletVector but with more rows corresponding to all the graphlets in the input graph.

Storing a trained model

Models can be stored either to the server file system, or to a database.

The following shows how to store a trained Pg2vec model to a specified file path:

1model.export().file(path=tmp + "/model.model", key="test", overwrite=True)

When storing models in database, they are stored as a row inside a model store table. The following shows how to store a trained Pg2vec model in database in a specific model store table:

1model.export().db(
2    username="user",
3    password="password",
4    model_store="modelstoretablename",
5    model_name="model",
6    jdbc_url="jdbc_url"
7)

Loading a pre-trained model

Similarly to storing, models can be loaded from a file in the server file system, or from a database.

It is possible to load a pre-trained Pg2vec model from a specified file path as follows:

1analyst.get_pg2vec_model_loader().file(
2    path=tmp + "/model.model",
3    key="test"
4)

We can load a pre-trained Pg2vec model from a model store table in database as follows:

1analyst.get_pg2vec_model_loader().db(
2    username="user",
3    password="password",
4    model_store="modelstoretablename",
5    model_name="model",
6    jdbc_url="jdbc_url"
7)

Destroying a model

We can destroy a model with the following operation:

1model.destroy()