DeepWalk

Overview of the algorithm

DeepWalk is a widely employed vertex representation learning algorithm used in industry (e.g., in Taobao from Alibaba). It consists of two main steps:

  • First, the random walk generation step computes random walks for each vertex (with a pre-defined walk length and a pre-defined number of walks per vertex)

  • Second, these generated walks are fed to a word2vec algorithm to generate the vector representation for each vertex (which is the word in the input provided to the word2vec algorithm). Further details regarding the DeepWalk algorithm is available in the KDD paper.

DeepWalk creates vertex embeddings for a specific graph and cannot be updated to incorporate modifications on the graph. Instead, a new DeepWalk model should be trained on this modified graph. Lastly, it is important to note that the memory consumption of the DeepWalk model is O(2n*d) where n is the number of vertices in the graph and d is the embedding length.

Functionalities

We describe here the usage of the main functionalities of our implementation of DeepWalk in PGX using DBpedia graph as an example (with 8,637,721 vertices and 165,049,964 edges).

Loading a graph

First, we create a session and an analyst:

1session = pypgx.get_session(session_name="my-session")
2analyst = session.create_analyst()

Our DeepWalk algorithm implementation can be applied to directed or undirected graphs (even though we only consider undirected random walks). To begin with, we can load a graph as follows:

1graph = session.read_graph_with_properties(self.small_graph)

Building a DeepWalk Model (minimal)

We build a DeepWalk model using the minimal configuration and default hyper-parameters:

1model = analyst.deepwalk_builder(
2    window_size=3,
3    walks_per_vertex=6,
4    walk_length=4
5)

Building a DeepWalk Model (customized)

We build a DeepWalk model using customized hyper-parameters:

 1model = analyst.deepwalk_builder(
 2    min_word_frequency=1,
 3    batch_size=512,
 4    num_epochs=1,
 5    layer_size=100,
 6    learning_rate=0.05,
 7    min_learning_rate=0.0001,
 8    window_size=3,
 9    walks_per_vertex=6,
10    walk_length=4,
11    sample_rate=1.0,
12    negative_sample=2
13)

We provide complete explanation for each builder operation (along with the default values) in our pypgx.api.mllib.Analyst.deepwalk_builder() docs.

Training the DeepWalk model

We can train a DeepWalk model with the specified (default or customized) settings:

1model.fit(graph)

Getting Loss value

We can fetch the loss value on a specified fraction of training data:

1loss = model.loss

Computing the similar vertices

We can fetch the k most similar vertices for a given vertex:

1similars = model.compute_similars(9, 2)
2similars.print()

The output results will be in the following format, for e.g., searching for similar vertices for Albert_Einstein using the trained model.

dstVertex

similarity

Albert_Einstein

1.0000001192092896

Physics

0.8664291501045227

Werner_Heisenberg

0.8625140190124512

Richard_Feynman

0.8496938943862915

List_of_physicists

0.8415523767471313

Physicist

0.8384397625923157

Max_Planck

0.8370327353477478

Niels_Bohr

0.8340970873832703

Quantum_mechanics

0.8331197500228882

Special_relativity

0.8280861973762512

Computing the similars (for a vertex batch)

We can fetch the k most similar vertices for a list of input vertices:

1vertices = [5, 9]
2batched_similars = model.compute_similars(vertices, 10)
3batched_similars.print()

The output results will be in the following format:

srcVertex

dstVertex

similarity

Machine_learning

Machine_learning

1.0000001192092896

Machine_learning

Data_mining

0.9070799350738525

Machine_learning

Computer_science

0.8963605165481567

Machine_learning

Unsupervised_learning

0.8828719854354858

Machine_learning

R_(programming_language)

0.8821185827255249

Machine_learning

Algorithm

0.8819515705108643

Machine_learning

Artificial_neural_network

0.8773092031478882

Machine_learning

Data_analysis

0.8758628368377686

Machine_learning

List_of_algorithms

0.8737979531288147

Machine_learning

K-means_clustering

0.8715602159500122

Albert_Einstein

Albert_Einstein

1.0000001192092896

Albert_Einstein

Physics

0.8664291501045227

Albert_Einstein

Werner_Heisenberg

0.8625140190124512

Albert_Einstein

Richard_Feynman

0.8496938943862915

Albert_Einstein

List_of_physicists

0.8415523767471313

Albert_Einstein

Physicist

0.8384397625923157

Albert_Einstein

Max_Planck

0.8370327353477478

Albert_Einstein

Niels_Bohr

0.8340970873832703

Albert_Einstein

Quantum_mechanics

0.8331197500228882

Albert_Einstein

Special_relativity

0.8280861973762512

Getting all trained vertex vectors

We can retrieve the trained vertex vectors for the current DeepWalk model and store in a TSV file (CSV with tab separator):

1vertex_vectors = model.trained_vectors.flatten_all()
2vertex_vectors.store(
3    tmp + "/vertex_vectors.tsv",
4    overwrite=True,
5    file_format="csv"
6)

The schema for the vertex_vectors() would be as follows without flattening (flatten_all() splits the vector column into separate double-valued columns):

vertexId

embedding

Storing a trained model

Models can be stored either to the server file system, or to a database.

The following shows how to store a trained DeepWalk model to a specified file path:

1model.export().file(path=tmp + "/model.model", key="test", overwrite=True)

When storing models in database, they are stored as a row inside a model store table. The following shows how to store a trained DeepWalk model in database in a specific model store table:

1model.export().db(
2    username="user",
3    password="password",
4    model_store="modelstoretablename",
5    model_name="model",
6    jdbc_url="jdbc_url"
7)

Loading a pre-trained model

Similarly to storing, models can be loaded from a file in the server file system, or from a database.

We can load a pre-trained DeepWalk model from a specified file path as follows:

1analyst.get_deepwalk_model_loader().file(path=tmp + "/model.model", key="test")

We can load a pre-trained DeepWalk model from a model store table in database as follows:

1analyst.get_deepwalk_model_loader().db(
2    username="user",
3    password="password",
4    model_store="modelstoretablename",
5    model_name="model",
6    jdbc_url="jdbc_url"
7)

Destroying a model

We can destroy a model as follows:

1model.destroy()