DeepWalk

Overview of the algorithm

DeepWalk is a widely employed vertex representation learning algorithm used in industry (e.g., in Taobao from Alibaba). It consists of two main steps:

First, the random walk generation step computes random walks for each vertex (with a pre-defined walk length and a pre-defined number of walks per vertex)
Second, these generated walks are fed to a word2vec algorithm to generate the vector representation for each vertex (which is the word in the input provided to the word2vec algorithm). Further details regarding the DeepWalk algorithm is available in the KDD paper.

DeepWalk creates vertex embeddings for a specific graph and cannot be updated to incorporate modifications on the graph. Instead, a new DeepWalk model should be trained on this modified graph. Lastly, it is important to note that the memory consumption of the DeepWalk model is O(2n*d) where n is the number of vertices in the graph and d is the embedding length.

Functionalities

We describe here the usage of the main functionalities of our implementation of DeepWalk in PGX using DBpedia graph as an example (with 8,637,721 vertices and 165,049,964 edges).

Loading a graph

First, we create a session and an analyst:

session = pypgx.get_session(session_name="my-session")
analyst = session.create_analyst()

Our DeepWalk algorithm implementation can be applied to directed or undirected graphs (even though we only consider undirected random walks). To begin with, we can load a graph as follows:

graph = session.read_graph_with_properties(self.small_graph)

Building a DeepWalk Model (minimal)

We build a DeepWalk model using the minimal configuration and default hyper-parameters:

model = analyst.deepwalk_builder(
    window_size=3,
    walks_per_vertex=6,
    walk_length=4
)

Building a DeepWalk Model (customized)

We build a DeepWalk model using customized hyper-parameters:

model = analyst.deepwalk_builder(
    min_word_frequency=1,
    batch_size=512,
    num_epochs=1,
    layer_size=100,
    learning_rate=0.05,
    min_learning_rate=0.0001,
    window_size=3,
    walks_per_vertex=6,
    walk_length=4,
    sample_rate=1.0,
    negative_sample=2
)

We provide complete explanation for each builder operation (along with the default values) in our pypgx.api.mllib.Analyst.deepwalk_builder() docs.

Training the DeepWalk model

We can train a DeepWalk model with the specified (default or customized) settings:

model.fit(graph)

Getting Loss value

We can fetch the loss value on a specified fraction of training data:

loss = model.loss

Computing the similar vertices

We can fetch the k most similar vertices for a given vertex:

similars = model.compute_similars(9, 2)
similars.print()

The output results will be in the following format, for e.g., searching for similar vertices for Albert_Einstein using the trained model.

dstVertex	similarity
Albert_Einstein	1.0000001192092896
Physics	0.8664291501045227
Werner_Heisenberg	0.8625140190124512
Richard_Feynman	0.8496938943862915
List_of_physicists	0.8415523767471313
Physicist	0.8384397625923157
Max_Planck	0.8370327353477478
Niels_Bohr	0.8340970873832703
Quantum_mechanics	0.8331197500228882
Special_relativity	0.8280861973762512

Computing the similars (for a vertex batch)

We can fetch the k most similar vertices for a list of input vertices:

vertices = [5, 9]
batched_similars = model.compute_similars(vertices, 10)
batched_similars.print()

The output results will be in the following format:

srcVertex	dstVertex	similarity
Machine_learning	Machine_learning	1.0000001192092896
Machine_learning	Data_mining	0.9070799350738525
Machine_learning	Computer_science	0.8963605165481567
Machine_learning	Unsupervised_learning	0.8828719854354858
Machine_learning	R_(programming_language)	0.8821185827255249
Machine_learning	Algorithm	0.8819515705108643
Machine_learning	Artificial_neural_network	0.8773092031478882
Machine_learning	Data_analysis	0.8758628368377686
Machine_learning	List_of_algorithms	0.8737979531288147
Machine_learning	K-means_clustering	0.8715602159500122
Albert_Einstein	Albert_Einstein	1.0000001192092896
Albert_Einstein	Physics	0.8664291501045227
Albert_Einstein	Werner_Heisenberg	0.8625140190124512
Albert_Einstein	Richard_Feynman	0.8496938943862915
Albert_Einstein	List_of_physicists	0.8415523767471313
Albert_Einstein	Physicist	0.8384397625923157
Albert_Einstein	Max_Planck	0.8370327353477478
Albert_Einstein	Niels_Bohr	0.8340970873832703
Albert_Einstein	Quantum_mechanics	0.8331197500228882
Albert_Einstein	Special_relativity	0.8280861973762512

Getting all trained vertex vectors

We can retrieve the trained vertex vectors for the current DeepWalk model and store in a TSV file (CSV with tab separator):

vertex_vectors = model.trained_vectors.flatten_all()
vertex_vectors.store(
    tmp + "/vertex_vectors.tsv",
    overwrite=True,
    file_format="csv"
)

The schema for the vertex_vectors() would be as follows without flattening (flatten_all() splits the vector column into separate double-valued columns):

vertexId

embedding

Storing a trained model

Models can be stored either to the server file system, or to a database.

The following shows how to store a trained DeepWalk model to a specified file path:

model.export().file(path=tmp + "/model.model", key="test", overwrite=True)

When storing models in database, they are stored as a row inside a model store table. The following shows how to store a trained DeepWalk model in database in a specific model store table:

model.export().db(
    username="user",
    password="password",
    model_store="modelstoretablename",
    model_name="model",
    jdbc_url="jdbc_url"
)

Loading a pre-trained model

Similarly to storing, models can be loaded from a file in the server file system, or from a database.

We can load a pre-trained DeepWalk model from a specified file path as follows:

analyst.get_deepwalk_model_loader().file(path=tmp + "/model.model", key="test")

We can load a pre-trained DeepWalk model from a model store table in database as follows:

analyst.get_deepwalk_model_loader().db(
    username="user",
    password="password",
    model_store="modelstoretablename",
    model_name="model",
    jdbc_url="jdbc_url"
)

Destroying a model

We can destroy a model as follows:

model.destroy()