DeepWalk
Overview of the algorithm
DeepWalk
is a widely employed vertex representation learning algorithm used in industry (e.g., in Taobao from Alibaba).
It consists of two main steps:
First, the random walk generation step computes random walks for each vertex (with a pre-defined walk length and a pre-defined number of walks per vertex)
Second, these generated walks are fed to a word2vec algorithm to generate the vector representation for each vertex (which is the word in the input provided to the word2vec algorithm). Further details regarding the
DeepWalk
algorithm is available in the KDD paper.
DeepWalk
creates vertex embeddings for a specific graph and cannot be updated to incorporate modifications on the graph.
Instead, a new DeepWalk
model should be trained on this modified graph. Lastly, it is important to note that the memory
consumption of the DeepWalk
model is O(2n*d)
where n
is the number of vertices in the graph and d
is the
embedding length.
Functionalities
We describe here the usage of the main functionalities of our implementation of DeepWalk
in PGX
using DBpedia graph as an example (with 8,637,721 vertices and 165,049,964 edges).
Loading a graph
First, we create a session and an analyst:
1session = pypgx.get_session(session_name="my-session")
2analyst = session.create_analyst()
Our DeepWalk
algorithm implementation can be applied to directed or undirected graphs
(even though we only consider undirected random walks). To begin with, we can load a graph as follows:
1graph = session.read_graph_with_properties(self.small_graph)
Building a DeepWalk Model (minimal)
We build a DeepWalk
model using the minimal configuration and default hyper-parameters:
1model = analyst.deepwalk_builder(
2 window_size=3,
3 walks_per_vertex=6,
4 walk_length=4
5)
Building a DeepWalk Model (customized)
We build a DeepWalk
model using customized hyper-parameters:
1model = analyst.deepwalk_builder(
2 min_word_frequency=1,
3 batch_size=512,
4 num_epochs=1,
5 layer_size=100,
6 learning_rate=0.05,
7 min_learning_rate=0.0001,
8 window_size=3,
9 walks_per_vertex=6,
10 walk_length=4,
11 sample_rate=1.0,
12 negative_sample=2
13)
We provide complete explanation for each builder operation (along with the default values) in our pypgx.api.mllib.Analyst.deepwalk_builder()
docs.
Training the DeepWalk model
We can train a DeepWalk
model with the specified (default or customized) settings:
1model.fit(graph)
Getting Loss value
We can fetch the loss value on a specified fraction of training data:
1loss = model.loss
Computing the similar vertices
We can fetch the k
most similar vertices for a given vertex:
1similars = model.compute_similars(9, 2)
2similars.print()
The output results will be in the following format, for e.g., searching for similar vertices for Albert_Einstein using the trained model.
dstVertex |
similarity |
---|---|
Albert_Einstein |
1.0000001192092896 |
Physics |
0.8664291501045227 |
Werner_Heisenberg |
0.8625140190124512 |
Richard_Feynman |
0.8496938943862915 |
List_of_physicists |
0.8415523767471313 |
Physicist |
0.8384397625923157 |
Max_Planck |
0.8370327353477478 |
Niels_Bohr |
0.8340970873832703 |
Quantum_mechanics |
0.8331197500228882 |
Special_relativity |
0.8280861973762512 |
Computing the similars (for a vertex batch)
We can fetch the k
most similar vertices for a list of input vertices:
1vertices = [5, 9]
2batched_similars = model.compute_similars(vertices, 10)
3batched_similars.print()
The output results will be in the following format:
srcVertex |
dstVertex |
similarity |
---|---|---|
Machine_learning |
Machine_learning |
1.0000001192092896 |
Machine_learning |
Data_mining |
0.9070799350738525 |
Machine_learning |
Computer_science |
0.8963605165481567 |
Machine_learning |
Unsupervised_learning |
0.8828719854354858 |
Machine_learning |
R_(programming_language) |
0.8821185827255249 |
Machine_learning |
Algorithm |
0.8819515705108643 |
Machine_learning |
Artificial_neural_network |
0.8773092031478882 |
Machine_learning |
Data_analysis |
0.8758628368377686 |
Machine_learning |
List_of_algorithms |
0.8737979531288147 |
Machine_learning |
K-means_clustering |
0.8715602159500122 |
Albert_Einstein |
Albert_Einstein |
1.0000001192092896 |
Albert_Einstein |
Physics |
0.8664291501045227 |
Albert_Einstein |
Werner_Heisenberg |
0.8625140190124512 |
Albert_Einstein |
Richard_Feynman |
0.8496938943862915 |
Albert_Einstein |
List_of_physicists |
0.8415523767471313 |
Albert_Einstein |
Physicist |
0.8384397625923157 |
Albert_Einstein |
Max_Planck |
0.8370327353477478 |
Albert_Einstein |
Niels_Bohr |
0.8340970873832703 |
Albert_Einstein |
Quantum_mechanics |
0.8331197500228882 |
Albert_Einstein |
Special_relativity |
0.8280861973762512 |
Getting all trained vertex vectors
We can retrieve the trained vertex vectors for the current DeepWalk
model and store in a TSV
file (CSV
with tab
separator):
1vertex_vectors = model.trained_vectors.flatten_all()
2vertex_vectors.store(
3 tmp + "/vertex_vectors.tsv",
4 overwrite=True,
5 file_format="csv"
6)
The schema for the vertex_vectors()
would be as follows without flattening (flatten_all()
splits the vector column into separate double-valued columns):
vertexId |
embedding |
Storing a trained model
Models can be stored either to the server file system, or to a database.
The following shows how to store a trained DeepWalk
model to a specified file path:
1model.export().file(path=tmp + "/model.model", key="test", overwrite=True)
When storing models in database, they are stored as a row inside a model store table.
The following shows how to store a trained DeepWalk
model in database in a specific model store table:
1model.export().db(
2 username="user",
3 password="password",
4 model_store="modelstoretablename",
5 model_name="model",
6 jdbc_url="jdbc_url"
7)
Loading a pre-trained model
Similarly to storing, models can be loaded from a file in the server file system, or from a database.
We can load a pre-trained DeepWalk
model from a specified file path as follows:
1analyst.get_deepwalk_model_loader().file(path=tmp + "/model.model", key="test")
We can load a pre-trained DeepWalk
model from a model store table in database as follows:
1analyst.get_deepwalk_model_loader().db(
2 username="user",
3 password="password",
4 model_store="modelstoretablename",
5 model_name="model",
6 jdbc_url="jdbc_url"
7)
Destroying a model
We can destroy a model as follows:
1model.destroy()