PGX 20.2.2
Documentation

DeepWalk

Overview of the Algorithm

DeepWalk is a widely employed vertex representation learning algorithm used in industry (e.g., in Taobao from Alibaba). It consists of two main steps:

  • First, the random walk generation step computes random walks for each vertex (with a pre-defined walk length and a pre-defined number of walks per vertex)
  • Second, these generated walks are fed to a word2vec algorithm to generate the vector representation for each vertex (which is the word in the input provided to the word2vec algorithm). Further details regarding the DeepWalk algorithm is available in the KDD paper.

DeepWalk creates vertex embeddings for a specific graph and cannot be updated to incorporate modifications on the graph. Instead, a new DeepWalk model should be trained on this modified graph. Lastly, it is important to note that the memory consumption of the DeepWalk model is O(2n*d) where n is the number of vertices in the graph and d is the embedding length.

Functionalities

We describe here the usage of the main functionalities of our implementation of DeepWalk in PGX using DBpedia graph as an example (with 8,637,721 vertices and 165,049,964 edges).

Loading a Graph

First, we create a session and an analyst:

cd $PGX_HOME
./bin/pgx-jshell
// starting the shell will create an implicit session and analyst
import oracle.pgx.api.*;
import oracle.pgx.api.beta.mllib.DeepWalkModel;
import oracle.pgx.api.beta.frames.*;

...

PgxSession session = Pgx.createSession("my-session");

Analyst analyst = session.createAnalyst();
session= pypgx.get_session(session_name="my-session")
analyst = session.create_analyst()

Our DeepWalk algorithm implementation can be applied to directed or undirected graphs (even though we only consider undirected random walks). To begin with, we can load a graph as follows:

pgx> var graph = session.readGraphWithProperties("<path>/dbpedia.json")
PgxGraph graph = session.readGraphWithProperties("<path>/dbpedia.json");
graph = session.read_graph_with_properties("<path>/dbpedia.json")

Building a DeepWalk Model (minimal)

We build a DeepWalk model using the minimal configuration and default hyper-parameters:

pgx> var model = analyst.deepWalkModelBuilder().
         setWindowSize(3).
         setWalksPerVertex(6).
         setWalkLength(4).
         build()
DeepWalkModel model = analyst.deepWalkModelBuilder()
    .setWindowSize(3)
    .setWalksPerVertex(6)
    .setWalkLength(4)
    .build();
model = analyst.deepwalk_builder(window_size=3,walks_per_vertex=6,walk_length=4)

Building a DeepWalk Model (customized)

We build a DeepWalk model using customized hyper-parameters:

pgx> var model = analyst.deepWalkModelBuilder().
         setMinWordFrequency(1).
         setBatchSize(512).
         setNumEpochs(1).
         setLayerSize(100).
         setLearningRate(0.05).
         setMinLearningRate(0.0001).
         setWindowSize(3).
         setWalksPerVertex(6).
         setWalkLength(4).
         setSampleRate(0.00001).
         setNegativeSample(2).
         setValidationFraction(0.01).
         build()
DeepWalkModel model = analyst.deepWalkModelBuilder()
    .setMinWordFrequency(1)
    .setBatchSize(512)
    .setNumEpochs(1)
    .setLayerSize(100)
    .setLearningRate(0.05)
    .setMinLearningRate(0.0001)
    .setWindowSize(3)
    .setWalksPerVertex(6)
    .setWalkLength(4)
    .setSampleRate(0.00001)
    .setNegativeSample(2)
    .setValidationFraction(0.01)
    .build();
model = analyst.deepwalk_builder(min_word_frequency= 1,
                                batch_size= 512,
                                num_epochs= 1,
                                layer_size= 100,
                                learning_rate= 0.05,
                                min_learning_rate= 0.0001,
                                window_size= 3,
                                walks_per_vertex= 6,
                                walk_length= 4,
                                sample_rate= 0.00001,
                                negative_sample= 2,
                                validation_fraction= 0.01)

We provide complete explanation for each builder operation (along with the default values) in our DeepWalkModelBuilder javadocs.

Training the DeepWalk Model

We can train a DeepWalk model with the specified (default or customized) settings:

pgx> model.fit(graph)
model.fit(graph);
model.fit(graph)

Getting Loss Value

We can fetch the loss value on a specified fraction of training data (set in builder using setValidationFraction):

pgx> var loss = model.getLoss()
double loss = model.getLoss();
loss = model.loss

Computing the Similar Vertices

We can fetch the k most similar vertices for a given vertex:

pgx> var similars = model.computeSimilars("Albert_Einstein", 10)
pgx> similars.print()
PgxFrame similars = model.computeSimilars("Albert_Einstein", 10);
similars.print();
similars = model.compute_similars("Albert_Einstein", 10)
similars.print()

The output results will be in the following format, for e.g., searching for similar vertices for Albert_Einstein using the trained model.

+-----------------------------------------+
| dstVertex          | similarity         |
+-----------------------------------------+
| Albert_Einstein    | 1.0000001192092896 |
| Physics            | 0.8664291501045227 |
| Werner_Heisenberg  | 0.8625140190124512 |
| Richard_Feynman    | 0.8496938943862915 |
| List_of_physicists | 0.8415523767471313 |
| Physicist          | 0.8384397625923157 |
| Max_Planck         | 0.8370327353477478 |
| Niels_Bohr         | 0.8340970873832703 |
| Quantum_mechanics  | 0.8331197500228882 |
| Special_relativity | 0.8280861973762512 |
+-----------------------------------------+

Computing the Similars (for a Vertex Batch)

We can fetch the k most similar vertices for a list of input vertices:

pgx> var vertices = new ArrayList()
pgx> vertices.add("Machine_learning")
pgx> vertices.add("Albert_Einstein")
pgx> batchedSimilars = model.computeSimilars(vertices, 10)
pgx> batchedSimilars.print()
List vertices = Arrays.asList("Machine_learning","Albert_Einstein");
PgxFrame batchedSimilars = model.computeSimilars(vertices, 10);
batchedSimilars.print();
vertices = ["Machine_learning","Albert_Einstein"]
batched_similars = model.compute_similars(vertices, 10)
batched_similars.print()

The output results will be in the following format:

+-------------------------------------------------------------------+
| srcVertex        | dstVertex                 | similarity         |
+-------------------------------------------------------------------+
| Machine_learning | Machine_learning          | 1.0000001192092896 |
| Machine_learning | Data_mining               | 0.9070799350738525 |
| Machine_learning | Computer_science          | 0.8963605165481567 |
| Machine_learning | Unsupervised_learning     | 0.8828719854354858 |
| Machine_learning | R_(programming_language)  | 0.8821185827255249 |
| Machine_learning | Algorithm                 | 0.8819515705108643 |
| Machine_learning | Artificial_neural_network | 0.8773092031478882 |
| Machine_learning | Data_analysis             | 0.8758628368377686 |
| Machine_learning | List_of_algorithms        | 0.8737979531288147 |
| Machine_learning | K-means_clustering        | 0.8715602159500122 |
| Albert_Einstein  | Albert_Einstein           | 1.0000001192092896 |
| Albert_Einstein  | Physics                   | 0.8664291501045227 |
| Albert_Einstein  | Werner_Heisenberg         | 0.8625140190124512 |
| Albert_Einstein  | Richard_Feynman           | 0.8496938943862915 |
| Albert_Einstein  | List_of_physicists        | 0.8415523767471313 |
| Albert_Einstein  | Physicist                 | 0.8384397625923157 |
| Albert_Einstein  | Max_Planck                | 0.8370327353477478 |
| Albert_Einstein  | Niels_Bohr                | 0.8340970873832703 |
| Albert_Einstein  | Quantum_mechanics         | 0.8331197500228882 |
| Albert_Einstein  | Special_relativity        | 0.8280861973762512 |
+-------------------------------------------------------------------+

Getting All Trained Vertex Vectors

We can retrieve the trained vertex vectors for the current DeepWalk model and store in a TSV file (CSV with tab separator):

pgx> var vertexVectors = model.getTrainedVertexVectors().flattenAll()
pgx> vertexVectors.write().
    overwrite(true).
    csv().
    separator('\t').
    store("<path>/vertex_vectors.tsv")
PgxFrame vertexVectors = model.getTrainedVertexVectors().flattenAll();
vertexVectors.write()
    .overwrite(true)
    .csv()
    .separator('\t')
    .store("<path>/vertex_vectors.tsv");
vertex_vectors = model.get_trained_vertex_vectors().flatten_all()
vertex_vectors.store("<path>/vertex_vectors.tsv",overwrite=True,file_format="csv")

The schema for the vertexVectors would be as follows without flattening (flattenAll splits the vector column into separate double-valued columns):

+---------------------------------------------------------------+
| vertexId                                | embedding           |
+---------------------------------------------------------------+

Storing a Trained Model

Store a trained DeepWalk encrypted model to a specified path:

pgx> model.store("<path>/<modelName>","encryption_key")
model.store("<path>/<modelName>","encryption_key");
model.store("<path>/<modelName>","encryption_key")

Loading a Pre-trained Model

We can load a pre-trained DeepWalk encrypted model from a specified path:

pgx> var model = analyst.loadDeepWalkModel("<path>/<modelName>","encryption_key")
DeepWalkModel model = analyst.loadDeepWalkModel("<path>/<modelName>","encryption_key");
analyst.load_deepwalk_model("<path>/<modelName>","encryption_key")

Destroying a Model

We can destroy a model as follows:

pgx> model.destroy()
model.destroy();
model.destroy()