PGX 21.1.1
Documentation

DeepWalk

Overview of the Algorithm

DeepWalk is a widely employed vertex representation learning algorithm used in industry (e.g., in Taobao from Alibaba). It consists of two main steps:

  • First, the random walk generation step computes random walks for each vertex (with a pre-defined walk length and a pre-defined number of walks per vertex)
  • Second, these generated walks are fed to a word2vec algorithm to generate the vector representation for each vertex (which is the word in the input provided to the word2vec algorithm). Further details regarding the DeepWalk algorithm is available in the KDD paper.

DeepWalk creates vertex embeddings for a specific graph and cannot be updated to incorporate modifications on the graph. Instead, a new DeepWalk model should be trained on this modified graph. Lastly, it is important to note that the memory consumption of the DeepWalk model is O(2n*d) where n is the number of vertices in the graph and d is the embedding length.

Functionalities

We describe here the usage of the main functionalities of our implementation of DeepWalk in PGX using DBpedia graph as an example (with 8,637,721 vertices and 165,049,964 edges).

Loading a Graph

First, we create a session and an analyst:

cd $PGX_HOME
./bin/pgx-jshell
// starting the shell will create an implicit session and analyst
import oracle.pgx.api.*;
import oracle.pgx.api.mllib.DeepWalkModel;
import oracle.pgx.api.frames.*;

...

PgxSession session = Pgx.createSession("my-session");

Analyst analyst = session.createAnalyst();
session= pypgx.get_session(session_name="my-session")
analyst = session.create_analyst()

Our DeepWalk algorithm implementation can be applied to directed or undirected graphs (even though we only consider undirected random walks). To begin with, we can load a graph as follows:

pgx> var graph = session.readGraphWithProperties("<path>/dbpedia.json")
PgxGraph graph = session.readGraphWithProperties("<path>/dbpedia.json");
graph = session.read_graph_with_properties("<path>/dbpedia.json")

Building a DeepWalk Model (minimal)

We build a DeepWalk model using the minimal configuration and default hyper-parameters:

pgx> var model = analyst.deepWalkModelBuilder().
         setWindowSize(3).
         setWalksPerVertex(6).
         setWalkLength(4).
         build()
DeepWalkModel model = analyst.deepWalkModelBuilder()
    .setWindowSize(3)
    .setWalksPerVertex(6)
    .setWalkLength(4)
    .build();
model = analyst.deepwalk_builder(window_size=3,walks_per_vertex=6,walk_length=4)

Building a DeepWalk Model (customized)

We build a DeepWalk model using customized hyper-parameters:

pgx> var model = analyst.deepWalkModelBuilder().
         setMinWordFrequency(1).
         setBatchSize(512).
         setNumEpochs(1).
         setLayerSize(100).
         setLearningRate(0.05).
         setMinLearningRate(0.0001).
         setWindowSize(3).
         setWalksPerVertex(6).
         setWalkLength(4).
         setSampleRate(0.00001).
         setNegativeSample(2).
         setValidationFraction(0.01).
         build()
DeepWalkModel model = analyst.deepWalkModelBuilder()
    .setMinWordFrequency(1)
    .setBatchSize(512)
    .setNumEpochs(1)
    .setLayerSize(100)
    .setLearningRate(0.05)
    .setMinLearningRate(0.0001)
    .setWindowSize(3)
    .setWalksPerVertex(6)
    .setWalkLength(4)
    .setSampleRate(0.00001)
    .setNegativeSample(2)
    .setValidationFraction(0.01)
    .build();
model = analyst.deepwalk_builder(min_word_frequency= 1,
                                batch_size= 512,
                                num_epochs= 1,
                                layer_size= 100,
                                learning_rate= 0.05,
                                min_learning_rate= 0.0001,
                                window_size= 3,
                                walks_per_vertex= 6,
                                walk_length= 4,
                                sample_rate= 0.00001,
                                negative_sample= 2,
                                validation_fraction= 0.01)

We provide complete explanation for each builder operation (along with the default values) in our DeepWalkModelBuilder javadocs.

Training the DeepWalk Model

We can train a DeepWalk model with the specified (default or customized) settings:

pgx> model.fit(graph)
model.fit(graph);
model.fit(graph)

Getting Loss Value

We can fetch the loss value on a specified fraction of training data (set in builder using setValidationFraction):

pgx> var loss = model.getLoss()
double loss = model.getLoss();
loss = model.loss

Computing the Similar Vertices

We can fetch the k most similar vertices for a given vertex:

pgx> var similars = model.computeSimilars("Albert_Einstein", 10)
pgx> similars.print()
PgxFrame similars = model.computeSimilars("Albert_Einstein", 10);
similars.print();
similars = model.compute_similars("Albert_Einstein", 10)
similars.print()

The output results will be in the following format, for e.g., searching for similar vertices for Albert_Einstein using the trained model.

+-----------------------------------------+
| dstVertex          | similarity         |
+-----------------------------------------+
| Albert_Einstein    | 1.0000001192092896 |
| Physics            | 0.8664291501045227 |
| Werner_Heisenberg  | 0.8625140190124512 |
| Richard_Feynman    | 0.8496938943862915 |
| List_of_physicists | 0.8415523767471313 |
| Physicist          | 0.8384397625923157 |
| Max_Planck         | 0.8370327353477478 |
| Niels_Bohr         | 0.8340970873832703 |
| Quantum_mechanics  | 0.8331197500228882 |
| Special_relativity | 0.8280861973762512 |
+-----------------------------------------+

Computing the Similars (for a Vertex Batch)

We can fetch the k most similar vertices for a list of input vertices:

pgx> var vertices = new ArrayList()
pgx> vertices.add("Machine_learning")
pgx> vertices.add("Albert_Einstein")
pgx> batchedSimilars = model.computeSimilars(vertices, 10)
pgx> batchedSimilars.print()
List vertices = Arrays.asList("Machine_learning","Albert_Einstein");
PgxFrame batchedSimilars = model.computeSimilars(vertices, 10);
batchedSimilars.print();
vertices = ["Machine_learning","Albert_Einstein"]
batched_similars = model.compute_similars(vertices, 10)
batched_similars.print()

The output results will be in the following format:

+-------------------------------------------------------------------+
| srcVertex        | dstVertex                 | similarity         |
+-------------------------------------------------------------------+
| Machine_learning | Machine_learning          | 1.0000001192092896 |
| Machine_learning | Data_mining               | 0.9070799350738525 |
| Machine_learning | Computer_science          | 0.8963605165481567 |
| Machine_learning | Unsupervised_learning     | 0.8828719854354858 |
| Machine_learning | R_(programming_language)  | 0.8821185827255249 |
| Machine_learning | Algorithm                 | 0.8819515705108643 |
| Machine_learning | Artificial_neural_network | 0.8773092031478882 |
| Machine_learning | Data_analysis             | 0.8758628368377686 |
| Machine_learning | List_of_algorithms        | 0.8737979531288147 |
| Machine_learning | K-means_clustering        | 0.8715602159500122 |
| Albert_Einstein  | Albert_Einstein           | 1.0000001192092896 |
| Albert_Einstein  | Physics                   | 0.8664291501045227 |
| Albert_Einstein  | Werner_Heisenberg         | 0.8625140190124512 |
| Albert_Einstein  | Richard_Feynman           | 0.8496938943862915 |
| Albert_Einstein  | List_of_physicists        | 0.8415523767471313 |
| Albert_Einstein  | Physicist                 | 0.8384397625923157 |
| Albert_Einstein  | Max_Planck                | 0.8370327353477478 |
| Albert_Einstein  | Niels_Bohr                | 0.8340970873832703 |
| Albert_Einstein  | Quantum_mechanics         | 0.8331197500228882 |
| Albert_Einstein  | Special_relativity        | 0.8280861973762512 |
+-------------------------------------------------------------------+

Getting All Trained Vertex Vectors

We can retrieve the trained vertex vectors for the current DeepWalk model and store in a TSV file (CSV with tab separator):

pgx> var vertexVectors = model.getTrainedVertexVectors().flattenAll()
pgx> vertexVectors.write().
    overwrite(true).
    csv().
    separator('\t').
    store("<path>/vertex_vectors.tsv")
PgxFrame vertexVectors = model.getTrainedVertexVectors().flattenAll();
vertexVectors.write()
    .overwrite(true)
    .csv()
    .separator('\t')
    .store("<path>/vertex_vectors.tsv");
vertex_vectors = model.get_trained_vertex_vectors().flatten_all()
vertex_vectors.store("<path>/vertex_vectors.tsv",overwrite=True,file_format="csv")

The schema for the vertexVectors would be as follows without flattening (flattenAll splits the vector column into separate double-valued columns):

+---------------------------------------------------------------+
| vertexId                                | embedding           |
+---------------------------------------------------------------+

Storing a Trained Model

Models can be stored either to the server file system, or to a database.

The following shows how to store a trained DeepWalk model to a specified file path:

pgx> model.export().file().path("<path>/<modelName>").store()
model.export().file().path("<path>/<modelName>").store();
model.export().file(path="<path>/<modelName>")

When storing models in database, they are stored as a row inside a model store table. The following shows how to store a trained DeepWalk model in database in a specific model store table:

pgx> model.export().db(). //
       username("user"). // DB user to use for storing the model
       password("password"). // password of the DB user
       jdbcUrl("jdbcUrl"). // jdbc url to the DB
       modelstore("modelstoretablename"). // name of the model store table
       modelname("model"). // name to give to the model (primary key of model store table)
       description("a model description"). // description to store alongside the model
       store();
model.export().db() //
       .username("user") // DB user to use for storing the model
       .password("password") // password of the DB user
       .jdbcUrl("jdbcUrl") // jdbc url to the DB
       .modelstore("modelstoretablename") // name of the model store table
       .modelname("model") // name to give to the model (primary key of model store table)
       .description("a model description") // description to store alongside the model
       .store();
model.export().db(username="user", password="password",
                  model_store="modelstoretablename", model_name="model",
                  jdbc_url="jdbc_url")

Loading a Pre-trained Model

Similarly to storing, models can be loaded from a file in the server file system, or from a database.

We can load a pre-trained DeepWalk model from a specified file path as follows:

pgx> var model = analyst.loadDeepWalkModel().file().path("<path>/<modelName>").load()
DeepWalkModel model = analyst.loadDeepWalkModel().file().path("<path>/<modelName>").load();
analyst.get_deepwalk_model_loader().file(path="<path>/<modelName>")

We can load a pre-trained DeepWalk model from a model store table in database as follows:

pgx> var model = analyst.loadDeepWalkModel().db().
     username("user"). //
     password("password"). //
     jdbcUrl("jdbcUrl"). //
     modelstore("modeltablename"). //
     modelname("model"). //
     load();
DeepWalkModel model = analyst.loadDeepWalkModel().db()
    .username("user") //
    .password("password") //
    .jdbcUrl("jdbcUrl") //
    .modelstore("modeltablename") //
    .modelname("model") //
    .load();
analyst.get_deepwalk_model_loader().db(username="user", password="password",
                                       model_store="modelstoretablename", model_name="model",
                                       jdbc_url="jdbc_url")

Destroying a Model

We can destroy a model as follows:

pgx> model.destroy()
model.destroy();
model.destroy()