DeepWalk
is a widely employed vertex representation learning algorithm used in industry (e.g., in Taobao from Alibaba).
It consists of two main steps:
DeepWalk
algorithm is available in
the KDD paper.DeepWalk
creates vertex embeddings for a specific graph and cannot be updated to incorporate modifications on the graph.
Instead, a new DeepWalk
model should be trained on this modified graph. Lastly, it is important to note that the memory
consumption of the DeepWalk
model is O(2n*d)
where n
is the number of vertices in the graph and d
is the
embedding length.
We describe here the usage of the main functionalities of our implementation of DeepWalk
in PGX
using DBpedia graph as an example (with 8,637,721 vertices and 165,049,964 edges).
First, we create a session and an analyst:
cd $PGX_HOME ./bin/pgx-jshell // starting the shell will create an implicit session and analyst
import oracle.pgx.api.*; import oracle.pgx.api.mllib.DeepWalkModel; import oracle.pgx.api.frames.*; ... PgxSession session = Pgx.createSession("my-session"); Analyst analyst = session.createAnalyst();
session= pypgx.get_session(session_name="my-session") analyst = session.create_analyst()
Our DeepWalk
algorithm implementation can be applied to directed or undirected graphs
(even though we only consider undirected random walks). To begin with, we can load a graph as follows:
pgx> var graph = session.readGraphWithProperties("<path>/dbpedia.json")
PgxGraph graph = session.readGraphWithProperties("<path>/dbpedia.json");
graph = session.read_graph_with_properties("<path>/dbpedia.json")
We build a DeepWalk
model using the minimal configuration and default hyper-parameters:
pgx> var model = analyst.deepWalkModelBuilder(). setWindowSize(3). setWalksPerVertex(6). setWalkLength(4). build()
DeepWalkModel model = analyst.deepWalkModelBuilder() .setWindowSize(3) .setWalksPerVertex(6) .setWalkLength(4) .build();
model = analyst.deepwalk_builder(window_size=3,walks_per_vertex=6,walk_length=4)
We build a DeepWalk
model using customized hyper-parameters:
pgx> var model = analyst.deepWalkModelBuilder(). setMinWordFrequency(1). setBatchSize(512). setNumEpochs(1). setLayerSize(100). setLearningRate(0.05). setMinLearningRate(0.0001). setWindowSize(3). setWalksPerVertex(6). setWalkLength(4). setSampleRate(0.00001). setNegativeSample(2). setValidationFraction(0.01). build()
DeepWalkModel model = analyst.deepWalkModelBuilder() .setMinWordFrequency(1) .setBatchSize(512) .setNumEpochs(1) .setLayerSize(100) .setLearningRate(0.05) .setMinLearningRate(0.0001) .setWindowSize(3) .setWalksPerVertex(6) .setWalkLength(4) .setSampleRate(0.00001) .setNegativeSample(2) .setValidationFraction(0.01) .build();
model = analyst.deepwalk_builder(min_word_frequency= 1, batch_size= 512, num_epochs= 1, layer_size= 100, learning_rate= 0.05, min_learning_rate= 0.0001, window_size= 3, walks_per_vertex= 6, walk_length= 4, sample_rate= 0.00001, negative_sample= 2, validation_fraction= 0.01)
We provide complete explanation for each builder operation (along with the default values) in our DeepWalkModelBuilder javadocs.
We can train a DeepWalk
model with the specified (default or customized) settings:
pgx> model.fit(graph)
model.fit(graph);
model.fit(graph)
We can fetch the loss value on a specified fraction of training data (set in builder using setValidationFraction
):
pgx> var loss = model.getLoss()
double loss = model.getLoss();
loss = model.loss
We can fetch the k
most similar vertices for a given vertex:
pgx> var similars = model.computeSimilars("Albert_Einstein", 10) pgx> similars.print()
PgxFrame similars = model.computeSimilars("Albert_Einstein", 10); similars.print();
similars = model.compute_similars("Albert_Einstein", 10) similars.print()
The output results will be in the following format, for e.g., searching for similar vertices for Albert_Einstein using the trained model.
+-----------------------------------------+ | dstVertex | similarity | +-----------------------------------------+ | Albert_Einstein | 1.0000001192092896 | | Physics | 0.8664291501045227 | | Werner_Heisenberg | 0.8625140190124512 | | Richard_Feynman | 0.8496938943862915 | | List_of_physicists | 0.8415523767471313 | | Physicist | 0.8384397625923157 | | Max_Planck | 0.8370327353477478 | | Niels_Bohr | 0.8340970873832703 | | Quantum_mechanics | 0.8331197500228882 | | Special_relativity | 0.8280861973762512 | +-----------------------------------------+
We can fetch the k
most similar vertices for a list of input vertices:
pgx> var vertices = new ArrayList() pgx> vertices.add("Machine_learning") pgx> vertices.add("Albert_Einstein") pgx> batchedSimilars = model.computeSimilars(vertices, 10) pgx> batchedSimilars.print()
List vertices = Arrays.asList("Machine_learning","Albert_Einstein"); PgxFrame batchedSimilars = model.computeSimilars(vertices, 10); batchedSimilars.print();
vertices = ["Machine_learning","Albert_Einstein"] batched_similars = model.compute_similars(vertices, 10) batched_similars.print()
The output results will be in the following format:
+-------------------------------------------------------------------+ | srcVertex | dstVertex | similarity | +-------------------------------------------------------------------+ | Machine_learning | Machine_learning | 1.0000001192092896 | | Machine_learning | Data_mining | 0.9070799350738525 | | Machine_learning | Computer_science | 0.8963605165481567 | | Machine_learning | Unsupervised_learning | 0.8828719854354858 | | Machine_learning | R_(programming_language) | 0.8821185827255249 | | Machine_learning | Algorithm | 0.8819515705108643 | | Machine_learning | Artificial_neural_network | 0.8773092031478882 | | Machine_learning | Data_analysis | 0.8758628368377686 | | Machine_learning | List_of_algorithms | 0.8737979531288147 | | Machine_learning | K-means_clustering | 0.8715602159500122 | | Albert_Einstein | Albert_Einstein | 1.0000001192092896 | | Albert_Einstein | Physics | 0.8664291501045227 | | Albert_Einstein | Werner_Heisenberg | 0.8625140190124512 | | Albert_Einstein | Richard_Feynman | 0.8496938943862915 | | Albert_Einstein | List_of_physicists | 0.8415523767471313 | | Albert_Einstein | Physicist | 0.8384397625923157 | | Albert_Einstein | Max_Planck | 0.8370327353477478 | | Albert_Einstein | Niels_Bohr | 0.8340970873832703 | | Albert_Einstein | Quantum_mechanics | 0.8331197500228882 | | Albert_Einstein | Special_relativity | 0.8280861973762512 | +-------------------------------------------------------------------+
We can retrieve the trained vertex vectors for the current DeepWalk
model and store in a TSV
file (CSV
with tab
separator):
pgx> var vertexVectors = model.getTrainedVertexVectors().flattenAll() pgx> vertexVectors.write(). overwrite(true). csv(). separator('\t'). store("<path>/vertex_vectors.tsv")
PgxFrame vertexVectors = model.getTrainedVertexVectors().flattenAll(); vertexVectors.write() .overwrite(true) .csv() .separator('\t') .store("<path>/vertex_vectors.tsv");
vertex_vectors = model.get_trained_vertex_vectors().flatten_all() vertex_vectors.store("<path>/vertex_vectors.tsv",overwrite=True,file_format="csv")
The schema for the vertexVectors
would be as follows without flattening (flattenAll
splits the vector column into separate double-valued columns):
+---------------------------------------------------------------+ | vertexId | embedding | +---------------------------------------------------------------+
Models can be stored either to the server file system, or to a database.
The following shows how to store a trained DeepWalk
model to a specified file path:
pgx> model.export().file().path("<path>/<modelName>").store()
model.export().file().path("<path>/<modelName>").store();
model.export().file(path="<path>/<modelName>")
When storing models in database, they are stored as a row inside a model store table.
The following shows how to store a trained DeepWalk
model in database in a specific model store table:
pgx> model.export().db(). // username("user"). // DB user to use for storing the model password("password"). // password of the DB user jdbcUrl("jdbcUrl"). // jdbc url to the DB modelstore("modelstoretablename"). // name of the model store table modelname("model"). // name to give to the model (primary key of model store table) description("a model description"). // description to store alongside the model store();
model.export().db() // .username("user") // DB user to use for storing the model .password("password") // password of the DB user .jdbcUrl("jdbcUrl") // jdbc url to the DB .modelstore("modelstoretablename") // name of the model store table .modelname("model") // name to give to the model (primary key of model store table) .description("a model description") // description to store alongside the model .store();
model.export().db(username="user", password="password", model_store="modelstoretablename", model_name="model", jdbc_url="jdbc_url")
Similarly to storing, models can be loaded from a file in the server file system, or from a database.
We can load a pre-trained DeepWalk
model from a specified file path as follows:
pgx> var model = analyst.loadDeepWalkModel().file().path("<path>/<modelName>").load()
DeepWalkModel model = analyst.loadDeepWalkModel().file().path("<path>/<modelName>").load();
analyst.get_deepwalk_model_loader().file(path="<path>/<modelName>")
We can load a pre-trained DeepWalk
model from a model store table in database as follows:
pgx> var model = analyst.loadDeepWalkModel().db(). username("user"). // password("password"). // jdbcUrl("jdbcUrl"). // modelstore("modeltablename"). // modelname("model"). // load();
DeepWalkModel model = analyst.loadDeepWalkModel().db() .username("user") // .password("password") // .jdbcUrl("jdbcUrl") // .modelstore("modeltablename") // .modelname("model") // .load();
analyst.get_deepwalk_model_loader().db(username="user", password="password", model_store="modelstoretablename", model_name="model", jdbc_url="jdbc_url")
We can destroy a model as follows:
pgx> model.destroy()
model.destroy();
model.destroy()