UnsupervisedEdgeWise

Overview

UnsupervisedEdgeWise is an inductive edge representation learning algorithm which is able to leverage vertex and edge feature information. It can be applied to a wide variety of tasks, including edge classification and link prediction.

UnsupervisedEdgeWise is based on top of the GraphWise model, leveraging the source vertex embedding and the destination vertex embedding generated by the GraphWise model to generate inductive edge embeddings. The training is based on [Deep Graph Infomax (DGI) by Velickovic et al.](https://arxiv.org/pdf/1809.10341.pdf)

Model Structure

A UnsupervisedEdgeWise model consists of graph convolutional layers followed by an embedding layer which defaults to a DGI layer.

First, the source and destination vertices of the target edge are processed through the convolutional layers. The forward pass through a convolutional layer for a vertex proceeds as follows

  1. A set of neighbors of the vertex is sampled.

  2. The previous layer representations of the neighbors are mean-aggregated, and the aggregated features are concatenated with the previous layer representation of the vertex.

  3. This concatenated vector is multiplied with weights, and a bias vector is added.

  4. The result is normalized such that the layer output has unit norm.

The edge combination layer concatenates the source vertex embedding, the edge features and the destination vertex embedding forward it through a linear layer to get the edge embedding.

The DGI Layer consists of three parts enabling unsupervised learning using embeddings produced by the convolution layers.

1. Corruption function: Shuffles the node features while preserving the graph structure to produce negative embedding samples using the convolution layers.

  1. Readout function: Sigmoid activated mean of embeddings, used as summary of a graph

3. Discriminator: Measures the similarity of positive (unshuffled) embeddings with the summary as well as the similarity of negative samples with the summary from which the loss function is computed.

Since none of these contains mutable hyperparameters, the default DGI layer is always used and cannot be adjusted.

Functionalities

We describe here the usage of the main functionalities of UnsupervisedEdgeWise in PGX, using the [Movielens](https://movielens.org) graph as an example.

Loading a graph

First, we create a session and an analyst:

1session = pypgx.get_session()
2analyst = session.analyst
 1full_graph = session.read_graph_with_properties(cpath)
 2edge_filter = EdgeFilter.from_pgql_result_set(
 3    session.query_pgql("SELECT e FROM movielens MATCH (v1) -[e]-> (v2) WHERE ID(e) % 4 > 0"), "e"
 4)
 5train_graph = full_graph.filter(edge_filter)
 6
 7test_edges = []
 8train_edges = train_graph.get_edges()
 9for v in full_graph.get_edges():
10  if(not train_edges.contains(v)):
11    test_edges.append(v)

Example: computing edge embeddings on the Movielens Dataset

We describe here the usage of UnsupervisedEdgeWise in PGX using the [Movielens](https://movielens.org) graph as an example.

This data set consists of 100,000 ratings (1-5) from 943 users on 1682 movies, with simple demographic info for the users (age, gender, occupation) and movies (year, avg_rating, genre).

Users and movies are vertices, while ratings of users to movies are edges with a rating feature. We will use EgdeWise to compute the edges embeddings.

We first build the model and fit it on the trainGraph:

 1conv_layer_config = dict(num_sampled_neighbors=10)
 2
 3conv_layer = analyst.graphwise_conv_layer_config(**conv_layer_config)
 4
 5params = dict(conv_layer_config=[conv_layer],
 6            pred_layer_config=[pred_layer],
 7            vertex_input_property_names=["movie_year", "avg_rating", "movie_genres",
 8                "user_occupation_label", "user_gender", "raw_user_age"],
 9            edge_input_property_names=["user_rating"],
10            num_epochs=10,
11            embedding_dim=32,
12            learning_rate=0.003,
13            normalize=False,  # recommended
14            seed=0)
15
16model = analyst.unsupervised_edgewise_builder(**params)
17
18model.fit(trainGraph)

Since EdgeWise is inductive, we can infer the compute the embeddings for unseen edges:

1embeddings = model.infer_embeddings(full_graph, test_edges)
2embeddings.print()

This returns the embeddings for any edge as:

` +-----------------------------------------+---------------------+ | edgeId                                  | embedding           | +-----------------------------------------+---------------------+ `

Building an EdgeWise Model (minimal)

We build a EdgeWise model using the minimal configuration and default hyperparameters. Note that even though only one feature property is needed (either on vertices with vertex_input_property_names or edges with edge_input_property_names) for the model to work, you can specify arbitrarily many.

1params = dict(
2    vertex_input_property_names=["features"],
3    edge_input_property_names=["edge_features"]
4)
5
6model = analyst.unsupervised_edgewise_builder(**params)

Advanced hyperparameter customization

The implementation allows for very rich hyperparameter customization. Internally, GraphWise for each node it applies an aggregation of the representations of neighbors, this operation can be configured through a sub-config class: either GraphWiseConvLayerConfig or GraphWiseAttentionLayerConfig.

In the following, we build such a configuration and use it in a model. We specify a weight decay of 0.001 and dropout with dropping probability 0.5 to counteract overfitting. Also, we recommend to disable normalization of embeddings when intended to use them in downstream classfication tasks.

To enable or disable GPU, we can use the parameter enable_accelerator. By default this feature is enabled, however if there’s no GPU device and the cuda toolkit is not installed, the feature will be disabled and CPU will be the device used for all mllib operations.

 1weight_property = analyst.pagerank(train_graph).name
 2conv_layer_config = dict(
 3    num_sampled_neighbors=25,
 4    activation_fn='tanh',
 5    weight_init_scheme='xavier',
 6    neighbor_weight_property_name=weight_property,
 7    dropout_rate=0.5
 8)
 9
10conv_layer = analyst.graphwise_conv_layer_config(**conv_layer_config)
11
12params = dict(
13    conv_layer_config=[conv_layer],
14    pred_layer_config=[pred_layer],
15    vertex_input_property_names=["vertex_features"],
16    edge_input_property_names=["edge_features"],
17    seed=17,
18    weight_decay=0.001,
19    normalize=False,  # recommended
20    enable_accelerator=True # Enable or Disable GPU
21)
22
23model = analyst.unsupervised_edgewise_builder(**params)

The above code uses GraphWiseConvLayerConfig for the convolutional layer configuration. It can be replaced with GraphWiseAttentionLayerConfig if a graph attention network model is desired. If the number of sampled neighbors is set to -1 using setNumSampledNeighbors, all neighboring nodes will be sampled.

1conv_layer_config = dict(
2    num_sampled_neighbors=25,
3    activation_fn='leaky_relu',
4    weight_init_scheme='xavier_uniform',
5    num_heads=4,
6    dropout_rate=0.5
7)
8
9conv_layer = analyst.graphwise_attention_layer_config(**conv_layer_config)

For a full description of all available hyperparameters and their default values, see the pypgx.api.mllib.UnsupervisedEdgeWiseModel, pypgx.api.mllib.GraphWiseConvLayerConfig, pypgx.api.mllib.GraphWiseAttentionLayerConfig, pypgx.api.mllib.GraphWiseDgiLayerConfig and pypgx.api.mllib.GraphWiseDominantLayerConfig docs.

Property types supported

The model supports two types of properties for both vertices and edges:

  • continuous properties (boolean, double, float, integer, long)

  • categorical properties (string)

For categorical properties, two categorical configurations are possible:

  • one-hot-encoding: each category is mapped to a vector, that is concatenated to other features (default)

  • embedding table: each category is mapped to an embedding that is concatenated to other features and is trained along with the model

One-hot-encoding converts each category into an independent vector. Therefore, it is suitable if we want each category to be interpreted as an equally independent group. For instance, if there are categories ranging from A to E without meaning anything by each alphabet, one-hot-encoding can be a good fit.

Embedding table is recommended if the semantics of the properties matter, and we want certain categories to be closer to each other than the others. For example, let’s assume there is a “day” property with values ranging from Monday to Sunday and we want to preserve our intuition that “Tuesday” is closer to “Wednesday” than “Saturday”. Then by choosing the embedding table configuration, we can let the vectors that represent the categories to be learned during training so that the vector that is mapped to “Tuesday” becomes close to that of “Wednesday”.

Although the embedding table approach has an advantage over one-hot-encoding that we can learn more suitable vectors to represent each category, this also means that a good amount of data is required to train the embedding table properly. The one-hot-encoding approach might be better for use-cases with limited training data.

When using the embedding table, we let users set the out-of-vocabulary probability. With the given probability, the embedding will be set to the out-of-vocabulary embedding randomly during training, in order to make the model more robust to unseen categories during inference.

 1vertex_input_property_configs = [
 2    analyst.one_hot_encoding_categorical_property_config(
 3        property_name="vertex_str_feature_1",
 4        max_vocabulary_size=100,
 5    ),
 6    analyst.learned_embedding_categorical_property_config(
 7        property_name="vertex_str_feature_2",
 8        embedding_dim=4,
 9        shared=False, # set whether to share the vocabulary or not when several  types have a property with the same name
10        oov_probability=0.001 # probability to set the word embedding to the out-of-vocabulary embedding
11    ),
12]
13
14model_params = dict(
15    vertex_input_property_names=[
16        "vertex_int_feature_1", # continuous feature
17        "vertex_str_feature_1", # string feature using one-hot-encoding
18        "vertex_str_feature_2", # string feature using embedding table
19        "vertex_str_feature_3", # string feature using one-hot-encoding (default)
20    ],
21    vertex_input_property_configs=vertex_input_property_configs,
22)
23
24model = analyst.unsupervised_edgewise_builder(**model_params)

Setting the edge embedding production method

The edge embedding is computed by default by combining the source vertex embedding, the destination vertex embedding and the edge features. You can manually set which of them are used by setting the EdgeCombinationMethod:

 1from pypgx.api.mllib import ConcatEdgeCombinationMethod
 2
 3method_config = dict(
 4    use_source_vertex=True,
 5    use_destination_vertex=False,
 6    use_edge=True
 7)
 8
 9method = ConcatEdgeCombinationMethod(**method_config)
10
11params = dict(
12    vertex_input_property_names=["vertex_features"],
13    edge_input_property_names=["edge_features"],
14    edge_combination_method=method,
15    seed=17
16)
17
18model = analyst.unsupervised_edgewise_builder(**params)

Training the UnsupervisedEdgeWiseModel

We can train a UnsupervisedEdgeWiseModel on a graph:

1model.fit(train_graph)

Getting Loss value

We can fetch the training loss value:

1loss = model.get_training_loss()

Inferring embeddings

We can use a trained model to infer embeddings for unseen edges and store in a CSV file:

1edge_vectors = model.infer_embeddings(full_graph, test_edges).flatten_all()
2edge_vectors.store(file_format="csv", path="<path>/edge_vectors.csv", overwrite=True)

The schema for the edge_vectors would be as follows without flattening (flatten_all splits the vector column into separate double-valued columns):

edgeId

embedding

Classifying the edges using the obtained embeddings

We can use the obtained embeddings in downstream edge classification tasks. The following shows how we can train a MLP classifier which takes the embeddings as input. We assume that the edge label information is stored under the edge property “labels”.

 1import pandas as pd
 2from sklearn.metrics import accuracy_score, make_scorer
 3from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
 4from sklearn.neural_network import MLPClassifier
 5from sklearn.preprocessing import StandardScaler
 6
 7
 8# prepare input data
 9edge_vectors_df = edge_vectors.to_pandas().astype({"edgeId": int})
10edge_labels_df = pd.DataFrame([
11    {"edgeId": e.id, "labels": properties}
12    for e, properties in graph.get_edge_property("labels").get_values()
13]).astype(int)
14
15edge_vectors_with_labels_df = edge_vectors_df.merge(edge_labels_df, on="edgeId")
16
17feature_columns = [c for c in edge_vectors_df.columns if c.startswith("embedding")]
18x = edge_vectors_with_labels_df[feature_columns].to_numpy()
19y = edge_vectors_with_labels_df["labels"].to_numpy()
20
21scaler = StandardScaler()
22x = scaler.fit_transform(x)
23
24# define a MLP classifier
25model = MLPClassifier(
26    hidden_layer_sizes=(6,),
27    learning_rate_init=0.05,
28    max_iter=2000,
29    random_state=42,
30)
31
32# define a metric and evaluate with cross-validation
33cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)
34scorer = make_scorer(accuracy_score, greater_is_better=True)
35scores = cross_val_score(model, x, y, scoring=scorer, cv=cv, n_jobs=-1)

Storing a trained model

Models can be stored either to the server file system, or to a database.

The following shows how to store a trained UnsupervisedEdgeWise model to a specified file path:

1model.export().file("<path>/<model_name>", key)

When storing models in database, they are stored as a row inside a model store table. The following shows how to store a trained UnsupervisedEdgeWise model in database in a specific model store table:

1model.export().db(
2    "modeltablename",
3    "model_name",
4    username="user",
5    password="password",
6    jdbc_url="jdbcUrl"
7)

Loading a pre-trained model

Similarly to storing, models can be loaded from a file in the server file system, or from a database. We can load a pre-trained UnsupervisedEdgeWise model from a specified file path as follows:

1model = analyst.load_unsupervised_edgewise_model("<path>/<model>", "key")

We can load a pre-trained UnsupervisedEdgeWise model from a model store table in database as follows:

1model = analyst.get_supervised_edgewise_model_loader().db(
2    "modeltablename",
3    "model_name",
4    username="user",
5    password="password",
6    jdbc_url="jdbcUrl"
7)

Destroying a model

We can destroy a model as follows:

1model.destroy()