SupervisedEdgeWise

Overview

SupervisedEdgeWise is an inductive edge representation learning algorithm which is able to leverage vertex and edge feature information. It can be applied to a wide variety of tasks, including edge classification and link prediction.

SupervisedEdgeWise is based on top of the GraphWise model, leveraging the source vertex embedding and the destination vertex embedding generated by the GraphWise model to generate inductive edge embeddings.

Model Structure

First, the source and destination vertices of the target edge are processed through the convolutional layers. The forward pass through a convolutional layer for a vertex proceeds as follows

A set of neighbors of the vertex is sampled.
The previous layer representations of the neighbors are mean-aggregated, and the aggregated features are concatenated with the previous layer representation of the vertex.
This concatenated vector is multiplied with weights, and a bias vector is added.
The result is normalized such that the layer output has unit norm.

The edge combination layer concatenates the source vertex embedding, the edge features and the destination vertex embedding forward it through a linear layer to get the edge embedding.

The prediction layers are standard neural network layers.

Functionalities

We describe here the usage of the main functionalities of SupervisedEdgeWise in PGX, using the Movielens graph as an example.

Loading a graph

First, we create a session and an analyst:

session = pypgx.get_session()
analyst = session.analyst

from pypgx.api.filters import EdgeFilter

full_graph = session.read_graph_with_properties(cpath)
train_edge_filter = EdgeFilter.from_pgql_result_set(
    session.query_pgql("SELECT e FROM movielens MATCH (v1) -[e]-> (v2) WHERE ID(e) % 4 > 0"), "e"
)
train_graph = full_graph.filter(train_edge_filter)

test_edge_filter = EdgeFilter.from_pgql_result_set(
    session.query_pgql("SELECT e FROM movielens MATCH (v1) -[e]-> (v2) WHERE ID(e) % 4 = 0"), "e"
)
test_graph = full_graph.filter(test_edge_filter)
test_edges = test_graph.get_edges()

Example: predicting ratings on the Movielens Dataset

We describe here the usage of SupervisedEdgeWise in PGX using the Movielens graph as an example.

This data set consists of 100,000 ratings (1-5) from 943 users on 1682 movies, with simple demographic info for the users (age, gender, occupation) and movies (year, avg_rating, genre).

Users and movies are vertices, while ratings of users to movies are edges with a rating feature. We will use EgdeWise to predict the ratings.

We first build the model and fit it on the train_graph:

from pypgx.api.mllib import MSELoss

conv_layer_config = dict(num_sampled_neighbors=10)
conv_layer = analyst.graphwise_conv_layer_config(**conv_layer_config)

pred_layer_config = dict(hidden_dim=16)
pred_layer = analyst.graphwise_pred_layer_config(**pred_layer_config)

params = dict(edge_target_property_name="labels",
            conv_layer_config=[conv_layer],
            pred_layer_config=[pred_layer],
            vertex_input_property_names=["movie_year", "avg_rating", "movie_genres",
                "user_occupation_label", "user_gender", "raw_user_age"],
            edge_input_property_names=["user_rating"],
            num_epochs=10,
            layer_size=32,
            learning_rate=0.003,
            normalize=true,
            loss_fn=MSELoss()
            seed=0)

model = analyst.supervised_edgewise_builder(**params)

model.fit(train_graph)

Since EdgeWise is inductive, we can infer the ratings for unseen edges:

labels = model.infer_labels(full_graph, test_edges)
labels.print()

This returns the rating prediction for any edge as:

edgeId	value
68472	3.844510078430176
53436	3.5453758239746094
73364	3.688265085220337
12096	3.8873679637908936
78740	3.3845553398132324
27664	2.6601722240448
34844	4.108948230743408
74224	3.7714107036590576
33744	3.2331383228302
32812	3.8763082027435303

We can also evaluate the performance of the model:

model.evaluate(full_graph, test_edges).print()

This returns:

MSE
0.9573243436116953

Building an EdgeWise Model (minimal)

We build a EdgeWise model using the minimal configuration and default hyperparameters. Note that even though only one feature property is needed (either on vertices with vertex_input_property_names or edges with edge_input_property_names) for the model to work, you can specify arbitrarily many.

params = dict(
    edge_target_property_name="label",
    vertex_input_property_names=["features"],
    edge_input_property_names=["edge_features"]
)

model = analyst.supervised_edgewise_builder(**params)

Advanced hyperparameter customization

The implementation allows for very rich hyperparameter customization. Internally, GraphWise for each node it applies an aggregation of the representations of neighbors, this operation can be configured through a sub-config class: either GraphWiseConvLayerConfig or GraphWiseAttentionLayerConfig.

GraphWiseConvLayer is based on Inductive Representation Learning on Large Graphs (GraphSage) by Hamilton et al.
GraphWiseAttentionLayer is based on Graph Attention neTworks (GAT) by Velickovic et al. which makes the aggregation smarter but comes with larger computation cost

Prediction layer config is implemented through pypgx.api.mllib.GraphWisePredictionLayerConfig class. In the following, we build such configurations and use them in a model. We specify a weight decay of 0.001 and dropout with dropping probability 0.5 to counteract overfitting.

To enable or disable GPU, we can use the parameter enable_accelerator. By default this feature is enabled, however if there’s no GPU device and the cuda toolkit is not installed, the feature will be disabled and CPU will be the device used for all mllib operations.

weight_property = analyst.pagerank(train_graph).name
conv_layer_config = dict(
    num_sampled_neighbors=25,
    activation_fn='tanh',
    weight_init_scheme='xavier',
    neighbor_weight_property_name=weight_property,
    dropout_rate=0.5
)

conv_layer = analyst.graphwise_conv_layer_config(**conv_layer_config)
pred_layer_config = dict(
    hidden_dim=32,
    activation_fn='relu',
    weight_init_scheme='he',
    dropout_rate=0.5
)

pred_layer = analyst.graphwise_pred_layer_config(**pred_layer_config)
params = dict(
    edge_target_property_name="labels",
    conv_layer_config=[conv_layer],
    pred_layer_config=[pred_layer],
    vertex_input_property_names=["vertex_features"],
    edge_input_property_names=["edge_features"],
    seed=17,
    weight_decay=0.001,
    enable_accelerator=True # Enable or disable GPU
)

model = analyst.supervised_edgewise_builder(**params)

The above code uses GraphWiseConvLayerConfig for the convolutional layer configuration. It can be replaced with GraphWiseAttentionLayerConfig if a graph attention network model is desired. If the number of sampled neighbors is set to -1 using setNumSampledNeighbors, all neighboring nodes will be sampled.

conv_layer_config = dict(
    num_sampled_neighbors=25,
    activation_fn='leaky_relu',
    weight_init_scheme='xavier_uniform',
    num_heads=4,
    dropout_rate=0.5
)

conv_layer = analyst.graphwise_attention_layer_config(**conv_layer_config)

For a full description of all available hyperparameters and their default values, see the pypgx.api.mllib.SupervisedEdgeWiseModelBuilder, pypgx.api.mllib.GraphWiseConvLayerConfig, pypgx.api.mllib.GraphWiseAttentionLayerConfig and pypgx.api.mllib.GraphWisePredictionLayerConfig docs.

Property types supported

The model supports two types of properties for both vertices and edges:

continuous properties (boolean, double, float, integer, long)
categorical properties (string)

For categorical properties, two categorical configurations are possible:

one-hot-encoding: each category is mapped to a vector, that is concatenated to other features (default)
embedding table: each category is mapped to an embedding that is concatenated to other features and is trained along with the model

One-hot-encoding converts each category into an independent vector. Therefore, it is suitable if we want each category to be interpreted as an equally independent group. For instance, if there are categories ranging from A to E without meaning anything by each alphabet, one-hot-encoding can be a good fit.

Embedding table is recommended if the semantics of the properties matter, and we want certain categories to be closer to each other than the others. For example, let’s assume there is a “day” property with values ranging from Monday to Sunday and we want to preserve our intuition that “Tuesday” is closer to “Wednesday” than “Saturday”. Then by choosing the embedding table configuration, we can let the vectors that represent the categories to be learned during training so that the vector that is mapped to “Tuesday” becomes close to that of “Wednesday”.

Although the embedding table approach has an advantage over one-hot-encoding that we can learn more suitable vectors to represent each category, this also means that a good amount of data is required to train the embedding table properly. The one-hot-encoding approach might be better for use-cases with limited training data.

When using the embedding table, we let users set the out-of-vocabulary probability. With the given probability, the embedding will be set to the out-of-vocabulary embedding randomly during training, in order to make the model more robust to unseen categories during inference.

vertex_input_property_configs = [
    analyst.one_hot_encoding_categorical_property_config(
        property_name="vertex_str_feature_1",
        max_vocabulary_size=100,
    ),
    analyst.learned_embedding_categorical_property_config(
        property_name="vertex_str_feature_2",
        embedding_dim=4,
        shared=False, # set whether to share the vocabulary or not when several  types have a property with the same name
        oov_probability=0.001 # probability to set the word embedding to the out-of-vocabulary embedding
    ),
]

model_params = dict(
    vertex_input_property_names=[
        "vertex_int_feature_1", # continuous feature
        "vertex_str_feature_1", # string feature using one-hot-encoding
        "vertex_str_feature_2", # string feature using embedding table
        "vertex_str_feature_3", # string feature using one-hot-encoding (default)
    ],
    vertex_input_property_configs=vertex_input_property_configs,
    edge_target_property_name="labels",
)

model = analyst.supervised_edgewise_builder(**model_params)

Classification vs Regression models

Whatever the type of the property you’re trying to predict, the default task that the model addresses is classification. Even if this property is a number, the model will assign one label for each value found and classify on it.

In some cases, you may prefer to infer continuous values for your property when it is an integer or a float. This is called the regression mode, and to enable it, you need to set the MSE loss function object.

It is possible to select different loss functions for the supervised model by providing a LossFunction object.

from pypgx.api.mllib import MSELoss

params = dict(edge_target_property_name="labels",
            vertex_input_property_names=["vertex_features"],
            edge_input_property_names=["edge_features"],
            loss_fn=MSELoss())

model = analyst.supervised_edgewise_builder(**params)

Setting a custom Loss Function and Batch Generator (for Anomaly Detection)

In addition to different loss functions, it is also possible to select different batch generators by providing a batch generator type. This is useful for applications such as Anomaly Detection, which can be cast into the standard supervised framework but require different loss functions and batch generators.

SupervisedEdgeWise model can use the DevNetLoss and the StratifiedOversamplingBatchGenerator. Where the DevNetLoss` takes two parameters: the confidence margin and the value the anomaly takes in the target property. In the following example, we assume the convLayerConfig has already been defined:

from pypgx.api.mllib import DevNetLoss

pred_layer_config = dict(
    hidden_dim=32,
    activation_fn='linear'
)

pred_layer = analyst.graphwise_pred_layer_config(**pred_layer_config)
params = dict(
    vertex_target_property_name="labels",
    conv_layer_config=[conv_layer],
    pred_layer_config=[pred_layer],
    vertex_input_property_names=["vertex_features"],
    edge_input_property_names=["edge_features"],
    loss_fn=DevNetLoss(5.0, True),
    batch_gen='stratified_oversampling',
    seed=17
)

model = analyst.supervised_edgewise_builder(**params)

Setting the edge embedding production method

The edge embedding is computed by default by combining the source vertex embedding, the destination vertex embedding and the edge features. You can manually set which of them are used by setting the EdgeCombinationMethod:

from pypgx.api.mllib import ConcatEdgeCombinationMethod

method_config = dict(
    use_source_vertex=True,
    use_destination_vertex=False,
    use_edge=True
)

method = ConcatEdgeCombinationMethod(**method_config)

params = dict(
    edge_target_property_name="labels",
    vertex_input_property_names=["vertex_features"],
    edge_input_property_names=["edge_features"],
    edge_combination_method=method,
    seed=17
)

model = analyst.supervised_edgewise_builder(**params)

The supported methods are concatenation (ConcatEdgeCombinationMethod) and point-wise product (ProductEdgeCombinationMethod).

Training the SupervisedEdgeWiseModel

We can train a SupervisedEdgeWiseModel on a graph:

model.fit(train_graph)

We can also add a validation step to the training. When training a model, the optimal number of training epochs is not known in advance and it is one of the key parameters that determine the model quality. Being able to monitor the training and validation losses helps us to identify a good value for the model parameters and gain visibility in the training process. The evaluation frequency can be specified in terms of epoch or step. To configure a validation step, create a GraphWiseValidationConfig and pass it to the model builder:

val_config = analyst.graphwise_validation_config(
    evaluation_frequency=1,
    evaluation_frequency_scale="epoch",
)

params = dict(
    edge_target_property_name="labels",
    vertex_input_property_names=["vertex_features"],
    edge_input_property_names=["edge_features"],
    validation_config=val_config,
    seed=17
)

model = analyst.supervised_edgewise_builder(**params)

After configuring a validation step, pass a graph for validation to the fit method together with the graph for training:

model.fit(train_graph, test_graph)

Getting Loss value

We can fetch the training loss value:

loss = model.get_training_loss()

Getting Training log

If a validation step was configured, we can fetch the training log that has training and validation loss information:

training_log = model.get_training_log()
training_log.print()

The output frame will be similar to the following example output:

epoch | training_loss | validation_loss

1 | 1.1378216743469238 | 0.7227532917802985 2 | 0.47905975580215454 | 0.36742845245383005 3 | 0.28058260679244995 | 0.32146902856501663

The first column will be named according to the evaluation frequency scale that was set in the validation configuration (“epoch” or “step”). Note that the validation loss is the average of the losses evaluated on all batches of the validation graph, while the training loss is the loss value logged at that epoch or step (i.e., the loss evaluated on the last batch). Also, please note that the training log will be overwritten if the fit method is called multiple times.

Inferring edge labels

We can infer the labels for edges on any graph (including edges or graphs that were not seen during training):

labels = model.infer(full_graph, test_edges)
labels.print()

If the model is a classification model, it’s also possible to set the decision threshold applied to the logits by adding it as an extra parameter, which is by default 0:

labels = model.infer(
    full_graph,
    full_graph.get_edges(),
    6
)
labels.print()

The output will be similar to the following example output:

edgeId	value
68472	2.2346956729888916
53436	2.1515913009643555
73364	1.9499346017837524
12096	2.1704165935516357
78740	2.1174447536468506
27664	2.1041007041931152
34844	2.148571491241455
74224	2.089123010635376
33744	2.0866644382476807
32812	2.0604987144470215

In a similar fashion, if the task is a classification task, you can get the model confidence for each class by inferring the prediction logits:

logits = model.infer_logits(full_graph, test_edges)
logits.print()

If the model is a classification model, the infer_labels method is also available and equivalent to infer.

Evaluating model performance

evaluate() is a convenience method to evaluate various metrics for the model:

model.evaluate(full_graph, test_edges).print()

Similar to inferring labels, if the task is a classification task, we can add the decision threshold as an extra parameter:

model.evaluate(full_graph, test_edges, 6).print()

The output will be similar to the following examples. For a classification model:

Accuracy	Precision	Recall	F1-Score
0.8488	0.8523	0.831	0.8367

For a regression model:

MSE
0.9573243436116953

If the model is a classification model, the evaluate_labels method is also available and equivalent to evaluate.

Inferring embeddings

We can use a trained model to infer embeddings for unseen edges and store in a CSV file:

edge_vectors = model.infer_embeddings(full_graph, test_edges).flatten_all()
edge_vectors.store(file_format="csv", path="<path>/edge_vectors.csv", overwrite=True)

The schema for the edge_vectors would be as follows without flattening (flatten_all splits the vector column into separate double-valued columns):

edgeId

embedding

Storing a trained model

Models can be stored either to the server file system, or to a database.

The following shows how to store a trained SupervisedEdgeWise model to a specified file path:

model.export().file("<path>/<model_name>", key)

When storing models in database, they are stored as a row inside a model store table. The following shows how to store a trained SupervisedEdgeWise model in database in a specific model store table:

model.export().db(
    "modeltablename",
    "model_name",
    username="user",
    password="password",
    jdbc_url="jdbcUrl"
)

Loading a pre-trained model

Similarly to storing, models can be loaded from a file in the server file system, or from a database. We can load a pre-trained SupervisedEdgeWise model from a specified file path as follows:

model = analyst.load_supervised_edgewise_model("<path>/<model>", "key")

We can load a pre-trained SupervisedEdgeWise model from a model store table in database as follows:

model = analyst.get_supervised_edgewise_model_loader().db(
    "modeltablename",
    "model_name",
    username="user",
    password="password",
    jdbc_url="jdbcUrl"
)

Destroying a model

We can destroy a model as follows:

model.destroy()