UnsupervisedAnomalyDetectionGraphWise

Overview

UnsupervisedAnomalyDetectionGraphWise is an inductive vertex representation learning algorithm which is able to leverage vertex feature information. It can be applied to a wide variety of tasks, including unsupervised learning vertex embeddings for vertex classification. UnsupervisedAnomalyDetectionGraphWise is based on Deep Anomaly Detection on Attributed Networks (Dominant) by Ding, Kaize, et al.

Model Structure

A UnsupervisedAnomalyDetectionGraphWise model consists of graph convolutional layers followed by an embedding layer which defaults to a DGI layer. The forward pass through a convolutional layer for a vertex proceeds as follows:

  1. A set of neighbors of the vertex is sampled.

  2. The previous layer representations of the neighbors are mean-aggregated, and the aggregated features are concatenated with the previous layer representation of the vertex.

  3. This concatenated vector is multiplied with weights, and a bias vector is added.

  4. The result is normalized to such that the layer output has unit norm.

The DGI Layer consists of three parts enabling unsuspervised learning using embeddings produced by the convolution layers.

  1. Corruption function: Shuffles the node features while preserving the graph structure to produce negative embedding samples using the convolution layers.

  2. Readout function: Sigmoid activated mean of embeddings, used as summary of a graph

  3. Discriminator: Measures the similarity of positive (unshuffled) embeddings with the summary as well as the similarity of negative samples with the summary from which the loss function is computed.

Since none of these contains mutable hyperparameters, the default DGI layer is always used and cannot be adjusted.

The second embedding layer available is the Dominant Layer, based on Deep Anomaly Detection on Attributed Networks (Dominant) by Ding, Kaize, et al.

Dominant is a model that detects anomalies based on the features and the neighbors’ structure. Using GCNs to reconstruct the features in an autoencoder’s settings, and the mask with the dot products of the embeddings.

The loss function is computed from the feature reconstruction loss and the structure reconstruction loss. The importance given to features or to the structure can be tuned with the alpha hyperparameter.

Functionalities

We describe here the usage of the main functionalities of our implementation of Dominant in PGX.

Loading a graph

First, we create a session and an analyst:

1session = pypgx.get_session()
2analyst = session.analyst

Since we train the model unsupervised, we do not have to use a test graph or test vertices

1graph = session.read_graph_with_properties(cpath)

Building an UnsupervisedGraphWise Model (minimal)

We build an UnsupervisedAnomalyDetectionGraphWise model using the minimal configuration and default hyper-parameters. Note that even though only one feature property is specified in this example, you can specify arbitrarily many.

1model = analyst.unsupervised_anomaly_detection_graphwise_builder(
2    vertex_input_property_names=["features"]
3)

Advanced hyperparameter customization

The implementation allows for very rich hyperparameter customization. This is done through a sub-config class: GraphWiseConvLayerConfig and GraphWiseEmbeddingConfig. In the following, we build such a configuration and use it in a model. We specify a weight decay of 0.001 and dropout with dropping probability 0.5 to counteract overfitting. We also specify the Dominant embedding layer’s alpha value to 0.6 to slightly increase the importance of the feature reconstruction.

 1weight_property = analyst.pagerank(train_graph).name
 2conv_layer_config = dict(
 3    num_sampled_neighbors=25,
 4    activation_fn='tanh',
 5    weight_init_scheme='xavier',
 6    neighbor_weight_property_name=weight_property,
 7    dropout_rate=0.5
 8)
 9conv_layer = analyst.graphwise_conv_layer_config(**conv_layer_config)
10dominant_config = dict(alpha=0.6)
11dominant_layer = analyst.graphwise_dominant_layer_config(**dominant_config)
12params = dict(
13    conv_layer_config=[conv_layer],
14    embedding_config=dominant_layer,
15    vertex_input_property_names=["vertex_features"],
16    edge_input_property_names=["edge_features"],
17    weight_decay=0.001
18)
19
20model = analyst.unsupervised_anomaly_detection_graphwise_builder(**params)

For a full description of all available hyperparameters and their default values, see the pypgx.api.mllib.UnsupervisedAnomalyDetectionGraphWiseModel, pypgx.api.mllib.GraphWiseConvLayerConfig, pypgx.api.mllib.GraphWiseDgiLayerConfig and pypgx.api.mllib.GraphWiseDominantLayerConfig docs.

Training the UnsupervisedAnomalyDetectionGraphWiseModel

We can train a UnsupervisedAnomalyDetectionGraphWiseModel model on a graph:

1model.fit(graph)

Getting Loss value

We can fetch the training loss value:

1loss = model.get_training_loss()

Inferring embeddings

We can use a trained model to infer embeddings for unseen nodes and store in a CSV file:

1vertex_vectors = model.infer_embeddings(full_graph, cora.get_vertices()).flatten_all()
2vertex_vectors.store("<path>/vertex_vectors.csv", file_format="csv", overwrite=True)

The schema for the vertex_vectors() would be as follows without flattening (flatten_all() splits the vector column into separate double-valued columns):

vertexId

embedding

Inferring anomalies

We can use a trained model to infer anomaly scores or labels for unseen nodes and store in a CSV file:

1vertex_scores = model.infer_anomaly_scores(full_graph, fullGraph.get_vertices()).flatten_all()
2vertex_scores.store("<path>/vertex_scores.csv", file_format="csv", overwrite=True)

If we know the contamination factor of our data we can use it to find a good threshold:

1contamination_factor = 0.2
2threshold = model.find_anomaly_threshold(full_graph, fullGraph.get_vertices(), contamination_factor)

If we have a threshold value we can directly infer labels:

1threshold = 0.9
2vertex_labels = model.infer_anomaly_labels(full_graph, fullGraph.get_vertices(), threshold).flatten_all()
3vertex_labels.store("<path>/vertex_labels.csv", file_format="csv", overwrite=True)

Storing a trained model

Models can be stored either to the server file system, or to a database.

The following shows how to store a trained UnsupervisedAnomalyDetectionGraphWise model to a specified file path:

1model.export().file("<path>/<model_name>", key)

When storing models in database, they are stored as a row inside a model store table. The following shows how to store a trained UnsupervisedAnomalyDetectionGraphWise model in database in a specific model store table:

1model.export().db(
2    "modeltablename",
3    "model_name",
4    username="user",
5    password="password",
6    jdbc_url="jdbcUrl"
7)

Loading a pre-trained model

Similarly to storing, models can be loaded from a file in the server file system, or from a database.

We can load a pre-trained UnsupervisedAnomalyDetectionGraphWise model from a specified file path as follows:

1model = analyst.load_unsupervised_anomaly_detection_graphwise_model(
2    "<path>/<model>",
3    key
4)

We can load a pre-trained UnsupervisedAnomalyDetectionGraphWise model from a model store table in database as follows:

1model = analyst.get_unsupervised_anomaly_detection_graphwise_model_loader().db(
2    "modeltablename",
3    "model_name",
4    username="user",
5    password="password",
6    jdbc_url="jdbcUrl"
7)

Destroying a model

We can destroy a model as follows:

1model.destroy()