UnsupervisedAnomalyDetectionGraphWise

Overview

UnsupervisedAnomalyDetectionGraphWise is an inductive vertex representation learning and anomaly detection algorithm which is able to leverage vertex and edge feature information. While it can be applied to a wide variety of tasks, it is particularly suitable for unsupervised learning of vertex embeddings for anomaly detection. After training this model, it is possible to infer anomaly scores or labels for unseen nodes. UnsupervisedAnomalyDetectionGraphWise is based on Deep Anomaly Detection on Attributed Networks (Dominant) by Ding, Kaize, et al.

Model Structure

A UnsupervisedAnomalyDetectionGraphWise model consists of graph convolutional layers followed by an embedding layer. There are two types of embedding layer available: DGI layer and Dominant layer. Both of the layers are for inductive vertex representation learning, with different loss functions. The embedding layer defaults to the DGI layer.

The forward pass through a convolutional layer for a vertex proceeds as follows:

A set of neighbors of the vertex is sampled.
The previous layer representations of the neighbors are mean-aggregated, and the aggregated features are concatenated with the previous layer representation of the vertex.
This concatenated vector is multiplied with weights, and a bias vector is added.
The result is normalized to such that the layer output has unit norm.

The DGI Layer, which is based on [Deep Graph Infomax (DGI) by Velickovic et al.](https://arxiv.org/pdf/1809.10341.pdf) consists of three parts enabling unsupervised learning using embeddings produced by the convolution layers.

Corruption function: Shuffles the node features while preserving the graph structure to produce negative embedding samples using the convolution layers.
Readout function: Sigmoid activated mean of embeddings, used as summary of a graph
Discriminator: Measures the similarity of positive (unshuffled) embeddings with the summary as well as the similarity of negative samples with the summary from which the loss function is computed.

Since none of these contains mutable hyperparameters, the default DGI layer is always used and cannot be adjusted.

The Dominant layer enables unsupervised learning using a deep autoencoder. It uses GCNs to reconstruct the features in the autoencoder setting, together with the reconstructed structure that is estimated using the dot products of the embeddings.

The loss function is computed from the feature reconstruction loss and the structure reconstruction loss. The importance given to features or to the structure can be tuned with the alpha hyperparameter.

The Dominant layer is based on [Deep Anomaly Detection on Attributed Networks (Dominant) by Ding, Kaize, et al.](https://www.public.asu.edu/~kding9/pdf/SDM2019_Deep.pdf)

Functionalities

We describe here the usage of the main functionalities of our implementation of Dominant in PGX. The following example demonstrates a scenario where we want to detect fraudulent vertices based on their features.

Loading a graph

First, we create a session and an analyst:

session = pypgx.get_session()
analyst = session.analyst

Since we train the model unsupervised, we do not have to use a test graph or test vertices

graph = session.read_graph_with_properties(cpath)

Building an UnsupervisedGraphWise Model (minimal)

We build an UnsupervisedAnomalyDetectionGraphWise model using the minimal configuration and default hyper-parameters. Note that even though only one feature property is specified in this example, you can specify arbitrarily many.

model = analyst.unsupervised_anomaly_detection_graphwise_builder(
    vertex_input_property_names=["features"]
)

Advanced hyperparameter customization

The implementation allows for very rich hyperparameter customization. This is done through a sub-config class: GraphWiseConvLayerConfig and GraphWiseEmbeddingConfig. In the following, we build such a configuration and use it in a model. We specify a weight decay of 0.001 and dropout with dropping probability 0.5 to counteract overfitting. We also specify the Dominant embedding layer’s alpha value to 0.6 to slightly increase the importance of the feature reconstruction.

To enable or disable GPU, we can use the parameter enable_accelerator. By default this feature is enabled, however if there’s no GPU device and the cuda toolkit is not installed, the feature will be disabled and CPU will be the device used for all mllib operations.

# customize convolutional layer config
weight_property = analyst.pagerank(train_graph).name
conv_layer_config = dict(
    num_sampled_neighbors=25,
    activation_fn='tanh',
    weight_init_scheme='xavier',
    neighbor_weight_property_name=weight_property,
    dropout_rate=0.5,  # set dropout rate to prevent overfitting
)
conv_layer = analyst.graphwise_conv_layer_config(**conv_layer_config)

# customize embedding layer config
dominant_config = dict(alpha=0.6)
dominant_layer = analyst.graphwise_dominant_layer_config(**dominant_config)
params = dict(
    conv_layer_config=[conv_layer],
    embedding_config=dominant_layer,
    vertex_input_property_names=["vertex_features"],
    edge_input_property_names=["edge_features"],
    weight_decay=0.001,  # set weight decay to prevent overfitting
    enable_accelerator=True # Enable or Disable GPU
)

model = analyst.unsupervised_anomaly_detection_graphwise_builder(**params)

For a full description of all available hyperparameters and their default values, see the pypgx.api.mllib.UnsupervisedAnomalyDetectionGraphWiseModel, pypgx.api.mllib.GraphWiseConvLayerConfig, pypgx.api.mllib.GraphWiseDgiLayerConfig and pypgx.api.mllib.GraphWiseDominantLayerConfig docs.

Training the `UnsupervisedAnomalyDetectionGraphWiseModel`

We can train a UnsupervisedAnomalyDetectionGraphWiseModel model on a graph:

model.fit(graph)

Getting Loss value

We can fetch the training loss value:

loss = model.get_training_loss()

Inferring embeddings

We can use a trained model to infer embeddings for unseen nodes and store in a CSV file:

vertex_vectors = model.infer_embeddings(full_graph, cora.get_vertices()).flatten_all()
vertex_vectors.store("<path>/vertex_vectors.csv", file_format="csv", overwrite=True)

The schema for the vertex_vectors() would be as follows without flattening (flatten_all() splits the vector column into separate double-valued columns):

vertexId

embedding

Inferring anomalies

We can use a trained model to infer anomaly scores or labels for unseen nodes and store in a CSV file:

vertex_scores = model.infer_anomaly_scores(full_graph, fullGraph.get_vertices()).flatten_all()
vertex_scores.store("<path>/vertex_scores.csv", file_format="csv", overwrite=True)

If we know the contamination factor of our data we can use it to find a good threshold:

contamination_factor = 0.2
threshold = model.find_anomaly_threshold(full_graph, fullGraph.get_vertices(), contamination_factor)

If we have a threshold value we can directly infer labels:

threshold = 0.9
vertex_labels = model.infer_anomaly_labels(full_graph, fullGraph.get_vertices(), threshold).flatten_all()
vertex_labels.store("<path>/vertex_labels.csv", file_format="csv", overwrite=True)

Storing a trained model

Models can be stored either to the server file system, or to a database.

The following shows how to store a trained UnsupervisedAnomalyDetectionGraphWise model to a specified file path:

model.export().file("<path>/<model_name>", key)

When storing models in database, they are stored as a row inside a model store table. The following shows how to store a trained UnsupervisedAnomalyDetectionGraphWise model in database in a specific model store table:

model.export().db(
    "modeltablename",
    "model_name",
    username="user",
    password="password",
    jdbc_url="jdbcUrl"
)

Loading a pre-trained model

Similarly to storing, models can be loaded from a file in the server file system, or from a database.

We can load a pre-trained UnsupervisedAnomalyDetectionGraphWise model from a specified file path as follows:

model = analyst.load_unsupervised_anomaly_detection_graphwise_model(
    "<path>/<model>",
    key
)

We can load a pre-trained UnsupervisedAnomalyDetectionGraphWise model from a model store table in database as follows:

model = analyst.get_unsupervised_anomaly_detection_graphwise_model_loader().db(
    "modeltablename",
    "model_name",
    username="user",
    password="password",
    jdbc_url="jdbcUrl"
)

Destroying a model

We can destroy a model as follows:

model.destroy()