UnsupervisedAnomalyDetectionGraphWise
Overview
UnsupervisedAnomalyDetectionGraphWise
is an inductive vertex representation learning and anomaly detection algorithm
which is able to leverage vertex and edge feature information. While it can be applied to a wide variety of tasks, it is particularly
suitable for unsupervised learning of vertex embeddings for anomaly detection. After training this model, it is possible to infer
anomaly scores or labels for unseen nodes.
UnsupervisedAnomalyDetectionGraphWise
is based on Deep Anomaly Detection on Attributed Networks (Dominant) by Ding, Kaize, et al.
Model Structure
A UnsupervisedAnomalyDetectionGraphWise
model consists of graph convolutional layers followed by an embedding layer.
There are two types of embedding layer available: DGI layer and Dominant layer. Both of the layers are for inductive vertex representation learning,
with different loss functions. The embedding layer defaults to the DGI layer.
The forward pass through a convolutional layer for a vertex proceeds as follows:
A set of neighbors of the vertex is sampled.
The previous layer representations of the neighbors are mean-aggregated, and the aggregated features are concatenated with the previous layer representation of the vertex.
This concatenated vector is multiplied with weights, and a bias vector is added.
The result is normalized to such that the layer output has unit norm.
The DGI Layer, which is based on [Deep Graph Infomax (DGI) by Velickovic et al.](https://arxiv.org/pdf/1809.10341.pdf) consists of three parts enabling unsupervised learning using embeddings produced by the convolution layers.
Corruption function: Shuffles the node features while preserving the graph structure to produce negative embedding samples using the convolution layers.
Readout function: Sigmoid activated mean of embeddings, used as summary of a graph
Discriminator: Measures the similarity of positive (unshuffled) embeddings with the summary as well as the similarity of negative samples with the summary from which the loss function is computed.
Since none of these contains mutable hyperparameters, the default DGI layer is always used and cannot be adjusted.
The Dominant layer enables unsupervised learning using a deep autoencoder. It uses GCNs to reconstruct the features in the autoencoder setting, together with the reconstructed structure that is estimated using the dot products of the embeddings.
The loss function is computed from the feature reconstruction loss and the structure reconstruction loss. The importance given to features or to the structure can be tuned with the alpha hyperparameter.
The Dominant layer is based on [Deep Anomaly Detection on Attributed Networks (Dominant) by Ding, Kaize, et al.](https://www.public.asu.edu/~kding9/pdf/SDM2019_Deep.pdf)
Functionalities
We describe here the usage of the main functionalities of our implementation of Dominant
in PGX.
The following example demonstrates a scenario where we want to detect fraudulent vertices based on their features.
Loading a graph
First, we create a session and an analyst:
1session = pypgx.get_session()
2analyst = session.analyst
Since we train the model unsupervised, we do not have to use a test graph or test vertices
1graph = session.read_graph_with_properties(cpath)
Building an UnsupervisedGraphWise Model (minimal)
We build an UnsupervisedAnomalyDetectionGraphWise
model using the minimal configuration and default hyper-parameters. Note that even though only
one feature property is specified in this example, you can specify arbitrarily many.
1model = analyst.unsupervised_anomaly_detection_graphwise_builder(
2 vertex_input_property_names=["features"]
3)
Advanced hyperparameter customization
The implementation allows for very rich hyperparameter customization. This is done through a sub-config class:
GraphWiseConvLayerConfig
and GraphWiseEmbeddingConfig
. In the following, we build such a configuration and use it in a model.
We specify a weight decay of 0.001
and dropout with dropping probability 0.5
to counteract overfitting.
We also specify the Dominant embedding layer’s alpha value to 0.6 to slightly increase the importance of the
feature reconstruction.
To enable or disable GPU, we can use the parameter enable_accelerator. By default this feature is enabled, however if there’s no GPU device and the cuda toolkit is not installed, the feature will be disabled and CPU will be the device used for all mllib operations.
1# customize convolutional layer config
2weight_property = analyst.pagerank(train_graph).name
3conv_layer_config = dict(
4 num_sampled_neighbors=25,
5 activation_fn='tanh',
6 weight_init_scheme='xavier',
7 neighbor_weight_property_name=weight_property,
8 dropout_rate=0.5, # set dropout rate to prevent overfitting
9)
10conv_layer = analyst.graphwise_conv_layer_config(**conv_layer_config)
11
12# customize embedding layer config
13dominant_config = dict(alpha=0.6)
14dominant_layer = analyst.graphwise_dominant_layer_config(**dominant_config)
15params = dict(
16 conv_layer_config=[conv_layer],
17 embedding_config=dominant_layer,
18 vertex_input_property_names=["vertex_features"],
19 edge_input_property_names=["edge_features"],
20 weight_decay=0.001, # set weight decay to prevent overfitting
21 enable_accelerator=True # Enable or Disable GPU
22)
23
24model = analyst.unsupervised_anomaly_detection_graphwise_builder(**params)
For a full description of all available hyperparameters and their default values, see the
pypgx.api.mllib.UnsupervisedAnomalyDetectionGraphWiseModel
,
pypgx.api.mllib.GraphWiseConvLayerConfig
,
pypgx.api.mllib.GraphWiseDgiLayerConfig
and
pypgx.api.mllib.GraphWiseDominantLayerConfig
docs.
Training the UnsupervisedAnomalyDetectionGraphWiseModel
We can train a UnsupervisedAnomalyDetectionGraphWiseModel
model on a graph:
1model.fit(graph)
Getting Loss value
We can fetch the training loss value:
1loss = model.get_training_loss()
Inferring embeddings
We can use a trained model to infer embeddings for unseen nodes and store in a CSV
file:
1vertex_vectors = model.infer_embeddings(full_graph, cora.get_vertices()).flatten_all()
2vertex_vectors.store("<path>/vertex_vectors.csv", file_format="csv", overwrite=True)
The schema for the vertex_vectors()
would be as follows without flattening (flatten_all()
splits the vector column into separate double-valued columns):
vertexId |
embedding |
Inferring anomalies
We can use a trained model to infer anomaly scores or labels for unseen nodes and store in a CSV
file:
1vertex_scores = model.infer_anomaly_scores(full_graph, fullGraph.get_vertices()).flatten_all()
2vertex_scores.store("<path>/vertex_scores.csv", file_format="csv", overwrite=True)
If we know the contamination factor of our data we can use it to find a good threshold:
1contamination_factor = 0.2
2threshold = model.find_anomaly_threshold(full_graph, fullGraph.get_vertices(), contamination_factor)
If we have a threshold value we can directly infer labels:
1threshold = 0.9
2vertex_labels = model.infer_anomaly_labels(full_graph, fullGraph.get_vertices(), threshold).flatten_all()
3vertex_labels.store("<path>/vertex_labels.csv", file_format="csv", overwrite=True)
Storing a trained model
Models can be stored either to the server file system, or to a database.
The following shows how to store a trained UnsupervisedAnomalyDetectionGraphWise
model to a specified file path:
1model.export().file("<path>/<model_name>", key)
When storing models in database, they are stored as a row inside a model store table.
The following shows how to store a trained UnsupervisedAnomalyDetectionGraphWise
model in database in a specific model store table:
1model.export().db(
2 "modeltablename",
3 "model_name",
4 username="user",
5 password="password",
6 jdbc_url="jdbcUrl"
7)
Loading a pre-trained model
Similarly to storing, models can be loaded from a file in the server file system, or from a database.
We can load a pre-trained UnsupervisedAnomalyDetectionGraphWise
model from a specified file path as follows:
1model = analyst.load_unsupervised_anomaly_detection_graphwise_model(
2 "<path>/<model>",
3 key
4)
We can load a pre-trained UnsupervisedAnomalyDetectionGraphWise
model from a model store table in database as follows:
1model = analyst.get_unsupervised_anomaly_detection_graphwise_model_loader().db(
2 "modeltablename",
3 "model_name",
4 username="user",
5 password="password",
6 jdbc_url="jdbcUrl"
7)
Destroying a model
We can destroy a model as follows:
1model.destroy()