UnsupervisedAnomalyDetectionGraphWise
Overview
UnsupervisedAnomalyDetectionGraphWise
is an inductive vertex representation learning algorithm which is able to leverage vertex feature
information. It can be applied to a wide variety of tasks, including unsupervised learning vertex embeddings for
vertex classification.
UnsupervisedAnomalyDetectionGraphWise
is based on Deep Anomaly Detection on Attributed Networks (Dominant) by Ding, Kaize, et al.
Model Structure
A UnsupervisedAnomalyDetectionGraphWise
model consists of graph convolutional layers followed by an embedding layer which defaults to a DGI layer.
The forward pass through a convolutional layer for a vertex proceeds as follows:
A set of neighbors of the vertex is sampled.
The previous layer representations of the neighbors are mean-aggregated, and the aggregated features are concatenated with the previous layer representation of the vertex.
This concatenated vector is multiplied with weights, and a bias vector is added.
The result is normalized to such that the layer output has unit norm.
The DGI Layer consists of three parts enabling unsuspervised learning using embeddings produced by the convolution layers.
Corruption function: Shuffles the node features while preserving the graph structure to produce negative embedding samples using the convolution layers.
Readout function: Sigmoid activated mean of embeddings, used as summary of a graph
Discriminator: Measures the similarity of positive (unshuffled) embeddings with the summary as well as the similarity of negative samples with the summary from which the loss function is computed.
Since none of these contains mutable hyperparameters, the default DGI layer is always used and cannot be adjusted.
The second embedding layer available is the Dominant Layer, based on Deep Anomaly Detection on Attributed Networks (Dominant) by Ding, Kaize, et al.
Dominant is a model that detects anomalies based on the features and the neighbors’ structure. Using GCNs to reconstruct the features in an autoencoder’s settings, and the mask with the dot products of the embeddings.
The loss function is computed from the feature reconstruction loss and the structure reconstruction loss. The importance given to features or to the structure can be tuned with the alpha hyperparameter.
Functionalities
We describe here the usage of the main functionalities of our implementation of Dominant
in PGX.
Loading a graph
First, we create a session and an analyst:
1session = pypgx.get_session()
2analyst = session.analyst
Since we train the model unsupervised, we do not have to use a test graph or test vertices
1graph = session.read_graph_with_properties(cpath)
Building an UnsupervisedGraphWise Model (minimal)
We build an UnsupervisedAnomalyDetectionGraphWise
model using the minimal configuration and default hyper-parameters. Note that even though only
one feature property is specified in this example, you can specify arbitrarily many.
1model = analyst.unsupervised_anomaly_detection_graphwise_builder(
2 vertex_input_property_names=["features"]
3)
Advanced hyperparameter customization
The implementation allows for very rich hyperparameter customization. This is done through a sub-config class:
GraphWiseConvLayerConfig
and GraphWiseEmbeddingConfig
. In the following, we build such a configuration and use it in a model.
We specify a weight decay of 0.001
and dropout with dropping probability 0.5
to counteract overfitting.
We also specify the Dominant embedding layer’s alpha value to 0.6 to slightly increase the importance of the
feature reconstruction.
1weight_property = analyst.pagerank(train_graph).name
2conv_layer_config = dict(
3 num_sampled_neighbors=25,
4 activation_fn='tanh',
5 weight_init_scheme='xavier',
6 neighbor_weight_property_name=weight_property,
7 dropout_rate=0.5
8)
9conv_layer = analyst.graphwise_conv_layer_config(**conv_layer_config)
10dominant_config = dict(alpha=0.6)
11dominant_layer = analyst.graphwise_dominant_layer_config(**dominant_config)
12params = dict(
13 conv_layer_config=[conv_layer],
14 embedding_config=dominant_layer,
15 vertex_input_property_names=["vertex_features"],
16 edge_input_property_names=["edge_features"],
17 weight_decay=0.001
18)
19
20model = analyst.unsupervised_anomaly_detection_graphwise_builder(**params)
For a full description of all available hyperparameters and their default values, see the
pypgx.api.mllib.UnsupervisedAnomalyDetectionGraphWiseModel
,
pypgx.api.mllib.GraphWiseConvLayerConfig
,
pypgx.api.mllib.GraphWiseDgiLayerConfig
and
pypgx.api.mllib.GraphWiseDominantLayerConfig
docs.
Training the UnsupervisedAnomalyDetectionGraphWiseModel
We can train a UnsupervisedAnomalyDetectionGraphWiseModel
model on a graph:
1model.fit(graph)
Getting Loss value
We can fetch the training loss value:
1loss = model.get_training_loss()
Inferring embeddings
We can use a trained model to infer embeddings for unseen nodes and store in a CSV
file:
1vertex_vectors = model.infer_embeddings(full_graph, cora.get_vertices()).flatten_all()
2vertex_vectors.store("<path>/vertex_vectors.csv", file_format="csv", overwrite=True)
The schema for the vertex_vectors()
would be as follows without flattening (flatten_all()
splits the vector column into separate double-valued columns):
vertexId |
embedding |
Inferring anomalies
We can use a trained model to infer anomaly scores or labels for unseen nodes and store in a CSV
file:
1vertex_scores = model.infer_anomaly_scores(full_graph, fullGraph.get_vertices()).flatten_all()
2vertex_scores.store("<path>/vertex_scores.csv", file_format="csv", overwrite=True)
If we know the contamination factor of our data we can use it to find a good threshold:
1contamination_factor = 0.2
2threshold = model.find_anomaly_threshold(full_graph, fullGraph.get_vertices(), contamination_factor)
If we have a threshold value we can directly infer labels:
1threshold = 0.9
2vertex_labels = model.infer_anomaly_labels(full_graph, fullGraph.get_vertices(), threshold).flatten_all()
3vertex_labels.store("<path>/vertex_labels.csv", file_format="csv", overwrite=True)
Storing a trained model
Models can be stored either to the server file system, or to a database.
The following shows how to store a trained UnsupervisedAnomalyDetectionGraphWise
model to a specified file path:
1model.export().file("<path>/<model_name>", key)
When storing models in database, they are stored as a row inside a model store table.
The following shows how to store a trained UnsupervisedAnomalyDetectionGraphWise
model in database in a specific model store table:
1model.export().db(
2 "modeltablename",
3 "model_name",
4 username="user",
5 password="password",
6 jdbc_url="jdbcUrl"
7)
Loading a pre-trained model
Similarly to storing, models can be loaded from a file in the server file system, or from a database.
We can load a pre-trained UnsupervisedAnomalyDetectionGraphWise
model from a specified file path as follows:
1model = analyst.load_unsupervised_anomaly_detection_graphwise_model(
2 "<path>/<model>",
3 key
4)
We can load a pre-trained UnsupervisedAnomalyDetectionGraphWise
model from a model store table in database as follows:
1model = analyst.get_unsupervised_anomaly_detection_graphwise_model_loader().db(
2 "modeltablename",
3 "model_name",
4 username="user",
5 password="password",
6 jdbc_url="jdbcUrl"
7)
Destroying a model
We can destroy a model as follows:
1model.destroy()