******
Pg2vec
******

Overview of the algorithm
-------------------------

:class:`Pg2vec` learns representations of graphlets (partitions inside a graph) by employing edges as the principal learning units and thereby packing more information
in each learning unit (as compared to previous approaches employing vertices as learning units) for the representation
learning task. It consists of three main steps:

1. We generate random walks for each vertex (with pre-defined length per walk and pre-defined number of walks per vertex).

2. Each edge in this random walk is mapped as a ``property edge-word`` in the created document (with the document label as the graph-id) where the property edge-word is defined as the concatenation of the properties of the source and destination vertices.

3. We feed the generated documents (with their attached document labels) to a `doc2vec <https://dl.acm.org/citation.cfm?id=3044805.3045025>`_ algorithm which generates the vector representation for each document (which is a graph in this case).

:class:`Pg2vec` creates graphlet embeddings for a specific set of graphlets and cannot be updated to incorporate modifications on these graphlets. 
Instead, a new :class:`Pg2vec` model should be trained on these modified graphlets. 
Lastly, it is important to note that the memory consumption of :class:`Pg2vec` model is ``O(2(n+m)*d)`` where ``n`` is 
the number of vertices in the graph, ``m`` is the number of graphlets in the graph, and ``d`` is the embedding length.

Functionalities
---------------

We provide here the main functionalities for our implementation of :class:`Pg2vec` in PGX
using `NCI109 <https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets>`_ dataset as an example (with 4127 graphs in it).

Loading a graph
~~~~~~~~~~~~~~~

First, we create a session and an analyst:

.. code-block:: python
    :linenos:

    session = pypgx.get_session(session_name="my-session")
    analyst = session.create_analyst()

Our :class:`h` algorithm can be applied to directed or undirected graphs 
(even though we only consider undirected random walks). To begin with, we can load a graph as follows:

.. code-block:: python
    :linenos:

    graph = session.read_graph_with_properties(self.small_graph)

Building a Pg2vec Model (minimal)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can build a :class:`Pg2vec` model using the minimal configuration and default hyper-parameters as follows:

.. code-block:: python
    :linenos:

    model = analyst.pg2vec_builder(
        graphlet_id_property_name="graph_id",
        vertex_property_names=["category"],
        window_size=4,
        walks_per_vertex=5,
        walk_length=8
    )

We specify the property name to determine each graphlet using the 
:meth:`Pg2vecModelBuilder.setGraphLetIdPropertyName` operation and also employ the vertex properties in :class:`Pg2vec` which are specified using the :meth:`Pg2vecModelBuilder.setVertexPropertyNames` operation.
We can also use the weakly connected component (WCC) functionality in PGX to determine the graphlets in a given graph.

Building a Pg2vec Model (customized)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can build a :class:`Pg2vec` model using customized hyper-parameters as follows:

.. code-block:: python
    :linenos:

    model = analyst.pg2vec_builder(
        graphlet_id_property_name="graph_id",
        vertex_property_names=["category"],
        min_word_frequency=1,
        batch_size=128,
        num_epochs=5,
        layer_size=200,
        learning_rate=0.04,
        min_learning_rate=0.0001,
        window_size=4,
        walks_per_vertex=5,
        walk_length=8,
        use_graphlet_size=True,
        graphlet_size_property_name="graphletSize-Pg2vec",
    )

We provide complete explanation for each builder operation (along with the default values) in our :class:`Pg2vecModelBuilder` docs.

Training the Pg2vec model
~~~~~~~~~~~~~~~~~~~~~~~~~

We can train a :class:`Pg2vec` model with the specified (default or customized) settings as follows:

.. code-block:: python
    :linenos:
    
    model.fit(graph)

Getting the loss value
~~~~~~~~~~~~~~~~~~~~~~

We can fetch the loss value on a specified fraction of training data as follows:

.. code-block:: python
    :linenos:

    loss = model.loss

Computing the similar graphlets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can fetch the ``k`` most similar graphlets for a given graphlet with the following code:

.. code-block:: python
    :linenos:

    similars = model.compute_similars(1, 10)

The output results will be in the following format, for e.g., searching for similar vertices for graphlet with ``ID = 52`` using
the trained model:

+-------------+--------------------+
| dstGraphlet | similarity         |
+=============+====================+
| 52          | 1.0                |
+-------------+--------------------+
| 10          | 0.8748674392700195 |
+-------------+--------------------+
| 23          | 0.8551455140113831 |
+-------------+--------------------+
| 26          | 0.8493421673774719 |
+-------------+--------------------+
| 47          | 0.8411962985992432 |
+-------------+--------------------+
| 25          | 0.8281504511833191 |
+-------------+--------------------+
| 43          | 0.8202780485153198 |
+-------------+--------------------+
| 24          | 0.8179885745048523 |
+-------------+--------------------+
| 8           | 0.796689510345459  |
+-------------+--------------------+
| 9           | 0.7947834134101868 |
+-------------+--------------------+

The visualization of two similar graphlets (top: `ID = 52` and bottom: `ID = 10`).

.. image:: /_static/images/mllib_Pg2vec_NCI109_Demo_g52.png
   :alt: similar graphlets pg2vec

.. image:: /_static/images/mllib_Pg2vec_NCI109_Demo_g10.png
   :alt: similar graphlets pg2vec

Computing the similars (for a graphlet batch)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can fetch the ``k`` most similar graphlets for a batch of input graphlets with the following code:

.. code-block:: python
    :linenos:

    batched_similars = model.compute_similars([1, 2], 10)

The output results will be in the following format, for e.g., searching for similar vertices for graphlets
with ``ID = 52`` and ``ID = 41`` using the trained model.

+-------------+-------------+--------------------+
| srcGraphlet | dstGraphlet | similarity         |
+=============+=============+====================+
| 52          | 52          | 1.0                |
+-------------+-------------+--------------------+
| 52          | 10          | 0.8748674392700195 |
+-------------+-------------+--------------------+
| 52          | 23          | 0.8551455140113831 |
+-------------+-------------+--------------------+
| 52          | 26          | 0.8493421673774719 |
+-------------+-------------+--------------------+
| 52          | 47          | 0.8411962985992432 |
+-------------+-------------+--------------------+
| 52          | 25          | 0.8281504511833191 |
+-------------+-------------+--------------------+
| 52          | 43          | 0.8202780485153198 |
+-------------+-------------+--------------------+
| 52          | 24          | 0.8179885745048523 |
+-------------+-------------+--------------------+
| 52          | 8           | 0.796689510345459  |
+-------------+-------------+--------------------+
| 52          | 9           | 0.7947834134101868 |
+-------------+-------------+--------------------+
| 41          | 41          | 1.0                |
+-------------+-------------+--------------------+
| 41          | 197         | 0.9653506875038147 |
+-------------+-------------+--------------------+
| 41          | 84          | 0.9552277326583862 |
+-------------+-------------+--------------------+
| 41          | 157         | 0.9465565085411072 |
+-------------+-------------+--------------------+
| 41          | 65          | 0.9287481307983398 |
+-------------+-------------+--------------------+
| 41          | 248         | 0.9177336096763611 |
+-------------+-------------+--------------------+
| 41          | 315         | 0.9043129086494446 |
+-------------+-------------+--------------------+
| 41          | 92          | 0.8998928070068359 |
+-------------+-------------+--------------------+
| 41          | 297         | 0.8897411227226257 |
+-------------+-------------+--------------------+
| 41          | 50          | 0.8810243010520935 |
+-------------+-------------+--------------------+

Inferring a graphlet vector
~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can infer the vector representation for a given new graphlet with the following code:

.. code-block:: python
    :linenos:

    from pypgx.api.filters import VertexFilter
    graphlet = graph.filter(VertexFilter("vertex.graph_id = 1"))
    inferred_vector = model.infer_graphlet_vector(graphlet)
    inferred_vector.print()

The schema for the ``inferred_vector`` would be as follows:

+-----------------------------------------+---------------------+
| graphlet                                | embedding           |
+-----------------------------------------+---------------------+

Inferring vectors (for a graphlet batch)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can infer the vector representations for multiple graphlets (specified with different graph-ids in a graph) with the following code:

.. code-block:: python
    :linenos:

    graphlets = session.read_graph_with_properties(
        self.small_graph
    )
    inferred_vector_batched = model.infer_graphlet_vector_batched(
        graphlets
    )
    inferred_vector_batched.print()

The schema is same as for ``inferGraphletVector`` but with more rows corresponding to the input graphlets.

Getting all trained graphlet vectors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can retrieve the trained graphlet vectors for the current :class:`Pg2vec` model as follows:

.. code-block:: python
    :linenos:

    vertex_vectors = model.trained_graphlet_vectors.flatten_all()
    vertex_vectors.store(
        path=tmp + "/graphlet_vectors.tsv",
        overwrite=True,
        file_format="csv"
    )

The schema is the same as for ``inferGraphletVector`` but with more rows corresponding to all the graphlets in the input graph.

Storing a trained model
~~~~~~~~~~~~~~~~~~~~~~~

Models can be stored either to the server file system, or to a database.

The following shows how to store a trained :class:`Pg2vec` model to a specified file path:

.. code-block:: python
    :linenos:

    model.export().file(path=tmp + "/model.model", key="test", overwrite=True)

When storing models in database, they are stored as a row inside a model store table.
The following shows how to store a trained :class:`Pg2vec` model in database in a specific model store table:

.. code-block:: python
    :linenos:

    model.export().db(
        username="user",
        password="password",
        model_store="modelstoretablename",
        model_name="model",
        jdbc_url="jdbc_url"
    )

Loading a pre-trained model
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Similarly to storing, models can be loaded from a file in the server file system, or from a database.

It is possible to load a pre-trained :class:`Pg2vec` model from a specified file path as follows:

.. code-block:: python
    :linenos:

    analyst.get_pg2vec_model_loader().file(
        path=tmp + "/model.model",
        key="test"
    )

We can load a pre-trained :class:`Pg2vec` model from a model store table in database as follows:

.. code-block:: python
    :linenos:

    analyst.get_pg2vec_model_loader().db(
        username="user",
        password="password",
        model_store="modelstoretablename",
        model_name="model",
        jdbc_url="jdbc_url"
    )

Destroying a model
~~~~~~~~~~~~~~~~~~

We can destroy a model with the following operation:

.. code-block:: python
    :linenos:

    model.destroy()