PGX 1.2.0
Documentation

# Analyzing a Superhero Network With Patterns and Computations

This document describes how you can combine computational graph analysis and graph pattern matching with PGX.

## Data Preparation

In this example, we use a data set which represents a network of fictional characters from Marvel Comics. The data is publicly available from the following web site: http://exposedata.com/marvel. According to the explanation on the web site, the vertices in the graph are fictional characters. The edges between them represent co-appearance relationship. That is, an edge indicates that two characters had appeared in the same issue of a comic.

We will use the dataset in CSV format which is available from the above web site. The data file is a comma-separated text file where each line describes an edge of the graph data. The following are a few lines excerpted from the data file.

```"IRON MAN IV/JAMES R.","FORTUNE, DOMINIC"
"IRON MAN IV/JAMES R.","ERWIN, CLYTEMNESTRA"
"IRON MAN IV/JAMES R.","IRON MAN/TONY STARK "
"IRON MAN/TONY STARK ","FORTUNE, DOMINIC"
"IRON MAN/TONY STARK ","ERWIN, CLYTEMNESTRA"
"ERWIN, CLYTEMNESTRA","FORTUNE, DOMINIC"
"PRINCESS ZANDA","BLACK PANTHER/T'CHAL"
"PRINCESS ZANDA","LITTLE, ABNER"
"LITTLE, ABNER","BLACK PANTHER/T'CHAL"
```

As you can see, the file is indeed a list of edges in the data graph, while each vertex is represented with a (unique) string ID. The dataset is a multi-graph - meaning that there can be multiple edges between one source vertex and one destination vertex.

First download the above data file and put it into a convenient location. In order to load the data file into PGX, you need a simple Json file that describes the information about the data file. The following is the content of the Json file:

```{
"uri": "hero-network.csv",
"format": "edge_list",
"node_id_type": "string",
"separator": ","
}
```

The above file describes the location of datafile, format of it, as well as the value type of vertex id and separator. See the graph loading documentation for detailed information about how to load files in various formats. Save this file in the same folder as the datafile, with the name of `hero-network.csv.json`.

Now start PGX shell and load the graph with the following commands. Note that although the original data set is directed, we are undirecting the graph for the sake of robust analysis.

```// load the graph and undirect it
```

## Computational Graph Analysis

### Computing Centralities with Built-in Application

Now we will perform some computational analysis on the loaded graph instance. Specifically, we will run two different centrality algorithms on the graph instance and compare these values.

In graph theory, Centrality is a numeric property associated with every vertex that indicates relative importance of the vertex in the graph. There exist several different definitions of centralities, however, because there are many different ways to define relative importance of a vertex in a graph.

For instance, the popular Pagerank is a kind of centrality. Here the relative importance is defined recursively -- a vertex is important if it is followed by many important vertices. Moreover, it has been shown that the high Pagerank value also indicates that the vertex has high probability of being visited from a random walk. Intuitively, a vertex must be central if a high percentage of random walks are likely to go through that vertex.

Betweenness Centrality is another centrality definition which originated from Social Science. In this definition, a vertex has high centrality value if it frequently appears in the shortest paths between any pair of other vertices. Removing such an vertex would decrease the number of shortest paths between those pairs. In other words, a vertex is considered central, if it is essential to bridge other vertices in between.

PGX provides built-in algorithms for computing both of these centrality definitions. The following code snippet shows how to run them. The first line invokes `pagerank` algorithm with termination threshold of 0.0001, damping factor 0.85, and maximum iteration of 100. The second line simply invokes `vertexBetweenessCentrality` algorithm. Note that the `pagerank` method creates a new vertex property of name `pagerank` and the `vertexBetweenessCentrality` methods create another of name `betweenness`.

```pgx> analyst.pagerank(G, 0.0001, 0.85, 100)
=> Vertex Property named 'pagerank' of type double belong to ...
pgx> analyst.vertexBetweennessCentrality(G)
=> Vertex Property named 'betweenness' of type double belong to ...
```

We would like to mention that Betweenness Centrality is very expensive to compute. The algorithm takes O(NM) asymptotic execution time where N is number of the vertices in the graph and M is the number of edges. PGX, however, provides an extremely efficient implementation of this algorithm which makes it possible to run this algorithm on fairly sizeable graphs.

### Checking the results

Now, we will check the result of the above analyses. We can see vertices with top 10 pagerank values and top 10 betweenness centrality with following PGQL queries:

```pgx> G.queryPgql("SELECT n, n.pagerank, n.betweenness WHERE (n) ORDER BY n.pagerank DESC").print(10)
pgx> G.queryPgql("SELECT n, n.pagerank, n.betweenness WHERE (n) ORDER BY n.betweenness DESC").print(10)
```

The above queries are basically asking to retrieve all the vertices and their Pagerank and Betweenness Centrality values side by side, but ordered by either Pagerank or Betweenness Centrality. The results of these two queries are as below:

```                   n            n.pagerank         n.betweenness
==================================================================
CAPTAIN AMERICA  0.009909165007070385    3720705.9463404175
SPIDER-MAN/PETER PAR  0.009642528052014274     5033505.395979866
IRON MAN/TONY STARK   0.007294247272283732    2061886.2266715935
WOLVERINE/LOGAN   0.006565491888921434    2055461.3906177816
THING/BENJAMIN J. GR  0.006323530672933508     1355105.996356541
THOR/DR. DONALD BLAK  0.006288523005967819    1464283.2128501371
HUMAN TORCH/JOHNNY S   0.00603325450804593     1094332.882618513
MR. FANTASTIC/REED R  0.005832614004012102    1133204.6525742242
SCARLET WITCH/WANDA   0.0056226525407795415     829059.8688294411
INVISIBLE WOMAN/SUE    0.00539237549494341     723990.9267695197

n            n.pagerank         n.betweenness
==================================================================
SPIDER-MAN/PETER PAR  0.009642528052014274     5033505.395979866
CAPTAIN AMERICA  0.009909165007070385    3720705.9463404175
IRON MAN/TONY STARK   0.007294247272283732    2061886.2266715935
WOLVERINE/LOGAN   0.006565491888921434    2055461.3906177816
DR. STRANGE/STEPHEN   0.004112867084489208    1628817.0459632506
HAVOK/ALEX SUMMERS   0.0029354800466873687    1612813.8824396853
THOR/DR. DONALD BLAK  0.006288523005967819    1464283.2128501371
HULK/DR. ROBERT BRUC  0.0049732193380299244     1416870.891465681
THING/BENJAMIN J. GR  0.006323530672933508     1355105.996356541
DAREDEVIL/MATT MURDO  0.0038373297472661817    1301923.2838150489
```

We can see that the top 10 results from two centrality analyses are similar but not exactly the same. The most popular characters (Spider-Man, Captain America, Iron Man and Wolverine) are ranked very high in both centralities. For Pagerank, characters that belong to popular hero teams (Avengers and Fantastic Four) get higher ranks due to close inter-relationship between them. For Betweenness Centrality, on the other hand, characters that have their own titles get higher ranks because they act as a bridge between different groups of characters.

## Graph Pattern Matching

In above examples, we used PGQL for ordering vertices and selecting their properties, as in classic SQL. However, PGQL is much more powerful since it can query complex subgraph patterns.

### Finding Common Neighbours

As an example, let us consider the following question: "Shang-chi, White Tiger and Iron Fists are three characters in Marvel comic books whose power is based on martial arts. Are there any characters who are linked to all of the three characters? If so, who are they?"

The above question can be expressed with the following PGQL query:

```pgx> Results = G.queryPgql("SELECT x WHERE\
(a @ 'SHANG-CHI')->x,\
(b @ 'WHITE TIGER/HECTOR A')->x,\
(c @ 'IRON FIST/DANIEL RAN')->x")
pgx> Results.print(10);
```

Note that the query now describes a subgraph pattern of four vertices (`a,b,c,x`) in the WHERE clause. Here, `a, b, c` are vertices that are uniquely identified by their string ID while `x` is another vertex that has links from all of the three vertices.

The result of this query is, however, rather disappointing because the same character appears repeatedly in the result set. This is because the graph is a multi-graph. There can be multiple edges between the same pair of nodes -- each edge introduces a separate answer.

```                   x
======================
D'ANGELO, LIEUTENANT
D'ANGELO, LIEUTENANT
D'ANGELO, LIEUTENANT
D'ANGELO, LIEUTENANT
D'ANGELO, LIEUTENANT
D'ANGELO, LIEUTENANT
D'ANGELO, LIEUTENANT
BOOMERANG/FRED MYERS
CAPTAIN MARVEL/CAPTA
CAPTAIN MARVEL/CAPTA
```

To avoid this situation, we can easily simplify the graph with following command:

```G = G.simplify(MultiEdges.REMOVE_MULTI_EDGES, SelfEdges.REMOVE_SELF_EDGES, TrivialVertices.REMOVE_TRIVIAL_VERTICES, Mode.CREATE_COPY,null)
```

The above command removes all the multi-edges (i.e. repeated edges between same set of nodes), self edges (i.e. an edge whose source and destination vertex is same), as well as trivial vertices (i.e. a vertex that does not have any incoming or outgoing edge). Also note that the command in fact creates a simplified copy of the original graph; all the vertex properties of the original graph are copied into the simplified graph.

Now let us repeat the original query; this time, the result is more informative.

```pgx> Results = G.queryPgql("SELECT x WHERE\
(a @ 'SHANG-CHI')->x,\
(b @ 'WHITE TIGER/HECTOR A')->x,\
(c @ 'IRON FIST/DANIEL RAN')->x")
pgx> Results.getNumResults();
21
pgx> Results.print(21)
```
```                   x
======================
IRON MAN/TONY STARK
FU MANCHU
NOVA/RICHARD RIDER
HULK/DR. ROBERT BRUC
SPIDER-MAN/PETER PAR
HERCULES [GREEK GOD]
BLACK WIDOW/NATASHA
MOON KNIGHT/MARC SPE
DAREDEVIL/MATT MURDO
JACK OF HEARTS/JACK
D'ANGELO, LIEUTENANT
BOOMERANG/FRED MYERS
...
```

Note that the answer contains not only generally popular characters who teamed up with these martial-art heroes (e.g. Iron Man or Spider-Man), but also minor villains whom the three heroes commonly confronted. For instance, you can see the name of Fu Manchu, an archaic Chinese-themed villain.

### More Patterns Using Computational Analysis

You can query more complicated patterns with PGQL. Moreover, you can even make use of the results from computational analysis in your patterns.

Let us consider another question: "Among the characters who are linked to Dr. Octopus, find me every minor character who is also linked to a major character other than Spider-Man. Give me pair of these two characters."

Now we can assume the 'majority' of a character can be indicated by Pagerank values that we computed early in this document. Now the above question can be translated into the following PGQL query

```  RS = G.queryPgql("SELECT b, x  WHERE \
(a@'DR. OCTOPUS/OTTO OCT')->x<-(b WITH id()!='SPIDER-MAN/PETER PAR'),\
x.pagerank < 0.0001 ,\
b.pagerank > 0.005 ")
```

The result of above query returns 59 rows. The result reveals that some minor Spider-Man related characters (e.g. Mary and Richard Parker, the parents of Peter Paker) have connections to other popular superheroes like Fantastic Four. The query also found some minor villains (e.g. GOG) that are linked to Dr. Octopus but also confronted other major superheroes (e.g. Mr. Fantastic).

```                   b                     x
============================================
MR. FANTASTIC/REED R  BYRNES, GAYLE WATSON
HUMAN TORCH/JOHNNY S  BYRNES, GAYLE WATSON
WOLVERINE/LOGAN   HOWARD, PROFESSOR MA
BEAST/HENRY &HANK& P  HOWARD, PROFESSOR MA
IRON MAN/TONY STARK      BAKER, ANNE-MARIE
MR. FANTASTIC/REED R       PARKER, RICHARD
THING/BENJAMIN J. GR       PARKER, RICHARD
HUMAN TORCH/JOHNNY S       PARKER, RICHARD
MR. FANTASTIC/REED R          PARKER, MARY
...
MR. FANTASTIC/REED R                   GOG
THING/BENJAMIN J. GR                   GOG
INVISIBLE WOMAN/SUE                    GOG
HUMAN TORCH/JOHNNY S                   GOG
...
```

## Summary

In this document, we have shown how to use PGX to analyze a public data set of a network of Marvel Comics superheroes. We showed that PGX supports both computational analysis as well as pattern matching. By using both kinds of analysis together, PGX enables the user to get interesting information even from this simple data set.