What is Graph Analysis?
Graph analysis is a methodology for data analysis. In graph analysis, the
original data set is represented as a graph where the vertices correspond to the
data entities and edges to the relationships between them. Therefore analyzing such a
graph takes into account the fine-grained, arbitrary
relationships naturally in the process, which enables the discovery of valuable non-immediate information about the original data set.
For instance, PageRank is a popular
graph algorithm that measures the relative importance between data entities
based on the relationship structures between them — e.g., links between webpages.
Other examples of graph analysis include influencer identification, community structure
detection and path finding.
PGX and Graph Analysis
PGX is a fast, parallel, in-memory graph analytic framework. PGX allows the
user to do the following things in an easy and efficient manner:
Loading graphs into memory: PGX is an in-memory graph analytic framework
that needs to load a graph instance into main-memory before running
analytic algorithms on the graph. PGX supports a few popular graph file formats
for convenient data loading.
Running built-in graph algorithms: PGX provides built-in implementations of
many popular graph algorithms. The user can easily apply these algorithms on
their graph data sets by simply invoking the appropriate methods.
Running custom graph algorithms: PGX is also able to execute custom (i.e.
user-provided) graph algorithms. Users can write up their own graph algorithms with the Green-Marl DSL
and feed it to PGX. The provided
Green-Marl program is transformed for PGX using a parallelizing
Running graph pattern matching queries: Along with running graph
algorithms, finding sub-graphs of interest to users' is a crucial
task in graph analysis. PGX provides a graph pattern matching feature in which a
query is written in an SQL-like declarative language, PGQL. PGQL queries are processed
in a highly efficient manner on top of PGX.
Mutating Graphs: Complicated graph analysis often consists of multiple steps,
where some of the steps require graph mutating operations. For example, one
may want to create an undirected version of the graph, renumber the vertices
in the graph, or remove repeated edges between vertices. PGX provides fast,
parallel built-in implementation of such operations.
Browsing and exporting results: Once the analysis is finished, the users can
browse the results of their analysis and export them into the file system.
The following figure depicts an overview of using PGX for graph analysis.
Figure: PGX Overview
Benefits of PGX
The benefits that PGX provides can be summarized as follows:
Fast, parallel, in-memory execution:
PGX is a fast, parallel, in-memory graph analytic framework. PGX adopts
light-weight in-memory data structures which allow fast execution of graph
algorithms. Moreover, PGX exploits multiple CPUs of modern computer systems
by running parallelized graph algorithms. Note that not only the built-in
algorithms are parallelized, but also that custom graph algorithms are
automatically parallelized with the help of a DSL compiler.
Rich built-in algorithms:
PGX provides built-in implementations of
many popular graph algorithms, including computing various centrality
measures, finding shortest paths, finding/evaluating clusters and
components, and predicting future edges, etc.
Support for custom algorithms:
PGX adopts the Green-Marl DSL for the sake of both ease of implementation
of custom algorithms and their efficient execution. The users can program
their own graph algorithms intuitively by using the high-level
graph-specific data type and operators in Green-Marl. PGX can execute the
given Green-Marl program efficiently by parallelizing the given Green-Marl
program and mapping it into the PGX-internal API.
Interleaved usage of graph algorithms and graph pattern matching
PGX supports two different kinds of high-level tasks: Running
graph analysis algorithms (built-in or custom), and pattern-matching -
finding sub-graphs that match a pattern specified in a query.
The output of each of these tasks can be used as input to the other - the
results of analytics algorithms are stored as transient properties of nodes
and edges in the graph; pattern matching can then be used against those
properties. Similarly, the graphs analytics are run against may be
sub-graphs derived from the original data using pattern-matching. The
combination of pattern-matching and analytics results in a highly expressive
and flexible interface for graph analytics.
PGX provides a shell application with which the user can exercise the PGX
features in an interactive manner. That is, the user can simply start the
shell and type commands from the shell command line, instead of creating a
whole Java application for his/her analysis.