Graph analysis is a methodology for data analysis. In graph analysis, the original data set is represented as a graph where the vertices correspond to the data entities and edges to the relationships between them. Therefore analyzing such a graph takes into account the fine-grained, arbitrary relationships naturally in the process, which enables the discovery of valuable non-immediate information about the original data set.
For instance, PageRank is a popular graph algorithm that measures the relative importance between data entities based on the relationship structures between them — e.g., links between webpages. Other examples of graph analysis include influencer identification, community structure detection and path finding.
PGX is a fast, parallel, in-memory graph analytic framework. PGX allows the user to do the following things in an easy and efficient manner:
Loading graphs into memory: PGX is an in-memory graph analytic framework that needs to load a graph instance into main-memory before running analytic algorithms on the graph. PGX supports a few popular graph file formats for convenient data loading.
Running built-in graph algorithms: PGX provides built-in implementations of many popular graph algorithms. The user can easily apply these algorithms on their graph data sets by simply invoking the appropriate methods.
Running custom graph algorithms: PGX is also able to execute custom (i.e. user-provided) graph algorithms. Users can write up their own graph algorithms with the Green-Marl DSL and feed it to PGX. The provided Green-Marl program is transformed for PGX using a parallelizing compiler.
Running graph pattern matching queries: Along with running graph algorithms, finding sub-graphs of interest to users' is a crucial task in graph analysis. PGX provides a graph pattern matching feature in which a query is written in an SQL-like declarative language, PGQL. PGQL queries are processed in a highly efficient manner on top of PGX.
Mutating Graphs: Complicated graph analysis often consists of multiple steps, where some of the steps require graph mutating operations. For example, one may want to create an undirected version of the graph, renumber the vertices in the graph, or remove repeated edges between vertices. PGX provides fast, parallel built-in implementation of such operations.
Browsing and exporting results: Once the analysis is finished, the users can browse the results of their analysis and export them into the file system.
The following figure depicts an overview of using PGX for graph analysis.
Figure: PGX Overview
The benefits that PGX provides can be summarized as follows:
Fast, parallel, in-memory execution: PGX is a fast, parallel, in-memory graph analytic framework. PGX adopts light-weight in-memory data structures which allow fast execution of graph algorithms. Moreover, PGX exploits multiple CPUs of modern computer systems by running parallelized graph algorithms. Note that not only the built-in algorithms are parallelized, but also that custom graph algorithms are automatically parallelized with the help of a DSL compiler.
Rich built-in algorithms: PGX provides built-in implementations of many popular graph algorithms, including computing various centrality measures, finding shortest paths, finding/evaluating clusters and components, and predicting future edges, etc.
Support for custom algorithms: PGX adopts the Green-Marl DSL for the sake of both ease of implementation of custom algorithms and their efficient execution. The users can program their own graph algorithms intuitively by using the high-level graph-specific data type and operators in Green-Marl. PGX can execute the given Green-Marl program efficiently by parallelizing the given Green-Marl program and mapping it into the PGX-internal API.
Interleaved usage of graph algorithms and graph pattern matching PGX supports two different kinds of high-level tasks: Running graph analysis algorithms (built-in or custom), and pattern-matching - finding sub-graphs that match a pattern specified in a query. The output of each of these tasks can be used as input to the other - the results of analytics algorithms are stored as transient properties of nodes and edges in the graph; pattern matching can then be used against those properties. Similarly, the graphs analytics are run against may be sub-graphs derived from the original data using pattern-matching. The combination of pattern-matching and analytics results in a highly expressive and flexible interface for graph analytics.
Interactive Shell: PGX provides a shell application with which the user can exercise the PGX features in an interactive manner. That is, the user can simply start the shell and type commands from the shell command line, instead of creating a whole Java application for his/her analysis.
Continue reading: