PGX 1.2.0

Detect Anomalies in a Healthcare System

In this example, we detect anomalies in medical transactions through a graph analysis procedure implemented with PGX. The result can be further investigated to identify potential fraud in the healthcare system. The example was inspired by this article from Hortonworks.

Anomaly Detection

We use a public data set of medical transactions from CMS(the United States Center for Medicare and Medicaid Services) for the year 2012. The transactions are between over 880,000 medical providers and CMS where the total amounts to more than $77B for the year. The data set can be downloaded from the CMS Website.

Here, we first show how to convert the data into a graph representation. Then we discuss how to use PGX to analyze the graph data.

Download the data set

Download the "Medicare Physician and Other Supplier PUF, CY2012 dataset" from here. Extract the file Medicare_Provider_Util_Payment_PUF_CY2012.txt into a location of your choice.

Convert the data into EDGE_LIST graph format

Copy the following script into a file named generateGraph.groovy. It will take the downloaded text file as an inputFile and generate two files as an output:

  1. outputFile is the file where the graph is going to be written in EDGE_LIST format.
  2. specialtiesOutFile is the file where the distinct specialties are going to be written. We need them later as an additional input for our analysis.
 * Copyright (C) 2013 - 2015 Oracle and/or its affiliates. All rights reserved.
"Creating Dictionaries"
readyToProcess = false
doctors = new HashMap<Integer,List>()
hcpcs = new HashSet<Integer>()
specialties = new HashSet<String>()
inputFile = "/path/to/file/Medicare_Provider_Util_Payment_PUF_CY2012.txt"
outputFile = "/path/to/file/bipartiteMedicGraph.txt"
specialtiesOutFile = "/path/to/file/specialties.txt"
npiPosition = -1
specPosition = -1
hcpcsPosition = -1

new File(inputFile).splitEachLine("\t") { fields ->
    if (readyToProcess) {
        if (!doctors.containsKey(fields[npiPosition])) {
            doctors.put(fields[npiPosition],[new HashSet<Integer>(),fields[specPosition]])
        if (!fields[hcpcsPosition].isInteger()) {
            newHcpcs = ""
            for (int i = 0; i < fields[hcpcsPosition].length(); i++) {
                if (String.valueOf(fields[hcpcsPosition].charAt(i)).isInteger()) {
                    newHcpcs = newHcpcs + fields[hcpcsPosition].charAt(i)
                    newHcpcs = newHcpcs + ((int)(fields[hcpcsPosition].charAt(i)))
            fields[hcpcsPosition] = newHcpcs
        hcpcsValue = Integer.parseInt(fields[hcpcsPosition])
    } else {
        * We are using this line because the first row is the header and in the 
        * second we see the copyright, therefore we are not going to process that
        readyToProcess = (fields.size() == 2)
        if (npiPosition < 0 || specPosition < 0 || hcpcsPosition < 0) {
            npiPosition = fields.indexOf("NPI")
            specPosition = fields.indexOf("PROVIDER_TYPE")
            hcpcsPosition = fields.indexOf("HCPCS_CODE")

"Creating files"
ln = System.getProperty("line.separator")
graphFile = new File(outputFile)

if (graphFile.exists()) {

append = true
fileWriter = new FileWriter(graphFile, append)
buffWriter = new BufferedWriter(fileWriter)

for (element in hcpcs.iterator()) {
    buffWriter.write(element + " * \"HCPCS\"" + ln)

for (key in doctors.keySet()) {
    buffWriter.write(key + " * \"" + doctors.get(key)[1] + "\"" + ln)

for (key in doctors.keySet()) {
    for (key2 in doctors.get(key)[0].iterator()) {
        buffWriter.write(key + " " + key2 + ln)


specialtiesFile = new File(specialtiesOutFile)

if (specialtiesFile.exists()) {

fileWriter = new FileWriter(specialtiesFile, append)
buffWriter = new BufferedWriter(fileWriter)

"Found " + specialties.size().toString() + " specialties, writing them to file"
for (spec in specialties.iterator()) {
    buffWriter.write(spec + ln)


Generate the graph

To run above script, change the variables inputFile, outputFile and specialtiesOutFile accordingly and then pass the path to the script as an argument to the PGX shell:

./bin/pgx generateGraph.groovy

JVM Heap size

You may need to increase the JVM default heap memory size. To do so you will need to run the following command

export JAVA_OPTS="-Xmx2048M"

We recommend to increase it up to 2 GB since the data set requires around 1.2 GB

Note that the above script converts the data set into a bipartite graph, with the left-hand side as health care providers and the right-hand side as health services. The resulting graph would look like this:


Detect Anomalies

The idea behind the anomaly detection is as follows.

  • An edge in the above graph corresponds to a specific medical provider (the left-hand side vertex) who provided a specific health service (the right-hand side vertex). There exists a two-hop (undirected) path between two medical providers, for each of the common health service that they both provide.
  • Therefore the more common services they provide, the closer the medical providers become to each other in the graph, in a sense that they are connected through many short (two-hop) paths.
  • On the other hand, the data set also specifies the "specialties" of medical providers. Since medical providers of the same specialty would likely perform many common services, these vertices are expected to be closer to one another than the others.
  • If there is a medical provider who is exceptionally close to other providers with different speciality, it is considered anomalous -- the provider is performing health services typically done by providers with a different speciality.

In order to compute this closeness among vertices, we use Personalized Pagerank(PPR). Specifically, for each specialty, we compute the PPR score of all the providers by setting only vertices of the current specialty as starting vertices. The PPR score of a given vertex represents the probability that random walks from the starting vertices end at this vertex. Consequently, a high PPR score for a vertex indicates that the vertex is close to the starting vertices.

After computing PPR score per specialty, we cross-check the values to identify anomalies. That is, we look up highest ranked vertices and see if they do belong to the current specialty. If not, the provider is considered anomalous.

We present two implementations of this analysis in the following subsections. The first approach makes use of the built-in Personalized PageRank algorithm. The second one improves memory consumption and performance of the analysis by writing and compiling a custom graph algorithm with Green-Marl.


  • The anomaly detection analysis can easily be done with PGX. Especially the PGX shell allows users to perform the analysis in an interactive manner.
  • When executed on a single server-class machine, the entire analysis was finished under 5 minutes.