PGX 1.2.0
Documentation

Using PGX as a Recommendation Engine

Recommendation

Graph analytics is becoming popular, but there are things you can do with it that are far beyond the sorts of things people intuit when they think of graphs.

Obvious applications of graph analytics are things like taking your Facebook friends graph, and finding people you aren't directly connected to but might know, because they're connected to a lot of your friends. In that sort of graph, vertices are homogenous - there is only one kind - a person. And in fact, graph analytics is easily applicable to any data set where entities reference other entities of the same type.

But some of the most interesting applications of graph analytics involve working with data sets where vertices are heterogenous. Nodes of type A are connected to vertices of type B (and add as many additional vertex types as you like) - that data is just as easily represented as a graph.

The thing that graph analytics brings to the table is the ability to discover latent information implied by the relationships expressed in data. It's hard to overstate the game-changing implications of that. What it provides is an empirical way to do things that used to be solely in the realm of human intuition - whether it's judging how risky a source code commit is, or picking a movie someone would like.

One such example is exactly that - recommendation engines,using matrix factorization. For the examples here, we will use the publicly available MovieLens data set.

In this case study, we'll use Oracle Labs' PGX graph analytics engine to create a recommendation engine with very little code.

Matrix Factorization for Recommendations

Matrix factorization is a relatively simple process - say you have a matrix of numbers,

|---|---|
| 6 | 8 |
| 4 | 2 |

you could factor that into two matrices

|---|---|    |---|---|
| 2 | 2 |    | 3 | 4 |
| 2 | 1 |    | 2 | 2 |

such that if you multiplied the numbers in one matrix by those in the other, you'd get the original matrix back.

So, say we have a matrix of movies and users, with the values being ratings, and with blanks for movies a user hasn't rated.

- Star Wars Dog Day Afternoon Blue Velvet The Toxic Avenger
Alice 1 1 5 4
Bob 1 - - 4
Jack 5 1 3 -

We might actually have many thousands of users and ratings here. How would performing a bunch of division help us?

The idea here is that every user has a few other pieces of information tied to it - say, age, and occupation. And every movie has information tied to it as well - genre, actors, year it came out. Now, one could apply some intuition and posit that, say, someone whose occupation is "engineer" has a higher probability of liking science fiction. And people have tried to write systems that hard-coded a whole lot of suppositions - but the results of that approach have never been stellar. And of course, just recommending the most popular movies is a cheap way to solve the problem, but it means everybody gets the same recommendations, even though not everybody likes the same things.

The insight - that users who share some characteristics and movies that share some characteristics, plus a set of ratings, encode latent information that is predictive of what a user will like is sound. Now, what if you could generically make recommendations, without having to hard-code any assumptions about those attributes, or even have to know what they are? That's what this approach to recommendations offers. The analysis involved is a two-step process:

  • Discovering (really, synthesizing) an array of features - a feature might take into account any of the things we know about the user, the movies they rated and their rating of them - a feature might be "rates action movies highly" - in practice, this is just an array of floating-point numbers - this is the training step - we'll synthesize a new property on both movie and user entities. The thing that's non-intuitive here is that a feature is neither a feature of a movie nor of a user, but rather, describes the kind of relationships the user has to movies or vice-versa. Essentially, we've synthesized some shared data to compare based on looking at a large number of connections (ratings) between users and movies.
    • Each user winds up with an array of floating-point values, one per feature, that represents how strongly they are associated with this feature
  • When we want to generate recommendations for a particular user, we iterate the movies (or some subset of them) and simply multiply the user's feature array by the movie's feature array, then descending-sort the movies by that number, pick a handful of items off the top, and those are the recommendations. And the process also works in reverse - if you had a movie and wanted to send out emails to people who are likely to watch it, you just do the same operation starting from a movie instead of a user.

So essentially the entire process of generating recommendations for anything is boiled down to a bit of straightforward, if a bit complicated, math.

Where matrix factorization actually enters the picture is the training step. We search - biased by some initial values, such as the number of features to discover a maximum number of iterations - for two matrices which, when multiplied, approximate that original ratings matrix. One matrix has a row for each movie and a column for the strength with which that movie has each feature, and the other matrix has a row for each user and the strength with which they have each feature. The idea being that if you multiplied these two matrices, you'd get back something close to the original matrix.

- Feature 1 Feature 2 Feature 3 Feature 4
Star Wars 0.25 0.8 0.01 0.053
Blue Velvet 0.96 0.325 0.13 0.46
Dog Day Afternoon 0.21 0.1 0.87 0.83
The Toxic Avenger 0.64 0.23 0.1 0.74
- Feature 1 Feature 2 Feature 3 Feature 4
Alice 0.25 0.8 0.01 0.053
Bob 0.96 0.325 0.13 0.46
Jack 0.21 0.1 0.87 0.83

The interesting thing about this is that we don't have to care about what the features are. It can certainly be useful to look at them and see what goes into them - when you read about Orwellian predictions made by insurance and credit card companies, such as being able to surprisingly accurately determine who is likely to get divorced based on their purchases, it's pretty likely that this is the sort of analysis they got that from - just build a system that makes accurate predictions, and then look at how the features were computed. The point is to let the machine discover those relationships, rather than making guesses and encoding those guesses into the system. That's what allows this to be a generic approach to recommendation generation. The approach to generating the features is to initialize each matrix with some values, and then iterate over it some number of times and learn by adjusting the values until the product really would match the original matrix to some degree of accuracy (which you can set). This is not a cheap process (although it's surprisingly fast in PGX on a decent laptop) but you do it once, then use the result for many recommendations, repeating the process periodically after new users or movies have been added to the system.

Implementing A Recommendation Engine with PGX

Getting the Movie Data

The movie data we will use can be downloaded from MovieLens. The MovieLens data comes as a zip file that contains a set of plain-text data files. You can download a small NodeJS script from here which you simply aim at the folder you unpacked the MovieLens data to, which reads the assorted files and outputs a graph in either GraphML XML or plain-text edge-list format - for brevity, I'll assume you've installed PGX unpacked the data and run it, e.g.

node movie-data-parse.js /path/to/movielens/folder > movies.edges

Loading the Movie Data Into PGX

PGX has an interactive groovy-based shell with code-completion and other features that we will use to run our recommendation engine (it also has a Java API, and Java examples are also included with this article).

Assuming you have installed PGX and it is on the path, you run it in a terminal simply by typing pgx[ENTER].

But first, to load the data, we need a small JSON file that describes the schema of the data and the location of the file. Using your favorite text editor, create the following file next to the movies.edges file we just created, and name this file edges.json:

{
    "uri": "movies.edges"
    ,"format": "edge_list"
    ,"vertex_props": [
        {
            "name": "age",
            "type": "integer"
        }
        ,{
            "name": "occupation",
            "type": "string"
        }
        ,{
            "name": "title",
            "type": "string"
        }
        ,{
            "name": "user",
            "type": "boolean"
        }
        ,{
            "name": "year",
            "type": "integer"
        }
    ]
    ,"edge_props": [{
            "name": "rating"
            ,"type": "double"
        }]
    ,"separator": " "
}

This allows the vertices and their properties to have human-friendly names, and describes to PGX which properties the columns in our edge list correspond to.

Now we can run PGX and load the graph:

tim@tim ~/work/oracle/recommendations/tmp $ pgx
[WARNING] neither HADOOP_HOME nor HADOOP_CONF_DIR is set. If loading/storing from/to HDFS, Hadoop default configuration values will be used.
PGX Shell 1.2.0-SNAPSHOT
type :help for available commands
11:39:00,073 [main] INFO Ctrl$2 - >>> PGX engine running.
variables instance, session and analyst ready to use
pgx> graph = session.readGraphWithProperties("edges.json")
==> PGX Graph named 'movies' bound to PGX session '7e95713b-c01c-4932-86ef-14251e5a056f' registered at PGX Server Instance running in embedded mode
pgx> 

Implementing Training in Green-Marl

PGX includes a language - Green-Marl which is specifically designed to make graph analytics simple to write. Under the hood, it is compiled to high-performance Java bytecode (or C code in distributed PGX). It contains primitives for common entities and operations, and allows you to focus on the analytics you want to write, rather than the plumbing code that most languages would need to do what it does. For example, initializing a vector - an array-like structure which most languages would need to loop over - looks like:

vect<double>[vector_length] Z = 0.0;

It also has primitives for breadth- and depth-first searches and other idioms that would have to be implemented by hand in other languages.
In particular, you can pass in vertex property definitions and then very compactly reference their values on individual graph vertices using dot-notation. For example, the following procedure takes a graph and a property definition, iterates the vertices and tests on the value of that property:

procedure foo(G: graph, 
              is_left: nodeProp<bool>) {
   foreach (curr_node: G.nodes) {
      if (curr_node.is_left) {
         ...

With PGX, our training algorithm can be quite compact and clear. Create a file next to movies.edges and edges.json named matrix_factorization_gradient_descent.gm, with the following contents:

procedure matrix_factorization_gradient_descent(G: graph, 
              is_left: nodeProp<bool>,
              weight: edgeProp<double>,
              learning_rate: double,
              change_per_step: double,
              lambda: double,
              max_step: int,
              vector_length: int;
              dest_property: nodeProp<vect<double>[vector_length]>) {
    nodeProp<vect<double>[vector_length]> dest_property_next;
    G.dest_property = uniform();
    G.dest_property_next = _.dest_property;
    double VALUE_MAX = 5.0;
    double VALUE_MIN = 1.0;
    double rate = learning_rate;
    int nupdate = 0;
    int counter = 0;
    while (counter < max_step) {
        double rmse = 0.0;
        foreach (curr_node: G.nodes) {
            if (curr_node.is_left) {
                vect<double>[vector_length] Z = 0.0;
                for(curr_edge: curr_node.outEdges) {
                    node opposite_node = curr_edge.toNode();
                    double weight_1 = curr_edge.weight;
                    double weight_2 = curr_node.dest_property * opposite_node.dest_property;
                    if (weight_2 > VALUE_MAX) {
                        weight_2 = VALUE_MAX;
                    } else if (weight_2 < VALUE_MIN) {
                        weight_2 = VALUE_MIN;
                    }
                    Z += ((weight_1 - weight_2) * opposite_node.dest_property - lambda * curr_node.dest_property);
                    rmse += (weight_1 - weight_2) * (weight_1 - weight_2);
                }
                        curr_node.dest_property_next = curr_node.dest_property + rate * Z;
            } else {
                vect<double>[vector_length] Z = 0.0;
                for(curr_edge: curr_node.inEdges) {
                    node opposite_node = curr_edge.fromNode();
                    double weight_1 = curr_edge.weight;
                    double weight_2 = curr_node.dest_property * opposite_node.dest_property;
                    if (weight_2 > VALUE_MAX) {
                        weight_2 = VALUE_MAX;
                    } else if (weight_2 < VALUE_MIN) {
                        weight_2 = VALUE_MIN;
                    }
                    Z += (weight_1 - weight_2) * opposite_node.dest_property - lambda * curr_node.dest_property;
                    rmse += (weight_1 - weight_2) * (weight_1 - weight_2);
                }               
                curr_node.dest_property_next = curr_node.dest_property + rate * Z;
            } 
        }
        G.dest_property = _.dest_property_next;
        rmse = sqrt(rmse / (G.numEdges() * 2));
        rate *= change_per_step;
        counter++;
    }
}

What this does is:

  • Define some variables
  • Define an outer loop that will bound how many times we will iterate to do the training - this is settable by the argument max_step
  • Loop over all vertices in the graph (note that, when compiled, this loop may be parallelized), and for each vertex
  • If it is a left-side (user) vertex, iterate the edges leading away from it to movies, otherwise iterate the edges leading to it from users - and iterate each vertex that edge points to - i.e. get hold of the list of vertices that vertex is connected to, and for each one:
    • Compute or update a vector (Z) of values, where the values are bounded by the maximum possible value (in our data, ratings are 0-5)
  • Set the destination property's value on each vertex
  • Update the local variables for the next iteration

Going back to our PGX session with the graph loaded, we can load this as follows:

training = session.compileProgram("matrix_factorization_gradient_descent.gm")

Implementing Recommendations in Green-Marl

The training piece is the complex part - once that is set, we can generate recommendations for any vertex in the graph fairly simply, using the features computed in the training phase. So we can simply keep our session open (the training data we generated is not persistent) and quickly generate recommendations for different users as we want.

The estimation program is another Green-Marl program, much shorter. Put this in a file next to the others named estimate_rating.gm:

procedure compute_estimated_rating (
    G: graph, 
    user: node,
    is_user: nodeProp<bool>,
    vector_length: int,
    feature: nodeProp<vect<double>[vector_length]>;
    estimated_rating: nodeProp<double> ) {
    foreach(n: G.nodes) {
        if ( n.is_user ) {
            n.estimated_rating = 0;
        } else if ( n.hasEdgeFrom (user) ) {
             // already seen the movie
            n.estimated_rating = 0;
        } else {
            // inner product of two feature vectors
            n.estimated_rating = user.feature * n.feature;
        }
    }
}

What this does is it uses the features we synthesized in the training step to estimate a user's rating of a movie they haven't seen based on the ratings of other users like them.

Load that into our PGX session as follows:

estimation = session.compileProgram("estimate_rating.gm")

Generating Recommendations

The next step is to define a synthetic property on the graph - a property which is not part of the original graph we loaded, and which is session-scoped, to hold our feature values for each vertex:

featureProperty = graph.createVertexVectorProperty( oracle.pgx.common.types.PropertyType.DOUBLE, 20, "feature" )

Now we are ready to run the training phase:

training.run(graph, graph.getVertexProperty( "user" ), graph.getEdgeProperty( "rating" ), 0.001D, 0.9D, 0.001D, 100, 20, featureProperty)

which should get you output that looks like:

==> {
  "success" : true,
  "canceled" : false,
  "executionTimeMs" : 2226,
  "returnValue" : null,
  "exception" : null
}

Next we need to create a property to store our recommendations in:

estRatingProperty = graph.createVertexProperty( oracle.pgx.common.types.PropertyType.DOUBLE, "estRating" );

Now we are ready to get some recommendations. Let's pick a random user vertex and assign it to the shell variable someUser:

someUser = graph.getVertex( 2315 )

And we can run our estimation Green-Marl procedure to compute some estimations:

estimation.run (graph, someUser, graph.getVertexProperty("user"), 20, featureProperty, estRatingProperty)

and then look at the top 10 results:

pgx> estRatingProperty.getTopKValues(10)
==> PgxVertex with ID 1567=5.920148497542082
==> PgxVertex with ID 1242=5.691110431206573
==> PgxVertex with ID 798=5.682416396766282
==> PgxVertex with ID 816=5.654641389008914
==> PgxVertex with ID 1471=5.579409110367704
==> PgxVertex with ID 1483=5.546784101675797
==> PgxVertex with ID 1569=5.531018328773964
==> PgxVertex with ID 1383=5.516017286422173
==> PgxVertex with ID 1369=5.5096872333124285
==> PgxVertex with ID 1063=5.493463480954369

Of course, it is more useful to see movie names, like so:

top = estRatingProperty.getTopKValues(10)
0.upto(9, { println top[it].getKey().getProperty("title") + " - " + top[it].getValue() })

(what this does is use Groovy's integer iteration syntax and a closure that looks up the title property on each vertex it is passed). That gets us more useful output:

Vermont Is For Lovers - 5.920148497542082
Night Flier - 5.691110431206573
Boys Life - 5.682416396766282
Frisk - 5.654641389008914
Visitors, The (Visiteurs, Les) - 5.579409110367704
Jerky Boys, The - 5.546784101675797
Quartier Mozart - 5.531018328773964
Squeeze - 5.516017286422173
I Can't Sleep (J'ai pas sommeil) - 5.5096872333124285
Crossfire - 5.493463480954369

Exercises for the Reader

You'll notice that while we have data about movie genres and user ages and occupations, the code above does not use any of the above - it simply looks for users whose ratings resemble those of the one we're generating recommendations for.

As a way to get to know Green-Marl, a useful exercise is to incorporate those properties into the training algorithm; a truly ambitious implementation would let the caller simply pass a set of property names to incorporate (hint: strings are hard unless they are enum-like).

Java Implementation

PGX also has a Java API which you can call, so below is Java code that implements the same calls we use above in the Groovy shell. To do that, you will need to include the PGX JAR files on the classpath.

package com.oracle.pgx.recommendations.java;

import java.io.File;
import java.util.Map;
import oracle.pgx.api.CompiledProgram;
import oracle.pgx.api.Pgx;
import oracle.pgx.api.PgxGraph;
import oracle.pgx.api.PgxSession;
import oracle.pgx.api.PgxVect;
import oracle.pgx.api.PgxVertex;
import oracle.pgx.api.VertexProperty;
import oracle.pgx.common.types.IdType;
import oracle.pgx.common.types.PropertyType;
import oracle.pgx.config.Format;
import oracle.pgx.config.GraphConfig;
import oracle.pgx.config.GraphConfigBuilder;

public class Recommendations {

    public static void main( String[] args ) throws Throwable {
        String file = checkFile( args.length > 0 ? args[ 0 ]
                : "./movies.edges" );

        PgxSession session = Pgx.createSession( "recommendations" );
        GraphConfig config = new GraphConfigBuilder().forFileFormat( Format.EDGE_LIST )
                .addEdgeProperty( "rating", PropertyType.DOUBLE )
                .addVertexProperty( "age", PropertyType.INTEGER, 0 )
                .addVertexProperty( "occupation", PropertyType.STRING, "" )
                .addVertexProperty( "title", PropertyType.STRING, "" )
                .addVertexProperty( "user", PropertyType.BOOLEAN, false )
                .addVertexProperty( "year", PropertyType.INTEGER, 0 )
                .setNodeIdType( IdType.INTEGER )
                .setUri( file )
                .build();

        PgxGraph graph = session.readGraphWithProperties( config );

        // read from classpath
        CompiledProgram training = session.compileProgram( "res:/matrix_factorization_gradient_descent.gm" );

        // Create a synthetic vertex property for training values
        VertexProperty<Integer, PgxVect<Double>> featureProperty = graph
                .createVertexVectorProperty( PropertyType.DOUBLE, 20, "feature" );

        training.run( graph, graph.getVertexProperty( "user" ),
                graph.getEdgeProperty( "rating" ), 0.001D, 0.9D, 0.001D,
                100, 20, featureProperty );

        // read from classpath
        CompiledProgram estimateRating = session.compileProgram( "res:/estimate_rating.gm" );

        PgxVertex randomUser = graph.getVertex( 2315 );
        System.out.println( "Get recommendations for " + randomUser.getProperty( "age" ) + " " + randomUser.getProperty( "occupation" ) + " " + randomUser.getId() );

        VertexProperty<Integer, Double> estRatingProperty = graph
                .createVertexProperty( PropertyType.DOUBLE, "estRating" );

        estimateRating.run( graph, randomUser, graph.getVertexProperty( "user" ),
                20, featureProperty, estRatingProperty );

        for ( Map.Entry<PgxVertex<Integer>, Double> e : estRatingProperty.getTopKValues( 10 ) ) {
            System.out.println( e.getKey().getProperty( "title" ) + " - " + e.getValue() );
        }
    }

    private static String checkFile( String graphFile ) {
        File f = new File( graphFile );
        if ( !f.exists() ) {
            System.err.println( "File does not exist: " + f );
            System.exit( 1 );
        }
        if ( !f.isFile() ) {
            System.err.println( "File is not a regular file: " + f );
            System.exit( 2 );
        }
        if ( !f.canRead() ) {
            System.err.println( "File is not readable: " + f );
            System.exit( 2 );
        }
        return f.getAbsolutePath();
    }
}

Built-in recommendation support in PGX

Since version 1.2.0, PGX has above matrix factorization algorithms built-in. The Analyst class provides the following method to perform the factorization step:

PgxFuture<MatrixFactorizationModel<ID>> matrixFactorizationGradientDescentAsync(
    BipartiteGraph graph, 
    EdgeProperty<Double> weightProperty, 
    double learningRate, 
    double changePerStep, 
    double lambda, 
    int maxStep, 
    int vectorLength)

// blocking version:
MatrixFactorizationModel<ID> matrixFactorizationGradientDescent(
    BipartiteGraph graph, 
    EdgeProperty<Double> weightProperty, 
    double learningRate, 
    double changePerStep, 
    double lambda, 
    int maxStep, 
    int vectorLength)

The returned MatrixFactorizationModel object provides a method to compute estimated ratings for a specific vertex:

PgxFuture<VertexProperty<ID, Double>> getEstimatedRatingsAsync(PgxVertex<ID> estimateRatingsFor)

// blocking version:
VertexProperty<ID, Double> getEstimatedRatings(PgxVertex<ID> estimateRatingsFor)