5.7.6.2 Collaborative Filtering Overview and Examples

Collaborative filtering, also referred to as social filtering, filters information by using the recommendations of other people. Collaborative filtering is widely used in systems that recommend purchases based on purchases by others with similar preferences.

The following examples demonstrate SQL-based collaborative filtering analytics.

Example 5-21 Collaborative Filtering Setup and Computation

This example shows how to use SQL-based collaborative filtering, specifically using matrix factorization to recommend telephone brands to customers. This example assumes there exists a graph called "PHONES" in the database. This example graph contains customer and item vertices, and edges with a 'rating' label linking some customer vertices to other some item vertices. The rating labels have a numeric value corresponding to the rating that a specific customer (edge OUT vertex) assigned to the specified product (edge IN vertex).

The following figure shows this graph.

Figure 5-1 Phones Graph for Collaborative Filtering

Description of Figure 5-1 follows
Description of "Figure 5-1 Phones Graph for Collaborative Filtering"
The following code shows an end-to-end flow to run the SQL-based collaborative filtering algorithm, which internally uses the matrix factorization algorithm.
set serveroutput on

DECLARE
  wt_l varchar2(32);  -- working tables
  wt_r varchar2(32);
  wt_l1 varchar2(32);
  wt_r1 varchar2(32);
  wt_i varchar2(32);
  wt_ld varchar2(32);
  wt_rd varchar2(32);
  edge_tab_name    varchar2(32) := 'phonesge$';
  edge_label       varchar2(32) := 'rating';
  rating_property  varchar2(32) := '';
  iterations       integer      := 100;
  min_error        number       := 0.001;
  k                integer      := 5;
  learning_rate    number       := 0.001;
  decrease_rate    number       := 0.95;
  regularization   number       := 0.02;
  dop              number       := 2;
  tablespace       varchar2(32) := null;
  options          varchar2(32) := null; 
BEGIN

  -- prepare
  opg_apis.cf_prep(edge_tab_name,wt_l,wt_r,wt_l1,wt_r1,wt_i,wt_ld,wt_rd);
  dbms_output.put_line('working table wt_l ' || wt_l);
  dbms_output.put_line('working table wt_r ' || wt_r);
  dbms_output.put_line('working table wt_l1 ' || wt_l1);
  dbms_output.put_line('working table wt_r1 ' || wt_r1);
  dbms_output.put_line('working table wt_i ' || wt_i);
  dbms_output.put_line('working table wt_ld ' || wt_ld);
  dbms_output.put_line('working table wt_rd ' || wt_rd);

  -- compute
  opg_apis.cf(edge_tab_name,edge_label,rating_property,iterations,
              min_error,k,learning_rate,decrease_rate,regularization,dop,
              wt_l,wt_r,wt_l1,wt_r1,wt_i,wt_ld,wt_rd,tablespace,options);
END;
/
The flow starts by creating the temporary working tables that are later on passed to the computation. At the end of computation, the example may produce the following output. Note that if no working table name is provided, the preparation step will automatically generate a temporary table name and create it. Because the temporary working table name uses the session ID, your output will probably be different.
working table wt_l    "PHONESGE$$CFL57"
working table wt_r    "PHONESGE$$CFR57"
working table wt_l1    "PHONESGE$$CFL157"
working table wt_r1    "PHONESGE$$CFR157"
working table wt_i    "PHONESGE$$CFI57"
working table wt_ld    "PHONESGE$$CFLD57"
working table wt_rd    "PHONESGE$$CFRD57"

PL/SQL procedure successfully completed.

Example 5-22 Collaborative Filtering: Validating the Intermediate Error

At the end of every computation, you can check the current error of the algorithm with the following query as long as the data in the working tables has not been already deleted. The following SQL query illustrates how to get the intermediate error of a current run of the collaborative filtering algorithm.

SELECT /*+ parallel(48) */ SQRT(SUM((w1-w2)*(w1-w2) + 
              <regularization>/2 * (err_reg_l+err_reg_r))) AS err 
  FROM <wt_i>;

Note that the regularization parameter and the working table name (parameter wt_i) should be replaced according to the values used when running the OPG_APIS.CF algorithm. In the preceding previous example, replace <regularization> with 0.02 and <wt_i> with "PHONESGE$$CFI149" as follows:

SELECT /*+ parallel(48) */ SQRT(SUM((w1-w2)*(w1-w2) + 0.02/2 * (err_reg_l+err_reg_r))) AS err 
  FROM "PHONESGE$$CFI149";

This query may produce the following output.

       ERR
----------
4.82163662

f the value of the current error is too high or if the predictions obtained from the matrix factorization results of the collaborative filtering are not yet useful, you can run more iterations of the algorithm, by reusing the working tables and the progress made so far. The following example shows how to make predictions using the SQL-based collaborative filtering.

Example 5-23 Collaborative Filtering: Making Predictions

The result of the collaborative filtering algorithm is stored in the tables wt_l and wt_r, which are the two factors of a matrix product. These matrix factors should be used when making the predictions of the collaborative filtering.

In a typical flow of the algorithm, the two matrix factors can be used to make the predictions before calling the OPG_APIS.CF_CLEANUP procedure, or they can be copied and persisted into other tables for later use. The following example demonstrates the latter case:

DECLARE
  wt_l varchar2(32);  -- working tables
  wt_r varchar2(32);
  wt_l1 varchar2(32);
  wt_r1 varchar2(32);
  wt_i varchar2(32);
  wt_ld varchar2(32);
  wt_rd varchar2(32);
  edge_tab_name    varchar2(32) := 'phonesge$';
  edge_label       varchar2(32) := 'rating';
  rating_property  varchar2(32) := '';
  iterations       integer      := 100;
  min_error        number       := 0.001;
  k                integer      := 5;
  learning_rate    number       := 0.001;
  decrease_rate    number       := 0.95;
  regularization   number       := 0.02;
  dop              number       := 2;
  tablespace       varchar2(32) := null;
  options          varchar2(32) := null; 
BEGIN

  -- prepare
  opg_apis.cf_prep(edge_tab_name,wt_l,wt_r,wt_l1,wt_r1,wt_i,wt_ld,wt_rd);

  -- compute
  opg_apis.cf(edge_tab_name,edge_label,rating_property,iterations,
              min_error,k,learning_rate,decrease_rate,regularization,dop,
              wt_l,wt_r,wt_l1,wt_r1,wt_i,wt_ld,wt_rd,tablespace,options);
  
  -- save only these two tables for later predictions
  EXECUTE IMMEDIATE 'CREATE TABLE customer_mat AS SELECT * FROM ' || wt_l;
  EXECUTE IMMEDIATE 'CREATE TABLE item_mat AS SELECT * FROM ' || wt_r;

  -- cleanup
  opg_apis.cf_cleanup('phonesge$',wt_l,wt_r,wt_l1,wt_r1,wt_i,wt_ld,wt_rd);
END;
/

This example will produce the only the following output.

PL/SQL procedure successfully completed.

Now that the matrix factors are saved in the tables customer_mat and item_mat, you can use the following query to check the "error" (difference) between the real values (those values that previously existed in the graph as 'ratings') and the estimated predictions (the result of the matrix multiplication in a certain customer row and item column).

Note that the following query is customized with a join on the vertex table in order return an NVARCHAR property of the vertices (for example, the name property) instead of a numeric ID. This query will return all the predictions for every single customer vertex to every item vertex in the graph.

SELECT /*+ parallel(48) */ MIN(vertex1.v) AS customer, 
                           MIN(vertex2.v) AS item, 
                           MIN(edges.vn) AS real, 
                           SUM(l.v * r.v) AS predicted
FROM PHONESGE$ edges, 
      CUSTOMER_MAT l, 
      ITEM_MAT r, 
      PHONESVT$ vertex1,   
      PHONESVT$ vertex2
WHERE l.k = r.k
  AND l.c = edges.svid(+)
  AND r.p = edges.dvid(+)
  AND l.c = vertex1.vid
  AND r.p = vertex2.vid
GROUP BY l.c, r.p
ORDER BY l.c, r.p  -- This order by clause is optional
;

This query may produce an output similar to the following (some rows are omitted for brevity).

CUSTOMER	ITEM		REAL	PREDICTED
------------------------------------------------
Adam		Apple 	 	5	3.67375703
Adam		Blackberry		3.66079652
Adam		Danger			2.77049596
Adam		Ericsson		4.21764858
Adam		Figo			3.10631337
Adam		Google		4 	4.42429022
Adam		Huawei		3	3.4289115
Ben		Apple	  	   	2.82127589
Ben		Blackberry	2	2.81132282
Ben		Danger		3	2.12761307
Ben		Ericsson	3   	3.2389595
Ben		Figo	  	    	2.38550534
Ben		Google		   	3.39765075
Ben		Huawei		    	2.63324582
...
Don		Apple		    	1.3777496
Don		Blackberry	1 	1.37288909
Don		Danger		1 	1.03900439
Don		Ericsson	   	1.58172236
Don		Figo		1	1.16494421
Don		Google			1.65921807
Don		Huawei		1	1.28592648
Erik		Apple		3	2.80809351
Erik		Blackberry	3	2.79818695
Erik		Danger			2.11767182
Erik		Ericsson	3 	3.2238255
Erik		Figo			2.3743591
Erik		Google		3	3.38177526
Erik		Huawei		3	2.62094201

If you want to check only some rows to decide whether the prediction results are ready or more iterations of the algorithm should be run, the previous query can be wrapped in an outer query. The following example will select only the first 11 results.

SELECT /*+ parallel(48) */ * FROM (
SELECT /*+ parallel(48) */ MIN(vertex1.v) AS customer, 
                           MIN(vertex2.v) AS item, 
                           MIN(edges.vn) AS real, 
                           SUM(l.v * r.v) AS predicted
FROM PHONESGE$ edges, 
     CUSTOMER_MAT l, 
     ITEM_MAT r, 
     PHONESVT$ vertex1,   
     PHONESVT$ vertex2
WHERE l.k = r.k
  AND l.c = edges.svid(+)
  AND r.p = edges.dvid(+)
  AND l.c = vertex1.vid
  AND r.p = vertex2.vid
GROUP BY l.c, r.p
ORDER BY l.c, r.p
) WHERE rownum <= 11;

This query may produce an output similar to the following.

CUSTOMER	ITEM		REAL	PREDICTED
------------------------------------------------
Adam		Apple 	 	5	3.67375703
Adam		Blackberry		3.66079652
Adam		Danger			2.77049596
Adam		Ericsson		4.21764858
Adam		Figo			3.10631337
Adam		Google		4 	4.42429022
Adam		Huawei		3	3.4289115
Ben		Apple	  	   	2.82127589
Ben		Blackberry	2	2.81132282
Ben		Danger		3	2.12761307
Ben		Ericsson	3   	3.2389595

To get a prediction for a specific vertex (customer, item, or both) the query can be restricted with the desired ID values. For example, to get the predicted value of vertex 1 (customer) and vertex 105 (item), you can use the following query.

SELECT /*+ parallel(48) */ MIN(vertex1.v) AS customer, 
                           MIN(vertex2.v) AS item, 
                           MIN(edges.vn) AS real, 
                           SUM(l.v * r.v) AS predicted
FROM PHONESGE$ edges, 
     CUSTOMER_MAT l, 
     ITEM_MAT r, 
     PHONESVT$ vertex1,   
     PHONESVT$ vertex2
WHERE l.k = r.k
  AND l.c = edges.svid(+)
  AND r.p = edges.dvid(+)
  AND l.c = vertex1.vid 
  AND vertex1.vid = 1 /* Remove to get all predictions for item 105 */
  AND r.p = vertex2.vid 
  AND vertex2.vid = 105 /* Remove to get all predictions for customer 1 */
                        /* Remove both lines to get all predictions */
GROUP BY l.c, r.p
ORDER BY l.c, r.p;

This query may produce an output similar to the following.

CUSTOMER	ITEM		REAL	PREDICTED
------------------------------------------------
Adam		Ericsson		4.21764858