5.7.6.2 Collaborative Filtering Overview and Examples
Collaborative filtering, also referred to as social filtering, filters information by using the recommendations of other people. Collaborative filtering is widely used in systems that recommend purchases based on purchases by others with similar preferences.
The following examples demonstrate SQL-based collaborative filtering analytics.
Example 5-21 Collaborative Filtering Setup and Computation
This example shows how to use SQL-based collaborative filtering, specifically using matrix factorization to recommend telephone brands to customers. This example assumes there exists a graph called "PHONES" in the database. This example graph contains customer and item vertices, and edges with a 'rating' label linking some customer vertices to other some item vertices. The rating labels have a numeric value corresponding to the rating that a specific customer (edge OUT vertex) assigned to the specified product (edge IN vertex).
The following figure shows this graph.
Figure 5-1 Phones Graph for Collaborative Filtering
Description of "Figure 5-1 Phones Graph for Collaborative Filtering"
set serveroutput on
DECLARE
wt_l varchar2(32); -- working tables
wt_r varchar2(32);
wt_l1 varchar2(32);
wt_r1 varchar2(32);
wt_i varchar2(32);
wt_ld varchar2(32);
wt_rd varchar2(32);
edge_tab_name varchar2(32) := 'phonesge$';
edge_label varchar2(32) := 'rating';
rating_property varchar2(32) := '';
iterations integer := 100;
min_error number := 0.001;
k integer := 5;
learning_rate number := 0.001;
decrease_rate number := 0.95;
regularization number := 0.02;
dop number := 2;
tablespace varchar2(32) := null;
options varchar2(32) := null;
BEGIN
-- prepare
opg_apis.cf_prep(edge_tab_name,wt_l,wt_r,wt_l1,wt_r1,wt_i,wt_ld,wt_rd);
dbms_output.put_line('working table wt_l ' || wt_l);
dbms_output.put_line('working table wt_r ' || wt_r);
dbms_output.put_line('working table wt_l1 ' || wt_l1);
dbms_output.put_line('working table wt_r1 ' || wt_r1);
dbms_output.put_line('working table wt_i ' || wt_i);
dbms_output.put_line('working table wt_ld ' || wt_ld);
dbms_output.put_line('working table wt_rd ' || wt_rd);
-- compute
opg_apis.cf(edge_tab_name,edge_label,rating_property,iterations,
min_error,k,learning_rate,decrease_rate,regularization,dop,
wt_l,wt_r,wt_l1,wt_r1,wt_i,wt_ld,wt_rd,tablespace,options);
END;
/
The flow starts by creating the temporary working tables that are later on passed to the computation. At the end of computation, the example may produce the following output. Note that if no working table name is provided, the preparation step will automatically generate a temporary table name and create it. Because the temporary working table name uses the session ID, your output will probably be different.working table wt_l "PHONESGE$$CFL57" working table wt_r "PHONESGE$$CFR57" working table wt_l1 "PHONESGE$$CFL157" working table wt_r1 "PHONESGE$$CFR157" working table wt_i "PHONESGE$$CFI57" working table wt_ld "PHONESGE$$CFLD57" working table wt_rd "PHONESGE$$CFRD57" PL/SQL procedure successfully completed.
Example 5-22 Collaborative Filtering: Validating the Intermediate Error
At the end of every computation, you can check the current error of the algorithm with the following query as long as the data in the working tables has not been already deleted. The following SQL query illustrates how to get the intermediate error of a current run of the collaborative filtering algorithm.
SELECT /*+ parallel(48) */ SQRT(SUM((w1-w2)*(w1-w2) +
<regularization>/2 * (err_reg_l+err_reg_r))) AS err
FROM <wt_i>;
Note that the regularization parameter and the working table name (parameter wt_i
) should be replaced according to the values used when running the OPG_APIS.CF algorithm. In the preceding previous example, replace <regularization>
with 0.02 and <wt_i>
with "PHONESGE$$CFI149" as follows:
SELECT /*+ parallel(48) */ SQRT(SUM((w1-w2)*(w1-w2) + 0.02/2 * (err_reg_l+err_reg_r))) AS err
FROM "PHONESGE$$CFI149";
This query may produce the following output.
ERR ---------- 4.82163662
f the value of the current error is too high or if the predictions obtained from the matrix factorization results of the collaborative filtering are not yet useful, you can run more iterations of the algorithm, by reusing the working tables and the progress made so far. The following example shows how to make predictions using the SQL-based collaborative filtering.
Example 5-23 Collaborative Filtering: Making Predictions
The result of the collaborative filtering algorithm is stored in the tables wt_l
and wt_r
, which are the two factors of a matrix product. These matrix factors should be used when making the predictions of the collaborative filtering.
In a typical flow of the algorithm, the two matrix factors can be used to make the predictions before calling the OPG_APIS.CF_CLEANUP procedure, or they can be copied and persisted into other tables for later use. The following example demonstrates the latter case:
DECLARE
wt_l varchar2(32); -- working tables
wt_r varchar2(32);
wt_l1 varchar2(32);
wt_r1 varchar2(32);
wt_i varchar2(32);
wt_ld varchar2(32);
wt_rd varchar2(32);
edge_tab_name varchar2(32) := 'phonesge$';
edge_label varchar2(32) := 'rating';
rating_property varchar2(32) := '';
iterations integer := 100;
min_error number := 0.001;
k integer := 5;
learning_rate number := 0.001;
decrease_rate number := 0.95;
regularization number := 0.02;
dop number := 2;
tablespace varchar2(32) := null;
options varchar2(32) := null;
BEGIN
-- prepare
opg_apis.cf_prep(edge_tab_name,wt_l,wt_r,wt_l1,wt_r1,wt_i,wt_ld,wt_rd);
-- compute
opg_apis.cf(edge_tab_name,edge_label,rating_property,iterations,
min_error,k,learning_rate,decrease_rate,regularization,dop,
wt_l,wt_r,wt_l1,wt_r1,wt_i,wt_ld,wt_rd,tablespace,options);
-- save only these two tables for later predictions
EXECUTE IMMEDIATE 'CREATE TABLE customer_mat AS SELECT * FROM ' || wt_l;
EXECUTE IMMEDIATE 'CREATE TABLE item_mat AS SELECT * FROM ' || wt_r;
-- cleanup
opg_apis.cf_cleanup('phonesge$',wt_l,wt_r,wt_l1,wt_r1,wt_i,wt_ld,wt_rd);
END;
/
This example will produce the only the following output.
PL/SQL procedure successfully completed.
Now that the matrix factors are saved in the tables customer_mat and item_mat, you can use the following query to check the "error" (difference) between the real values (those values that previously existed in the graph as 'ratings') and the estimated predictions (the result of the matrix multiplication in a certain customer row and item column).
Note that the following query is customized with a join on the vertex table in order return an NVARCHAR property of the vertices (for example, the name property) instead of a numeric ID. This query will return all the predictions for every single customer vertex to every item vertex in the graph.
SELECT /*+ parallel(48) */ MIN(vertex1.v) AS customer,
MIN(vertex2.v) AS item,
MIN(edges.vn) AS real,
SUM(l.v * r.v) AS predicted
FROM PHONESGE$ edges,
CUSTOMER_MAT l,
ITEM_MAT r,
PHONESVT$ vertex1,
PHONESVT$ vertex2
WHERE l.k = r.k
AND l.c = edges.svid(+)
AND r.p = edges.dvid(+)
AND l.c = vertex1.vid
AND r.p = vertex2.vid
GROUP BY l.c, r.p
ORDER BY l.c, r.p -- This order by clause is optional
;
This query may produce an output similar to the following (some rows are omitted for brevity).
CUSTOMER ITEM REAL PREDICTED ------------------------------------------------ Adam Apple 5 3.67375703 Adam Blackberry 3.66079652 Adam Danger 2.77049596 Adam Ericsson 4.21764858 Adam Figo 3.10631337 Adam Google 4 4.42429022 Adam Huawei 3 3.4289115 Ben Apple 2.82127589 Ben Blackberry 2 2.81132282 Ben Danger 3 2.12761307 Ben Ericsson 3 3.2389595 Ben Figo 2.38550534 Ben Google 3.39765075 Ben Huawei 2.63324582 ... Don Apple 1.3777496 Don Blackberry 1 1.37288909 Don Danger 1 1.03900439 Don Ericsson 1.58172236 Don Figo 1 1.16494421 Don Google 1.65921807 Don Huawei 1 1.28592648 Erik Apple 3 2.80809351 Erik Blackberry 3 2.79818695 Erik Danger 2.11767182 Erik Ericsson 3 3.2238255 Erik Figo 2.3743591 Erik Google 3 3.38177526 Erik Huawei 3 2.62094201
If you want to check only some rows to decide whether the prediction results are ready or more iterations of the algorithm should be run, the previous query can be wrapped in an outer query. The following example will select only the first 11 results.
SELECT /*+ parallel(48) */ * FROM (
SELECT /*+ parallel(48) */ MIN(vertex1.v) AS customer,
MIN(vertex2.v) AS item,
MIN(edges.vn) AS real,
SUM(l.v * r.v) AS predicted
FROM PHONESGE$ edges,
CUSTOMER_MAT l,
ITEM_MAT r,
PHONESVT$ vertex1,
PHONESVT$ vertex2
WHERE l.k = r.k
AND l.c = edges.svid(+)
AND r.p = edges.dvid(+)
AND l.c = vertex1.vid
AND r.p = vertex2.vid
GROUP BY l.c, r.p
ORDER BY l.c, r.p
) WHERE rownum <= 11;
This query may produce an output similar to the following.
CUSTOMER ITEM REAL PREDICTED ------------------------------------------------ Adam Apple 5 3.67375703 Adam Blackberry 3.66079652 Adam Danger 2.77049596 Adam Ericsson 4.21764858 Adam Figo 3.10631337 Adam Google 4 4.42429022 Adam Huawei 3 3.4289115 Ben Apple 2.82127589 Ben Blackberry 2 2.81132282 Ben Danger 3 2.12761307 Ben Ericsson 3 3.2389595
To get a prediction for a specific vertex (customer, item, or both) the query can be restricted with the desired ID values. For example, to get the predicted value of vertex 1 (customer) and vertex 105 (item), you can use the following query.
SELECT /*+ parallel(48) */ MIN(vertex1.v) AS customer,
MIN(vertex2.v) AS item,
MIN(edges.vn) AS real,
SUM(l.v * r.v) AS predicted
FROM PHONESGE$ edges,
CUSTOMER_MAT l,
ITEM_MAT r,
PHONESVT$ vertex1,
PHONESVT$ vertex2
WHERE l.k = r.k
AND l.c = edges.svid(+)
AND r.p = edges.dvid(+)
AND l.c = vertex1.vid
AND vertex1.vid = 1 /* Remove to get all predictions for item 105 */
AND r.p = vertex2.vid
AND vertex2.vid = 105 /* Remove to get all predictions for customer 1 */
/* Remove both lines to get all predictions */
GROUP BY l.c, r.p
ORDER BY l.c, r.p;
This query may produce an output similar to the following.
CUSTOMER ITEM REAL PREDICTED ------------------------------------------------ Adam Ericsson 4.21764858
Parent topic: SQL-Based Property Graph Analytics