Before You Begin

This tutorial shows you how to register a vector embedding model in Oracle Analytics, create a data flow to compare data using the similarity analysis step, and visualize the data flow's output dataset.

Background

The similarity analysis process determines how similar two or more items are to each other. You can perform similarity analysis on a dataset containing documents, products, user profiles, data points, events, and other items. For example, you might have data that shows details about auto engine failure, by using similarity analysis you can find out if these failures are alike. Did the failures occur on the same make and model, the same manufacturing year, the same engine size and other similar or different details?

Similarity analysis uses vector embeddings which are numerical representations that capture the semantic meaning of your data. A vector embedding model assigns numerical values to each element of your data. The closer two data points are to each other, the more similar the underlying content. In Oracle Analytics, you use a vector search to identify and rank records based on the proximity to a source record.

This tutorial uses a dataset of loan borrowers on an Oracle Autonomous Data Warehouse 23ai instance to demonstrate the steps required to perform similarity analysis. If you are performing similarity analysis, your dataset must reside on the Oracle Database 23ai or Oracle Autonomous Data Warehouse 23ai database where your vector embedding model is installed.

You can create a workbook with the original dataset, for example, the Simanalysis_loan_demo used for similarity analysis and add the dataset output, for example, the tutorial_simana_loan_ds to create additional visualizations and insights.

What Do You Need?

  • Access to Oracle Analytics
  • Create a connection to Oracle Database 23ai or Oracle Autonomous Data Warehouse 23ai with an installed vector embedding model and the dataset to analyze

Register the Vector Embedding Model

In this section, you register the vector embedding model installed on Oracle Database 23ai or Oracle Autonomous Data Warehouse 23ai in Oracle Analytics.

  1. On your home page, click Page Menu, select Register Model/Function, and then select Vector Embedding Models.
  2. In Register a Vector Embedding Model, click the 23ai connection containing your vector embedding model.
  3. In Select Vector Embedding Model, click a model. Keep the default name or enter a Name.


    Description of select_vector_model.png follows
    Description of the illustration select_vector_model.png
  4. From Vector Distance, select a function or use the default Cosine function.


    Description of vector_model_info.png follows
    Description of the illustration vector_model_info.png
  5. Click Register.

Create a Data Flow

In this section, you create a data flow to perform the similarity analysis. Select ten to fifteen columns and include a unique identifier column from the dataset. Using more than 15 columns increases the time required for the dataflow to complete.

  1. On your home page, click Create and click Data Flow.
  2. In Add Data, click the dataset to convert to numerical vectors and click Add.


    This tutorial uses the Simanalysis_loan_demo dataset. A green check selected column icon next to the column name indicates that these columns are selected for analysis.

    Description of simana_loan_ds.png follows
    Description of the illustration simana_loan_ds.png
  3. In the Data Flow, review columns in the dataset. Under Name, click the checkbox next to columns that you don't want to include in the similarity analysis and click Remove.


    Description of remove_columns.png follows
    Description of the illustration remove_columns.png
  4. Click Add step Add step icon on the dataset node and select Similarity Analysis Similarity Analysis step icon. In Select Vector Embedding Model, click the model to use and click OK.

Specify the Parameters

In this section, you select the values to use as the Similarity Analysis parameters. This tutorial uses Top to find the 100 records that are the closest matches to the source Member ID and uses Member ID as the unique identifier value in Reference Column1.

  1. Collapse the Outputs section. .
  2. In the Parameters section, click Select a value for the Source. From Available Data, click Member_ID and click a value.
  3. In Top (closest) or Bottom (furthest), select a value. In Number of Results, enter a value.
  4. In Reference Column1, click Select a column. From Available Data, click the unique identifier. In this dataset Member_ID is the unique identifier.


    Description of parameters.png follows
    Description of the illustration parameters.png

Specify the Outputs

In this section, you specify the columns to include in the data flow's output dataset. In this tutorial, Reference Column2 and Reference Column3 aren't used as Parameters and aren't selected as Outputs. In the output dataset, distance represents the value between source record and similar records.

When selecting outputs, include Profile Expression, which is a concatenated string of all the columns and their values used in similarity analysis.

  1. Collapse Parameters and expand Outputs.
  2. Click the checkbox next to these columns:
    • source_value
    • source_reference_col1
    • result_reference_col1
    • distance
    • profile_expression


    Description of outputs.png follows
    Description of the illustration outputs.png
  3. Click Add Step Add step icon on the Similarity Analytics node and select Save Data.
  4. In Save Data, enter a Name in Dataset.
  5. In Table, enter a name. You can use the same name as the dataset.
  6. In the distance row under Columns, click Sum from Default Aggregate and click Average.


    Description of save_data_step.png follows
    Description of the illustration save_data_step.png
  7. Click Save Save icon. In Save Data Flow as, enter a name, and click OK.
  8. Click Run Data Flow Run Data Flow icon.
  9. Click Go back Go back icon.

Visualize the Similarity Analysis Dataset

In this section, you create a table visualization to show the dataset resulting from running the data flow.

Use the History tab in the data flow to check the run's status. When the data flow run completes, you can visualize the dataset.

  1. On your home page, click Create and click Create Workbook.
  2. In Add Data, click the similarity analysis dataset and click Add to Workbook.
  3. In the Data pane, hold down the Ctrl key and select source_value, result_reference_col1, and distance. Right-click, select Pick Visualization and select Table.
  4. Right-click the distance column, select Sort By, select Custom. In Sort By, select distance, select from Low to High, and then click OK.


    Description of table_closest_records.png follows
    Description of the illustration table_closest_records.png
  5. Click Save Save icon.

Explore the Results

In this section, you create additional visualizations and delve into the details of the similarity analysis results.

  1. In the Data pane, select distance and drag it to the workbook's filter bar.
  2. In the distance filter dialog, click Top Bottom. In the Method row, click Top to select Bottom. In the Count row, change the value to 20 and click outside of the dialog.
  3. In the table visualization's toolbar, click Menu visualization menu icon, select Edit, and then select Duplicate Visualization. In the second table visualization, click Change Visualization Type Change Vis type icon in the visualization toolbar and click Radar Line Radar line vis icon.


    Description of radar_line_vis.png follows
    Description of the illustration radar_line_vis.png
  4. In the Data pane, select distance and drag it to Color in the Grammar panel. In Color, click Drop Target Options drop target options icon and select Manage Assignments.
  5. In Manage Color Assignments, click PresetsPresets icon next to distance, select Magma, and then click Done.
  6. Click Properties Properties icon. In Properties, click Axis. In the Values Axis, click On in Labels to change to Off disabling axis labels.


    Description of updated_radar.png follows
    Description of the illustration updated_radar.png

Visualize the Details

In this section, you create a calculation to work with visualizing the profile_expression column.

  1. In the Data pane, right-click My Calculations and select Create Calculation.
  2. In Create Calculation, enter Profile Exp in Name.
  3. In the expression field, enter Cas and select CAST. In the expression variable, enter pro and select profile_expression. In type, enter varchar(4000).

    Your expression should look like CAST(profile_expression AS varchar(4000)).

  4. Click Validate. Click Save.
  5. In the Data pane, hold down the Ctrl key, select source_value, result_reference_col1, distance, and Profile Exp. Right-click, select Pick Visualization and select Table.
  6. Right-click the distance column, select Sort By, select Custom. In Sort By, select distance, select from Low to High, and then click OK.


    Description of pro_expression_details.png follows
    Description of the illustration pro_expression_details.png

Learn More