Note:

Integrate OCI with OpenSearch and LangChain within OCI Data Science Notebook

Introduction

OpenSearch has seen rapid adoption in recent years, particularly as companies increasingly embrace large language models (LLMs) to drive data insights and leverage intelligent search capabilities for their custom business use cases. OpenSearch’s continuous commitment to delivering seamless integration with the latest AI/ML features has allowed organizations to build powerful observability, search, and analytics engines, which are critical as LLM usage grows. With Oracle Cloud Infrastructure Search with OpenSearch (OCI with OpenSearch), there is already a noticeable enthusiasm, to leverage the latest AI/ML features including Generative AI (GenAI) assistant and Learning to Rank, to solve strategic problems.

OCI with OpenSearch 2.11 release provided console and Command Line Interface (CLI) functionality for performing hybrid search, semantic search, as well as conversational search with retrieval-augmented generation (RAG). However some of the critical pain points we gathered from customers revolved around configuring conversational search workflows/RAG pipelines, but more importantly, the complexities of algorithms for pre-processing and ingesting large volumes unstructured or structured data, particularly from numerous formats such as PDF, CSV, EXCEL, WORD, DATA LAKES, and so on, and making it accessible for LLM-driven applications that require both speed and precision in retrieving information. Therefore, integrating LangChain and OCI with OpenSearch has become crucial as it facilitates efficient query handling and retrieval-augmented generation (RAG), empowering LLMs with more contextual and accurate responses.

LangChain provides an easy-to-use framework for creating and managing complex LLM workflows, allowing developers to streamline the integration between LLMs and OpenSearch. This combination enhances search relevancy by allowing language models to retrieve the most pertinent information from vast data stores, improving response quality for end-users.

Additionally, our integration with OCI Data Science and OCI Generative AI can significantly speed up the development and production deployment of enterprise AI use cases with minimal effort and code. OCI Data Science provides a comprehensive suite of tools for data scientists to build, train, and deploy models with minimal friction, automating much of the tedious work associated with data preparation and model lifecycle management. By integrating OCI Data Science with OCI with OpenSearch and LangChain, we can efficiently operationalize AI and leverage advanced data insights with fewer resources.

For efficient readability, this first tutorial will focus on setting OCI with OpenSearch and configuring communication with OCI Data Science service; configuring LangChain and leveraging its integration with OCI with OpenSearch to perform semantic search. The next tutorial, will focus on how to leverage the seamless integration between LangChain, OCI with OpenSearch, Oracle Cloud Infrastructure Data Science AI Quick Actions (AI Quick Actions), and OCI Generative AI service to speed up LLM application development.

Objectives

Task 1: Create an OpenSearch Cluster and Set up OCI Data Science Notebook for your Tenancy

Setting up OCI Data Science Notebook for your tenancy requires a few steps which include creating and configuring VCN with internet access, setting groups, dynamic groups, and granting the necessary Oracle Cloud Infrastructure Identity and Access Management (OCI IAM) polices to allow OCI Data Science to manage specific resources in your tenancy. For simplicity we have compiled all the steps here: 1_Setup_Data_Science_Notebook_In_Your_Tenancy.ipynb.

The OCI Data Science team also has an elaborate documentation on all the features the offer. For more information, see Overview of Data Science.

Note: If you already have an OpenSearch cluster, and have already configured notebook for your tenancy, you can skip to the LangChain integration part of this tutorial and try out the semantic search use case.

It is important to note that creating a new OCI Data Science project itself does not spin up additional compute resources, each notebook session you create under a data science project spins its own compute resources.

Task 1.1: Configure a Virtual Cloud Network (VCN)

Create a VCN within your tenancy, ensuring proper subnets and security lists are set up to allow secure communication for data science operations.

Preferably, use the wizard and click Create a VCN with Internet Connectivity.

Not: One very crucial step is to make sure to add the appropriate ingress and egress rules to the private subnet of your VCN. The absence of these rules might prevent the notebook from having internet access, which in turn will prevent you from installing critical libraries and dependencies. If you already have an existing VCN, you can simply edit the security list of the private subnet to add the following ingress rules.

Image showing VCN. Figure 1.a: VNC configuration wizard

Image Ingress Rules Config Figure 1.b: Configure Ingress Rules for Private Subnet

Task 1.2: Create an OpenSearch Cluster

You will need an OpenSearch cluster with 2.11 image or later.

If you do not already have an existing cluster, see Search and Visualize Data using Oracle Cloud Infrastructure Search with OpenSearch. Use the same VCN created in Task 1.1 to create your cluster.

Or

If you already have an existing cluster that you want to use, simply navigate to the private subnet of your cluster and make sure to add the above egress rules in the security list.

Task 1.3: Create Required Groups and Dynamic Groups

Define a group for your data scientists and a dynamic group for the OCI Data Science resources. This enables appropriate permissions for performing data science activities within OCI. For more information about manually configuring the required groups and dynamic groups, see Creating a Data Scientists User Group.

You can also opt to create these resources dynamically using OCI Resource Manager stacks.

  1. Go to the OCI Console, navigate to Developer Services, Resource Manager and click Stack.

  2. Click Create Stacks, and follow the wizard to create a stack from the OCI Data Science template.

    Note: You will find the data science template under the Service tab when you click Select Template.

Task 1.4: Configure OCI IAM policies for Notebook

You need to configure some OCI IAM policies to grant permissions to your groups and dynamic groups and allow OCI Data Science service to manage certain resources in your tenancy. Ensure policies cover network communication and data access permissions. For more information, see Model Deployment Policies.

Task 1.5: Configure OCI IAM Policies for AI Quick Actions

If you intend to leverage the power of AI Quick Actions in your tenancy to automate and accelerate your model training, evaluation, and deployment, you need to configure a set of OCI IAM policies and dynamic groups to grant access to the right resources in your tenancy. AI Quick Actions will not work in your notebook session if you do not configure these required policies. For more information, see AI Quick Actions Policies.

Define tenancy datascience as ocid1.tenancy.oc1..aaaaaaaax5hdic7xxxxxxxxxxxxxxxxxnrncdzp3m75ubbvzqqzn3q
Endorse any-user to read data-science-models in tenancy datascience where ALL {target.compartment.name='service-managed-models'}
Endorse any-user to inspect data-science-models in tenancy datascience where ALL {target.compartment.name='service-managed-models'}
Endorse any-user to read object in tenancy datascience where ALL {target.compartment.name='service-managed-models', target.bucket.name='service-managed-models'}

It it important that you create your OCI IAM policy document in the root compartment. You can add all the required rules into the same policy document. For simplicity, you can find the set of all the OCI IAM policies for both notebook and AI Quick Actions used in this tutorial. For more information, see data-science-iam-policies.

Note: Ensure to update the group and dynamic groups names, as well as the compartment name. You can further make these policies more restrictive or more open based on your designs.

Task 2: Launch a Jupyter Notebook in OCI Data Science

To launch a notebook session, you will need to create a data science project, and in this project you can create multiple notebooks. Each notebooks has its own virtual machine (VM) and can be launched independently. In each notebook session, you can deploy and manage multiple models, link an existing GitHub or Bitbucket repository or create and manage a new one, develop multiple Jupyter notebooks, build a complete application, or orchestrate your entire machine learning workflow.

  1. Go to the OCI Console, navigate to Data Science, Projects and click Create project.

    Note: For simplicity, we will be creating all resources in the same compartment. Be sure to select the same compartment where you created VCN, and OpenSearch cluster.

    Image showing VCN.

    Figure 2: Create a new Data Science Project

  2. Create a new notebook session.

    1. In the Project Details page, click Notebook sessions and Create notebook session.

      Image showing VCN.

    2. You can select the desired compute shape and attach necessary data sources. While creating the notebook session, you need to make sure to select the correct VCN and the same private subnet you used for OpenSearch cluster.

      Image showing VCN.

      Figure 3: Create a new Notebook session

  3. Click Open to launch a notebook session. You can now install required libraries, including LangChain and OpenSearch Python clients, directly from the notebook interface using pip.

    Image showing VCN.

    Image showing VCN.

    Figure 4: Launch Notebook session

Task 3: Configure LangChain with Jupyter Notebook

Install LangChain along with other libraries in your Jupyter Notebook environment or script, and leverage its seamless integration with OCI with OpenSearch, HuggingFace, OCI Generative AI service and several more, to perform semantic search, or conversational search.

Add the following code at the top of your Jupyter Notebook.

!pip install -U langchain langchain-community opensearch-py pypdf  sentence-transformers oci  langchain-huggingface oracle_ads

To avoid having to re-install these packages everytime in each Jupyter Notebook that you create within a notebook session, it is advisable that you create and activate a conda environment and install all your dependencies in them, that way you can reuse this environment with multiple Jupyter notebooks. There are several pre-configured conda environments under Launcher. Feel free to select and install any of these preconfigured environments in a terminal, then install any additional libraries on top. Once installation is complete, just set the Kernel in your notebook to this active conda environment.

  1. Launch a new terminal. Go to File, New and click Terminal.

    Image showing VCN.

    Figure 5: Launch a new terminal

  2. Run the following command to create a conda environment.

    odsc conda install -s python_p310_any_x86_64_v1
    conda activate <environment name>
    
  3. Run the following command to install LangChain using pip in terminal.

    pip install  -U oracle_ads oci langchain langchain-community opensearch-py pypdf  sentence-transformers oci  langchain-huggingface oracle_ads
    

Task 4: Process Documents with LangChain

One of the note strength of LangChain is that it offers capabilities to process large volumes of documents efficiently and with minimal coding, be it structured or unstructured data. You simply have to import the necessary document processing classes that are most suitable for your use and invoke the load method to process the documents. Run the following command.

import os
from langchain.document_loaders import PyPDFLoader, CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Step 1: Load PDF and CSV Documents
def load_documents():
    # Load PDF Document
    pdf_loader = PyPDFLoader("data/sample_document.pdf")
    pdf_documents = pdf_loader.load()

    # Load CSV Document
    csv_loader = CSVLoader(file_path="data/sample_data.csv")
    csv_documents = csv_loader.load()

    # Combine both sets of documents
    all_documents = pdf_documents + csv_documents
    return all_documents

# Step 2: Split the documents into smaller chunks for better processing
def split_documents(documents):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100
    )
    split_docs = text_splitter.split_documents(documents)
    return split_docs


# Load and process documents
documents = load_documents()
split_docs = split_documents(documents)

Once you have installed LangChain in your noteBook environment, you can leverage the OCI with OpenSearch integration with LangChain to perform semantic search more simply than writing code from scratch to follow the lengthy step by step OpenSearch guide with multiple API calls.

  1. Use the LangChain document library to process and chunk your unstructured data as shown in Task 4.

  2. Define an embedding model that you will like to use for automatically generating embeddings for your data during ingestion. Once again, leveraging LangChain integration with HuggingFace you can deploy any of the pre-trained HuggingFace models with a single line of code. All you need to do is specify the name of the embedding model you want to use. You can also use a custom fine-tuned model for this purpose. Run the following command.

    from langchain.embeddings import HuggingFaceEmbeddings
    embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2")
    
  3. Create a connection to OCI with OpenSearch using LangChain, specifying the index name, authentication method, as well as the embedding model you want to use. With this method, an index with the specified index name will be created during data ingestion, or updated with new data if the index already exists. Run the following command.

    from langchain.vectorstores import OpenSearchVectorSearch
    import oci
    
    # set up your script to use Resource principal for authentication
    auth_provider = oci.auth.signers.get_resource_principals_signer()
    auth = ("username", "password")
    AUTH_TYPE="RESOURCE_PRINCIPAL"
    opensearch_url="https://amaaaaaallb3......................opensearch.us-ashburn-1.oci.oraclecloud.com:9200"   // replace this whole value your opensearch url. Be sure sure to have the :9200 port in your url
    
    # Initialize OpenSearch as the vector database
    vector_db = OpenSearchVectorSearch(opensearch_url=opensearch_url,
                               index_name="<YOUR-INDEX-NAME>",
                               embedding_function=embedding_model,
                               signer=auth_provider,
                               auth_type="RESOURCE_PRINCIPAL",
                               http_auth=auth)
    
  4. You can also directly ingest your processed data chunks in bulk into your OpenSearch cluster using LangChain. The following example shows how you can perform bulk ingestion in batches on a list of processed document chunks. Use the tqdm library to track the progress of the data ingestion. Run the following command.

    from langchain.vectorstores import OpenSearchVectorSearch
    import oci
    from tqdm import tqdm
    
    batch_size=100
    documents = load_documents() # function defined above feel free to or define a new one to process documents
    document_chunks = split_documents(documents) // function defined above feel free to edit
    index_name= <YOUR-INDEX-NAME>
    
    # Ingest documents in batches
    for i in tqdm(range(0, len(document_chunks), batch_size), desc="Ingesting batches"):
    batch = document_chunks[i:i + batch_size]
    vector_db.add_texts(texts=batch,
                      bulk_size=batch_size,
                      embedding=embedding_model,
                      opensearch_url=opensearch_url,
                      index_name= index_name,
                      signer=auth_provider,
                      auth_type=AUTH_TYPE,
                      http_auth=("username", "password"))
    
    #refresh index
    vector_db.client.indices.refresh(index=index_name)
    
  5. Once data is ingested, run the following command to perform semantic search on your index.

    # Generate topK documents with scores
    query = "what can you tell me about picasso?"
    search_results = vector_db.similarity_search_with_score_by_vector(embedding_model.embed_query(query), k=5)
    
    # Iterate over the search results and print the text along with the relevance scores
    for document, score in search_results:
       print(f"Score: {score}")
       print(f"Document: {document.page_content}\n")
    

Next Steps

OCI with OpenSearch integration with LangChain and OCI Data Science is a game changer and will significantly speed-up enterprise application development of business use cases around semantic search and LLM. This tutorial provides a complete guide with examples for setting up OCI with OpenSearch and OCI Data Science in your tenancy, and taking advantage of the LangChain to perform semantic search.

In the next tutorial: Tutorial 2: Integrate LangChain, OCI Data Science Notebook, OCI with OpenSearch and OCI Generative AI to Accelerate LLM Development for RAG and Conversational Search, we will discuss how to leverage seamless integration between LangChain, OCI Data Science, AI Quick Action, and OCI Generative AI service, to develop your own custom LLM application. We invite you to try OCI with OpenSearch for your enterprise business AI/ML use cases.

You can find the code in the following GitHub repo.

Acknowledgments

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.