GenAI Ingestion Task

3.2.5 GenAI Ingestion Task

Large Language Models (LLMs) can produce inaccurate responses when they do not have access to relevant and current information. Use a GenAI Ingestion task to prepare your business data for Retrieval Augmented Generation (RAG), so that later tasks can retrieve relevant context from a vector store to generate more accurate AI responses.

The GenAI Ingestion task reads content from a configured data source, splits the content into smaller chunks, generates vector embeddings by using the selected embedding model, and stores those embeddings in an Oracle Database vector table. The ingested data can then be queried by a GenAI Retrieve task or other RAG-based components.

Use this task when you want to ingest documents, text, or file content into a vector database as part of a workflow. For example, you can ingest policy documents, support articles, product manuals, loan documents, or customer records, and then use that data later to answer natural language questions.

Prerequisites

Oracle AI Vector Search, which is available in Oracle Database 26ai, is designed for Artificial Intelligence (AI) workloads and allows you to query data based on semantics, rather than keywords.

Before you begin, ensure that you have completed the following tasks.

Create a table using the VECTOR data type in Oracle Database 26ai as shown in the following example. The table must be compatible with the vector store schema used by MicroTx Workflows. Note down the name of the name as you will provide this name while creating the GenAI Ingestion task. This is the table where the GenAI Ingestion task stores the vector embeddings.
```
CREATE TABLE TEST_VECTORS (
    id VARCHAR2(36) PRIMARY KEY,
    content CLOB,
    metadata JSON,
    embedding VECTOR
);
```
For more information about creating a table, see Create Tables Using the VECTOR Data Type in Oracle AI Vector Search User's Guide.
Create a database profile for the Oracle Database instance that contains the vector table. Note down this name as you will provide this name later. See Create a Database Profile.
Create an LLM definition that uses an embedding model. See Create an LLM Definition.
Identify the data source that you want to ingest. If you want to ingest a local file, upload the file to storage before using it in the task. See Upload Files to Storage.

To add a GenAI Ingestion Task

Navigate to the Task tab in a workflow and view all the tasks that you can add using the Workflow Builder. See Access the Task Tab in Workflow Builder.
In the More Tasks dialog box, click GenAI Ingestion Task to add it to the workflow.
Click the task that you have added in the left pane. The Task tab in the right pane displays details about the task, such as its name and parameters. Next, let's provide details for the task.
In the Task Details group, enter the following information.
- Task Name: Mandatory. Enter a unique name for the task. The name must be between 1 to 128 alphanumeric characters in length and cannot contain spaces or any special characters. Optionally, you can use underscore (_) and hyphen (-).
- Task Reference: Mandatory. Enter a value to refer to the task within a workflow definition. This value must be unique within a workflow. The task reference value must be between 1 to 128 alphanumeric characters in length and cannot contain spaces or any special characters. Optionally, you can use underscore (_) and hyphen (-).
In the GenAI Ingestion Parameters group, provide the following information.
- Embedding Profile Name: Mandatory. Select the LLM definition that contains the embedding model configuration.
- Embedding Model: Mandatory. Select the embedding model to use for generating vector embeddings. The selected model must exist in the embedding profile.
- Data Store Profile: Mandatory. Select the database profile that connects to the Oracle Database instance where the vector table exists.
- Table Name: Mandatory. Enter the name of the vector table where MicroTx Workflows stores the generated embeddings.
- Chunk Size: Optional. Enter the target chunk size used when splitting input content. The default value is 512. Smaller chunks can improve retrieval precision but create more rows and embedding operations. Larger chunks preserve more context but can reduce retrieval precision.
- Dimensions: Optional. Enter the number of dimensions for the generated embedding vectors. The default value is 512. This value must match the embedding model output and the vector column definition in the database table.
- Min Chunk Size (chars): Optional. Enter the minimum chunk size in characters. The default value is 80 characters.
- Min Chunk Length to Embed: Optional. Enter the minimum length a chunk must have before MicroTx Workflows generates an embedding for it. The default value is 5.
- Max Num Chunks: Optional. Enter the maximum number of chunks to generate from the input content. The default value is 10000.
- Index Type: Optional. Select the type of vector index. Supported values are HNSW for Hierarchical Navigable Small World (HNSW), IVF for Inverted File Flat (IVF), and NONE. The default value is HNSW.
- Distance Type: Optional. Select the distance metric used for vector similarity. Supported values are COSINE, DOT, EUCLIDEAN, MANHATTAN, and EUCLIDEAN_SQUARED. The default value is COSINE. For information about the different functions, see Vector Distance Metrics in Oracle AI Vector Search User's Guide.
- Keep Separator: Optional. Select this option to retain text separators, such as newline characters, during chunking. The default value is true.
- Data Source: Mandatory. Select the content that you want to ingest.
  - WEB: Retrieve content from a web URL.
  - OCI: Retrieve content from an OCI URL.
  - LOCAL: Retrieve content from a file uploaded to local storage.
  - TEXT: Provide inline text to ingest.
- Enable Idempotency: Optional. Enable this option when you want the ingestion task to avoid duplicate ingestion for the same workflow idempotency key and task idempotency key. When enabled, MicroTx Workflows performs the vector store update and idempotency result update in a single database transaction. If ingestion fails, the transaction is rolled back. Idempotency is disabled by default.
- Idempotent Table Name: Optional. Enter the table name used to store task idempotency lock information. If you do not provide a value, MicroTx Workflows uses fenced_task_idempotency_lock.
- Idempotency Timeout (ms): Optional. Specifies how long MicroTx Workflows holds the task-level idempotency lock for this task. The lock prevents duplicate concurrent execution during retries or replays. The default is 600000 milliseconds, or 10 minutes.
Click Save to save the changes to the workflow.

MicroTx Workflows displays the changes in JSON code.
Review all the changes, and then click Confirm Save to save the changes.

If you do not want to save the changes, click Cancel, and then click Reset to undo all the changes that you have made since the workflow was last saved.

Example

When you enter information in the Task tab, the corresponding code of the task is updated in the JSON tab. The following sample code displays the JSON code for a GenAI Ingestion task with sample values.

{
    "name": "sample_genai_ingest",
    "taskReferenceName": "genai_ingestion_ref",
    "inputParameters": {
        "embeddingModelProfile": {
            "name": "oci-cohere-embedding",
            "model": "cohere.embed-multilingual-image-v3.0"
        },
        "dataStoreProfile": "oracle-atp-123",
        "tableName": "test_vectors",
        "data": {
            "source": "local",
            "filePath": "oracle-transaction-manager-microservices-developer-guide.pdf"
        },
        "dimensions": "512",
        "chunkSize": "512",
        "minChunkSizeChars": "80",
        "minChunkLengthToEmbed": "5",
        "maxNumChunks": "10000",
        "keepSeparator": true,
        "indexType": "HNSW",
        "distanceType": "cosine"
    },
    "type": "GENAI_INGESTION"
}

Parent topic: Create System Tasks and Operator Tasks