Process Unstructured Documents Intelligently

Processing unstructured documents can be a time-consuming task without intelligent automation. Let's take invoice PDFs as an example - you may need to extract key header information such as company name, date, invoice number, address, and so on. You will likely also need to extract each line item with various item numbers, descriptions, quantities, unit prices, and totals. After extraction, this information needs to be posted to a system of record such as a database, a target application like Oracle E-Business Suite, Oracle Fusion Cloud Financials, or Oracle Fusion Cloud SCM to name a few. Finally, your end users can take action on the newly posted information, whether building reporting, or even a custom application.

This scenario is easily accomplished using Oracle Integration Cloud Service to orchestrate various Oracle Cloud Infrastructure (OCI) services. With Oracle Integration Cloud Service, you can easily connect to and integrate your systems of record whether they run in OCI, or elsewhere. OCI's AI services such as OCI Document Understanding can easily be combined with Oracle Integration Cloud Service to achieve a variety of use cases.

You can easily apply this approach to a number of other use cases that automate the processing of unstructured documents using prebuilt models such as passports, driver's licenses, and receipts. Other document types can also be processed by training a custom model in the OCI Document Understanding service.

Architecture

This architecture outlines how to use Oracle Integration Cloud Service to orchestrate OCI services to automate intelligent unstructured document processing.

The following diagram illustrates this reference architecture.



oic-process-documents-arch.zip

The workflow of this architecture resembles:

  1. An integration is kicked off by Oracle Integration Cloud Service to fetch new email attachments (PDFs, PNGs, JPGs, etc.) from either Microsoft Outlook or Gmail using prebuilt adapters.
  2. Attachments can be stored in Oracle Integration Cloud Service's embedded file server, or OCI Object Storage for short to long-term retention.
  3. OCI Document Understanding is invoked to pick up and process the newly uploaded files, returning structured JSON of the extracted key fields back to Oracle Integration Cloud Service.
  4. If the confidence score returned by OCI Document Understanding meets an acceptable threshold, the integration then transforms and validates the extracted result by making additional calls to various systems or applications using any one of its 100+ prebuilt adapters. Otherwise, the integration kicks off a process within Oracle Integration Cloud Service Process Automation to ensure human-in-the-loop exception handling. As part of this process, a developer or analyst receives an email notification to review the document and either correct it before resubmitting, or manually identify the required key value pairs so that the integration may continue.
  5. The extracted data is inserted into a system of record such as on-premises Oracle E-Business Suite through the use of OCI FastConnect and a connectivity agent, Oracle Fusion Cloud Financials over the Oracle backbone, an Oracle Autonomous Transaction Processing Database via a private endpoint, or other applications like Salesforce, SAP, and Workday.
  6. When the extracted and validated data is inserted to a private Oracle Autonomous Transaction Processing Database as part of the integration flow, you can now leverage additional OCI capabilities to give your end users different ways to interact with the data. For instance, you could easily build a custom portal using Oracle APEX (a low-code platform included with Oracle Database). This portal could give business users the ability to query and update the extracted data through a custom UI.
  7. Optionally you could connect the Oracle Autonomous Transaction Processing Database to an Oracle Analytics Cloud instance where business users could build custom reports that uncover the most important processed document trends.

The architecture has the following components:

  • Region

    An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

  • Availability domains

    Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain shouldn't affect the other availability domains in the region.

  • Fault domains

    A fault domain is a grouping of hardware and infrastructure within an availability domain. Each availability domain has three fault domains with independent power and hardware. When you distribute resources across multiple fault domains, your applications can tolerate physical server failure, system maintenance, and power failures inside a fault domain.

  • Virtual cloud network (VCN) and subnets

    A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • On-premises network

    This network is the local network used by your organization. It is one of the spokes of the topology.

  • Dynamic routing gateway (DRG)

    The DRG is a virtual router that provides a path for private network traffic between VCNs in the same region, between a VCN and a network outside the region, such as a VCN in another Oracle Cloud Infrastructure region, an on-premises network, or a network in another cloud provider.

  • Service gateway

    The service gateway provides access from a VCN to other services, such as Oracle Cloud Infrastructure Object Storage. The traffic from the VCN to the Oracle service travels over the Oracle network fabric and does not traverse the internet.

  • FastConnect

    Oracle Cloud Infrastructure FastConnect provides an easy way to create a dedicated, private connection between your data center and Oracle Cloud Infrastructure. FastConnect provides higher-bandwidth options and a more reliable networking experience when compared with internet-based connections.

  • Route table

    Virtual route tables contain rules to route traffic from subnets to destinations outside a VCN, typically through gateways.

  • Security list

    For each subnet, you can create security rules that specify the source, destination, and type of traffic that must be allowed in and out of the subnet.

  • Object storage

    Object storage provides quick access to large amounts of structured and unstructured data of any content type, including database backups, analytic data, and rich content such as images and videos. You can safely and securely store and then retrieve data directly from the internet or from within the cloud platform. You can scale storage without experiencing any degradation in performance or service reliability. Use standard storage for "hot" storage that you need to access quickly, immediately, and frequently. Use archive storage for "cold" storage that you retain for long periods of time and seldom or rarely access.

  • Oracle Services Network

    The Oracle Services Network (OSN) is a conceptual network in Oracle Cloud Infrastructure that is reserved for Oracle services. These services have public IP addresses that you can reach over the internet. Hosts outside Oracle Cloud can access the OSN privately by using Oracle Cloud Infrastructure FastConnect or VPN Connect. Hosts in your VCNs can access the OSN privately through a service gateway.

  • Integration

    Oracle Integration is a fully managed service that allows you to integrate your applications, automate processes, gain insight into your business processes, and create visual applications.

  • Document Analysis

    Oracle Cloud Infrastructure Document Understanding is an AI service for performing deep-learning–based document analysis at scale. With prebuilt models available out of the box, developers can easily build intelligent document processing into their applications without machine learning (ML) expertise.

  • Analytics

    Oracle Analytics Cloud is a scalable and secure public cloud service that empowers business analysts with modern, AI-powered, self-service analytics capabilities for data preparation, visualization, enterprise reporting, augmented analysis, and natural language processing and generation. With Oracle Analytics Cloud, you also get flexible service management capabilities, including fast setup, easy scaling and patching, and automated lifecycle management.

  • APEX Service

    Oracle APEX is a low-code development platform that enables you to build scalable, feature-rich, secure, enterprise apps that can be deployed anywhere that Oracle Database is installed. You don't need to be an expert in a vast array of technologies to deliver sophisticated solutions. Oracle APEX includes built-in features such as user interface themes, navigational controls, form handlers, and flexible reports that accelerate the application development process.

  • Autonomous Transaction Processing

    Oracle Autonomous Transaction Processing is a self-driving, self-securing, self-repairing database service that is optimized for transaction processing workloads. You do not need to configure or manage any hardware, or install any software. Oracle Cloud Infrastructure handles creating the database, as well as backing up, patching, upgrading, and tuning the database.

  • Identity and Access Management (IAM)

    Oracle Cloud Infrastructure Identity and Access Management (IAM) is the access control plane for Oracle Cloud Infrastructure (OCI) and Oracle Cloud Applications. The IAM API and the user interface enable you to manage identity domains and the resources within the identity domain. Each OCI IAM identity domain represents a standalone identity and access management solution or a different user population.

  • Logging
    Logging is a highly scalable and fully managed service that provides access to the following types of logs from your resources in the cloud:
    • Audit logs: Logs related to events emitted by the Audit service.
    • Service logs: Logs emitted by individual services such as API Gateway, Events, Functions, Load Balancing, Object Storage, and VCN flow logs.
    • Custom logs: Logs that contain diagnostic information from custom applications, other cloud providers, or an on-premises environment.
  • Audit

    The Oracle Cloud Infrastructure Audit service automatically records calls to all supported Oracle Cloud Infrastructure public application programming interface (API) endpoints as log events. Currently, all services support logging by Oracle Cloud Infrastructure Audit.

Recommendations

Use the following recommendations as a starting point to implement intelligent document processing with Oracle Integration Cloud Service. Your requirements might differ from the architecture described here.
  • Restrict Access to an Oracle Integration Cloud Service Instance

    Restrict the networks that have access to your Oracle Integration Cloud Service instance by configuring an allowlist (formerly a whitelist). Only users from the specific IP addresses, classless inter-domain routing (CIDR) blocks, and virtual cloud networks that you specify can access the instance.

  • Connectivity

    When you deploy resources to OCI, you might start small, with a single connection to your on-premises network. This single connection could be through FastConnect or through IPSec VPN. To plan for redundancy, consider all the components (hardware devices, facilities, circuits, and power) between your on-premises network and OCI. Also consider diversity, to ensure that facilities are not shared between the paths.

  • Use the Connectivity Agent in High Availability Environments

    You can use the connectivity agent in high availability environments with Oracle Integration Cloud Service by installing the connectivity agent twice on different hosts. The connectivity agents can scale horizontally, thereby providing all the benefits of running multiple agents for an agent group. This results in increased performance and extends failover benefits.

  • Use Private Endpoints

    A private endpoint lets your integrations connect to private resources in your virtual cloud network (VCN). All traffic goes through a private channel that is set up within OCI. You can configure one private endpoint per instance. These allow your Oracle Integration Cloud Service instance to access private resources without needing to go through a connectivity agent.

Acknowledgments

  • Authors: Nolan Trouvé, Jerry Mbamo
  • Contributor: Daryl Eicher