Build a secure OCI Data Integration environment with pre-built tasks from templates

Build secure and scalable data processing tasks from external sources to a target Oracle Autonomous Data Warehouse data store using Oracle Cloud Infrastructure Data Integration (OCI Data Integration) Service.

In this reference architecture, we are considering a scenario where your business data is spread across on-premises data stores while the company has been trying to migrate some applications to the cloud already. OCI Data Integration can extend any capability, pre-existing on-premises and on other clouds, leveraging network and data store connectivity present in the OCI fabric in a secure and scalable manner.

Architecture

This architecture depicts the different components that could be involved in the above scenario.

For a multicloud strategy, you may encounter technologies and data services in other cloud providers for which OCI provides architecture references for connectivity to other cloud providers. On-premises data stores vary for multiple technologies, from data stored in files to process-driven datasets in ERPs.

The following diagram illustrates the reference architecture and data journey.



oci-data-integration-flow-oracle.zip

Here are the steps to safely ingest, process, and enrich data to become a piece of target information stored in the downstream database or lakehouse.

  1. Through Oracle Cloud Infrastructure FastConnect or Site-to-Site VPN, on-premises data sources can be ingested using OCI Data Integration Data Assets connectors.
  2. Similarly, data sources that are reachable by the OCI Data Integration Data Assets connectors can be used to pull datasets residing in the other clouds (for example, custom applications, non-Oracle applications, Oracle databases running on third-party clouds, Oracle Fusion SaaS, third-party cloud services, and applications). Data can also be uploaded in bulk load files into Oracle Cloud Infrastructure Object Storage buckets whenever not accessible directly by a OCI Data Integration Data Asset connector.

    Oracle has developed specific cloud connectivity solutions for other cloud providers such as Microsoft Azure, Amazon Web Services, and Google Cloud Platform. In the absence of vertical cloud interoperability, connectivity to services or applications can be done securely through a NAT gateway, guaranteeing only outbound traffic to the internet is allowed. OCI mitigates any data exposure on the internet by encrypting end-to-end connectivity to the endpoints. Yet, in the ingestion, OCI Data Integration Pipelines can orchestrate other types of data intake, such as high-volume real-time data streaming and data source replicas with Oracle GoldenGate. The orchestration capabilities of invoking REST API calls to OCI services can leverage the detection of file changes in buckets of OCI Object Storage and combination with Events and Integration Functions, trickle ingestion data streams.

  3. Once data is ingested into the OCI fabric, it is processed on exclusive virtual cloud networks (VCN) that can be further isolated from internet access. Data integration services (OCI Data Integration) through data flows can perform multiple transformations in a code-free interface, mapping source and target entities and the respective transformations. At the same time data transformations occur, OCI Data Catalog services undertake the cataloging to provide lineage. Data at rest in the Oracle Databases may be subject to regulations for privacy and compliance. Oracle Data Safe evaluates database security posture, identifying and categorizing risks, eventually masking information considered sensitive. Another resource for data and information safety, OCI Vault, provides services to store and manage keys and secrets such as account information and passwords, encrypting them and simplifying the overall process of securing data.
  4. While the OCI Data Integration Pipelines and OCI Data Integration Dataflows promote the enrichment of data assets within, REST operators can also secure access to other OCI services. In this capacity, the OCI Data Integration Orchestration can invoke notebooks in Data Science for machine learning or interrogate artificial intelligence services for augmenting the data with Forecast or Anomaly Detection. OCI Data Integration Orchestration can spin Spark engines for bursting extensive data processing using OCI Data Flow with the same secure OCI fabric. All orchestration management, such as Monitoring, Logging, and Notifications, are integrated through the exact mechanism.
  5. OCI Data Integration writes to any Oracle store within OCI or on-premises, plus OCI data lake combinations and MySQL. Analytics immediately leverages the target stores with extensive resources for data visualization, business modeling, and pixel-perfect reporting.
  6. Consumers, producers, and developers of data are organized safely in fine-grain policies for data and resource access control.

The following architecture diagram drills further into the implementation, devising a suggested network subnet separation.



oci-data-integration-arch-oracle.zip

OCI Data Integration services provide out-of-the-box connectivity to many data sources, and micro-batches can process the data incrementally into the OCI environment. Similarly, other OCI services can be called to enrich and curate the datasets further.

  • Batch processing transforms large-scale data sets from source systems, leveraging OCI native services that seamlessly integrate with OCI Object Storage and allow you to create curated data for use cases such as data aggregation and enrichment, data warehouse ingestion, and machine learning and AI data use at scale.
  • OCI Data Integration is a fully-managed, serverless, cloud-native service that extracts, loads, transforms, cleanses, and reshapes data from various data sources into target Oracle Cloud Infrastructure services, such as Autonomous Data Warehouse and OCI Object Storage.
  • OCI Data Integration orchestrates the dependencies within the processing data flows but also with the remaining Oracle Cloud Infrastructure services, such as OCI Artificial Intelligence and Oracle Machine Learning for data enrichment or further classification and Data Safe for data security and compliance. Policies with granular control of access maintain service-to-service authentication and authorization.
  • OCI Data Integration Application Templates provide a set of OCI Data Integration Tasks (REST (API), SQL, Integration (data flow), and Pipelines) immediately available for usage. The tasks are fully parameterized, allowing them to be directly used. The tasks can also be saved into new projects and folders, allowing the design to be modified to accommodate further implementation details.

The architecture has the following components:

  • Region

    An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

  • Virtual cloud network (VCN) and subnets

    A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • Data Integration

    Oracle Cloud Infrastructure Data Integration is a fully managed, multitenant, serverless, native cloud service that helps you with common ETL tasks such as ingesting data from different sources; cleansing, transforming, and reshaping that data; and efficiently loading it to target data sources on OCI.

    Ingestion of data from various sources (for example Amazon Redshift, Azure SQL Database, and Amazon S3) into Object Storage and Autonomous Data Warehouse is the first step in this process.

  • Object storage

    Object storage provides quick access to large amounts of structured and unstructured data of any content type, including database backups, analytic data, and rich content such as images and videos. You can safely and securely store and then retrieve data directly from the internet or from within the cloud platform. You can seamlessly scale storage without experiencing any degradation in performance or service reliability. Use standard storage for "hot" storage that you need to access quickly, immediately, and frequently. Use archive storage for "cold" storage that you retain for long periods of time and seldom or rarely access.

  • Data Science

    Oracle Cloud Infrastructure Data Science is a fully managed, serverless platform that data science teams can use to build, train, and manage machine learning (ML) models on Oracle Cloud Infrastructure (OCI). It can easily integrate with other OCI services such as Oracle Autonomous Data Warehouse, Oracle Cloud Infrastructure Object Storage, and more. You can build and evaluate high-quality machine learning models that increase business flexibility by putting enterprise-trusted data to work quickly, and you can support data-driven business objectives with easier deployment of ML models.

  • Oracle Machine Learning

    Oracle Machine Learning offers features to build, train, and deploy models for data in the database. Oracle Machine Learning provides a Zeppelin notebook interface that lets data scientists train models using the OML4Py Python client library. Oracle Machine Learning also offers a no-code approach to model training with the AutoML UI. The deployment of models as REST APIs can be done through Oracle Machine Learning Services. There is, however, limited support for open source software.

  • AI Services

    Oracle Cloud Infrastructure AI services provide a collection of pre-trained and customizable model APIs over use cases spanning from language, vision, speech, decision, and forecasting. AI services provide model predictions that are accessible via REST API endpoints. These services provide state-of-the-art pre-trained models and should be considered and evaluated before training custom machine learning models using services 1-6. Alternatively, Oracle Machine Learning services also provide a series of pre-trained models for language (topic, keywords, summary, similarity) and vision.

  • Data Safe

    Oracle Data Safe is a fully-integrated, regional cloud service focused that provides a complete set of features for protecting sensitive and regulated data in Oracle databases. Data Safe also supports on-premises databases, Oracle Exadata Database Service on Cloud@Customer, and multicloud deployments. All Oracle Database customers can reduce the risk of a data breach and simplify compliance by using Oracle Data Safe to assess configuration and user risk, monitor and audit user activity, and to discover, classify, and mask sensitive data.

  • Autonomous Data Warehouse

    Oracle Autonomous Data Warehouse is a self-driving, self-securing, self-repairing database service that is optimized for data warehousing workloads. You do not need to configure or manage any hardware, or install any software. Oracle Cloud Infrastructure handles creating the database, as well as backing up, patching, upgrading, and tuning the database.

Recommendations

Use the following recommendations as a starting point. Your requirements might differ from the architecture described here.
  • VCN

    When you create a VCN, determine the number of CIDR blocks required and the size of each block based on the number of resources that you plan to attach to subnets in the VCN. Use CIDR blocks that are within the standard private IP address space.

    Select CIDR blocks that don't overlap with any other network (in Oracle Cloud Infrastructure, your on-premises data center, or another cloud provider) to which you intend to set up private connections.

    After you create a VCN, you can change, add, and remove its CIDR blocks.

    When you design the subnets, consider your traffic flow and security requirements. Attach all the resources within a specific tier or role to the same subnet, which can serve as a security boundary.

  • OCI Data Integration templates

    Many daily management tasks can be easily automated using or reusing template tasks. In addition, templates expand the OCI Data Integration data processing and management capabilities by offering a distinct set of tasks tailored to assist data engineers. Use cases to call other OCI services such as Oracle Cloud Infrastructure AI Services for document classifications, Oracle Data Safe for masking content to be stored, and controlling and reporting in the incremental feed to Autonomous Data Warehouse are template building blocks for ease of OCI Data Integration use.

    The list of templates currently available are:

    • Oracle Object Store Management

      Application with REST tasks for Object Storage to copy, delete, and rename objects and to create and delete buckets.

    • Oracle Vision Image

      Application with REST tasks for performing OCI Vision Image Analysis. The tasks include image classification, object detection, and image text detection.

    • Oracle Vision Document

      Application with REST tasks for performing OCI Vision Document AI. The tasks include document classification, document key-value detection, document language classification, document table detection, and document text detection.

    • Oracle DataSafe Masking

      Application with parameterized tasks to generate a Oracle Data Safe sensitive model and masking from a target Oracle database schema.

    • Load Files from Oracle Object Storage to ADW

      Application with tasks to load different file types from OCI Object Storage into Autonomous Data Warehouse: JSON, Parquet, CSV, Avro.

    • Oracle Database to Autonomous Data Warehouse Incremental Load (Customer Managed)

      Application that allows incremental tasks to run based on and report the last execution in a metadata table stored in an Autonomous Data Warehouse target schema.

    • Oracle Fusion Applications using Oracle Business Intelligence Publisher (BIP) to ADW Incremental Load

      Application that allows Oracle Fusion Applications using Oracle Business Intelligence Publisher (BIP) reports to run extracts based on and report the last execution in a metadata table stored in an Autonomous Data Warehouse target schema.

Considerations

When collecting, processing, and curating application data for analysis and machine learning, consider the following implementation options.

  • Data Processing
    • Oracle Cloud Infrastructure Data Integration provides a cloud native, serverless, fully-managed ETL platform that is scalable and cost effective.
    • Oracle Cloud Infrastructure Data Flow provides a serverless Spark environment to process data at scale with a pay-per-use, extremely elastic model.
    • Oracle Cloud Infrastructure Big Data Service provides enterprise-grade Hadoop-as-a-service with end-to-end security, high performance, and ease of management and upgradeability.
  • Data Persistence
    • Oracle Autonomous Data Warehouse is an easy-to-use, fully autonomous database that scales elastically, delivers fast query performance, and requires no database administration. It also offers direct access to the data from object storage external or hybrid partitioned tables.
    • Oracle Cloud Infrastructure Object Storage stores unlimited data in raw format.
  • Data Refinery

    Oracle Cloud Infrastructure Data Integration provides a cloud native, serverless, fully-managed ETL platform that is scalable and cost efficient.

Deploy

The Terraform code for this reference architecture is available in GitHub.

  1. Go to GitHub.
  2. Clone or download the repository to your local computer.
  3. Follow the instructions in the README document.

Acknowledgments

  • Author: Mario Miola