Implement a multicloud data lake integration architecture

This reference architecture shows how organizations can integrate data from multiple sources into Oracle Cloud Infrastructure (OCI) data lake.

This reference architecture represents an use case for a large-scale enterprise organization with a business strategy that includes acquisition of new organizations as a part of their long-term growth plan. The organization is in the process of building a data lake with an analytics platform and cost analysis is one of modules in scope.

The organization has implemented Oracle Fusion Cloud Applications for finance where the invoice data is stored.

The organization has recently acquired a new organization and are using Amazon Web Services (AWS) for hosting the invoice processing application. There is a requirement to bring the invoice data from AWS to Oracle Cloud Infrastructure (OCI) where the data lake is implemented and enrich the high-volume invoice data with cost center/supplier information before loading into the data lake. Cost center data is sourced from Oracle Fusion Cloud Applications and supplier data is sourced from an on-premises MySQL database.

Architecture

This reference architecture describes how you can bring the data from different cloud providers and on-premises data sources to a data lake hosted in OCI. This architecture covers batch integration, data integration, real time integration and event based integration scenarios.

The following diagram illustrates the data flow for this reference architecture.
Description of oci_multicloud_datalake_flow.png follows
Description of the illustration oci_multicloud_datalake_flow.png

oci-multicloud-datalake-flow-oracle.zip

OCI Data Integration:

Connects and extracts data from:
- AWS services and Azure services through native adapters.
- On-premises data sources through private connectivity (FastConnect/VPN).
- Oracle SaaS applications through BICC connector.
Performs transformation on the extracted data.
Loads data into OCI data lake through adapters (ADB/Object Storage).

Oracle Integration Cloud:

Receives real time data from various source systems like Oracle SaaS applications/IOT/Streaming services/social media/on-premises systems/other Cloud providers through native adapters.
Performs transformation/orchestration logic.
Loads data into OCI data lake through adapters (ADB/Object Storage).

The following diagram illustrates this reference architecture.

Description of oci_multicloud_datalake.png follows

Description of the illustration oci_multicloud_datalake.png

oci-multicloud-datalake-oracle.zip

Oracle Data Integration Service is used for the following scenarios:

Consolidating data by capturing data from multiple, heterogeneous source systems and integrating into a single persistent store. This is typically accomplished using extract, transform and load (ETL) routines.
Extracting high volume data from the source systems (HDFS, Oracle Autonomous database, MySQL, Oracle Database, Azure Synapse, AWS Redshift, Object Storage, S3, Microsoft SQL, PostgreSQL, and so on) which are hosted in the private/public network (customer on-premises, 3rd party cloud network (Azure VNet, AWS VPC)) and then loaded into the OCI data lake.
Extracting the data from Oracle Fusion Cloud Applications through BICC/BI Publisher connector and then loading into the OCI data lake.
Extracting high volume data from multiple sources with an orchestration pattern.
Implementing scheduled (daily, monthly, weekly, monthly, cron expression, and so on) ETL jobs.

Oracle Integration Cloud (OIC) is used for the following scenarios:

Receiving data from Oracle Cloud applications, CRM, E-commerce and on-premises/3rd party cloud applications in real-time and then loading into data lake.
Loading the data into data lake from a file (less volume) generated by a data-source.
Exposing Oracle Integration Cloud REST APIs to webhook platforms, receiving the data in real-time and loading into the data lake.
Some IOT platforms (Geotab, CheckSafe, and so on) have webhook fuctionality and send data to any https api for new events so they can connect directly to the API Gateway.
Receiving data from social media platforms (Facebook, LinkedIn, Twitter, Slack, and so on) and loading into the OCI data lake.

Oracle API Gateway is used for the following scenarios:

Publishing OIC APIs and Application APIs with private endpoints that are accessible from within your network or you can expose to the public internet if required. The endpoints support API validation, request and response transformation, CORS, authentication and authorization, and request limiting.
Decoupling the security and business logic in API development.
Exposing APIs to the restricted sources with security controls which may feed the data to downstream data lake.

The architecture has the following components:

Region
An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).
Availability domains
Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.
Virtual cloud network (VCN) and subnets
A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.
Integration
Oracle Integration is a fully managed service that allows you to integrate your applications, automate processes, gain insight into your business processes, and create visual applications.
Oracle Data Integration
Oracle Cloud Infrastructure Data Integration is a fully managed, serverless, cloud-native service that extracts, loads, transforms, cleanses, and reshapes data from a variety of data sources into target Oracle Cloud Infrastructure services, such as Autonomous Data Warehouse and Oracle Cloud Infrastructure Object Storage. ETL (extract transform load) leverages fully-managed scale-out processing on Spark, and ELT (extract load transform) leverages full SQL push-down capabilities of the Autonomous Data Warehouse in order to minimize data movement and to improve the time to value for newly ingested data. Users design data integration processes using an intuitive, codeless user interface that optimizes integration flows to generate the most efficient engine and orchestration, automatically allocating and scaling the execution environment. Oracle Cloud Infrastructure Data Integration provides interactive exploration and data preparation and helps data engineers protect against schema drift by defining rules to handle schema changes.
Oracle Business Intelligence Cloud Connector
The Oracle BI Cloud Connector (BICC) is a useful tool for extracting data from Fusion and for storing it in shared resources like Oracle Universal Content Management (UCM) Server or cloud storage in CSV format.
OIC Connectivity Agent
With the OIC connectivity agent, you can create hybrid integrations and exchange messages between applications in private or on-premises networks and Oracle Integration Cloud.
Data Lake
A data lake is a scalable, centralized repository that can store raw data and enables an enterprise to store all its data in a cost effective, elastic environment. A data lake provides a flexible storage mechanism for storing raw data.
Object storage
Object storage provides quick access to large amounts of structured and unstructured data of any content type, including database backups, analytic data, and rich content such as images and videos. You can safely and securely store and then retrieve data directly from the internet or from within the cloud platform. You can seamlessly scale storage without experiencing any degradation in performance or service reliability. Use standard storage for "hot" storage that you need to access quickly, immediately, and frequently. Use archive storage for "cold" storage that you retain for long periods of time and seldom or rarely access.
Autonomous Database
Oracle Cloud Infrastructure Autonomous Database is a fully managed, preconfigured database environments that you can use for transaction processing and data warehousing workloads. You do not need to configure or manage any hardware, or install any software. Oracle Cloud Infrastructure handles creating the database, as well as backing up, patching, upgrading, and tuning the database.
Analytics
Oracle Analytics Cloud is a scalable and secure public cloud service that empowers business analysts with modern, AI-powered, self-service analytics capabilities for data preparation, visualization, enterprise reporting, augmented analysis, and natural language processing and generation. With Oracle Analytics Cloud, you also get flexible service management capabilities, including fast setup, easy scaling and patching, and automated lifecycle management.
Data catalog
Oracle Cloud Infrastructure Data Catalog is a fully managed, self-service data discovery and governance solution for your enterprise data. It provides data engineers, data scientists, data stewards, and chief data officers a single collaborative environment to manage the organization's technical, business, and operational metadata.

Recommendations

Use the following recommendation as a starting point. Your requirements might differ from the architecture described here.

Security
All connections are establised through a private network and all ETL transactions are routed through Fastconnect for on-premises, Colt for AWS, Azure Interconnect for Azure. It is also recommended to use encryption and decryption at source and target. This will ensure the security at transit.

Considerations

Consider the following points when deploying this reference architecture.

Security
Use OCI Identity and Access Management (IAM) policies to control who can access your cloud resources and what operations can be performed. To protect the database passwords or any other secrets, consider using the OCI Vault service.
- Assign least privilege access for IAM users and groups to resource types in dis-family.
- To minimize loss of data due to inadvertent deletes by an authorized user or malicious deletes, Oracle recommends assigning the DIS_WORKSPACE_DELETE permission to a minimum possible set of IAM users and groups. Assign the DIS_WORKSPACE_DELETE permission only to tenancy and compartment administrators.
- To protect your data sources from any security vulnerability, provide credentials to read-only accounts only. Data Integration only needs read access to ingest data from data assets.
Cost
- If large-scale data is transferred across the cloud boundary frequently, the direction of data flow becomes essential. Cloud providers typically do not charge for data ingress, but all providers charge a data egress fee. The data egress rates vary among cloud providers. It is crucial to take egress cost into multicloud design considerations. In addition, data residency must be considered when moving data.
- OCI FastConnect: The cost of FastConnect is the same across all OCI regions.
- Microsoft Azure ExpressRoute: The Microsoft Azure ExpressRoute cost varies from one region to another. Azure has more than one SKU available for an express route. Oracle recommends using the Local setting, because it has no separate ingress or egress charges, and it starts at the minimum bandwidth of 1 Gbps. The Standard and Premium configurations offer lower bandwidth, but incur separate egress charges in a metered setup.
- Use the low-cost Archive Storage service to store data that is rarely accessed but that must be retained for a longer duration. Define lifecycle management policies to automatically move data to Archive Storage or delete data after a specified duration.
High availability
Every interconnect circuit (ExpressRoute and FastConnect) comes with a redundant circuit on the same POP but different physical router, providing high availability.

Explore More

Review these additional resources to learn more about the features of this reference architecture.

Acknowledgments

Author: Subburam Mathuraiveeran

Contributors: Wei Han, Phil Wilkins