Data Platform - Data Federation

A data lakehouse architecture will help you to effectively collect and analyze event data and streaming data from devices in real time and correlate it with a broad range of enterprise data resources to leverage your data investment and to gain the insights you want.

But how do you correlate data in your data lakehouse with data that resides on your heterogeneous data sources or that is deployed on other clouds or on-premises systems, without the need to duplicate data?

The answer is to leverage a data federation approach that combines your data lakehouse data with federated data from third-party cloud stores and data from cloud and on-premises databases, regardless of its physical location.

This reference architecture positions the technology solution within the overall business context:



Data federation is a technique that allows integration, unification, and governance of data that is stored in different data stores by using a federated query engine that translates a single query into subqueries that are shipped to the source data stores. The results are then merged and presented to the user or application, as depicted below.



data-platform-federation-overview-oracle.zip

Often data federation is interchanged with data virtualization. Data virtualization creates a unified view of the data from multiple sources without duplicating the data, and serves fresh data in real time.

For the purposes of this reference architecture, the term data federation is used, even though the capabilities described below address both federation and virtualization.

Using data federation simplifies data access for consumers, such as analytics and data science engines, by connecting them to a federated serving engine rather than to mutiple data sources, thereby increasing the reusability, governance, and security of federated data.

Analytics engines traditionally offer data federation capabilities that can be used as a complement to the federated serving engine. Usually, federated data serving engines have more capabilities to increase performance and address scalability which implicitly benefit those analytics engines.

This document describes one potential solution for data federation within Oracle Cloud Infrastructure, but there can be alternatives that use different technologies for different scenarios.

Functional Architecture

This architecture uses a data lakehouse to store and work with data, regardless of its shape or form. At the core of this architecture is the data warehouse deployed on Oracle Autonomous Data Warehouse.

Additionally, the architecture uses a unified query engine to federate curated data from selected sources with the data in lakehouse. Federated data is obtained by using mechanisms such as external tables, database links, and data sharing, depending on the data store.

A federation architecture that combines lakehouse data and existing data stores allows you to:

  • Join all data regardless of where it is stored
  • Support a multicloud and hybrid-cloud data platform that federates data stored on other clouds and on premises
  • Simplify the data consumer experience of obtaining and querying data from different engines
  • Increase security as a single data security model can be enforced at the federated query engine
  • Increase governance by using a data catalog that unifies metadata for entities stored in the lakehouse and federated with the query engine
  • Increase performance by leveraging data materialization and autonomous database caching
  • Expose unified and curated data to different consumers by using analytics dashboards, SQL interfaces, API endpoints and data sharing
  • Leverage a multi-model database as a federated query engine

The following diagram illustrates the functional architecture. For simplicity's sake, not all the capabilities of the lakehouse are shown.



data-platform-federation-functional-oracle.zip

Note this architecture depicts a federated data platform that uses mostly batch processing, but it can be augmented with the data lakehouse real-time capabilities to process streaming data.

Streaming data processing often needs to consume contextual data within its data pipelines. Contextual data might be stored on different data sources, but a data federation engine that serves all contextual data to the data pipelines simplifies those pipelines.

The architecture focuses on the following logical divisions:

Ingest, Transform

Ingests and refines the data for use in each of the data layers in the architecture.

Federated data is consumed on demand from cloud storage, databases, and data shares. Data isn't transformed on this layer because it has already been curated on the source data store.

Persist, Curate, Create

Facilitates access and navigation of the data to show the current business view. For relational technologies, data may be logically or physically structured in simple relational, longitudinal, dimensional or OLAP forms. For non-relational data, this layer contains one or more pools of data, either output from an analytical process or data optimized for a specific analytical task.

This layer contains the federated serving engine that unifies and serves data residing in the data warehouse, data lake, and federated data sources. It offers the ability to query federated data on demand and to materialize federated data for increased query performance.

The federating engine offers the ability to serve data by using SQL, REST APIs, or data sharing which increases interoperability while simplifying connectivity, as data consumers connect to a single serving engine rather than to several data stores.

Analyze, Learn, Predict

Abstracts the logical business view of the data for consumers. This abstraction facilitates agile approaches to development, migration to the target architecture, and the provision of a single reporting layer from multiple federated sources.

This layer leverages the serving engine to obtain federated data which can be further augmented with the data connectors available on this layer and provided by the visualization or data science services.

Using a federated query engine allows data consumer access to be abstracted from the underlying data stores, increasing productivity as data is federated once and consumed by many data consumers. This also makes the system more interoperable as any consumer that can interoperate with SQL, REST APIs or data sharing can use and join lakehouse and federated data.

The architecture has the following functional components:

Batch Ingest

Batch ingest is useful for data that can’t be ingested in real time or that is too costly to adapt for real-time ingestion. It is also important for transforming data into reliable and trustworthy information that can be curated and persisted for regular consumption.

Batch Ingest complements the data federation engine since it can ingest data that the federation engine doesn’t have native access to, or for specific use cases where the data needs to be transformed to conform to the lakehouse data model.

You can use the following services together or independently to achieve a highly flexible and effective data-integration and transformation workflow.

  • Oracle Cloud Infrastructure Data Integration is a fully managed, serverless, cloud-native service that extracts, loads, transforms, cleanses, and reshapes data from a variety of data sources into target Oracle Cloud Infrastructure services, such as Autonomous Data Warehouse and Oracle Cloud Infrastructure Object Storage. ETL (extract transform load) leverages fully-managed scale-out processing on Spark, and ELT (extract load transform) leverages full SQL push-down capabilities of the Autonomous Data Warehouse in order to minimize data movement and to improve the time to value for newly ingested data. Users design data integration processes using an intuitive, codeless user interface that optimizes integration flows to generate the most efficient engine and orchestration, automatically allocating and scaling the execution environment. Oracle Cloud Infrastructure Data Integration provides interactive exploration and data preparation and helps data engineers protect against schema drift by defining rules to handle schema changes.

  • Oracle Data Integrator provides comprehensive data integration from high-volume and high-performance batch loads, to event-driven, trickle-feed integration processes, to SOA-enabled data services. A declarative design approach ensures faster, simpler development and maintenance, and provides a unique approach to extract load transform (ELT) that helps guarantee the highest level of performance possible for data transformation and validation processes. Oracle data transforms use a web interface to simplify the configuration and execution of ELT and to help users build and schedule data and work flows using a declarative design approach.

  • Oracle Data Transforms enable ELT for selected supported technologies, simplifying the configuration and execution of data pipelines by using a web user interface that allows users to declaratively build and schedule data flows and workflows. Oracle Data Transforms is available as a fully-managed environment within Oracle Autonomous Data Warehouse to load and transform data from several data sources into an ADW instance.

Depending on the use case, these components can be used independently or together to achieve highly flexible and performant data integration and transformation.

Batch Processing

Batch processing transforms large scale data sets stored on the data lakehouse. Batch processing leverages Oracle Cloud Infrastructure native services that seamlessly integrate with Oracle Cloud Infrastructure Object Storage and allows you to create curated data for use cases such as data aggregation and enrichment, data warehouse ingestion, and machine learning and AI data use at scale.

Oracle Cloud Infrastructure Data Integration, described above, is a fully-managed, serverless, cloud-native service that extracts, loads, transforms, cleanses, and reshapes data from a variety of data sources into target Oracle Cloud Infrastructure services, such as Oracle Autonomous Data Warehouse and Oracle Cloud Infrastructure Object Storage.

Oracle Cloud Infrastructure Data Flow is a fully-managed big data service that lets you run Apache Spark applications without having to deploy or manage infrastructure. It lets you deliver big data and AI applications faster, because you can focus on your applications without having to manage operations. Data flow applications are reusable templates that consist of a Spark application and its dependencies, default parameters, and a default run-time resource specification.

Serving

Oracle Autonomous Data Warehouse is a self-driving, self-securing, self-repairing database service that is optimized for data warehousing workloads. You do not need to configure or manage any hardware, or install any software. Oracle Cloud Infrastructure handles creating the database, as well as backing up, patching, upgrading, and tuning the database.

After provisioning, you can scale the number of CPU cores or the storage capacity of the database at any time without impacting availability or performance.

ADW can also virtualize data that resides in object storage as external and hybrid partitioned tables so that you can join and consume data derived from other sources with the warehouse data. You can also move historical data from the warehouse into object storage and then consume it seamlessly by using hybrid partitioned tables.

ADW can use previously harvested metadata stored in the Oracle Cloud Infrastructure Data Catalog to create external tables, and can automatically synchronize metadata updates in the Oracle Cloud Infrastructure Data Catalog with the external tables definition to maintain consistency, simplify management, and reduce effort.

Analytic views, an Autonomous Database feature, provide a fast and efficient way to create analytic queries of data stored in existing database tables and views. Analytic views organize data using a dimensional model. They allow you to easily add aggregations and calculations to data sets and to present data in views that can be queried with relatively simple SQL. This feature allows you to semantically model a star or snowflake schema directly in ADW, using data stored internally and externally, and allows consumption of the model by using SQL and any SQL compliant data consumer.

ADW provides the ability to federate and query data stored on third-party cloud stores (namely AWS S3, Azure Blob and GCP CGS), on third-party cloud databases (namely AWS Redshift, Azure Synapse Analytics, Google BigQuery and Snowflake), third-party databases (namely IBM DB2, MongoDB, PostrgreSQL, Hive) and even SaaS applications.

In a single query, ADW can query and join data from cloud stores, cloud databases and other popular databases, simplifying data access to the serving engine consumers, as they are abstracted from the complexity of querying separately, several query engines, to obtain a unified result. It can also combine this data with data obtained from data shares provided by producers compliant with the Delta Sharing open protocol.

Cloud Storage

Oracle Cloud Infrastructure Object Storage is an internet-scale, high-performance storage platform that offers reliable and cost-efficient data durability. Oracle Cloud Infrastructure Object Storage can store an unlimited amount of unstructured data of any content type, including analytic data. You can safely and securely store or retrieve data directly from the internet or from within the cloud platform. Multiple management interfaces let you easily start small and scale seamlessly, without experiencing any degradation in performance or service reliability.

Oracle Cloud Infrastructure Object Storage can also be used as a cold storage layer for the data warehouse by storing data that is used infrequently and then joining it seamlessly with the most recent data by using hybrid tables in Oracle Autonomous Data Warehouse.

Visualize/Learn

Oracle Analytics Cloud is a scalable and secure public cloud service that provides a full set of capabilities to explore and perform collaborative analytics for you, your work group, and your enterprise. It supports citizen data scientists, advanced business analysts training, and executing machine learning (ML) models. Machine learning models can be executed on the analytics service or directly on Oracle Autonomous Data Warehouse as OML-embedded models for large-scale batch predictions that leverage the processing power, scalability, and elasticity of the warehouse and OCI AI services, such as Oracle Cloud Infrastructure Vision.

With Oracle Analytics Cloud you also get flexible service management capabilities, including fast setup, easy scaling and patching, and automated lifecycle management.

Learn and predict

Oracle Cloud Infrastructure Data Science provides infrastructure, open source technologies, libraries, packages, and data science tools for data science teams to build, train, and manage machine learning (ML) models in Oracle Cloud Infrastructure. The collaborative and project-driven workspace provides an end-to-end cohesive user experience and supports the lifecycle of predictive models. Data Science enables data scientists and machine learning engineers to download and install packages directly from the Anaconda Repository at no cost and thus allowing them to innovate on their projects with a curated data science ecosystem of machine learning libraries.

The Data Science Jobs feature enables data scientists to define and run repeatable machine learning tasks on a fully-managed infrastructure.

The Data Science Model Deployment feature allows data scientists to deploy trained models as fully-managed HTTP endpoints that can provide predictions in real time, infusing intelligence into processes and applications, and allowing the business to react to relevant events as they occur.

Oracle Machine Learning provides powerful machine learning capabilities tightly integrated in Oracle Autonomous Database, with support for Python and AutoML. It supports models using open source and scalable, in-database algorithms that reduce data preparation and movement. AutoML helps data scientists speed up time to value of the company’s machine learning initiatives by using auto algorithm selection, adaptive data sampling, auto feature selection, and auto model tuning. With Oracle Machine Learning services available in Oracle Autonomous Data Warehouse, you can not only manage models but you can also deploy those models as REST endpoints in order to democratize real-time predictions within the company allowing business to react to relevant events as they occur, rather than after the fact.

AI services

Oracle Cloud Infrastructure AI Services services provide a set of ready-to-consume AI services that can be used to support a range of use cases from text analysis to predictive maintenance. These services have prebuilt, finely tuned models that you can integrate into data pipelines, analytics, and applications by using APIs.

Oracle Cloud Infrastructure Anomaly Detection provides a rich set of tools to identify undesirable events or observations in business data in real time so that you can take actions to avoid business disruptions.

Oracle Cloud Infrastructure AI Language performs sophisticated text analysis at scale. With pretrained and custom models, developers can process unstructured text and extract insights without data science expertise. Pretrained models support sentiment analysis, key phrase extraction, text classification, and named entity recognition. You can also train custom models for named entity recognition and text classification with domain specific data sets. Translation service enables you to translate text across 21 different languages.

Oracle Cloud Infrastructure Speech harnesses the power of spoken language by allowing you to easily convert media files containing human speech into highly accurate text transcriptions. OCI Speech can be used to transcribe customer service calls, automate subtitling, and generate metadata for media assets to create a fully searchable archive.

Oracle Cloud Infrastructure Vision performs image recognition and document analysis tasks such as classifying images, detecting and faces, extracting text, and recognizing tables. You can either leverage pretrained models or easily create custom vision models for industry- and customer-specific scenarios. Vision service is a fully-managed, multitenant, native cloud service that helps with all common computer vision tasks.

Oracle Cloud Infrastructure Document Understanding performs document analysis tasks such as extracting text and recognizing tables. OCI Document Understanding service is a fully-managed, multitenant, native cloud service that helps with all common document analysis tasks.

Data Enrichment

Data enrichment can improve the data that is used to train machine learning models to achieve better and more accurate prediction outcomes.

Oracle Cloud Infrastructure Data Labeling allows you to create and browse data sets, view data records (text or images), and apply labels for the purposes of building AI/ML models. The service also provides interactive user interfaces designed to aid in the labeling process. After records are labeled, the data set can be exported as line-delimited JSON for use in AI/ML model development.

API

The API layer allows you to infuse the intelligence derived from Data Science and Oracle Machine Learning into applications, business processes, and things to influence and improve their operation and function. The API layer provides secure consumption of the Data Science-deployed models to Oracle Machine Learning REST endpoints and the ability to govern the system to ensure the availability of run-time environments. You can also leverage functions to perform additional logic as needed.

Oracle Cloud Infrastructure API Gateway enables you to publish APIs with private endpoints that are accessible from within your network, and that you can expose with public IP addresses if you want them to accept internet traffic. The endpoints support API validation, request and response transformation, CORS, authentication and authorization, and request limiting. It allows API observability to monitor usage and guarantee SLAs. Usage plans can also be used to monitor and manage the API consumers and API clients that access APIs and to set up different access tiers for different customers in order to track data usage that is consumed by using APIs. Usage plans are a key feature to support data monetization.

Oracle Cloud Infrastructure Functions is a fully-managed, multitenant, highly-scalable, on-demand, functions-as-a-service platform. It is built on enterprise-grade Oracle Cloud Infrastructure and powered by the Fn Project open source engine.

Oracle REST Data Services (ORDS) is a Java application that enables developers with SQL and database skills to develop REST APIs for Oracle Database. Any application developer can use these APIs from any language environment, without installing and maintaining client drivers, in the same way they that they access other external services using REST, the most widely used API technology. ORDS is deployed as a fully-managed feature in ADW and can be used to expose lakehouse information by using APIs to data consumers.

Data Governance

Oracle Cloud Infrastructure Data Catalog provides visibility to where technical assets such as metadata and respective attributes reside and offers the ability to maintain a business glossary that is mapped to that technical metadata. Oracle Cloud Infrastructure Data Catalog can also serve metadata to Oracle Autonomous Data Warehouse Warehouse to facilitate external table creation in the data warehouse.

Data Security

Data security is crucial in exploring and using lakehouse data to the fullest extent. Leveraging a zero-trust security model with defense-in-depth and RBAC capabilities, and ensuring compliance with the most stringent regulation, data security provides preventive, detective, and corrective security controls to ensure that data exfiltration and breaches are prevented.

Oracle Data Safe is a fully-integrated Oracle Cloud service focused on data security. It provides a complete and integrated set of features for protecting sensitive and regulated data in Oracle Cloud databases, such as Oracle Autonomous Data Warehouse. Features include security assessment, user assessment, data discovery, data masking, and activity auditing.

Oracle Cloud Infrastructure Audit provides visibility into activities related to Oracle Cloud Infrastructure (OCI) resources and tenancies. Audit log events can be used for security audits to track usage of and changes to OCI resources and to help ensure compliance with standards and regulations.

Oracle Cloud Infrastructure Logging provides a highly-scalable and fully-managed single interface for all the logs in the tenancy, including audit logs. Use OCI Logging to access logs from all OCI resources so that you can enable, manage, and search them.

Oracle Cloud Infrastructure Vault is an encryption management service that stores and manages encryption keys and secrets to securely access resources. Enables customer managed keys to be used for Oracle Autonomous Data Warehouse and data lake encryption for increased data protection at rest. Enables secrets to securely store services and user credentials to improve your security posture and to ensure credentials aren’t compromised and used inappropriately.

Physical Architecture

The physical architecture for this data platform supports the following:
  • Oracle Autonomous Data Warehouse (ADW) obtains data from the federated data sources using Oracle-Managed Heterogeneous Connectivity features
  • ADW uses target databases accessible from the public internet that are configured and allow incoming SSL/TLS connections, so that Oracle-Managed Heterogeneous Connectivity can connect and query data securely
  • ADW reads data from Databricks using data shares
  • Databricks data shares are accessible by using the public internet but are secured using credential files provided by Databricks
  • Data from AWS S3, Azure Blob and Google Cloud Storage is either federated and read on demand using external tables or copied into ADW depending on the use case and requirements
  • Data from non federated data sources is ingested securely by using micro batch and files from relational and non-relational data sources that are not federated
  • Data is processed leveraging a combination of Oracle Cloud Infrastructure Data Integration and Oracle Cloud Infrastructure Data Flow
  • Data is stored in ADW and Oracle Cloud Infrastructure Object Storage and is organized according to its quality and value
  • ADW serves warehouse, lake and federated data securely to consumers
  • Oracle Analytics Cloud surfaces data to business users by using visualizations
  • Oracle Analytics Cloud is exposed by using Oracle Cloud Infrastructure Load Balancing that is secured by Oracle Cloud Infrastructure Web Application Firewall (WAF) to provide access by using the internet
  • Oracle Cloud Infrastructure Data Science is used to build, train, and deploy machine learning (ML) models
  • Oracle Cloud Infrastructure API Gateway is leveraged to govern the Data Science ML model deployments
  • Oracle Cloud Infrastructure Data Catalog harvests metadata from ADW and object storage
  • Oracle Cloud Infrastructure Bastion is used by administrators to manage private cloud resources

The following diagram illustrates the architecture:



data-platform-federation-physical-oracle.zip

The design for the physical architecture:

  • Leverages 2 VCNs, one for hub and another for the workload itself
  • On premises connectivity leverages both Oracle Cloud Infrastructure FastConnect and site-to-site VPN for redundancy
  • All incoming traffic from on premises and from the internet is first routed into the hub VCN and then into the workload VCN
  • All data is secure in transit and at rest
  • Services are deployed with private endpoints to increase the security posture
  • The VCN is segregated into several private subnets to increase the security posture
  • Lake data is segregated into several buckets in object storage, leveraging a medallion architecture
  • Federated data sources and cloud stores are accessed by using public connectivity and the NAT gateway attached to the workload VCN

Potential design improvements not depicted on this deployment for simplicity’s sake include:

  • Leverage Customer-Managed Heterogeneous Connectivity, using an Oracle Database Gateway, to connect to federated data sources using private connectivity
  • Leveraging a full CIS-compliant landing zone
  • Leverage a network firewall to improve the overall security posture by inspecting all traffic and by enforcing policies

Recommendations

Use the following recommendations as a starting point to use data from heterogeneous data sources for business analysis and machine learning.

Your requirements might differ from the architecture described here.

Oracle Autonomous Data Warehouse

This architecture uses Oracle Autonomous Data Warehouse (ADW) on shared infrastructure.

  • Consider using materialized views to increase the performance when accessing federated data.
  • Consider refreshing the materialized views with the frequency needed to avoid federated data to be stalled.
  • Consider creating views to query data by using database links from federated sources in order for those views to be harvested and catalogued in Oracle Cloud Infrastructure Data Catalog for increased data governance.
  • Consider storing the federated data source credentials in a secret in Oracle Cloud Infrastructure Vault to increase security posture.
  • Consider using the Autonomous Database security features, such as redaction (dynamic data masking), in views that surface federated data to increase data security.
  • Consider using data sharing to consume federated data from heterogeneous data sources that are compatible with the Delta Sharing open protocol.
  • Consider using data sharing to share curated data to consumers that are compatible with the Delta Sharing open protocol.
  • Consider using Customer-Managed Heterogeneous Connectivity with an Oracle Database Gateway to connect to federated data sources with private connectivity for which you need increased security, lower latency or both.

Oracle Analytics Cloud

This architecture leverages Oracle Analytics Cloud (OAC) for delivering augmented analytics to end users.

  • Consider using OAC's ample range of data sources to complement the federated data sources used by ADW.
  • Consider federating data sources needed in OAC on ADW for increased performance, caching, processing offload to the serving engine, and simplification of the analytical semantic layer.

Note this architecture relies on a data lakehouse reference architecture including the recommendations that are applicable to your architecture. See the Explore More section for a link to the lakehouse reference architecture and othe resources.

Considerations

When federating data for analysis, consider the following implementation options.

Guidance Recommended Other Options Rationale
Data Refinery

Oracle Cloud Infrastructure Data Integration

  • Oracle Data Integrator
  • Oracle Autonomous Database Data Transforms

Oracle Cloud Infrastructure Data Integration provides a cloud native, serverless, fully-managed ETL platform that is scalable and cost efficient.

Data Persistence
  • Oracle Autonomous Data Warehouse
  • Oracle Cloud Infrastructure Object Storage
Oracle Exadata Database Service

Oracle Autonomous Data Warehouse is an easy-to-use, fully autonomous database that scales elastically, delivers fast query performance, and requires no database administration. It also offers direct access to the data from object storage external or hybrid partitioned tables.

Oracle Cloud Infrastructure Object Storage stores unlimited data in raw format.

Data Processing
  • Oracle Cloud Infrastructure Data Integration
  • Oracle Cloud Infrastructure Data Flow
Third-party tools

Oracle Cloud Infrastructure Data Integration provides a cloud native, serverless, fully-managed ETL platform that is scalable and cost effective.

Oracle Cloud Infrastructure Data Flow provides a serverless Spark environment to process data at scale with a pay-per-use, extremely elastic model.

Access & Interpretation
  • Oracle Analytics Cloud
  • Oracle Cloud Infrastructure Data Science
  • Oracle Machine Learning
  • Oracle Cloud Infrastructure AI services
Third-party tools

Oracle Analytics Cloud is fully managed and tightly integrated with the curated data inOracle Autonomous Data Warehouse.

Data Science is a fully-managed, self-service platform for data science teams to build, train, and manage machine learning (ML) models in Oracle Cloud Infrastructure. The Data Science service provides infrastructure and data science tools such as AutoML and model deployment capabilities.

Oracle Machine Learning is a fully-managed, self service platform for data science available with Oracle Autonomous Data Warehouse that leverages the processing power of the warehouse to build, train, test, and deploy ML models at scale without the need to move the data outside of the warehouse.

Oracle Cloud Infrastructure AI services are a set of services that provide pre-built models specifically built and trained to perform tasks such as inferencing potential anomalies or detecting sentiment.

Acknowledgments

  • Author: José Cruz
  • Contributors: Robert Lies