Data Platform - Decentralized Data Platform

Use a data lakehouse to collect and analyze event and streaming data from devices in real time and correlate it with a broad range of enterprise data resources to gain the insights you want.

How best to support and empower your organization’s various teams, such as marketing, finance, or logistics, with the flexibility to work with their domain-specific data while also enabling secure cross-domain data sharing and consumption without duplicating data and creating data silos?

Adopt a domain-driven data architecture that provides teams and departments across the organization with the agility and flexibility needed to efficiently use their data and develop the data products essential for their business.

This reference architecture positions the technology solution within the overall business context, where strategic intents drive the creation of measurable strategic outcomes. These outcomes generate new strategic intents, effectively delivering continuous, data-driven business improvements.



Each domain independently follows the high-level process shown above to create its domain data products. Domain-driven data architectures provide the flexibility that organizations require by avoiding reliance on a single point of contention, such as a fully centralized data platform and IT team, and by fostering agile innovation to produce trusted data products within each domain.



decentralized-data-platform-overview-oracle.zip

The objective of each domain is to acquire domain-related data and then to produce data products that are consumed by other domains or final data consumers.

The domains can be:

  • Source-aligned: Sources data directly from relevant domain data sources, such as enterprise applications, and produces data products that are consumed by aggregate or consumer-aligned domains. These data products represent the source of truth for a particular domain. The data is granular, curated, and foundational within and across domains.
  • Aggregate: Consumes and combines source-aligned data, creating aggregated and added-value data products that foster reuse, reduce duplication, and comprise foundational business logic needed by consumer-aligned domains.
  • Consumer-aligned: Consumes data from source-aligned and aggregate domains to create data products that serve specific use cases and address data consumer's needs within a given domain.

The data domain teams and their subject matter experts (SMEs) have the flexibility to choose the technology needed to curate their data products, reducing the friction and complexity of long technology selection processes, and reducing the time to deliver data products.

The chosen technology is usually determined at an enterprise level so that it adheres to security, scalability, resilience, and high-availability requirements. This architecture assumes that any Oracle Cloud Infrastructure (OCI) service used with a data lakehouse can be leveraged by any domain.

Data domain teams often use automation to deploy domain archetypes, making preconfigured technologies available to quickly onboard new domains while ensuring that enterprise-level requirements, such as security, are enforced.

After they are created, data products are then served to other domains or end users and applications. Data products are continuously curated to provide information and insights.

Data products can be of several types. A single data product can be served by using more than one interface.
  • Data sets
  • APIs
  • Dashboards
  • Streams
  • AI and machine learning (ML) models that address a specific need

This reference architecture uses primarily data sharing as the underlying mechanism to provide and consume data products between domains.

Oracle Autonomous Data Warehouse enables data sharing and allows live sharing of data between Autonomous Data Warehouse instances or with versioned data from any technology that is compliant with the Delta Sharing open protocol.

Functional Architecture

This architecture depicts a decentralized platform where each domain is a subset of the overall data platform and where each domain can choose the technologies and services used.

The architecture uses a data lakehouse to store and provide data, regardless of its shape or form. For simplicity's sake, the architecture will depict a few domains that use a subset of the available data lakehouse services.

A decentralized data platform that uses a data lakehouse architecture provides:

  • An interoperable and modular lakehouse architecture where data domains can ingest and curate any type of data for any use case
  • Flexibility for each data domain to use the Oracle Cloud Infrastructure (OCI) services needed to support the creation of their data products
  • Curation of data products that can be shared securely by using data sharing, streaming, APIs, dashboards, or applications
  • Agility in creating data products, reducing interdomain dependencies except those required for exchange of data products
  • Increased data domain isolation and reduced data interchange complexity by using accepted data interchange mechanisms and contracts to exchange data between domains
  • Increased data governance and data trust because knowledgeable subject matter experts (SMEs) curate data and data products for their domains
  • Ease of onboarding new data domains using infrastructure as code (IaC) to automate deployment using prebuilt and tested Terraform stacks
  • Resource and cost efficiency as data domain teams right-size the specific services they use to create data products
  • Appropriate cost accountability for each data domain with the option of fine-grained cost control within the specific domains

The following diagram illustrates the functional architecture. For simplicity's sake, only four data domains are shown and only some of the data lakehouse capabilities that can be used by data domains are shown.



decentralized-data-platform-logical-oracle.zip

Because the particular industry and organization that deploys a decentralized data platform determines the data domains, this reference architecture doesn't prescribe how data domains should be defined. The data domains depicted are just one example.

The architecture focuses on the following logical divisions used by all domains:

  • Connect, Ingest, Transform

    Connects to data sources and ingests and refines their data for use in each of the data layers in the architecture.

    Source-aligned data domains source data from internal and external data sources and from other domains consuming their data products. Aggregate and Consumer-aligned data domains usually source their data from other domains data products. All domains can source relevant domain data from external sources.

  • Persist, Curate, Create

    Facilitates access and navigation of the data to show the current business view. For relational technologies, data may be logically or physically structured in simple relational, longitudinal, dimensional or OLAP forms. For non-relational data, this layer contains one or more pools of data, either output from an analytical process or data optimized for a specific analytical task.

    In this layer, each data domain curates the data they use to create and expose data products. Usually data is curated and organized by using a medallion architecture that promotes data from bronze, to silver, to gold, according to its value and quality.

    Data products often serve data that is either in the gold or the silver layer. If the data product serves granular data, that data is served from the silver layer. If the data product serves data that is aggregated or is already a further augmented data set, that data is usually served from the gold layer.

  • Analyze, Learn, Predict

    Abstracts the logical business view of the data for consumers. This abstraction facilitates agile approaches to development, migration to the target architecture, and the provision of a single reporting layer from multiple data sources.

    Each data domain typically has their own data consumers, such as domain users, applications, or systems that consume curated data in the form of dashboards, data applications, streaming or APIs.

    Data domains can serve data products to other data domains and within their own domain as a way to organize cross project data sharing.

The architecture has the following functional characteristics:

  • Four data domains are depicted. Each domain curates data specific to that domain, creates data products based on that curated data, and then shares those data products to other domains within the organization or to external entities.
  • Domains can source data from internal data sources, data products curated by other domains, or data shared by external entities.
  • The Customer and Finance domains are source-aligned domains that ingest and curate data from internal systems, have their own users, and curate data products to serve to other domains.
  • The Risk domain is an aggregate domain that sources data from the Customer and Finance domains to obtain Customer profiles and financial augmented transactions, respectively. This data is used to build and train machine learning (ML) risk models and key performance indicators (KPIs) that are used by dashboards and are shared with the Marketing domain.
  • The Marketing domain is a consumer-aligned domain that sources Customer profiles and Risk Propensity data from the Customer and Risk domains exclusively. This domain creates segmentation ML models that determine the best personalized offers. These are made available to internal applications by using inferencing APIs and batch inferencing results are shared as a data product to partners that execute outbound campaigns.
  • All domains share a common data catalog that contains information about their data assets, data entities, and business glossaries.
  • Each data domain team and their data product owners maintain their specific data catalog objects. Security isolation is guaranteed by using Oracle Cloud Infrastructure Identity and Access Management policies that define which team can manage which data catalog entities.
  • Common data catalog entities, such as business glossary terms that are used across the organization, are maintained by a data governance body composed of all domain product owners.
  • Data products are marked in the data catalog so that they are searchable, contain their own semantics, and are related to the business glossary.
  • Data sharing is used to share live or versioned data products between domains. The choice of using either live or versioned data products depends on each data product and use case.

The architecture main functional components are:

  • Source-aligned domains: Customer and Finance

    These domains focuses on curating customer and finance data that is derived from structured and unstructured data.

    The Customer domain uses the following capabilities to create a Customer Profiles data product:

    • Batch Ingest (Oracle Cloud Infrastructure Data Integration): Ingests data from CRM, Website and customer-facing applications.
    • Batch Processing (Oracle Cloud Infrastructure Data Integration, Oracle Cloud Infrastructure Data Flow): Processes structured and unstructured data by using low code ELT, code-centric ETL, or both, to create the Customer Profiles data products.
    • Serving (Oracle Autonomous Data Warehouse): Curates and provides Customer Profiles data to the Risk and Marketing domains.
    • Cloud Storage/Data Lake (Oracle Cloud Infrastructure Object Storage): Stores customer documents, contracts or forms.
    • Visualize/Learn (Oracle Analytics Cloud): Serves domain end users augmented analytics including customer-related KPIs, such as life time value (LTV), retention rate, customer satisfaction score (CSAT), and net promoter score (NPS).
    • AI & Generative AI Services: Oracle Cloud Infrastructure Document Understanding extracts data from customer forms and documents and Oracle Cloud Infrastructure Language processes text data and enriches it with sentiment analysis, named entity recognition, or text classification.

    The Finance domain uses the following capabilities to create an Augmented Financial Transactions data product:

    • Real Time Ingest (Oracle Cloud Infrastructure GoldenGate): Captures financial transactions from the core banking system in near real time and in a non-intrusive way.
    • Batch Processing (Oracle Cloud Infrastructure Data Transforms): Using low code ELT, it validates, shapes, and transforms raw data into a curated data product by categorizing and augmenting financial transactions data with spending categories, merchant details, or location data.
    • Serving (Oracle Autonomous Data Warehouse): Holds curated data and provides Augmented Transactions to the Risk domain.
    • Cloud Storage/Data Lake (Oracle Cloud Infrastructure Object Storage): Stores finance-related forms that are referenced in the financial transaction records stored in Oracle Autonomous Data Warehouse.
  • Aggregate domain: Risk

    This domain focuses on building, training and running machine learning models to detect risk based on internal data, such as customer profiles and augmented transactions, and external data such as economic and macroeconomic data.

    This domain has specialized SMEs in risk analysis and prevention and serves all the other domains that need its data products. The domain has internal users that consume augmented analytics, but the majority of their work is to share machine learning batch inferencing results. For example, batch inferencing might calculate the risk propensity of customers subscribing financial services based on their lifestyle and spending. and on macroeconomic factors, such as economy growth, inflation, or unemployment rate.

    This domain uses the following capabilities to create a risk propensity data product:

    • Serving (Oracle Autonomous Data Warehouse): Processes transformations and feature engineering to feed the ML models as well as to store the batch inferencing results and produce risk related KPIs. The Risk aggregate domain is a consumer of the customer profiles and augmented transactions data, shared by the Customer and Finance Domains, respectively. It provides risk propensity data to the Marketing domain.
    • Learn and Predict (Oracle Cloud Infrastructure Data Science): Covers the full machine learning operations lifecycle, from exploratory data analysis, model development, execution, to continuous improvement. It produces batch inferencing results that are the basis for the risk propensity shared data.
  • Consumer-aligned domain: Marketing

    This domain focuses on curating data to support personalized and targeted campaigns. It uses data shared from other domains as input and it provides the segmentation and next best offer data in real time by using API-driven inferencing and by sharing data with 3rd party marketing partners that execute campaigns and share back the campaign execution results.

    This domain uses the following capabilities to create campaign segmentation data products:

    • Batch Processing (Oracle Cloud Infrastructure Data Transforms): Processes and shapes data consumed from the data shares. It can also be used to replicate data from the data shares into Oracle Autonomous Data Warehouse.
    • Serving (Oracle Autonomous Data Warehouse): Stores curated data, campaign information, segments, and targeted offers for a given campaign.
    • Cloud Storage/Data Lake (Oracle Cloud Infrastructure Object Storage): Stores any unstructured data used by the domain.
    • Visualize/Learn (Oracle Analytics Cloud): Serves domain end users augmented analytics such as campaign targets and execution KPIs.
    • Learn and Predict (Oracle Machine Learning): Covers the full machine learning operations lifecycle from exploratory data analysis to model deployment. Users leverage AutoML to speed up building and training models. Depending on the campaigns, batch inferencing model results are served by using data sharing to external partners that execute the campaigns or served via Oracle Machine Learning deployments for real time inferencing invoked by customer facing applications.
    • API (Oracle Cloud Infrastructure API Gateway): Secures and governs the Oracle Machine Learning deployment API endpoints.
  • Shared services

    Services used by all domains for data governance and security include:

    • Data Governance (Oracle Cloud Infrastructure Data Catalog): Catalogs the business glossary and all domain data entities, categorizing which ones are data products so they can be discovered.
    • Data Security (Oracle Data Safe, OCI Audit, OCI Logging, OCI Vault): Increases the security posture of all domains.

Architecture Variant: Shared Deployment

A decentralized data platform doesn't neccesarily require that cloud resources be completely decentralized for a given domain.

It is possible to have a decentralized platform running on a shared data platform, where a common set of service instances support the different data domain teams.

The primary architecture enables the highest level of isolation and flexibility for each domain and is highly scalable to address decentralized data platforms with a large number of domains. Requirements for a decentralized data platform may vary and for specific use cases a different architecture pattern variant might be better suited.

The following diagram shows a shared deployment variation of the distributed platform pattern.



decentralized-variant-shared-oracle.zip

A single Oracle Autonomous Data Warehouse instance is shared between all domains, which are isolated using role-based access (RBAC) and different schemas. The data residing in the lake is also isolated for each domain by using Oracle Cloud Infrastructure Identity and Access Management policies and distinct compartments. Data products are curated within their respective schemas, cataloged, and are shared by using live and versioned sharing.

For data ingestion and processing, Domains A and B use the same Oracle Cloud Infrastructure Data Integration and Oracle Cloud Infrastructure Data Flow instances and applications. Domains C and D have very specific requirements for data ingestion and processing and therefore have separate instances.

The same logic is applies to the consumption layer where Domains A and B share a single analytics cloud instance, segregated using RBAC, while Domains C and D use their own services instances.

It is also possible to use a hybrid solution; instead of having a single instance for all domains or an instance per domain, some domains might be using a shared instance while others have a dedicated instance.

Such a hybrid solution is typically driven by requirements other than functional requirements, such as performance, security, high-availability, or disaster-recovery requirements that are more demanding for some domains, and require separate instances to address those requirements, without negatively impacting other domains' workloads.

Architecture Variant: Hub and Spoke

Often, large organizations with subsidiaries in different regions and countries need to run their data platforms independently, without a centralized data platform that serves all subsidiary workloads, while still needing to share data with headquarters for global visibility and key performance indicators (KPIs).

A decentralized data platform is a good solution for this scenario, where there is a hub (the headquarters) and several spokes (the subsidiaries) that need to interchange data securely and efficiently.

This variant uses geography as an example for an hub and spoke pattern, but the same pattern can also be applied to other examples such as a holding company and its subsidiaries.

The spokes can be deployed in the same tenancy as the hub or in different tenancies.

The following diagram shows a hub and the several spokes that are deployed in different regions and that use versioned shares, enabled by the Delta Sharing protocol, to exchange data. This diagram shows only the serving engine functional components. The rest of the functional architecture is similar to that shown in the primary functional architecture.



decentralized-variant-hub-spoke-oracle.zip

Because data is exchanged securely and is transmitted across regions over the internet, you should take latency into account. If data products shared between the spokes and the hub are aggregated data sets and KPIs, and not large volumes of granular data, then this pattern is simple to deploy, maintain, and operate.

An alternative approach is to use Oracle Autonomous Database Cloud Links which allow seamless data sharing across instances, even if they are in other regions.

For cross-regional data sharing, the source Oracle Autonomous Data Warehouse instance must be cloned into the destination region so that it can be accessed seamlessly by the hub Autonomous Data Warehouse instance. Clones can be refreshed periodically, either manually or automatically, so that the hub Autonomous Data Warehouse can consume up-to-date data products shared by the spokes.

Because the hub will most likely consume data products that are a subset of the whole data set the spokes curate, the spokes can have a dedicated Autonomous Data Warehouse instance just to hold data products to be shared with the hub, optimizing the refreshable clone.

Network traffic for refreshable clones is routed through the Oracle backbone and has lower latency and higher bandwidth when moving large data products that reside on the spoke Autonomous Data Warehouse instances.

The choice between using versioned shares or cloud links is influenced primarily by performance and cost rather than by functional requirements.

Regardless of the option used, the hub and the spokes have their own local data platform which could use the decentralized approach shown in this architecture.

Architecture Variant: Heterogeneous Data Ecosystem

The primary reference architecture describes how to deploy a decentralized data platform for a single organization.

You can, however, use the same architecture to support a heterogeneous data ecosystem with different organizations sharing data using different technologies and for different purposes.

Use cases can include hospitals that share anonymized data with universities for research purposes, or suppliers sharing parts data with car manufacturers.

Organizations that use Oracle Autonomous Data Warehouse as the serving engine can provide and consume shared data from other technologies that support the Delta Sharing open protocol.

Delta Sharing is a good choice to support data ecosystems due to its broad support and due to the simplicity by which it provides and consumes data securely.

You can also share data by using other mechanisms, such as APIs or data streaming.

Physical Architecture

The physical architecture for this decentralized data platform supports the following:

  • Domain isolation using Oracle Cloud Infrastructure Identity and Access Management compartments and policies where the respective teams are only authorized to use and deploy cloud resources in their compartment
  • Domain deployment in their respective workload VCNs for an higher isolation level and increased security posture
  • Data ingestion, storage, processing, and serving processes managed by domain teams using cloud resources deployed in their compartment(s) and VCNs
  • Support for non-functional requirements such as scalability, high availability, disaster recovery, security, and service level objectives (SLOs) because each domain team uses separate cloud resources according to their specific domain requirements
  • Fine grained cost control for each domain cloud resources usage
  • Fully secure and private end to end traffic using private endpoints and instances deployed in private subnets

    It is also possible to have some services deployed with public endpoints on a per-domain basis while adhering to corporate security rules.

  • Data sharing enabled by Oracle Autonomous Data Warehouse using either live shares or versioned shares and whether to serve up-to-date or versioned data, depending on the use case
  • Centralized data catalog for all domains, with the data catalog sub entities isolated per domain using Oracle Cloud Infrastructure Identity and Access Management policies, except for data products that need to be discoverable
  • Highly scalable deployment as each new domain can be onboarded by using infrastructure as code (IaC) automation without impacting the existing data domains

The following diagram illustrates this reference architecture.



decentralized-data-platform-physical-oracle.zip

The physical architecture diagram depicts two domains to exemplify how the cloud networking and services are laid out for each domain. Typically, all domain networking and compartments are the same unless there is an exception driven by specific, non-functional requirements.

The design for the physical architecture:

  • Leverages a hub VCN and one VCN for each data domain which contains the workload for that domain
  • Leverages on-premises connectivity using both Oracle Cloud Infrastructure FastConnect and site-to-site VPN for redundancy
  • Routes all incoming traffic from on premises and from the internet first into the hub VCN and then into the data domain workload VCNs
  • Secures all data in transit and at rest
  • Deploys services with private endpoints to increase the security posture
  • Segregates VCNs into several private subnets to increase the security posture
  • Provides a compartment for each domain for resources isolation
  • Uses a dynamic routing gateway (DRG) so that cloud resources support inbound and outbound traffic to the other domains VCNs
  • Places Autonomous Data Warehouse instances in the data private subnet for increased security, but can provide and consume live and versioned shares from the other domain Autonomous Data Warehouse instances if routes are established to enable that traffic

Potential design improvements not depicted on this deployment for simplicity's sake include:

  • Leveraging a full CIS-compliant landing zone
  • Deploying a network firewall in the hub VCN to improve the overall security posture by inspecting all traffic and by enforcing policies

Recommendations

The recommendations provided in this section focus specifically on decentralized data platforms and are additional to the recommendations provided in the data lakehouse reference architecture listed in the Explore More section.

Use the following recommendations as a starting point to share data securely. Your requirements might differ from the architecture described here.

Oracle Autonomous Data Warehouse

This architecture uses Oracle Autonomous Data Warehouse on shared infrastructure.

  • Use a medallion architecture for the lakehouse and create data products based on the silver (granular, augmented) and gold (enriched, aggregate) layers.
  • Consider sharing data products by using Autonomous Data Warehouse with its native support for heterogeneous data sharing to provide a simpler, more secure, and more reliable architecture.
  • Consider sharing external data, exposed in Autonomous Data Warehouse as external tables or hybrid tables, to benefit from the security features of versioned or live sharing.
  • Consider creating views for your data product tables to differentiate the base objects (tables) from the shared objects (views).
  • To increase security when sharing data with live shares, consider using name space and name values that are different from the underlying schemas and tables to hide internal object names.
  • To increase security when using live sharing with cloud links, have your data set registration administrator define the most restrictive data set scope for your use cases.
  • When using live sharing with cloud links, consider enabling caching for improving data consumer query performance.
  • When using live sharing with cloud links with a large volume of data products, consider offloading the queries to refreshable clones for improved data consumer performance and workload segregation.
  • If you have either a large number of domain Autonomous Data Warehouse instances or if your instance compute requirements are high, consider consolidating them into an elastic pool.

OCI Object Storage

This architecture uses highly-scalable and durable Oracle Cloud Infrastructure Object Storage as the lake storage.

Consider using multiple, granular compartments to organize the data domains and the teams within the data domains to help segregate their workloads with Oracle Cloud Infrastructure Identity and Access Management policies.

Oracle Cloud Infrastructure Data Catalog

This architecture uses Oracle Cloud Infrastructure Data Catalog to manage technical, business, and operational metadata for data products so that they are self-discoverable.

  • Consider using a single data catalog instance for all domains to centralize metadata and data products governance
  • Consider granting manage access to domain users for only their data assets
  • Consider granting read access to all users so they can find data products maintained across the organization
  • Consider using custom properties to enrich operational metadata with properties such as data product owner, availability, last updated date, version, and so on.

Data domains deployment

This architecture uses the Data Lakehouse pattern and available OCI services to support an end-to-end data, analytics, and AI workload.

  • Consider segregating domains by using separate VCNs for each domain to increase the security posture and domain flexibility when deploying cloud resources.
  • Consider segregating the different OCI services that each domain uses, leveraging compartments and IAM policies.

Data product sharing

  • If you need to serve data products by using APIs, consider using Oracle REST Data Services.
  • If you share data products by using Oracle REST Data Services, consider using Oracle Cloud Infrastructure API Gateway to secure the APIs.
  • If you need to stream data products, consider using Oracle Cloud Infrastructure GoldenGate and Oracle Cloud Infrastructure Streaming.

Acknowledgments

  • Author: José Cruz
  • Contributors: Massimo Castelli, Mike Blackmore, Larry Fumagalli, Robert Lies