Data Platform - Data Warehouse with Complex Integration

Enterprise application data is often distributed in multiple systems across the enterprise and can't easily be integrated and analyzed to produce actionable insights.

This reference architecture provides a framework to enrich enterprise application data with raw data from other sources and to use machine learning models to bring intelligence and predictive insights into business processes.

This reference architecture positions the technology solution within the overall business context:



As departments consolidate data from multiple sources into data marts to gain targeted insights, the enterprise data warehouse must change and adapt to be able to leverage available data marts and other structured and unstructured sources.

Data warehouses separate analysis workload from transaction workload and enable an organization to consolidate data from several sources. This facilitates querying and analyzing historical data in a business-oriented format that can survive changes to transactional systems. Leveraging warehoused data for machine learning and predictive analysis is key to infusing intelligence into business processes. Intelligent business processes help proactively manage critical business events, such as recommending the right products on the right channel to the right customers, or detecting potentially fraudulent events.

Architecture

This architecture collects and combines application data for analysis and machine learning to provide actionable insights.



oci-polyglot-architecture-oracle.zip

The following diagram shows a mapping of the architecture above to services provided on Oracle Cloud Infrastructure (OCI) using best practices.



oci-polyglot-physical-arch-oracle.zip

The architecture focuses on the following logical divisions:

  • Ingest, Transform

    Ingests and refines the data for use in each of the data layers in the architecture.

  • Persist, Curate, Create

    Facilitates access and navigation of the data to show the current and historical business view. It contains both raw data as well as granular and aggregated curated data. For relational technologies, data may be logical or physically structured in simple relational, longitudinal, dimensional or OLAP forms. For non-relational data, this layer contains one or more pools of data, either output from an analytical process or data optimized for a specific analytical task.

  • Analyze, Learn, Predict

    Abstracts the logical business view of the data for the consumers. This abstraction facilitates agile approaches to development, migration to the target architecture, and the provision of a single reporting layer from multiple federated sources.

The architecture has the following components:

  • Batch Ingest

    Batch ingest is useful for data that can't be ingested in real time or is too costly to adapt for real time ingestion. It is also important for transforming data into reliable and trustworthy information that can be curated and persisted for regular consumption. You can use the following services together or independently to achieve a highly flexible and effective data integration and transformation workflow.

    • Oracle Cloud Infrastructure Data Integration is a fully managed, serverless, cloud-native service that extracts, loads, transforms, cleanses, and reshapes data from a variety of data sources into target Oracle Cloud Infrastructure services, such as Autonomous Data Warehouse and Oracle Cloud Infrastructure Object Storage. ETL (extract transform load) leverages fully-managed scale-out processing on Spark, and ELT (extract load transform) leverages full SQL push-down capabilities of the Autonomous Data Warehouse in order to minimize data movement and to improve the time to value for newly ingested data. Users design data integration processes using an intuitive, codeless user interface that optimizes integration flows to generate the most efficient engine and orchestration, automatically allocating and scaling the execution environment. Oracle Cloud Infrastructure Data Integration provides interactive exploration and data preparation and helps data engineers protect against schema drift by defining rules to handle schema changes.

    • Oracle Data Transforms is based on the Oracle Data Integrator (ODI) integration tool that can be deployed from Oracle Autonomous Database database actions (Data Studio). It provides a fully unified solution for building, deploying, and managing complex data warehouses or as part of data-centric architectures in a SOA or business intelligence environment. In addition, it combines all of the elements of data integration, data movement, data synchronization, data quality, and data management, to ensure that information is timely, accurate, and consistent across complex systems.

      Oracle Data Integrator provides comprehensive data integration from high-volume and high-performance batch loads, to event-driven, trickle-feed integration processes, to SOA-enabled data services. A declarative design approach ensures faster, simpler development and maintenance, and provides a unique approach to extract load transform (ELT) that helps guarantee the highest level of performance possible for data transformation and validation processes. Oracle data transforms use a web interface to simplify the configuration and execution of ELT and to help users build and schedule data and work flows using a declarative design approach.

    Depending on the use case, these components can be used independently or together to achieve highly flexible and performant data integration and transformation.

  • Real-time ingest

    Oracle Cloud Infrastructure GoldenGate is a fully managed service that allows data ingestion from sources residing on premises or in any cloud, leveraging the GoldenGate CDC technology for a non intrusive and efficient capture of data and delivery to Oracle Autonomous Data Warehouse in real time and at scale in order to make relevant information available to consumers as quickly as possible.

  • Autonomous Data Warehouse

    Oracle Autonomous Data Warehouse is a self-driving, self-securing, self-repairing database service that is optimized for data warehousing workloads. You do not need to configure or manage any hardware, or install any software. Oracle Cloud Infrastructure handles creating the database, as well as backing up, patching, upgrading, and tuning the database.

    Bulk or cold-storage data that resides in object storage can be joined with warehouse data as external tables and hybrid partitioned tables.

    Autonomous Data Warehouse can use previously harvested metadata stored in the Data Catalog to create external tables, and can automatically synchronize metadata updates in the Data Catalog with the external tables definition to maintain consistency, simplify management, and reduce effort.

    In addition, Data Lake Accelerator, a component of Oracle Autonomous Database, can seamlessly consume object storage data, scale processing to deliver fast queries, autoscale the database compute instance when needed, and reduce the impact on the database workload by isolating object storage queries from the database compute instance.

  • Object storage

    Object storage provides quick access to large amounts of structured and unstructured data of any content type, including database backups, analytic data, and rich content such as images and videos. You can safely and securely store and then retrieve data directly from the internet or from within the cloud platform. You can seamlessly scale storage without experiencing any degradation in performance or service reliability. Use standard storage for "hot" storage that you need to access quickly, immediately, and frequently. Use archive storage for "cold" storage that you retain for long periods of time and seldom or rarely access.

  • Analytics

    Oracle Analytics Cloud is a scalable and secure public cloud service that provides a full set of capabilities to explore and perform collaborative analytics for you, your workgroup, and your enterprise. It supports citizen data scientists, advanced business analysts training, and machine learning (ML) models. Machine learning models can be executed on the analytics service or directly on Oracle Autonomous Data Warehouse as OML-embedded models for large scale batch predictions that leverage the processing power, scalability and elasticity of the warehouse.

    With Oracle Analytics Cloud you also get flexible service management capabilities, including fast setup, easy scaling and patching, and automated lifecycle management.

  • Machine learning

    Oracle Machine Learning provides powerful machine learning capabilities tightly integrated in Oracle Autonomous Database, with support for Python and AutoML. It supports models using open source and scalable, in-database algorithms that reduce data preparation and movement. AutoML helps data scientists speed up time to value of the company’s machine learning initiatives by using auto algorithm selection, adaptive data sampling, auto feature selection, and auto model tuning.

    With Oracle Machine Learning services available in Oracle Autonomous Data Warehouse, you can not only manage models but you can also deploy those models as REST endpoints in order to democratize real time predictions within the company allowing business to react to events of relevance as they occur rather than after the fact.

  • Data science

    Data Science provides infrastructure, open source technologies, libraries, packages, and data science tools for data science teams to build, train, and manage machine learning (ML) models in Oracle Cloud Infrastructure. The collaborative and project-driven workspace provides an end-to-end, cohesive user experience and supports the lifecycle of predictive models.

    The Data Science Model Deployment feature allows data scientists to deploy trained models as fully managed HTTP endpoints that can provide predictions in real time, infusing intelligence into processes and applications and allowing the business to react to relevant events as they occur.

  • Data Catalog

    Oracle Cloud Infrastructure Data Catalog provides visibility for technical assets such as metadata and metadata attributes and allows you to maintain a business glossary that is mapped to that technical metadata. Oracle Cloud Infrastructure Data Catalog also provides metadata to Autonomous Data Warehouse in order to facilitate external table creation in the data warehouse.

Recommendations

Use the following recommendations as a starting point to collect and combine application data for analysis and machine learning.

Your requirements might differ from the architecture described here.

  • Oracle Autonomous Data Warehouse

    This architecture uses Oracle Autonomous Data Warehouse on shared infrastructure. Enable auto scaling to give the database workloads up to three times the processing power.

    Consider using Oracle Autonomous Data Warehouse on dedicated infrastructure if you want the self-service database capability within a private database cloud environment running on the public cloud.

    Consider using the Hybrid Partitioned Tables feature of Autonomous Data Warehouse for data that is not often consumed and for which you don't need the same performance. With this feature you can move partitions of data to object storage and combine them with partitions stored in Autonomous Data Warehouse for seamless delivery.

    Consider using the External Tables feature to consume data stored in object storage in real time without the need to replicate it to Autonomous Data Warehouse. This allows the data warehouse to consume curated data regardless of the format (parquet, avro, orc, json, csv, and so on).

    Consider using Data Lake Accelerator when consuming object storage data in order to deliver an improved and faster user experience to users consuming and joining data between the data warehouse and the data lake.

  • Oracle Machine Learning and Oracle Cloud Infrastructure Data Science model deployment

    This architecture leverages Oracle Machine Learning and Oracle Cloud Infrastructure Data Science to run predictions in real time to provide results to people and to applications.

    Consider deploying an API Gateway if real-time predictions are being consumed by partners and external entities in order to secure and govern the consumption of the deployed model.

  • Data Catalog

    To have a complete and holistic end-to-end view of the data stored and flowing on the platform, consider harvesting not only data stores supporting the data persistence layer but also the source data stores. Mapping this harvested technical metadata to the business glossary and enriching it with custom properties allows you to map business concepts, and to document and govern security and access definitions.

    To facilitate the creation of external tables in Autonomous Data Warehouse that virtualize data stored in object storage, leverage the previously harvested metadata stored in Oracle Cloud Infrastructure Data Catalog. This simplifies the creation of external tables, enforces consistency of metadata across data stores, and is less prone to human error.

Considerations

When collecting and combining application data and streaming event data for analysis and machine learning, consider the following implementation options.

Guidance Data Refinery Data Persistence Platform Access & Interpretation
Recommended
  • Oracle Cloud Infrastructure Data Integration
  • Oracle Cloud Infrastructure GoldenGate
  • Oracle Autonomous Data Warehouse
  • Oracle Cloud Infrastructure Object Storage
  • Oracle Analytics Cloud
  • Oracle Cloud Infrastructure Data Science
  • Oracle Machine Learning
Other Options
  • Oracle Data Integrator
  • Oracle Autonomous Database Data Transforms
Oracle Exadata Database Service Third-party tools
Rationale

Oracle Cloud Infrastructure Data Integration provides a cloud native, serverless, fully managed ETL platform that is scalable and cost efficient.

Oracle Cloud Infrastructure GoldenGate provides a cloud native, serverless, fully managed, non-intrusive data replication platform that is scalable, cost efficient and that can be deployed in hybrid environments.

Oracle Autonomous Data Warehouse is an easy-to- use, fully autonomous database that scales elastically, delivers fast query performance and requires no database administration. It also offers direct access to the data from object storage by using external or hybrid partitioned tables.

Oracle Cloud Infrastructure Object Storage stores unlimited data in raw format.

Oracle Analytics Cloud is a fully managed and tightly integrated with the curated data inOracle Autonomous Data Warehouse.

Oracle Cloud Infrastructure Data Science is a fully managed, self-service platform for data science teams to build, train, and manage machine learning (ML) models in Oracle Cloud Infrastructure. The Data Science service provides infrastructure and data science tools such as AutoML and model deployment capabilities.

Oracle Machine Learning is a fully managed, self service platform for data science available with Autonomous Data Warehouse that leverages the processing power of the warehouse to build, train, test and deploy ML models at scale without the need to move the data outside of the warehouse.

Deploy

The Terraform code for this reference architecture is available on GitHub. You can pull the code into Oracle Cloud Infrastructure Resource Manager with a single click, create the stack, and deploy it. Alternatively, download the code from GitHub to your computer, customize the code, and deploy it using the Terraform CLI.

  • Deploy by using Oracle Cloud Infrastructure Resource Manager:
    1. Click Deploy to Oracle Cloud

      If you aren't already signed in, enter the tenancy and user credentials.

    2. Review and accept the terms and conditions.
    3. Select the region where you want to deploy the stack.
    4. Follow the on-screen prompts and instructions to create the stack.
    5. After creating the stack, click Terraform Actions, and select Plan.
    6. Wait for the job to be completed, and review the plan.

      To make any changes, return to the Stack Details page, click Edit Stack, and make the required changes. Then, run the Plan action again.

    7. If no further changes are necessary, return to the Stack Details page, click Terraform Actions, and select Apply.
  • Deploy by using the Terraform CLI:
    1. Go to GitHub.
    2. Download or clone the code to your local computer.
    3. Follow the instructions in the README.

Change Log

This log lists significant changes: