Deploy GoldenGate Stream Analytics for Production Workloads

Your data lake stores huge volumes of persisted data with great potential for analytics, but you also want a simple architecture for gaining real-time insights from streaming data from production workloads.

By deploying Oracle GoldenGate Stream Analytics (GGSA), Kafka, and Spark independently in a highly available configuration, you can process and enrich streaming data with the redundancy needed for a production-quality deployment and then output the data to a number of different platforms simultaneously.

This reference architecture positions the technology solution within the overall business context:

Description of data-driven-business-context.png follows
Description of the illustration data-driven-business-context.png

The data needs of a business evolve continuously. Internet of things (IoT), geolocation, social media, and other data-streaming technologies have changed the way data is collected, stored, distributed, processed and analyzed. An effective streaming data architecture must support a complex, event-driven data pipeline, real-time decision making, and different types of end points.

A data lake enables an enterprise to store all of its data in a cost effective, elastic environment while providing the necessary processing, persistence, and analytic services to discover new business insights. A data lake stores and curates structured and unstructured data and provides methods for organizing large volumes of highly diverse data from multiple sources.

With a data warehouse, you perform data transformation and cleansing before you commit the data to the warehouse. With a a data lake, you ingest data quickly and prepare it on the fly as people access it. A data lake supports operational reporting and business monitoring that require immediate access to data and flexible analysis to understand what is happening in the business while it is happening.

This architecture brings concepts of a data lake and a data warehouse together to provide a complete data ecosystem that delivers the benefits of both.

At a conceptual level, the technology solution addresses the problem as follows:

Description of goldengate-streamed-data-overview.png follows
Description of the illustration goldengate-streamed-data-overview.png

Architecture

This architecture combine the abilities of a data lake and a data warehouse to process and analyze streaming data from production workloads in a highly available and redundant environment.

By deploying Oracle GoldenGate Stream Analytics (GGSA), Kafka, and Spark independently in a highly available configuration, you can process and enrich streaming data with the redundancy needed for a production-quality deployment and then output the data to a number of different platforms simultaneously.

The following diagram illustrates the functional architecture.

Description of goldengate-streamed-data-functional-architecture.png follows
Description of the illustration goldengate-streamed-data-functional-architecture.png

goldengate-streamed-data-functional-architecture-oracle.zip

The following diagram illustrates the architecture topology.

Description of goldengate-streamed-data-oci-architecture.png follows
Description of the illustration goldengate-streamed-data-oci-architecture.png

goldengate-streamed-data-oci-architecture-oracle.zip

The architecture focuses on the following logical divisions:

  • Data refinery

    Ingests and refines the data for use in each of the data layers in the architecture. The shape is intended to illustrate the differences in processing costs for storing and refining data at each level and for moving data between them.

  • Data persistence platform (curated information layer)

    Facilitates access and navigation of the data to show the current business view. For relational technologies, data may be logical or physically structured in simple relational, longitudinal, dimensional or OLAP forms. For non-relational data, this layer contains one or more pools of data, either output from an analytical process or data optimized for a specific analytical task.

  • Access and interpretation

    Abstracts the logical business view of the data for the consumers. This abstraction facilitates agile approaches to development, migration to the target architecture, and the provision of a single reporting layer from multiple federated sources.

The architecture has the following components:

  • Stream processing

    GoldenGate Stream Analytics processes and analyzes large-scale, real-time information by using sophisticated correlation patterns, enrichment, and machine learning. Users can explore real-time data through live charts, maps, visualizations, and can graphically build streaming pipelines without any hand coding. These pipelines execute in a scalable and highly available clustered big data environment using Spark integrated with Oracle’s continuous query engine to address critical real-time use cases of modern enterprises.

    GoldenGate Stream Analytics requires the following components to complete a production deployment of the service.

    • Spark

      Apache Spark is an open-source platform used to process large amounts of batch and streaming data.

    • Kafka

      Apache Kafka is an open source publish and subscribe platform that provides high-throughput, low-latency messaging capabilities.

    • Autonomous Transaction Processing

      Oracle Autonomous Transaction Processing is a self-driving, self-securing, self-repairing database service that is optimized for transaction processing workloads. You do not need to configure or manage any hardware, or install any software. Oracle Cloud Infrastructure handles creating the database, as well as backing up, patching, upgrading, and tuning the database.

    • File storage

      The Oracle Cloud Infrastructure File Storage service provides a durable, scalable, secure, enterprise-grade network file system. You can connect to a File Storage service file system from any bare metal, virtual machine, or container instance in a VCN. You can also access a file system from outside the VCN by using Oracle Cloud Infrastructure FastConnect and IPSec VPN.

  • Autonomous Data Warehouse

    Oracle Autonomous Data Warehouse is a self-driving, self-securing, self-repairing database service that is optimized for data warehousing workloads. You do not need to configure or manage any hardware, or install any software. Oracle Cloud Infrastructure handles creating the database, as well as backing up, patching, upgrading, and tuning the database.

  • Object storage

    Object storage provides quick access to large amounts of structured and unstructured data of any content type, including database backups, analytic data, and rich content such as images and videos. You can safely and securely store and then retrieve data directly from the internet or from within the cloud platform. You can seamlessly scale storage without experiencing any degradation in performance or service reliability. Use standard storage for "hot" storage that you need to access quickly, immediately, and frequently. Use archive storage for "cold" storage that you retain for long periods of time and seldom or rarely access.

  • Analytics

    Oracle Analytics Cloud is a scalable and secure public cloud service that provides a full set of capabilities to explore and perform collaborative analytics for you, your workgroup, and your enterprise. With Oracle Analytics Cloud you also get flexible service management capabilities, including fast setup, easy scaling and patching, and automated lifecycle management.

Recommendations

Use the following recommendations as a starting point to process streaming data and data from a broad range of enterprise data resources for business analysis and machine learning.

Choosing the correct datastore is important when designing a fast data or streaming data architecture. If your goal is to act immediately on the insights gained from fast and streaming data, then the value of stored data decreases over time.

The recommendations in this architecture take the following into consideration:

  • Fast data ingest rates (inserts)
  • In-memory indexing for fast and efficient lookups
  • Near real-time analytics on all ingested data with online analytical processing (OLAP)
  • Integrated machine learning capabilities to “learn” from previous events
  • High availability and replication to provide continuous value to the business
  • Linear scalability by simply adding nodes

Your requirements might differ from the architecture described here.

  • File Storage

    File storage is used to capture and maintain reference information that needs to be cached to optimize usage during the streaming enrichment process.

  • Autonomous Transaction Processing

    Oracle Autonomous Transaction Processing is used instead of a MySQL database instance to capture and maintain the metadata of the data pipelines developed using GoldenGate Stream Analytics. We recommend ATP because it removes all management from the customer's hands.

  • Autonomous Data Warehouse

    Data science model building and analytics reporting that combine multiple sources generally require longer-term data storage (more than 1 week). We recommend the combination of Oracle Autonomous Data Warehouse and Oracle Cloud Infrastructure Object Storage (rather than Kafka) to store the input and output of the GoldenGate Stream Analytics process.

  • Oracle Analytics Cloud

    Oracle Analytics Cloud provides a complete solution for your analytic and reporting needs.

Considerations

When processing streaming data and a broad range of enterprise data resources for business analysis and machine learning, consider these implementation options.

Guidance Data Refinery Data Persistence Platform Access & Interpretation
Recommended GoldenGate Streaming Analytics
  • Spark
  • Kafka
  • Oracle Cloud Infrastructure Object Storage
  • GoldenGate Streaming Analytics
  • Oracle Cloud Infrastructure File Storage
  • Oracle Autonomous Data Warehouse
Oracle Analytics Cloud
Other Options Oracle Cloud Infrastructure Streaming service
  • Oracle Database Exadata Cloud Service
  • Cloudera CDP instantiated on Oracle Cloud Infrastructure
  • Oracle Cloud Infrastructure Big Data service
Third-party tools
Rationale

The install script can create customer-managed clusters for Spark and Kafka. You can also customize GoldenGate Streaming Analytics to use existing Spark and Kafka installations. When deploying, you can also choose to use Oracle Cloud Infrastructure Streaming service rather than a customer managed instance. Choose the best option based on your business requirements.

Oracle Cloud Infrastructure Object Storage stores unlimited data in raw format.

Oracle Autonomous Data Warehouse is an easy-to- use, fully autonomous database that scales elastically, delivers fast query performance and requires no database administration. It also offers direct access to the data from object storage via external tables.

Oracle Analytics Cloud is a fully managed and tightly integrated with the Curated Data Layer (Oracle Autonomous Data Warehouse).

Deploy

The Terraform code for this reference architecture is available as a sample stack in Oracle Cloud Infrastructure Resource Manager. You can also download the code from GitHub, and customize it to suit your specific requirements.

Note:

You can deploy the full solution or small solution:
  • Full solution will deploy all components in the architecture diagram.
  • Small solution will deploy only the GoldenGate Stream Analytics with Kafka, Spark and File Storage.
  • Deploy Full Solution using the sample stack in Oracle Cloud Infrastructure Resource Manager:
    1. Click Deploy to Oracle Cloud - Full Solution- Full Solution or Deploy to Oracle Cloud - Small Solution.

      If you aren't already signed in, enter the tenancy and user credentials.

    2. Select the region where you want to deploy the stack.
    3. Follow the on-screen prompts and instructions to create the stack.
    4. After creating the stack, click Terraform Actions, and select Plan.
    5. Wait for the job to be completed, and review the plan.

      To make any changes, return to the Stack Details page, click Edit Stack, and make the required changes. Then, run the Plan action again.

    6. If no further changes are necessary, return to the Stack Details page, click Terraform Actions, and select Apply.
  • Deploy Full Solution using the Terraform code in GitHub:
    1. Go to GitHub - Full Solution or GitHub - Small Solution.
    2. Clone or download the repository to your local computer.
    3. Follow the instructions in the README document.

Explore More

Learn more about the features of this architecture.

Best practices framework for Oracle Cloud Infrastructure

Change Log

This log lists significant changes: