Deploy an Open-source Data Lakehouse on OCI

A data lakehouse is a modern, open architecture that enables you to store, understand, and analyze all your data. It combines the abilities of a data lake and a data warehouse to process a broad range of enterprise data for advanced analytics and business insights. A data lakehouse offers an architecture that eliminates data silos, enabling you to analyze data across your data estate. You can build a data lakehouse from the ground up on Oracle Cloud Infrastructure (OCI).

Architecture

This architecture introduces a platform topology, component overview, recommended best practices, and Terraform automation to deploy an open-source data lakehouse on OCI.

A data lakehouse can store and aggregate enterprise application data. Data can be sent to the data lake or to the data warehouse. The data which is present in the data lake can either be processed and loaded to the data warehouse, or it can be read directly from the data lake for advanced analytics.

The following diagram illustrates this reference architecture.

Description of open-source-data-lakehouse.png follows
Description of the illustration open-source-data-lakehouse.png

open-source-data-lakehouse-oracle.zip

In this data lakehouse architecture on OCI, Oracle MySQL HeatWave is used for data warehousing. Oracle MySQL HeatWave is the only MySQL cloud service with a built-in, high performance, in-memory query accelerator. For database admins and application developers, it’s the only service that enables them to run OLTP and OLAP workloads directly from their MySQL database. Since MySQL is optimized for OLTP, many MySQL implementations use a separate OLAP database for business analytics.

Oracle MySQL HeatWave increases MySQL performance by orders of magnitude for analytics and mixed workloads, without any changes to existing applications. Oracle MySQL HeatWave provides a single, unified platform for transactional and analytics workloads. This eliminates the need for the complex, time-consuming, expensive ETL and integration with a separate analytics database. The MySQL Autopilot in Oracle MySQL HeatWave automates provisioning, data loading, query execution, and failure handling, which saves developers and DBAs significant time.

Oracle Cloud Infrastructure Object Storage serves as the data lake in this architecture. OCI Object Storage enables an enterprise to store all of its data in a cost effective, elastic environment while providing the necessary processing, persistence, and analytic services to discover new business insights. With a data lake on OCI Object Storage, you can store and curate structured and unstructured data and use methods for organizing large volumes of highly diverse data from multiple sources.

The presented architecture contains the following open-source components:

  • Apache Zeppelin

    Apache Zeppelin is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R, and more.

    Zeppelin is used for data science and data exploration in this architecture. In Zeppelin, you can create notebooks and leverage the Zeppelin interpreter concept, which allows any language or back-end data processing system to be plugged in. With connectivity established from Zeppelin to MySQL and to Object Storage, you can run joint queries and retrieve data from the data warehouse and the data lake simultaneously, in a true data lakehouse querying experience.

  • Grafana

    Grafana is the open-source platform in this architecture. Grafana is a popular web-application, written in TypeScript (front-end) and Go (back-end). It provides charts and graphs for the supported data sources, one of them being MySQL. Numerous plugins are available online to enable you to extend Grafana.

Zeppelin and Grafana make use of a Network File System (NFS), which is shared by two virtual machine instances placed in a private subnet. These instances exist in two distinct fault domains within an availability domain. The file system exists in a dedicated private subnet featuring a Network Security Group (NSG), which allows access to the mount target from all instances. Users are able to access Zeppelin and Grafana through an OCI Load Balancer, which is part of this highly available design.

This architecture uses Oracle Cloud Infrastructure Data Integration to load files from Object Storage to MySQL. MySQL, Data Integration and Data Science for Marine Life Workshop walks through how to set up and run data flows with Object Storage as the source and MySQL as the destination.

The architecture has the following OCI components:

  • Tenancy

    A tenancy is a secure and isolated partition that Oracle sets up within Oracle Cloud when you sign up for Oracle Cloud Infrastructure. You can create, organize, and administer your resources in Oracle Cloud within your tenancy. A tenancy is synonymous with a company or organization. Usually, a company will have a single tenancy and reflect its organizational structure within that tenancy. A single tenancy is usually associated with a single subscription, and a single subscription usually only has one tenancy.

  • Compartment

    Compartments are cross-region logical partitions within an Oracle Cloud Infrastructure tenancy. Use compartments to organize your resources in Oracle Cloud, control access to the resources, and set usage quotas. To control access to the resources in a given compartment, you define policies that specify who can access the resources and what actions they can perform.

  • Policy

    An Oracle Cloud Infrastructure Identity and Access Management policy specifies who can access which resources, and how. Access is granted at the group and compartment level, which means you can write a policy that gives a group a specific type of access within a specific compartment, or to the tenancy.

  • Region

    An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

  • Virtual cloud network (VCN) and subnets

    A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • Availability domains

    Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.

  • Fault domains

    A fault domain is a grouping of hardware and infrastructure within an availability domain. Each availability domain has three fault domains with independent power and hardware. When you distribute resources across multiple fault domains, your applications can tolerate physical server failure, system maintenance, and power failures inside a fault domain.

  • Object Storage

    OCI Object Storage is an internet-scale, high-performance storage platform that offers reliable and cost-efficient data durability. Object Storage can store an unlimited amount of unstructured data of any content type, including analytic data. You can safely and securely store or retrieve data directly from the internet or from within the cloud platform. Multiple management interfaces let you easily start small and scale seamlessly, without experiencing any degradation in performance or service reliability.

    Use Object Storage as a cold storage layer for the data warehouse by storing data that is used infrequently and then joining it seamlessly with the most recent data with Apache Zeppelin. Use archive storage for files that need to be retained for long periods of time and seldom or rarely accessed.

  • MySQL HeatWave

    Oracle MySQL Database Service is a fully-managed database service that lets developers quickly develop and deploy secure, cloud-native applications using the world’s most popular open source database. Oracle MySQL HeatWave is a new, integrated, high-performance, in-memory query accelerator for Oracle MySQL Database Service that accelerates MySQL performance for analytics and transactional queries.

  • Data Integration

    Oracle Cloud Infrastructure Data Integration is a fully managed, serverless, cloud-native service that extracts, loads, transforms, cleanses, and reshapes data from a variety of data sources into target Oracle Cloud Infrastructure services. ETL (extract transform load) leverages fully-managed scale-out processing on Spark. Users design data integration processes using an intuitive, codeless user interface that optimizes integration flows to generate the most efficient engine and orchestration, automatically allocating and scaling the execution environment. OCI Data Integration provides interactive exploration and data preparation and helps data engineers protect against schema drift by defining rules to handle schema changes.

  • Load balancer

    The Oracle Cloud Infrastructure Load Balancing service provides automated traffic distribution from a single entry point to multiple servers in the back end.

  • Compute instances

    The Oracle Cloud Infrastructure Compute service enables you to provision and manage compute hosts in the cloud. You can launch compute instances with shapes that meet your resource requirements for CPU, memory, network bandwidth, and storage. After creating a compute instance, you can access it securely, restart it, attach and detach volumes, and terminate it when you no longer need it.

  • File Storage

    The Oracle Cloud Infrastructure File Storage service provides a durable, scalable, secure, enterprise-grade network file system. You can connect to a File Storage service file system from any bare metal, virtual machine, or container instance in a VCN. You can also access a file system from outside the VCN by using Oracle Cloud Infrastructure FastConnect and IPSec VPN.

  • Internet gateway

    The internet gateway allows traffic between the public subnets in a VCN and the public internet.

  • Network address translation (NAT) gateway

    A NAT gateway enables private resources in a VCN to access hosts on the internet, without exposing those resources to incoming internet connections.

  • Network security group (NSG)

    NSGs act as virtual firewalls for your cloud resources. With the zero-trust security model of Oracle Cloud Infrastructure, all traffic is denied, and you can control the network traffic inside a VCN. An NSG consists of a set of ingress and egress security rules that apply to only a specified set of VNICs in a single VCN.

Recommendations

Use the following recommendations as a starting point. Your requirements might differ from the architecture described here.
  • VCN

    When you create a VCN, determine the number of CIDR blocks required and the size of each block based on the number of resources that you plan to attach to subnets in the VCN. Use CIDR blocks that are within the standard private IP address space.

    Select an address range that doesn’t overlap with your on-premises network or any other network, so that you can set up a connection between the VCN and your on-premises network, if necessary.

    When you design the subnets, consider your traffic flow and security requirements. Attach all the resources within a specific tier or role to the same subnet, which can serve as a security boundary.

  • Security

    Use policies to restrict who can access the Oracle Cloud Infrastructure (OCI) resources that your company has and how they can access them.

    Specific policies are required for a successful security implementation. Consider using Oracle Cloud Infrastructure Vault for additional protection of your keys, certificates and secrets.

    The Networking service offers two virtual firewall features that use security rules to control traffic at the packet level: security lists and network security groups (NSG). An NSG consists of a set of ingress and egress security rules that apply only to a set of VNICs of your choice in a single VCN. For example, you can choose all compute instances that act as web servers in the web tier of a multitier application in your VCN.

    NSG security rules function the same as security list rules. However, for an NSG security rule's source or destination, you can specify an NSG instead of a CIDR block. So, you can easily write security rules to control traffic between two NSGs in the same VCN or traffic within a single NSG. When you create a database system, you can specify one or more NSGs. You can also update an existing database system to use one or more NSGs.

  • Compute

    Pick shapes with the appropriate OCPUs and memory combination, and provision local NVMe and/or block storage according to need, for each instance. Consider using the available flexible shapes, which provide more flexibility in combining OCPUs and memory.

Considerations

When deploying a data lakehouse on Oracle Cloud Infrastructure (OCI), consider the following:

  • Integrating data sources

    For the Object Storage integration with MySQL, we chose Oracle Cloud Infrastructure Data Integration. If you want to use an open-source tool, Talend is an option. Talend was not included in this architecture because its download cannot be automated (the download requires you to sign-up for Talend).

  • Data Science and Machine Learning

    We chose open-source Apache Zeppelin as our platform for data science and data discovery. Alternatively, you can use Oracle Cloud Infrastructure Data Science, a key enabler of advanced data-driven insights and applications which provides data scientists with access to automated workflows, JupyterLab notebooks and hundreds of open source tools, and a streamlined approach to building models. It's a collaborative platform, by enabling teams to work together with ways to share and reproduce models in a structured, and a secure way to achieve enterprise-grade results. The platform is fully managed, meeting the needs of the modern enterprise.

  • Analytics and reporting

    In this architecture, Grafana was chosen as the analytics platform. Consider using Oracle Analytics Cloud, an advanced managed service that can comprehensively fulfill your analytics and reporting requirements.

  • Instance availability and scalability

    In this example, in the highly-available architecture, we show two instances in different fault domains within the same availability domain. You can choose to place the instances in different availability domains (in regions, where available), for even higher fault tolerance.

    An alternative approach, which represents a more dynamic adaptation to more demanding and varying performance needs, is the use of an instance pool and autoscaling.

    Instance pools enable you to provision and create multiple Oracle Cloud Infrastructure Compute instances based on the same configuration within the same region.

    Autoscaling enables you to automatically adjust the number of Compute instances in an instance pool based on performance metrics, such as CPU utilization. Autoscaling helps you provide consistent performance for users during periods of high demand and helps you reduce your costs during periods of low demand.

  • Database availability and scalability

    Consider deploying an additional Oracle MySQL DB System as a standby replica, for high-availability. You can place the replica in a different fault domain, availability domain, or region.

  • Database backups

    Oracle MySQL Database Service supports two types of backup: full backup of all data contained in the database system and incremental backup of only the data that has been added or changed since the last full backup. Backups run in the following ways:

    • Manual: An action in the Console or request made through the API initiates the backup. You can retain manual backups for a minimum of one day and a maximum of 365 days.
    • Automatic: Automatically scheduled backups run without any required interaction at a time of your choosing. Automatic backups are retained for between one and 35 days. The default retention value is seven days. When defined, you can’t edit the retention period of an automatic backup.
  • File storage

    In this example, we're leveraging the Oracle Cloud Infrastructure File Storage service. Alternatively, a multi-node architecture can feature a shared block volume. A block volume can be attached to multiple instances as read-write. In that case, a cluster aware system or solution such as Oracle Cluster File System version 2 (OCFS2) must be installed.

  • Security

    Consider using Oracle Cloud Guard to monitor and maintain the security of your resources in OCI proactively. Oracle Cloud Guard uses detector recipes that you can define to examine your resources for security weaknesses and to monitor operators and users for risky activities. When any misconfiguration or insecure activity is detected, Oracle Cloud Guard recommends corrective actions and assists with those actions, based on responder recipes that you can define. For resources that require maximum security, Oracle recommends that you use security zones. A security zone is a compartment associated with an Oracle-defined recipe of security policies that are based on best practices. For example, the resources in a security zone must not be accessible from the public internet and they must be encrypted using customer-managed keys. When you create and update resources in a security zone, OCI validates the operations against the policies in the security-zone recipe, and denies operations that violate any of the policies.

  • Application development

    Oracle Application Express (Oracle APEX), Oracle's low-code development platform, is an example of a tool that you can use to build scalable and secure enterprise apps on top of the data lakehouse.

    Using Oracle APEX, developers can quickly develop and deploy compelling apps that solve real problems and provide immediate value. You won't need to be an expert in a vast array of technologies to deliver sophisticated solutions. Focus on solving the problem and let Oracle APEX take care of the rest. It has world-class features and you can deploy it anywhere.

Deploy

The Terraform code for this reference architecture is available on GitHub. You can pull the code into Oracle Cloud Infrastructure Resource Manager with a single click, create the stack, and deploy it. Alternatively, you can download the code from GitHub to your computer, customize the code, and deploy the architecture by using the Terraform CLI.

  • Deploy using the sample stack in Oracle Cloud Infrastructure Resource Manager:
    1. Click Deploy to Oracle Cloud

      If you aren't already signed in, enter the tenancy and user credentials.

    2. Review and accept the terms and conditions.
    3. Select the region where you want to deploy the stack.
    4. Follow the on-screen prompts and instructions to create the stack.
    5. After creating the stack, click Terraform Actions, and select Plan.
    6. Wait for the job to be completed, and review the plan.

      To make any changes, return to the Stack Details page, click Edit Stack, and make the required changes. Then, run the Plan action again.

    7. If no further changes are necessary, return to the Stack Details page, click Terraform Actions, and select Apply.
  • Deploy using the Terraform code in GitHub:
    1. Go to GitHub.
    2. Clone or download the repository to your local computer.
    3. Follow the instructions in the README document.

Acknowledgments

Author: Nuno Goncalves