Recommended Patterns for Cloud-Based Data Lakes

Depending on your use case, data lakes can be built on Object Storage or Hadoop. Both can scale and seamlessly integrate with existing enterprise data and tools. Consider either the Greenfield or the Migration patterns for your organization. Choose either the Greenfield or the Migration pattern based on whether you plan a completely new implementation, or want to migrate your existing Big Data solution to Oracle Cloud.

The following workflow shows you the recommended patterns based on your requirements.

Description of data-lake-solution-pattern.png follows
Description of the illustration data-lake-solution-pattern.png

Note:

In this document, we focus on the migration of Big Data Appliance (BDA) and Big Data Cloud Services (BDCS) clusters to OCI based on Cloudera Distribution of Hadoop (CDH). However, the recommendations here are applicable to other on-premises and cloud Hadoop distributions.

Build New Data Platform on Oracle Cloud (Greenfield)

You have two options to build data lakes in Oracle Cloud for Greenfield projects. Use Big Data Service (BDS) for HDFS-based data lakes. Use OCI cloud native data services for Object Storage based data lakes without using HDFS.

Cloud Native Data Services

Build a data lake in OCI Object Storage and use Cloud Native Data and AI services. These services include Data Flow, Data Integration, Autonomous Data Warehouse, Data Catalog, and Data Science along with a few others.

Oracle recommends these services to build a new data lake:

  • Object Storage as the data lake store for all kinds of raw data
  • Data Flow service for Spark batch processes and for ephemeral Spark clusters
  • Data Integration service for ingesting data and for ETL jobs
  • Autonomous Data Warehouse (ADW) for serving and presenting layer data
  • Data Catalog for data discovery and governance

Oracle recommends these additional services to build a new data lake:

  • Streaming service for a managed ingestion of real-time data
  • Data Transfer Appliance (DTA) service for one-time bulk transfer of data
  • GoldenGate service for Change Data Capture (CDC) data and for streaming analytics
  • Data Science service for machine learning requirements
  • Oracle Analytics Cloud (OAC) service for BI, analytics, and reporting requirements

Big Data Service

Build your data lake in HDFS using Oracle Big Data Service (BDS). BDS provides most commonly used Hadoop components including HDFS, Hive, HBase, Spark, and Oozie.

Oracle recommends these services to build a new data lake using Hadoop clusters:

  • Data Integration service for ingesting data and for ETL jobs
  • Data Transfer Appliance (DTA) service for one-time bulk transfer of data
  • GoldenGate service for CDC data and for streaming analytics
  • Data Catalog service for data discovery and governance
  • Data Science service for machine learning requirements
  • OAC service for BI, analytics, and reporting requirements
  • BDS for HDFS and other Hadoop components

Greenfield Pattern Workflow

When you build a new data lake, follow this workflow from requirements through testing and validation:

  1. Requirements: List the requirements for new environments in OCI
  2. Assessment: Assess the required OCI services and tools
  3. Design: Design your solution architecture and sizing for OCI
  4. Plan: Create a detailed plan mapping your time and resources
  5. Provision: Provision and configure the required resources in OCI
  6. Implement: Implement your data and application workloads
  7. Automate Pipeline: Orchestrate and schedule workflow pipelines for automation
  8. Test and Validate: Perform validation, functional, and performance testing for the end-to-end solution

Migrate Existing Data Platform on Oracle Cloud

You can migrate your existing BDA, BDCS, and other Hadoop clusters from an on-premises or cloud environment to Oracle Cloud Infrastructure (OCI). Choose one of these vetted migration patterns: Rebuild, Replatform, or Rehost to migrate your existing Hadoop clusters to Oracle cloud-based data lakes.

Rebuild Pattern

Use the Rebuild pattern if you don't want to use Hadoop clusters and want to migrate to cloud native services in Oracle Cloud Infrastructure (OCI). Start with a clean slate to architect and begin implementing from scratch in OCI. Leverage managed, cloud native services for all major components in your stack. For example, build a stack using Data Flow, Data Catalog, Data Integration, Streaming, Data Science, ADW, and OAC.

Oracle recommends these services to migrate to a cloud-based data lake without Hadoop clusters:

  • Object Storage service as the data lake store for all kinds of raw data

    Note:

    You can use Object Storage with an HDFS connector as the HDFS store in place of HDFS within the Hadoop or Spark cluster.
  • Data Integration service for ingesting data and for ETL jobs
  • Streaming service for managed ingestion of real-time data, which can replace your self-managed Kafka or Flume services
  • Data Transfer Appliance for one-time bulk transfer of data
  • GoldenGate for CDC data and for streaming analytics
  • Data Flow service for Spark batch processes and for ephemeral Spark clusters
  • ADW for serving and presentation layer data
  • Data Catalog service for data discovery and governance
  • Data Science service for Machine Learning requirements
  • OAC service for BI, analytics, and reporting requirements

Replatform Pattern

Use the Replatform migration pattern if you want to use Hadoop clusters on the cloud and replace some of the components with cloud native services. Use Big Data Service for HDFS and other Hadoop components, and redesign part of your stack using our additional managed cloud native services.

You may need to redesign your stack to use the Replatform pattern.

  • Include serverless cloud native services along with BDS in OCI
  • Leverage managed cloud native services where possible

You can replace some of these components based on your needs.

  • BDS for HDFS and other Hadoop components such as Hive, HBase, Kafka, and Oozie
  • Data Integration service for ingesting data and for ETL jobs
  • Data Transfer Appliance service for one-time bulk transfer of data
  • GoldenGate service for CDC data and for streaming analytics
  • Data Catalog service for data discovery and governance
  • Data Science service for Machine Learning requirements
  • OAC service for BI, analytics, and reporting requirements

Rehost Pattern

Migrate your BDA, BDCS, and other Hadoop clusters to build your data lake in HDFS using Big Data Service (BDS). You can use a lift and shift approach when using the Rehost pattern. All the commonly used Hadoop components including HDFS, Hive, HBase, Spark, and Oozie are available in the managed Hadoop clusters provided by BDS.

Migration Pattern Workflow

When you migrate your data lake to Oracle Cloud, follow this workflow from requirements through the cut over to the new environment.

  1. Discovery and requirements: Discover and catalog the current system to list the requirements for the new OCI environment
  2. Assessment: Assess the required OCI services and tools
  3. Design: Design your solution architecture and sizing for OCI
  4. Plan: Create a detailed plan mapping your time and resources
  5. Provision: Provision and configure the required resources in OCI
  6. Migrate Data: Transfer the data and metadata into selected OCI services data storage
  7. Migrate workload: Migrate your workloads and applications to OCI services using the migration pattern you selected
  8. Automate Pipeline: Orchestrate and schedule workflow pipelines for automation
  9. Test and Validate: Plan functional and performance testing and validation for the final OCI environment
  10. Cut over: Turn off the source environment and cut over to using only using the new OCI based environment