Learn About Hadoop-based Data Lakes

Oracle Big Data Service provides a Hadoop stack that includes Apache Ambari, Apache Hadoop, Apache HBase, Apache Hive, Apache Spark, and other services for working with and securing big data.

Big Data Service makes it simple for enterprises to move workloads to the cloud and ensures compatibility with on-premises solutions. It enables moving data into Object Storage to save on costs and to decouple compute resources from storage. You can access BDS using the OCI Console, OCI CLI, REST APIs, or SDKs. You get full access to customize what is deployed on your BDS clusters.

Oracle Cloud SQL is an available add-on service that enables you to initiate Oracle SQL queries on data in HDFS, Kafka, and Object Storage. Any user, application, or analytics tool can work with data stores to minimize data movement and speed up queries. BDS operates with Data Integration, Data Science, and other analysis services. Developers can access data using Oracle SQL. Enterprises can eliminate data silos and ensure that data lakes are not isolated from other corporate data sources.

About Data Lakehouse

The Oracle Lakehouse pattern combines the best elements from data warehouses and data lakes. It provides an integrated platform of multiple Oracle cloud services working together with easy movement of data, unified governance, and offers the ability to use the best open source and commercial tools based on your use cases and preferences.

Description of data-lake-house.png follows

Description of the illustration data-lake-house.png

Key elements of Oracle Lakehouse pattern include

Integration of data warehouse and data lake patterns.
Elimination of data silos – easy movement of data between warehouse and lake as needed.
Unified metadata and governance.
Support for popular open source and commercial tools.
Support for a wide variety of data sources, data formats, and data types (structured, semi-structured, and unstructured)
Support for diverse data consumers and workloads including big data analytics, SQL and BI, data science, and machine learning across all industries.

Key services in the platform that are used in this playbook include:

Big Data

Oracle Big Data provides clusters with a Hadoop environment. Big Data simplifies the process of making Hadoop clusters both highly available and secure. Based on Oracle's best practices, Big Data implements high availability and security, and reduces the need for advanced Hadoop skills. Big Data offers the commonly used Hadoop components making it simple for enterprises to move workloads to the cloud and ensures compatibility with on-premises solutions.

Data Catalog

Oracle Cloud Infrastructure Data Catalog is a fully managed, self-service data discovery and governance solution for your enterprise data. Data Catalogs are essential to an organization's ability to search and find data to analyze. They help data professionals discover data and support data governance.

Use Data Catalog as a single collaborative environment to manage technical, business, and operational metadata. You can harvest technical metadata from a wide range of supported data sources that are accessible using public or private IP addresses. You can organize, find, access, understand, enrich, and activate this metadata. Utilize on-demand or schedule-based automatic harvesting to ensure the data catalog always has up-to-date information. You benefit from all of the security, reliability, performance, and scale of Oracle Cloud.

Data Flow

Oracle Cloud Infrastructure Data Flow is a fully managed service for running Apache Spark applications. Data Flow applications are reusable templates consisting of a Spark application, its dependencies, default parameters, and a default runtime resource specification. You can manage all aspects of Data Flow and the application development lifecycle, tracking and executing Apache Spark jobs using the REST APIs through the API Gateway and available functions.

Data Flow supports rapid application delivery by allowing developers to focus on their application development. It provides log management and a runtime environment to execute applications. You can integrate the applications and workflows and access APIs through the user interface. It eliminates the need for setting up infrastructure, cluster provisioning, software installation, storage, and security.

Autonomous Data Warehouse

Oracle Autonomous Data Warehouse is a self-driving, self-securing, self-repairing database service that is optimized for data warehousing workloads. You do not need to configure or manage any hardware or install any software. Oracle Cloud Infrastructure handles creating the database, as well as backing up, patching, upgrading, and tuning the database.

Data Integration

Oracle Cloud Infrastructure Data Integration is a fully managed, serverless cloud service to ingest and transform data for data science and analytics. Data Integration helps simplify your complex data extract, transform, and load processes (ETL/E-LT) into data lakes and warehouses for data science and analytics with Oracle’s Data Flow designer. It provides automated schema drift protection with rule-based integration flow which helps you avoid broken integration flows and reduce maintenance as data schemas evolve.

Data Science

Oracle Cloud Infrastructure Data Science is a fully managed and serverless platform for data scientists to build, train, deploy and manage machine learning models on Oracle Cloud Infrastructure. Data scientists can use Oracle's Accelerated Data Science (ADS) library enhanced by Oracle for Automated Machine Learning (AutoML), model evaluation, and model explanation.

Analytics

Oracle Analytics Cloud is a scalable and secure public cloud service that provides a full set of capabilities to explore and perform collaborative analytics for you, your workgroup, and your enterprise. With Oracle Analytics Cloud you also get flexible service management capabilities, including fast setup, easy scaling and patching, and automated lifecycle management.