Using Data Lake Capabilities with Autonomous Database

Provides information on using Autonomous Database as a data lakehouse.

About SQL Analytics on Data Lakes with Autonomous Database

Data lakes are a key part of current data management architectures with data stored across object store offerings from Oracle, Amazon, Azure, Google, and other vendors.

Data lakes augment and complement data warehouses:

  • As a data processing engine for ETL: This allows you to reduce the data warehousing workload.

  • By storing data that may not be appropriate for a data warehouse: This includes log files, sensor data, IoT data, and so on. These source data tend to be voluminous with low information density. Storing this data in an object store might be more appropriate than in a data warehouse, while the information derived from the data is ideal for SQL analytics.

  • For use with data science and business analytics: It is easy to upload files into the data lake and then use various processing methods over that data (Spark, Python, and so on).

    Business analysts using Autonomous Database can easily take advantage of these data sets without ETL. You can combine the data sets with data in your warehouse to gain new insights. For example, an analyst uploads third party customer demographic files to object storage and then immediately uses that data with data in the warehouse to perform customer segmentation analyses, blending the demographic data with existing customer and sales data.

Autonomous Database's deep integration with the data lake represents a new category in modern data management: the data lakehouse. Autonomous Database simplifies access to the data lake by providing rich and highly performant SQL access and embedded analytics, including: machine learning, graph, spatial, JSON, and more. This open access allows any SQL based business intelligence tool or application to benefit by using data stored in multiple places without needing to understand the complexities of the data lake.

Description of data-lake-architecture.png follows
Description of the illustration data-lake-architecture.png

Integrating Autonomous Database and the Data Lake

Autonomous Database supports integrating with data lakes not just on Oracle Cloud Infrastructure, but also on Amazon, Azure, Google, and more. You have the option of loading data into the database or querying the data directly in the source object store. Both approaches use the same tooling and APIs to access the data. Loading data into Autonomous Database will typically offer significant query performance gains when compared to querying object storage directly. However, querying the object store directly avoids the need to load data and allows for an agile approach to extending analytics to new data sources. Once those new sources are deemed to have proven value, you have the option to load the data into Autonomous Database.

Security Credentials for Accessing Data in Object Stores

Autonomous Database supports multiple cloud services and object stores, including Oracle Cloud Infrastructure, Azure, AWS, Google, and others. The first step in accessing these stores is to ensure that security policies are in place. For example, you must specify authorization rules to read and/or write files in a bucket on object storage. Each cloud has its own process for specifying role based access control.

Review the following to set up the proper policies for your object storage platform:

Note:

Security principals are not required for data that is stored publicly.

Autonomous Database uses a credential object to represent the identity to access a cloud service. See CREATE_CREDENTIAL Procedure for more information on security credentials that are used to connect to external sources.

This identity may be an Oracle Cloud Infrastructure user or a resource principal, an Amazon Web Service ARN, an Azure service principal, or a Google service account. The credential is included with each Autonomous Database issued service request and the user action is authorized by the cloud resource. Without proper authorization, the service request fails, typically with a "not found" error.

Review the following blog posts to learn more about using the different types of cloud service credentials:

After security policies are in place you can use either the Autonomous Database Tools or DBMS_CLOUD APIs to access services.

Use Data Studio to Load and Link Data

Data Studio is designed for the business user and provides features for data loading that efficiently loads data from the data source and data linking that creates external tables that are used to query data dynamically.

You can start by creating a pointer to a cloud storage location. A cloud storage location is a bucket in an object store. Next, use the Database Tools to drag and drop the data to map object storage sources to target tables.

When creating the mapping, the schemas (columns and data types) for the underlying object store objects are automatically derived to simplify the job definition. Under the covers, Data Studio uses the Autonomous Database DBMS_CLOUD APIs to link to and load data.

See The Data Load Page for more information.

Load Data

You can use DBMS_CLOUD data loading APIs as part of your ETL processing flows. DBMS_CLOUD APIs simplify the loading process by encapsulating the complexities of dealing with different data sources and producing highly efficient loading jobs. You specify the source properties (for example, data location, file type, and columns for unstructured files) and their mapping to a target table. The documentation describes the details of loading data from a variety of sources and describes debugging data loads, including the processing steps and bad records that were encountered.

Loading data involves the following:

  • Creating credentials
  • Loading different types of files, including JSON, data pump, delimited text, Parquet, Avro, and ORC.
  • Replication using Oracle GoldenGate, and more.

Some of the key DBMS_CLOUD APIs for loading data are:

  • COPY_DATA Procedure: Loads data from object storage sources. Specify the source and how that source maps to a target table. The API supports a variety of file types, including: delimited text, Parquet, Avro, and ORC. There are numerous features for logging the processing.

  • COPY_COLLECTION Procedure: Loads JSON data from object storage into collections.

  • SEND_REQUEST Function and Procedure: The most flexible approach, DBMS_CLOUD.SEND_REQUEST sends and processes REST requests to web service providers. Use DBMS_CLOUD.SEND_REQUEST in your own PL/SQL procedures to load data that is not available in an accessible file store (for example, weather data from government services).

You can also use data pipelines for continuous incremental data loading from your data lake (as data arrives on object store it is loaded to a database table). Data import pipelines leverage the same underlying loading capabilities provided by DBMS_CLOUD.

The DBMS_CLOUD_PIPELINE package offers a set of APIs to create, manage, and run a pipeline, including:

See Using Data Pipelines for Continuous Load and Export for more information.

Link Data

Autonomous Database external tables are used to link to data. An external table accesses the files in object store directly without loading the data. This is ideal for cases where you may want to quickly explore data that is in object storage to understand its value. Applications query external tables as they would any other table in Autonomous Database, the location of the source is transparent. This means that any SQL-based tool or application can query across the data warehouse and data lake, finding new insights that would have otherwise been difficult to achieve.

External tables support a variety of file types, including: delimited text, JSON, Parquet, Avro, and ORC. Review the documentation to learn more about external concepts, including the external table types, validating sources, and leveraging Oracle Cloud Infrastructure Data Catalog to automatically generate external tables.

Listed below are a few of the DBMS_CLOUD APIs for working with external tables:

  • CREATE_EXTERNAL_TABLE: Create an external table over object storage data.

  • CREATE_EXTERNAL_PART_TABLE: Create a partitioned external table over object storage data.

    Partitioned external tables deliver performance benefits at query time. Partition pruning eliminates the need to scan data files that are not required for processing. For example, you may have 12 partitions, one for each month in 2021. A query for March 2021 only scans the data for that one month, or 1/12 of the data, dramatically improving performance.

    The DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE procedure understands standard data lake partitioning patterns, where each partition's data is stored in its own folder. The procedure automatically creates partitions based on the underlying data organization. Use SYNC_EXTERNAL_PART_TABLE to update the table's partitions when the underlying data has changed.

  • CREATE_HYBRID_PART_TABLE: Create a hybrid partitioned table.

    A hybrid partitioned table allows you to specify both partition data that should be stored in Autonomous Database and partition data to be stored externally. This allows you to keep "hot" data in the database for faster query performance and updates while archived data is stored in object store. Applications query the hybrid table without the need to understand where data resides.

Query Data Lakehouse Using SQL

After you have integrated Autonomous Database with the data lake, you can use the full breadth of Oracle SQL for querying data across both the database and object storage.

Description of data-lake-features.png follows
Description of the illustration data-lake-features.png

The location of data is completely transparent to the application. The application simply connects to Autonomous Database and then uses all of the Oracle SQL query language to query across your data sets.

This allows you to:

  • Correlate information from data lake and data warehouse.
  • Access data from any SQL tool or application.
  • Preserve your investment in tools and skill sets.
  • Safeguard sensitive data using Oracle Database advanced security policies.
Description of adb_lakehouse_graph.png follows
Description of the illustration adb_lakehouse_graph.png

Advanced Analytics

Integrating the various types of data allows business analysts to apply Autonomous Database's built-in analytics across all data and you do not need to deploy specialized analytic engines.

Using Autonomous Database set up as a data lakehouse eliminates costly data replication, security challenges and administration overhead. Most importantly, it allows cross-domain analytics. For example, machine learning models can easily take advantage of spatial analytics to answer questions like, "What is the predicted impact of the movement of a hurricane on our product shipments?"

The following table provides a snapshot of the advanced analytic capabilities that you can take advantage of in Autonomous Database.

Advanced Analytics Type Features
Graph Analytics

Graph algorithms: Library of 50+ built-in algorithms for detecting and evaluating communities, link prediction, path-finding, ranking and walking

Graph queries:

  • Compute new metrics based on relationships between entities
  • Evaluate prediction results, discover new connections

Tooling for working with graphs:

  • Graph modeling tool to map relational data to graphs
  • Browser-based notebooks for interactive analysis and collaboration
  • Integrated graph visualization

See Oracle Graph for more information.

Spatial Analytics

Geospatial data processing:

  • Convert address data or place names to

    geospatial data

  • Prepare, validate and cleanse geospatial data

Geospatial data analysis: Categorize or filter based on location and proximity

Map visualization: Interactive analysis and reporting

See Oracle Spatial for more information.

Machine Learning

ML algorithms

  • Fast model building and real time scoring
  • Choice of languages: SQL, R and Python

Collaborative tooling

  • No-code AutoML automates model building, tuning and deployment
  • Notebooks offer interactive analyses and visualizations
  • Share notebooks and templates

See Oracle Machine Learning for more information.

Analytic Views

Shared business models:

  • Define business hierarchies
  • Create metrics that leverage business hierarchies
  • Model shared by tools and applications (Oracle Analytics Cloud, Tableau, and so on)
  • SQL enhanced to leverage model

Tools for working with Analytic Views: Data analysis tool simplifies creation of multidimensional model

See Analytic Views for more information.

Collaborate Using Common Metadata

Historically, one of the major challenges in any data management solution is understanding the data itself.

You may have important questions about your data:

  • What data is contained in a table?
  • What is the meaning of a column?
  • What is the reliability of the data? If it's not accurate enough for financial reporting, is it useful for marketing purposes?
  • When was the data last updated?
  • What is the schema for files contained in object storage?

Oracle Cloud Infrastructure offers a centralized data catalog, Oracle Cloud Infrastructure Data Catalog (Data Catalog), that provides the answers to these questions. Data Catalog allows you to quickly discover data across Object Storage, Autonomous Database, Oracle Databases, MySQL, Kafka, Hive, and more.

Automated metadata harvesting, for example from table and column definitions, derives technical metadata from these sources. It is also possible to derive metadata from unstructured sources in the object storage data lake. Once the sources have been harvested, business analysts and data stewards apply business terms and categories to the data. You now have a collaborative environment for data providers/consumers to search assets based on names, business terms, tags, and custom properties.

Autonomous Database integrates with Data Catalog, simplifying the administrative process and promoting semantic consistency across the lakehouse. The data lake metadata harvested by Data Catalog is automatically synchronized with Autonomous Database. This allows business analysts to immediately query data in the object store and combine that information with data in Autonomous Database using their favorite SQL based tool and application.

View details about how to set up the connection to Data Catalog

Spark on Autonomous Database

There is no single processing or analytic engine for object storage data. However, one of the leading processing engines is Apache Spark.

Spark is a distributed data processing framework that is used heavily in ETL and data science workloads. Spark is used by Oracle Cloud Infrastructure services, including:

There is deep Spark integration with Autonomous Database. Spark on Oracle is an open source project that optimizes access to Autonomous Database from Spark-based applications. The Spark catalog is extended with the database data dictionary – think of the database as simply an additional namespace that is available. Queries that are executed thru Spark are automatically translated into highly efficient Oracle SQL by rewriting Spark against Autonomous Database. This means that on Autonomous Database the integration goes well beyond what most Spark 'connectors' provide (that is pushing projections and filters to the underlying system). Spark on Oracle is able to entirely push complex analysis pipelines containing all the analytical functions and operators of Spark SQL. For example, over 90% of TPC-DS queries were fully pushed down to Oracle Database for processing.

View the blog for Spark on Oracle and the open source project.