Provides information on using Autonomous Database as a data lakehouse.
About SQL Analytics on Data Lakes with Autonomous Database
Data lakes are a key part of current data management architectures with data stored across object store offerings from Oracle, Amazon, Azure, Google, and other vendors.
Data lakes augment and complement data warehouses:
As a data processing engine for ETL: This allows you to reduce the data warehousing workload.
By storing data that may not be appropriate for a data warehouse: This includes log files, sensor data, IoT data, and so on. These source data tend to be voluminous with low information density. Storing this data in an object store might be more appropriate than in a data warehouse, while the information derived from the data is ideal for SQL analytics.
For use with data science and business analytics: It is easy to upload files into the data lake and then use various processing methods over that data (Spark, Python, and so on).
Business analysts using Autonomous Database can easily take advantage of these data sets without ETL. You can combine the data sets with data in your warehouse to gain new insights. For example, an analyst uploads third party customer demographic files to object storage and then immediately uses that data with data in the warehouse to perform customer segmentation analyses, blending the demographic data with existing customer and sales data.
Autonomous Database's deep integration with the data lake represents a new category in modern data management: the data lakehouse. Autonomous Database simplifies access to the data lake by providing rich and highly performant SQL access and embedded analytics, including: machine learning, graph, spatial, JSON, and more. This open access allows any SQL based business intelligence tool or application to benefit by using data stored in multiple places without needing to understand the complexities of the data lake.
Description of the illustration data-lake-architecture.png
Integrating Autonomous Database and the Data Lake
Autonomous Database supports integrating with data lakes not just on Oracle Cloud Infrastructure, but also on Amazon, Azure, Google, and more. You have the option of loading data into the database or querying the data directly in the source object store. Both approaches use the same tooling and APIs to access the data. Loading data into Autonomous Database will typically offer significant query performance gains when compared to querying object storage directly. However, querying the object store directly avoids the need to load data and allows for an agile approach to extending analytics to new data sources. Once those new sources are deemed to have proven value, you have the option to load the data into Autonomous Database.
Security Credentials for Accessing Data in Object Stores
Autonomous Database supports multiple cloud services and object stores, including Oracle Cloud Infrastructure, Azure, AWS, Google, and others. The first step in accessing these stores is to ensure that security policies are in place. For example, you must specify authorization rules to read and/or write files in a bucket on object storage. Each cloud has its own process for specifying role based access control.
Review the following to set up the proper policies for your object storage platform:
Note:Security principals are not required for data that is stored publicly.
Autonomous Database uses a credential object to represent the identity to access a cloud service. See CREATE_CREDENTIAL Procedure for more information on security credentials that are used to connect to external sources.
This identity may be an Oracle Cloud Infrastructure user or a resource principal, an Amazon Web Service ARN, an Azure service principal, or a Google service account. The credential is included with each Autonomous Database issued service request and the user action is authorized by the cloud resource. Without proper authorization, the service request fails, typically with a "not found" error.
Review the following blog posts to learn more about using the different types of cloud service credentials:
- Accessing Oracle Cloud Infrastructure Resources from Your Autonomous Database using Resource Principal
- Autonomous Database Now Supports Accessing the Object Storage with OCI Native Authentication
- Using role Amazon Resource Names (ARNs) to securely access AWS Resources from your Autonomous Database
- How to Easily Access Azure
Resources from Your Autonomous Database: Once the security policies
are in place, you can use both the Database Tools and
DBMS_CLOUDAPIs to access those services.
After security policies are in place you can use either the Autonomous
Database Tools or
APIs to access services.
Use Data Studio to Load and Link Data
Data Studio is designed for the business user and provides features for data loading that efficiently loads data from the data source and data linking that creates external tables that are used to query data dynamically.
You can start by creating a pointer to a cloud storage location. A cloud storage location is a bucket in an object store. Next, use the Database Tools to drag and drop the data to map object storage sources to target tables.
When creating the mapping, the schemas (columns and data types) for the
underlying object store objects are automatically derived to simplify the job
definition. Under the covers, Data Studio uses the Autonomous Database
DBMS_CLOUD APIs to link to and
See The Data Load Page for more information.
You can use
DBMS_CLOUD data loading APIs as part of your ETL
APIs simplify the loading process by encapsulating the complexities of dealing with
different data sources and producing highly efficient loading jobs. You specify the
source properties (for example, data location, file type, and columns for
unstructured files) and their mapping to a target table. The documentation describes
the details of loading data from a variety of sources and describes debugging data
loads, including the processing steps and bad records that were encountered.
Loading data involves the following:
- Creating credentials
- Loading different types of files, including JSON, data pump, delimited text, Parquet, Avro, and ORC.
- Replication using Oracle GoldenGate, and more.
Some of the key
DBMS_CLOUD APIs for loading data are:
COPY_DATA Procedure: Loads data from object storage sources. Specify the source and how that source maps to a target table. The API supports a variety of file types, including: delimited text, Parquet, Avro, and ORC. There are numerous features for logging the processing.
COPY_COLLECTION Procedure: Loads JSON data from object storage into collections.
SEND_REQUEST Function and Procedure: The most flexible approach,
DBMS_CLOUD.SEND_REQUESTsends and processes REST requests to web service providers. Use
DBMS_CLOUD.SEND_REQUESTin your own PL/SQL procedures to load data that is not available in an accessible file store (for example, weather data from government services).
You can also use data pipelines for continuous incremental data loading
from your data lake (as data arrives on object store it is loaded to a database
table). Data import pipelines leverage the same underlying loading capabilities
package offers a set of APIs to create, manage, and run a pipeline, including:
CREATE_PIPELINE Procedure: Creates a new pipeline.
SET_ATTRIBUTE Procedure: Allows you to specify pipeline properties, including: a description of the source data and how it maps to a target, the load interval, the job priority and more.
START_PIPELINE Procedure: Starts a pipeline for continuous incremental data loading.
See Using Data Pipelines for Continuous Load and Export for more information.
Autonomous Database external tables are used to link to data. An external table accesses the files in object store directly without loading the data. This is ideal for cases where you may want to quickly explore data that is in object storage to understand its value. Applications query external tables as they would any other table in Autonomous Database, the location of the source is transparent. This means that any SQL-based tool or application can query across the data warehouse and data lake, finding new insights that would have otherwise been difficult to achieve.
External tables support a variety of file types, including: delimited text, JSON, Parquet, Avro, and ORC. Review the documentation to learn more about external concepts, including the external table types, validating sources, and leveraging Oracle Cloud Infrastructure Data Catalog to automatically generate external tables.
Listed below are a few of the
DBMS_CLOUD APIs for working with external tables:
CREATE_EXTERNAL_TABLE: Create an external table over object storage data.
CREATE_EXTERNAL_PART_TABLE: Create a partitioned external table over object storage data.
Partitioned external tables deliver performance benefits at query time. Partition pruning eliminates the need to scan data files that are not required for processing. For example, you may have 12 partitions, one for each month in 2021. A query for March 2021 only scans the data for that one month, or 1/12 of the data, dramatically improving performance.
DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLEprocedure understands standard data lake partitioning patterns, where each partition's data is stored in its own folder. The procedure automatically creates partitions based on the underlying data organization. Use SYNC_EXTERNAL_PART_TABLE to update the table's partitions when the underlying data has changed.
CREATE_HYBRID_PART_TABLE: Create a hybrid partitioned table.
A hybrid partitioned table allows you to specify both partition data that should be stored in Autonomous Database and partition data to be stored externally. This allows you to keep "hot" data in the database for faster query performance and updates while archived data is stored in object store. Applications query the hybrid table without the need to understand where data resides.
Query Data Lakehouse Using SQL
After you have integrated Autonomous Database with the data lake, you can use the full breadth of Oracle SQL for querying data across both the database and object storage.
Description of the illustration data-lake-features.png
The location of data is completely transparent to the application. The application simply connects to Autonomous Database and then uses all of the Oracle SQL query language to query across your data sets.
This allows you to:
- Correlate information from data lake and data warehouse.
- Access data from any SQL tool or application.
- Preserve your investment in tools and skill sets.
- Safeguard sensitive data using Oracle Database advanced security policies.
Description of the illustration adb_lakehouse_graph.png
Integrating the various types of data allows business analysts to apply Autonomous Database's built-in analytics across all data and you do not need to deploy specialized analytic engines.
Using Autonomous Database set up as a data lakehouse eliminates costly data replication, security challenges and administration overhead. Most importantly, it allows cross-domain analytics. For example, machine learning models can easily take advantage of spatial analytics to answer questions like, "What is the predicted impact of the movement of a hurricane on our product shipments?"
The following table provides a snapshot of the advanced analytic capabilities that you can take advantage of in Autonomous Database.
|Advanced Analytics Type
Graph algorithms: Library of 50+ built-in algorithms for detecting and evaluating communities, link prediction, path-finding, ranking and walking
Tooling for working with graphs:
See Oracle Graph for more information.
Geospatial data processing:
Geospatial data analysis: Categorize or filter based on location and proximity
Map visualization: Interactive analysis and reporting
See Oracle Spatial for more information.
See Oracle Machine Learning for more information.
Shared business models:
Tools for working with Analytic Views: Data analysis tool simplifies creation of multidimensional model
See Analytic Views for more information.
Collaborate Using Common Metadata
Historically, one of the major challenges in any data management solution is understanding the data itself.
You may have important questions about your data:
- What data is contained in a table?
- What is the meaning of a column?
- What is the reliability of the data? If it's not accurate enough for financial reporting, is it useful for marketing purposes?
- When was the data last updated?
- What is the schema for files contained in object storage?
Oracle Cloud Infrastructure offers a centralized data catalog, Oracle Cloud Infrastructure Data Catalog (Data Catalog), that provides the answers to these questions. Data Catalog allows you to quickly discover data across Object Storage, Autonomous Database, Oracle Databases, MySQL, Kafka, Hive, and more.
Automated metadata harvesting, for example from table and column definitions, derives technical metadata from these sources. It is also possible to derive metadata from unstructured sources in the object storage data lake. Once the sources have been harvested, business analysts and data stewards apply business terms and categories to the data. You now have a collaborative environment for data providers/consumers to search assets based on names, business terms, tags, and custom properties.
Autonomous Database integrates with Data Catalog, simplifying the administrative process and promoting semantic consistency across the lakehouse. The data lake metadata harvested by Data Catalog is automatically synchronized with Autonomous Database. This allows business analysts to immediately query data in the object store and combine that information with data in Autonomous Database using their favorite SQL based tool and application.
Spark on Autonomous Database
There is no single processing or analytic engine for object storage data. However, one of the leading processing engines is Apache Spark.
Spark is a distributed data processing framework that is used heavily in ETL and data science workloads. Spark is used by Oracle Cloud Infrastructure services, including:
- Oracle Cloud Infrastructure Data Flow: fully managed Apache Spark service.
- Oracle Cloud Infrastructure Data Integration: a fully managed cloud service that helps you with common extract, load, and transform (ETL) tasks such as ingesting data from different sources, cleansing, transforming, and reshaping that data, and then efficiently loading it to target data sources.
- Oracle Cloud Infrastructure Data Science: machine learning (ML) service that offers JupyterLab notebook environments and access to hundreds of popular open source tools and frameworks.Spark-based.
There is deep Spark integration with Autonomous Database. Spark on Oracle is an open source project that optimizes access to Autonomous Database from Spark-based applications. The Spark catalog is extended with the database data dictionary – think of the database as simply an additional namespace that is available. Queries that are executed thru Spark are automatically translated into highly efficient Oracle SQL by rewriting Spark against Autonomous Database. This means that on Autonomous Database the integration goes well beyond what most Spark 'connectors' provide (that is pushing projections and filters to the underlying system). Spark on Oracle is able to entirely push complex analysis pipelines containing all the analytical functions and operators of Spark SQL. For example, over 90% of TPC-DS queries were fully pushed down to Oracle Database for processing.