Automate Loading Data to a Data Warehouse Using a Serverless Application

Oracle Functions is a serverless, highly scalable, fully managed Functions-as-a-Service platform built on Oracle Cloud Infrastructure and powered by the open-source Fn Project engine. Developers can use Oracle Functions to write and deploy code that delivers business value without worrying about provisioning or managing the underlying infrastructure. Oracle Functions is container-native, with functions packaged as Docker container images.

Architecture

This reference architecture shows how you can use a serverless function to automate the process of extracting data from files generated by various databases or applications and loading the data into a data warehouse for analysis.

The architecture diagram shows the data flow. When zipped CSV files are uploaded to a specific bucket in Oracle Cloud Infrastructure Object Storage, the Events service is triggered. The emitted event invokes a function, which extracts the data from the uploaded files and loads the data into an Oracle Autonomous Data Warehouse instance. After the data is loaded to the data warehouse, the processed CSV files are moved to a different bucket in Oracle Cloud Infrastructure Object Storage.

Description of the illustration autoload-adw.png

autoload-adw-oracle.zip

The architecture has the following components:

Region
An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).
Availability domains
Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.
Virtual cloud network (VCN) and subnets
A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

In this reference architecture, the function is deployed in a private regional subnet.
Events and Functions
Oracle Cloud Infrastructure services emit events, which are structured messages that describe the changes in resources. Events are emitted for create, read, update, or delete (CRUD) operations, resource lifecycle state changes, and system events that affect cloud resources.

Oracle Functions enable you to deploy your code, call it directly, or trigger it in response to events. Oracle Functions uses Docker containers hosted in Oracle Cloud Infrastructure Registry.

In this architecture, when a file is uploaded to a bucket in Oracle Cloud Infrastructure Object Storage, an event is emitted. The event invokes a Python-based function, which extracts data from the file and loads it to Oracle Autonomous Data Warehouse. The processed file is then moved to another bucket.
Object storage
Object storage provides quick access to large amounts of structured and unstructured data of any content type, including database backups, analytic data, and rich content such as images and videos. You can safely and securely store and then retrieve data directly from the internet or from within the cloud platform. You can seamlessly scale storage without experiencing any degradation in performance or service reliability. Use standard storage for "hot" storage that you need to access quickly, immediately, and frequently. Use archive storage for "cold" storage that you retain for long periods of time and seldom or rarely access.
Autonomous Data Warehouse
Oracle Autonomous Data Warehouse is a self-driving, self-securing, self-repairing database service that is optimized for data warehousing workloads. You do not need to configure or manage any hardware, or install any software. Oracle Cloud Infrastructure handles creating the database, as well as backing up, patching, upgrading, and tuning the database.

Recommendations

Your requirements might differ from the architecture described here. Use the following recommendations as a starting point.

VCN
When you create a VCN, determine the number of CIDR blocks required and the size of each block based on the number of resources that you plan to attach to subnets in the VCN. Use CIDR blocks that are within the standard private IP address space.

Select CIDR blocks that don't overlap with any other network (in Oracle Cloud Infrastructure, your on-premises data center, or another cloud provider) to which you intend to set up private connections.
Functions
The function used in this reference architecture is coded using Python. Oracle Functions supports Java, Node, Go, and Ruby as well. Use a programming language of your choice.
Oracle Autonomous Data Warehouse
In the reference architecture, the function accesses the data warehouse by using Simple Oracle Document Access (SODA). You can implement other access methods, such as SQL with cx_oracle and Oracle Client libraries, by modifying the function.

In the reference architecture, the function and the data warehouse are in the same subnet. You can deploy them in separate subnets if required.

Considerations

Scalability
- Database
  You can manually scale the number of CPU cores of the database up or down at any time. The autoscaling feature of autonomous databases allows your database to use up to three times the current base number of CPU cores at any time. As demand increases, autoscaling automatically increases the number of cores in use. Autonomous databases allow you to scale the storage capacity at any time without affecting availability or performance.
- Functions
  Oracle Functions creates and removes function containers automatically based on the request load. You pay only when the functions are invoked and for the duration that they run.
Security
- Use Oracle Cloud Infrastructure Identity and Access Management (IAM) policies to control who can access your cloud resources and what operations can be performed.
  You can assign either only function-invocation permissions or both function-management and function-invocation permissions to specific groups.
  
  You can use IAM policies to also control access from Oracle Functions to Oracle Cloud Infrastructure Object Storage.
- In Oracle Cloud Infrastructure Object Storage, encryption is enabled by default and can’t be turned off.
- This reference architecture shows the data warehouse and the function in the same subnet. For enhanced isolation, you can use a separate subnet for the data warehouse.
- To protect the database passwords or any other secrets, consider using the Oracle Cloud Infrastructure Vault service.

Deploy

The Terraform code for this reference architecture is available in GitHub. You can pull the code into Oracle Cloud Infrastructure Resource Manager with a single click, create the stack, and deploy it. Alternatively, download the code from GitHub to your computer, customize the code, and deploy the architecture by using the Terraform command line interface (CLI).

Deploy by using Oracle Cloud Infrastructure Resource Manager:
1. Click
  If you aren't already signed in, enter the tenancy and user credentials.
2. Review and accept the terms and conditions.
3. Select the region where you want to deploy the stack.
4. Follow the on-screen prompts and instructions to create the stack.
5. After creating the stack, click Terraform Actions, and select Plan.
6. Wait for the job to be completed, and review the plan.
  To make any changes, return to the Stack Details page, click Edit Stack, and make the required changes. Then, run the Plan action again.
7. If no further changes are necessary, return to the Stack Details page, click Terraform Actions, and select Apply.
Deploy by using the Terraform CLI:
1. Go to GitHub.
2. Download or clone the code to your local computer.
3. Follow the instructions in the README.

Explore More

Change Log

This log lists only the significant changes:

December 6, 2021

Updated the Deploy section to deploy the latest revision of the stack in GitHub directly to Oracle Cloud Infrastructure Resource Manager with a single click.
Added the option to download editable versions (.SVG and .DRAWIO) of the architecture diagram.

July 27, 2021

Added steps to deploy the architecture by using Oracle Cloud Infrastructure Resource Manager.
Removed information about specific service limits for Functions and Streaming. For the current limits, see the Service Limits documentation.
Added information about scaling the database and functions in the Considerations section.