Automate Loading Data to a Data Warehouse Using a Serverless Application

Oracle Functions is a serverless, highly scalable, fully managed Functions-as-a-Service platform built on Oracle Cloud Infrastructure and powered by the open-source Fn Project engine. Developers can use Oracle Functions to write and deploy code that delivers business value without worrying about provisioning or managing the underlying infrastructure. Oracle Functions is container-native, with functions packaged as Docker container images.

Architecture

This reference architecture shows how you can use a serverless function to automate the process of extracting data from files generated by various databases or applications and loading the data into a data warehouse for analysis.

The architecture diagram shows the data flow. When zipped CSV files are uploaded to a specific bucket in Oracle Cloud Infrastructure Object Storage, the Events service is triggered. The emitted event invokes a function, which extracts the data from the uploaded files and loads the data into an Oracle Autonomous Data Warehouse instance. After the data is loaded to the data warehouse, the processed CSV files are moved to a different bucket in Oracle Cloud Infrastructure Object Storage.

Description of autoload-adw.png follows
Description of the illustration autoload-adw.png

The architecture has the following components:

  • Region

    A region is a localized geographic area composed of one or more availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or continents).

  • Availability domains

    Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.

  • Virtual cloud network (VCN) and subnets

    A VCN is a software-defined network that you set up in an Oracle Cloud Infrastructure region. VCNs can be segmented into subnets, which can be specific to a region or to an availability domain. Both region-specific and availability domain-specific subnets can coexist in the same VCN. A subnet can be public or private.

    In this reference architecture, the function is deployed in a private regional subnet.

  • Events and Functions

    Oracle Cloud Infrastructure services emit events, which are structured messages that describe the changes in resources. Events are emitted for create, read, update, or delete (CRUD) operations, resource lifecycle state changes, and system events that affect cloud resources.

    Oracle Functions enable you to deploy your code, call it directly, or trigger it in response to events. Oracle Functions uses Docker containers hosted in Oracle Cloud Infrastructure Registry.

    In this architecture, when a file is uploaded to a bucket in Oracle Cloud Infrastructure Object Storage, an event is emitted. The event invokes a Python-based function, which extracts data from the file and loads it to Oracle Autonomous Data Warehouse. The processed file is then moved to another bucket.

  • Object storage

    Object storage provides quick access to large amounts of structured and unstructured data of any content type, including database backups, analytic data, and rich content such as images and videos. Use standard storage for "hot" storage that you need to access quickly, immediately, and frequently. Use archive storage for "cold" storage that you retain for long periods of time and seldom or rarely access.

  • Autonomous data warehouse

    Oracle Autonomous Data Warehouse provides an easy-to-use, fully autonomous database that scales elastically, delivers fast query performance, and requires no database administration.

Recommendations

Your requirements might differ from the architecture described here. Use the following recommendations as a starting point.

  • VCN

    When you create the VCN, determine how many IP addresses your cloud resources in each subnet require. Using the Classless Inter-Domain Routing (CIDR) notation, specify a subnet mask and a network address range that's large enough for the required IP addresses. Use an address space that's within the standard private IP address blocks.

    Select an address range that doesn’t overlap with your on-premises network, so that you can set up a connection between the VCN and your on-premises network, if necessary.

  • Functions

    The function used in this reference architecture is coded using Python. Oracle Functions supports Java, Node, Go, and Ruby as well. Use a programming language of your choice.

  • Oracle Autonomous Data Warehouse

    In the reference architecture, the function accesses the data warehouse by using Simple Oracle Document Access (SODA). You can implement other access methods, such as SQL with cx_oracle and Oracle Client libraries, by modifying the function.

    In the reference architecture, the function and the data warehouse are in the same subnet. You can deploy them in separate subnets if required.

Considerations

  • Resource limits
    • The maximum object size is 10 TiB.
    • In a multipart object, the size of each part is limited to 50 GiB.
    • The Oracle Cloud Infrastructure Object Storage services uses the CIDR 134.70.0.0/17 for all regions.
    • You can create a maximum of 10 applications and 20 functions in a tenancy. If you need to create more applications or functions, you can request an increase in the service limit through the web console.
    • The maximum duration of a function is 2 minutes, and the maximum memory is 1 GB.
    • The Events service has a limit of 50 rules per tenancy.
  • Security
    • Use Oracle Cloud Infrastructure Identity and Access Management (IAM) policies to control who can access your cloud resources and what operations can be performed.

      You can assign either only function-invocation permissions or both function-management and function-invocation permissions to specific groups.

      You can use IAM policies to also control access from Oracle Functions to Oracle Cloud Infrastructure Object Storage.

    • In Oracle Cloud Infrastructure Object Storage, encryption is enabled by default and can’t be turned off.
    • This reference architecture shows the data warehouse and the function in the same subnet. For enhanced isolation, you can use a separate subnet for the data warehouse.
    • To protect the database passwords or any other secrets, consider using the Oracle Cloud Infrastructure Vault service.

Deploy

The code to deploy this reference architecture is available on GitHub.

  1. Go to GitHub.
  2. Copy the code to your local computer.
  3. Follow the instructions in the README document.