Query External Data with ORC, Parquet, or Avro Source Files

Autonomous Data Warehouse makes it easy to access ORC, Parquet, or Avro data stored in object store using external tables. ORC, Parquet, and Avro sources have metadata embedded in them and the DBMS_CLOUD.CREATE_EXTERNAL_TABLE procedure can utilize this metadata to simplify the creation of external tables.

You don’t need to know the structure of the data, DBMS_CLOUD can examine the file and convert ORC, Parquet, or Avro contents into the equivalent Oracle columns and data types. You only need to know the location of the data in object store, specify its type, ORC, Parquet, or Avro, and have credentials to access the source file on your object store.

Note:

The steps to use external tables are very similar for ORC, Parquet, and Avro. These steps show working with a Parquet format source file.

The source file in this example, sales_extended.parquet, contains Parquet format data. To query this file in Autonomous Data Warehouse, do the following:

  1. Store your object store credentials, to access the object store, using the procedure DBMS_CLOUD.CREATE_CREDENTIAL:
    BEGIN
      DBMS_CLOUD.CREATE_CREDENTIAL(
        credential_name => 'DEF_CRED_NAME',
        username => 'adwc_user@example.com',
        password => 'password' );
    END;
    /
    

    This operation stores the credentials in the database in an encrypted format. You can use any name for the credential name. Note that this step is required only once unless your object store credentials change. Once you store the credentials you can then use the same credential name for creating external tables.

    See CREATE_CREDENTIAL Procedure for information about the username and password parameters for different object storage services.

  2. Create an external table for ORC, Parquet, or Avro on top of your source files using the procedure DBMS_CLOUD.CREATE_EXTERNAL_TABLE.

    The procedure DBMS_CLOUD.CREATE_EXTERNAL_TABLE supports external files in the supported cloud object storage services, including: Oracle Cloud Infrastructure Object Storage, Azure Blob Storage, Amazon S3, and Amazon S3-Compatible, including: Oracle Cloud Infrastructure Object Storage, Google Cloud Storage, and Wasabi Hot Cloud Storage. The credential is a table level property; therefore, the external files must be on the same object store.

    By default, the columns created in the external table automatically map their data types to Oracle data types for the fields found in the source files and the external table column names match the source field names.

    BEGIN
       DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
        table_name =>'sales_extended_ext',
        credential_name =>'DEF_CRED_NAME',
        file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/namespace-string/b/bucketname/o/sales_extended.parquet',
        format =>  '{"type":"parquet", "schema": "first"}'
     );
    END;
    /     
    

    The parameters are:

    • table_name: is the external table name.

    • credential_name: is the name of the credential created in the previous step.

    • file_uri_list: is a comma delimited list of the source files you want to query.

    • format: defines the options to describe the format of the source file. For a Parquet file, use the format parameter to specify the type parquet. For an Avro file use the format parameter to specify the type avro. For an ORC file use the format parameter to specify the type orc.

      Note:

      Support for ORC format requires Oracle Database 19c. The format option type with value orc is not supported with Oracle Database 18c.

    In this example, namespace-string is the Oracle Cloud Infrastructure object storage namespace and bucketname is the bucket name. See Understanding Object Storage Namespaces for more information.

    By default the format schema parameter is set and the columns and data types are derived automatically and the fields in the source match the external table columns by name. Source data types are converted to the external table column Oracle data types according to the DBMS_CLOUD mapping for ORC, Parquet, or Avro data types. The valid schema parameter values are:

    • first: Analyze the schema of the first ORC, Parquet, or Avro file that DBMS_CLOUD finds in the specified file_uri_list (first is the default value for schema).

    • all: Analyze all the schemas for all the ORC, Parquet, or Avro files found in the file_uri_list. Because these are simply files captured in an object store, there is no guarantee that each file’s metadata is the same. For example, File1 may contain a field called “address”, while File2 may be missing that field. Examining each file to derive the columns is a bit more expensive but may be required if the first file does not contain all the required fields.

    Note:

    If the column_list parameter is specified, then you provide the column names and data types for the external table and the schema value, if specified is ignored. Using column_list you can limit the columns in the external table. If column_list is not specified then the schema default value is first.
  3. You can now run queries on the external table you created in the previous step:
    DESC sales_extended_ext;
    Name           Null? Type           
    -------------- ----- -------------- 
    PROD_ID              NUMBER(10)     
    CUST_ID              NUMBER(10)     
    TIME_ID              VARCHAR2(4000) 
    CHANNEL_ID           NUMBER(10)     
    PROMO_ID             NUMBER(10)     
    QUANTITY_SOLD        NUMBER(10)     
    AMOUNT_SOLD          NUMBER(10,2)   
    GENDER               VARCHAR2(4000) 
    CITY                 VARCHAR2(4000) 
    STATE_PROVINCE       VARCHAR2(4000) 
    INCOME_LEVEL         VARCHAR2(4000)
    
    SELECT prod_id, quantity_sold, gender, city, income_level 
         FROM sales_extended_ext 
         WHERE ROWNUM < 8;
    
         PROD_ID     QUANTITY_SOLD     GENDER    CITY              INCOME_LEVEL
    1    13          1                 M         Adelaide          K: 250,000 – 299,999
    2    13          1                 M         Dolores           L: 300,000 and above
    3    13          1                 M         Cayuga            F: 110,000 – 129,999
    4    13          1                 F         Bergen op Zoom    C: 50,000 – 69,999
    5    13          1                 F         Neuss             J: 190,000 – 249,999
    6    13          1                 F         Darwin            F: 110,000 – 129,999
    7    13          1                 M         Sabadell          K:250,000 – 299,999
    

    This query shows values for rows in the external table. If you want to query this data frequently, after examining the data you can load it into a table with DBMS_CLOUD.COPY_DATA.