Query External Partitioned Data with Folder Format Source File Organization

Use DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE to create an external partitioned table and generate the partitioning information from the Cloud Object Store file path.

When you create an external table with folder format data files, you have two options for specifying the types of the partition columns:

  • You can manually specify the columns and their data types with column_list parameter. See Query External Partitioned Data with Hive Format Source File Organization for an example using the column_list parameter.

  • You can let DBMS_CLOUD derive the data file columns and their types from information in structured data files such as Avro, ORC, and Parquet data files. In this case, you use the partition_columns option with the format parameter to supply the column names and their data types for the partition columns and you do not need to supply the column_list or the field_list parameters.

Consider the following sample source files in Object Store:

.../sales/USA/2020/01/sales1.parquet

.../sales/USA/2020/02/sales2.parquet

To create a partitioned external table with the Cloud Object Store file path defining the partitions from files with this sample folder format, do the following:

  1. Store your object store credentials using the procedure DBMS_CLOUD.CREATE_CREDENTIAL.

    For example:

    BEGIN
      DBMS_CLOUD.CREATE_CREDENTIAL (
        credential_name => 'DEF_CRED_NAME',
        username => 'adb_user@example.com',
        password => 'password' );
    END;
    /
    

    Creating a credential to access Oracle Cloud Infrastructure Object Store is not required if you enable resource principal credentials. See Use Resource Principal to Access Oracle Cloud Infrastructure Resources for more information.

    This operation stores the credentials in the database in an encrypted format. You can use any name for the credential name. Note that this step is required only once unless your object store credentials change. Once you store the credentials you can then use the same credential name for creating external tables.

    See CREATE_CREDENTIAL Procedure for information about the username and password parameters for different object storage services.

  2. Create an external partitioned table on top of your source files using the procedure DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE.

    The procedure DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE supports external partitioned files in the supported cloud object storage services. The credential is a table level property; therefore, the external files must all be on the same cloud object store.

    For example:

    BEGIN DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE(
        table_name => 'MYSALES',
        credential_name => 'DEF_CRED_NAME',
        file_uri_list => 'https://objectstorage.us-phoenix-1.oraclecloud.com/n/namespace-string/b/bucketname/o/sales/*.parquet',
        format =>
            json_object('type' value 'parquet', 'schema' value 'first',
                        'partition_columns' value
                              json_array(
                                    json_object('name' value 'country', 'type' value 'varchar2(100)'),
                                    json_object('name' value 'year', 'type' value 'number'),
                                    json_object('name' value 'month', 'type' value 'varchar2(2)')
                              )
             )
        );
    END;
    /

    The DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE parameters for structured data files, such as for a Parquet data file does not require the column_list or the field_list parameters. The column names and data types are derived for the columns from the first parquet file that the procedure scans (and therefore all files must have the same shape). The generated column list includes the columns derived from the object name and these column have the data types specified with the partition_columns format parameter.

    The parameters are:

    • table_name: is the external table name.

    • credential_name: is the name of the credential created in the previous step.

    • file_uri_list: is a comma-delimited list of source file URIs. There are two options for this list:

      • Specify a comma-delimited list of individual file URIs without wildcarding.

      • Specify a single file URI with wildcards, where the wildcards can only be after the last slash "/". The character "*" can be used as the wildcard for multiple characters, the character "?" can be used as the wildcard for a single character.

    • column_list: is a comma delimited list of column names and data types for the external table. The list includes the columns that are inside the file as well as those derived from the object name.

      The column_list is not required when the data files are structured files (Parquet, Avro, or ORC).

    • field_list: Identifies the fields in the source files and their data types. The default value is NULL meaning the fields and their data types are determined by the column_list parameter.

      The field_list is not required when the data files are structured files (Parquet, Avro, or ORC).

    • format: defines the options you can specify to describe the format of the source file. The partition_columns format parameter specifies the names of the partition columns. See DBMS_CLOUD Package Format Options for more information.

      If the data in your source file is encrypted, decrypt the data by specifying the encryption format option. See Decrypt Data While Importing from Object Storage for more information on decrypting data.

    In this example, namespace-string is the Oracle Cloud Infrastructure object storage namespace and bucketname is the bucket name. See Understanding Object Storage Namespaces for more information.

    See CREATE_EXTERNAL_PART_TABLE Procedure for detailed information about the parameters.

    See DBMS_CLOUD URI Formats for more information on the supported cloud object storage services.

    If there are any rows in the source files that do not match the format options you specified, the query reports an error. You can use DBMS_CLOUD parameters, like rejectlimit to suppress these errors. As an alternative, you can also validate the external partitioned table you created to see the error messages and the rejected rows so that you can change your format options accordingly. See Validate External Data and Validate External Partitioned Data for more information.

  3. You can now run queries on the external partitioned table you created in the previous step.

    Your Autonomous Database takes advantage of the partitioning information of your external partitioned table, ensuring that the query only accesses the relevant data files in the Object Store. For example, the following query only reads data files from one partition.

    For example:

    SELECT year, month, product, units 
    FROM SALES WHERE year='2020' AND month='02' AND country='USA'

    The external partitioned tables you create with DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE include two invisible columns file$path and file$name. These columns help identify which file a record is coming from. See External Table Metadata Columns for more information.