DBMS_CLOUD Avro, ORC, and Parquet Support

This section covers the DBMS_CLOUD Avro, ORC, and Parquet support provided with Autonomous Database.

DBMS_CLOUD Package Format Options for Avro, ORC, or Parquet

The format argument in DBMS_CLOUD specifies the format of source files.

The two ways to specify the format argument are:

format => '{"format_option" : “format_value” }'  

And:

format => json_object('format_option' value 'format_value'))

Examples:

format => json_object('type' VALUE 'CSV')

To specify multiple format options, separate the values with a ",".

For example:

format => json_object('ignoremissingcolumns' value 'true', 'removequotes' value 'true', 'dateformat' value 'YYYY-MM-DD-HH24-MI-SS', 'blankasnull' value 'true')
Format Option Description Syntax

regexuri

When the value of regexuri is set to TRUE, you can use wildcards as well as regular expressions in the file names in Cloud source file URIs.

The characters "*" and "?" are considered wildcard characters when the regexuri parameter is set to FALSE. When the regexuri parameter is set to TRUE the characters "*" and "?" are part of the specified regular expression pattern.

Regular expression patterns are only supported for the file name or subfolder path in your URIs and the pattern matching is identical to that performed by the REGEXP_LIKE function. Regular expression patterns are not supported for directory names.

For external tables, this option is only supported with the tables that are created on a file in the Object Storage.

For example:

format => JSON_OBJECT('regexuri' value TRUE)

See REGEXP_LIKE Condition for more information on REGEXP_LIKE condition.

regexuri: True

Default value : False

type

Specifies the file type.

type : avro | orc | parquet

schema

When schema is set to first or all, the external table columns and data types are automatically derived from the Avro, ORC, or Parquet file metadata.

The column names will match those found in Avro, ORC, or Parquet. The data types are converted from Avro, ORC, or Parquet data types to Oracle data types. All columns are added to the table.

The value first specifies to use the metadata from the first Avro, ORC, or Parquet file in the file_uri_list to auto generate the columns and their data types. Use first if all of the files have the same schema.

The value all specifies to use the metadata from all Avro, ORC, or Parquet files in the file_uri_list to auto generate the columns and their data types. Use all (slower) if the files may have different schemas.

Default: If column_list is specified, then the schema value, if specified is ignored. If column_list is not specified then the schema default value is first.

Note: For Avro, ORC, or Parquet format files the schema format option is not available and the column_list parameter must be specified for partitioned external tables using the DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE procedure.

schema : first | all

DBMS_CLOUD Package Avro to Oracle Data Type Mapping

Describes the mapping of Avro data types to Oracle data types.

Note:

Complex types, such as maps, arrays, and structs are supported starting with Oracle Database 19c. See DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types for information on using Avro complex types.
Avro Type Oracle Type
INT NUMBER(10)
LONG NUMBER(19)
BOOL NUMBER(1)
UTF8 BYTE_ARRAY RAW(2000)
FLT BINARY_FLOAT
DBL BINARY_DOUBLE
DECIMAL(p) NUMBER(p)
DECIMAL(p,s) NUMBER(p,s)
DATE DATE
STRING VARCHAR2
TIME_MILLIS VARCHAR2(20 BYTE)
TIME_MICROS VARCHAR2(20 BYTE)
TIMESTAMP_MILLIS TIMESTAMP(3)
TIMESTAMP_MICROS TIMESTAMP(6)
ENUM VARCHAR2(n) Where: "n" is the actual maximum length of the AVRO ENUM's possible values
DURATION RAW(2000)
FIXED RAW(2000)
NULL VARCHAR2(1) BYTE

See DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types for information on using Avro complex types.

DBMS_CLOUD Package ORC to Oracle Data Type Mapping

Describes the mapping of ORC data types to Oracle data types.

See DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types for information on using ORC complex types.

ORC Type Oracle Type More Information
array VARCHAR2(n) JSON format DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types
bigint (64 bit) NUMBER(19)  
binary BLOB  
boolean (1 bit) NUMBER(1)  
char CHAR(n)  
date DATE  
double BINARY_DOUBLE  
float BINARY_FLOAT  
int (32 bit) NUMBER(10)  
list VARCHAR2(n) JSON format DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types
map VARCHAR2(n) JSON format DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types
smallint (16 bit) NUMBER(5)  
string VARCHAR2(4000)  
struct VARCHAR2(n) JSON format DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types
timestamp TIMESTAMP  
tinyint (8 bit) NUMBER(3)  
union VARCHAR2(n) JSON format DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types
varchar VARCHAR2(n)  

DBMS_CLOUD Package Parquet to Oracle Data Type Mapping

Describes the mapping of Parquet data types to Oracle data types.

Note:

Complex types, such as maps, arrays, and structs are supported starting with Oracle Database 19c. See DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types for information on using Parquet complex types.
Parquet Type Oracle Type
UINT_64 NUMBER(20)
INT_64 NUMBER(19)
UINT_32 NUMBER(10)
INT_32 NUMBER(10)
UINT_16 NUMBER(5)
INT_16 NUMBER(5)
UINT_8 NUMBER(3)
INT_8 NUMBER(3)
BOOL NUMBER(1)
UTF8 BYTE_ARRAY VARCHAR2(4000 BYTE)
FLT BINARY_FLOAT
DBL BINARY_DOUBLE
DECIMAL(p) NUMBER(p)
DECIMAL(p,s) NUMBER(p,s)
DATE DATE
STRING VARCHAR2(4000)
TIME_MILLIS VARCHAR2(20 BYTE)
TIME_MILLIS_UTC VARCHAR2(20 BYTE)
TIME_MICROS VARCHAR2(20 BYTE)
TIME_MICROS_UTC VARCHAR2(20 BYTE)
TIMESTAMP_MILLIS TIMESTAMP(3)
TIMESTAMP_MILLIS_UTC TIMESTAMP(3)
TIMESTAMP_MICROS TIMESTAMP(6)
TIMESTAMP_MICROS_UTC TIMESTAMP(6)
TIMESTAMP_NANOS TIMESTAMP(9)

See DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types for information on using Parquet complex types.

DBMS_CLOUD Package Oracle Data Type to Parquet Mapping

Describes the mapping of Oracle data types to Parquet data types.

Oracle Type Parquet Type
BINARY_DOUBLE DBL
BINARY_FLOAT FLT
DATE DATE
NUMBER(p,s) DECIMAL(p,s)
NUMBER(p) DECIMAL(p)
TIMESTAMP(3) TIMESTAMP_MILLIS
TIMESTAMP(3) TIMESTAMP_MILLIS_UTC
TIMESTAMP(6) TIMESTAMP_MICROS
TIMESTAMP(6) TIMESTAMP_MICROS_UTC
TIMESTAMP(9) TIMESTAMP_NANOS
VARCHAR2(4000) STRING

NLS Session Parameters

The NLS session parameters NLS_DATE_FORMAT, NLS_TIMESTAMP_FORMAT, NLS_TIMESTAMP_TZ_FORMAT and NLS_NUMERIC_CHARACTERS define how the date, timestamp, timestamp with time zone format, and radix separator for timestamp with decimal marker should be shown when a table with those column types are queried.

In addition, when you export data using DBMS_CLOUD.EXPORT_DATA and specify Parquet output, Autonomous Database reads the values of these parameters from the NLS_SESSION_PARAMETERS table. Autonomous Database uses these values to convert the Oracle data types DATE or TIMESTAMP to Parquet types.

The NLS_SESSION_PARAMETERS parameters support an RR format mask (two character year specification).

The RR format mask for the year is not supported for these parameters when you export data to Parquet with DBMS_CLOUD.EXPORT_DATA. An application error is raised if you attempt to export to parquet and the NLS_SESSION_PARAMETERS are set to use the RR format mask (the default value for the RR format depends on the value of the NLS_TERRITORY parameter).

When one of the parameters NLS_DATE_FORMAT, NLS_TIMESTAMP_FORMAT or NLS_TIMESTAMP_TZ_FORMAT uses the RR format mask, you must change the format value to supported value to export data to Parquet with DBMS_CLOUD.EXPORT_DATA. For example:

ALTER SESSION SET NLS_DATE_FORMAT = "MM/DD/YYYY";
ALTER SESSION SET NLS_TIMESTAMP_FORMAT = 'YYYY-MM-DD HH:MI:SS.FF';
ALTER SESSION SET  NLS_TIMESTAMP_TZ_FORMAT='YYYY-MM-DD HH:MI:SS.FF TZH:TZM';
After you change the value, you can verify the change by querying the NLS_SESSION_PARAMETERS view:
SELECT value FROM NLS_SESSION_PARAMETERS 
       WHERE parameter IN ('NLS_DATE_FORMAT','NLS_TIMESTAMP_FORMAT','NLS_TIMESTAMP_TZ_FORMAT');

If NLS_DATE_FORMAT is set, it applies to the columns with DATE datatype. If NLS_TIMESTAMP_FORMAT is set, it applies to the columns with TIMESTAMP datattype. If NLS_TIMESTAMP_TZ_FORMAT is set, it applies to the columns with TIMESTAMP WITH TIME ZONE datatype.

See Date and Time Parameters and NLS Data Dictionary Views for more information.

DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types

Describes the mapping of Avro, ORC, and Parquet complex data types to Oracle data types.

Autonomous Database supports complex data types, including the following complex types:

  • struct

  • list

  • map

  • union

  • array

When you specify a source file type of Avro, ORC, or Parquet and the source file includes complex columns, Autonomous Database queries return JSON for the complex columns. This simplifies processing of query results; you can use Oracle's powerful JSON parsing features consistently across the file types and data types. The following table shows the format for the complex types in Autonomous Database:

Note:

The complex fields map to VARCHAR2 columns and VARCHAR2 size limits apply.
Type Parquet ORC Avro Oracle
List: sequence of values List List Array VARCHAR2 (JSON format)
Map: list of objects with single key Map Map Map VARCHAR2 (JSON format)
Union: values of different type Not Available Union Union VARCHAR2 (JSON format)
Object: zero or more key-value pairs Struct Struct Record VARCHAR2 (JSON format)

If your ORC, Parquet, or Avro source files contain complex types, then you can query the JSON output for these common complex types. For example, the following shows an ORC file, movie-info.orc, with a complex type (the same complex type handling applies for Parquet and Avro source files).

Consider the movie-info.orc file with the following schema:

id    int
original_title string
overview       string
poster_path    string
release_date   string
vote_count     int
runtime        int
popularity     double
genres         array<struct<id:int,name:string>

Notice that each movie is categorized by multiple genres using an array of genres. The genres array is an array of structs and each item has an id (int) and a name (string). The genres array is considered a complex type. You can create a table over this ORC file using DBMS_CLOUD.CREATE_EXTERNAL_TABLE as follows:

BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
        table_name =>'movie_info',
        credential_name =>'OBJ_STORE_CRED',
        file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/mytenancy/b/movies/o/movie-info.orc',
        format => '{"type":"orc", "schema": "first"}');
END;
/

When you create the external table the database automatically generates the columns based on the schema in the ORC file (if you are using Avro or Parquet, the same applies). For this example, the DBMS_CLOUD.CREATE_EXTERNAL_TABLE creates a table in your database as follows:

CREATE TABLE "ADMIN"."MOVIE_INFO" 
    ( "ID"
      NUMBER(10,0), 
      "ORIGINAL_TITLE"  VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP", 
      "OVERVIEW"        VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP", 
      "POSTER_PATH"     VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP", 
      "RELEASE_DATE"    VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP", 
      "VOTE_COUNT"      NUMBER(10,0), 
      "RUNTIME"         NUMBER(10,0), 
      "POPULARITY"      BINARY_DOUBLE, 
      "GENRES"          VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP"
    )  DEFAULT COLLATION "USING_NLS_COMP"
    ORGANIZATION EXTERNAL 
     ( TYPE      ORACLE_BIGDATA
       DEFAULT DIRECTORY "DATA_PUMP_DIR"
       ACCESS PARAMETERS
       ( com.oracle.bigdata.credential.name=OBJ_STORE_CRED
         com.oracle.bigdata.fileformat=ORC
   )
       LOCATION
        (
      'https://objectstorage.us-phoenix-1.oraclecloud.com/n/mytenancy/b/movies/o/movie-info.orc'
           )
     )
   REJECT LIMIT UNLIMITED 
   PARALLEL;
 )

Now you can query the movie data:

SELECT original_title, release_date, genres 
     FROM movie_info 
     WHERE release_date > '2000'
     ORDER BY original_title;

This produces the following output:


original_title              release_date   genres
(500) Days of Summer        2009           [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},{"id":17,"name":"Horror"},{"id":19,"name":"Western"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
10,000 BC                   2008           [{"id":6,"name":"Comedy"}]
11:14                       2003           [{"id":9,"name":"Thriller"},{"id":14,"name":"Family"}]
127 Hours                   2010           [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"}]
13 Going on 30              2004           [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
1408                        2007           [{"id":45,"name":"Sci-Fi"},{"id":6,"name":"Comedy"},{"id":17,"name":"Horror"},{"id":6,"name":"Comedy"},{"id":18,"name":"War"}]

Notice that the complex type genres is returned as a JSON array.

To make the JSON data more useful, you can transform the column using Oracle's JSON functions. For example, you can use the JSON "." notation as well as the more powerful transform functions such as JSON_TABLE.

See Simple Dot-Notation Access to JSON Data for information on "." notation.

See SQL/JSON Function JSON_TABLE for information on JSON_TABLE.

The following example shows a query on the table that takes each value of the array and turns the value into a row in the result set:
SELECT original_title, release_date, m.genre_name, genres
    FROM movie_info mi,
       JSON_TABLE(mi.genres, '$.name[*]'
        COLUMNS (genre_name VARCHAR2(25) PATH
      '$')                 
                 ) AS m
 WHERE rownum < 10;

The JSON_TABLE creates a row for each value of the array, think outer join, and the struct is parsed to extract the name of the genre. This produces the following output:


original_title                   release_date         genre_name        genres
(500) Days of Summer             2009                 Drama             [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},
                                                                        {"id":17,"name":"Horror"},{"id":19,"name":"Western"},
                                                                        {"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer             2009                 Comedy            [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},
                                                                        {"id":17,"name":"Horror"},{"id":19,"name":"Western"},
                                                                        {"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer             2009                 Horror            [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},
                                                                        {"id":17,"name":"Horror"},{"id":19,"name":"Western"},
                                                                        {"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer             2009                 Western           [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},
                                                                        {"id":17,"name":"Horror"},{"id":19,"name":"Western"},
                                                                        {"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer             2009                 War               [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},
                                                                        {"id":17,"name":"Horror"},{"id":19,"name":"Western"},
                                                                        {"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer             2009                 Romance           [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},
                                                                        {"id":17,"name":"Horror"},{"id":19,"name":"Western"},
                                                                        {"id":18,"name":"War"},{"id":15,"name":"Romance"}]
10,000 BC                        2008                 Comedy            [{"id":6,"name":"Comedy"}]
11:14                            2003                 Family            [{"id":9,"name":"Thriller"},{"id":14,"name":"Family"}]
11:14                            2003                 Thriller          [{"id":9,"name":"Thriller"},{"id":14,"name":"Family"}]
127 Hours                        2010                 Comedy            [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"}]
127 Hours                        2010                 Drama             [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"}]
13 Going on 30                   2004                 Romance           [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},
                                                                        {"id":18,"name":"War"},{"id":15,"name":"Romance"}]
13 Going on 30                   2004                 Comedy            [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},
                                                                        {"id":18,"name":"War"},{"id":15,"name":"Romance"}]
13 Going on 30                   2004                 War               [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},
                                                                        {"id":18,"name":"War"},{"id":15,"name":"Romance"}]
13 Going on 30                   2004                 Drama             [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},
                                                                        {"id":18,"name":"War"},{"id":15,"name":"Romance"}]

DBMS_CLOUD Package Avro, ORC, and Parquet to Oracle Column Name Mapping

Describes rules for how Avro, ORC, and Parquet column names are converted to Oracle column names.

The following are supported for Avro, ORC, and Parquet column names, but may require use of double quotes for Oracle SQL references in external tables. Thus, for ease of use and to avoid having to use double quotes when referencing column names, if possible do not use the following in Avro, ORC, and Parquet column names:

  • Embedded blanks

  • Leading numbers

  • Leading underscores

  • Oracle SQL reserved words

The following table shows various types of Avro, ORC, and Parquet column names, and rules for using the column names in Oracle column names in external tables.

Avro, ORC, or Parquet Name CREATE TABLE Name Oracle CATALOG Valid SQL Notes
part, Part, or PART part, Part, PART PART

select part

select Part

select paRt

select PART

Oracle implicitly uppercases unquoted column names
Ord No "Ord No" Ord No select "Ord No" Double quotes are required when there are embedded blanks, which also preserves the character case
__index_key__ "__index_key__" __index_key__ select "__index_key__" Double quotes are required when there is a leading underscore, which also preserves the character case
6Way "6Way" 6Way select "6Way" Double quotes are required when there is a leading numeric digit, which also preserves the character case
create, Create, or CREATE, and so on. (any case variation) partition, Partition, PARTITION, and so on (for an Oracle Reserved word) "CREATE" "PARTITION" CREATE PARTITION

select "CREATE"

select "PARTITION"

Double quotes are required around Oracle SQL Reserved words. These are forced to uppercase, but must always be double-quoted when used anywhere in SQL
rowid, Rowid, ROWid, and so on (for ROWID see notes) rowid  

select "rowid"

select "Rowid"

select "ROWid"

select "rowid"

For ROWID, any mixed or lower-case variation of ROWID preserves the case and must always be double-quoted and use the original case variations. Due to the inherent conflict with Oracle ROWID for the table, if you specify upper-case ROWID, it is automatically stored as lower-case "rowid" and must always be double-quoted when referenced.

Notes:

  • In general a column name in an external table can be referenced without double quotes.

  • Unless there is an embedded blank, a leading underscore ("_") or leading numeric digit ("0" through "9") in the column name, the original case of the column name is preserved, and it must always be referenced with double quotes and using the original case (upper, lower or mixed-case) of the Avro, ORC, or Parquet column name.

  • After using DBMS_CLOUD.CREATE_EXTERNAL_TABLE to create an external table with the format specified as avro, orc, or parquet, use the DESCRIBE command in SQL*Plus to view the table's column names.

  • When Oracle SQL Reserved Words are used in Avro, ORC, or Parquet column names, they must always be double-quoted when referenced anywhere in SQL. See Oracle SQL Reserved Words for more information.