DBMS_CLOUD Avro, ORC, and Parquet Support
This section
covers the DBMS_CLOUD
Avro, ORC, and Parquet support provided with Autonomous Database.
DBMS_CLOUD Package Format Options for Avro, ORC, or Parquet
The format argument in DBMS_CLOUD
specifies the format of source files.
The two ways to specify the format argument are:
format => '{"format_option" : “format_value” }'
And:
format => json_object('format_option' value 'format_value'))
Examples:
format => json_object('type' VALUE 'CSV')
To specify multiple format options, separate the values with a ",
".
For example:
format => json_object('ignoremissingcolumns' value 'true', 'removequotes' value 'true', 'dateformat' value 'YYYY-MM-DD-HH24-MI-SS', 'blankasnull' value 'true')
Format Option | Description | Syntax |
---|---|---|
|
When the value of The characters "*" and "?" are considered wildcard characters when the Regular expression patterns are only supported for the file name or subfolder path in your URIs and the pattern matching is identical to that performed by the For external tables, this option is only supported with the tables that are created on a file in the Object Storage. For example:
See REGEXP_LIKE Condition for more information on |
Default value : |
type |
Specifies the file type. |
|
|
When schema is set to The column names will match those found in Avro, ORC, or Parquet. The data types are converted from Avro, ORC, or Parquet data types to Oracle data types. All columns are added to the table. The value The value Default: If Note: For Avro, ORC, or Parquet format files the
|
|
DBMS_CLOUD Package Avro to Oracle Data Type Mapping
Describes the mapping of Avro data types to Oracle data types.
Note:
Complex types, such as maps, arrays, and structs are supported starting with Oracle Database 19c. See DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types for information on using Avro complex types.Avro Type | Oracle Type |
---|---|
INT | NUMBER(10) |
LONG | NUMBER(19) |
BOOL | NUMBER(1) |
UTF8 BYTE_ARRAY | RAW(2000) |
FLT | BINARY_FLOAT |
DBL | BINARY_DOUBLE |
DECIMAL(p) | NUMBER(p) |
DECIMAL(p,s) | NUMBER(p,s) |
DATE | DATE |
STRING | VARCHAR2 |
TIME_MILLIS | VARCHAR2(20 BYTE) |
TIME_MICROS | VARCHAR2(20 BYTE) |
TIMESTAMP_MILLIS | TIMESTAMP(3) |
TIMESTAMP_MICROS | TIMESTAMP(6) |
ENUM | VARCHAR2(n) Where: "n" is the actual maximum length of the AVRO ENUM's possible values |
DURATION | RAW(2000) |
FIXED | RAW(2000) |
NULL | VARCHAR2(1) BYTE |
See DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types for information on using Avro complex types.
DBMS_CLOUD Package ORC to Oracle Data Type Mapping
Describes the mapping of ORC data types to Oracle data types.
See DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types for information on using ORC complex types.
ORC Type | Oracle Type | More Information |
---|---|---|
array | VARCHAR2(n) JSON format | DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types |
bigint (64 bit) | NUMBER(19) | |
binary | BLOB | |
boolean (1 bit) | NUMBER(1) | |
char | CHAR(n) | |
date | DATE | |
double | BINARY_DOUBLE | |
float | BINARY_FLOAT | |
int (32 bit) | NUMBER(10) | |
list | VARCHAR2(n) JSON format | DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types |
map | VARCHAR2(n) JSON format | DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types |
smallint (16 bit) | NUMBER(5) | |
string | VARCHAR2(4000) | |
struct | VARCHAR2(n) JSON format | DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types |
timestamp | TIMESTAMP | |
tinyint (8 bit) | NUMBER(3) | |
union | VARCHAR2(n) JSON format | DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types |
varchar | VARCHAR2(n) |
DBMS_CLOUD Package Parquet to Oracle Data Type Mapping
Describes the mapping of Parquet data types to Oracle data types.
Note:
Complex types, such as maps, arrays, and structs are supported starting with Oracle Database 19c. See DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types for information on using Parquet complex types.Parquet Type | Oracle Type |
---|---|
UINT_64 | NUMBER(20) |
INT_64 | NUMBER(19) |
UINT_32 | NUMBER(10) |
INT_32 | NUMBER(10) |
UINT_16 | NUMBER(5) |
INT_16 | NUMBER(5) |
UINT_8 | NUMBER(3) |
INT_8 | NUMBER(3) |
BOOL | NUMBER(1) |
UTF8 BYTE_ARRAY | VARCHAR2(4000 BYTE) |
FLT | BINARY_FLOAT |
DBL | BINARY_DOUBLE |
DECIMAL(p) | NUMBER(p) |
DECIMAL(p,s) | NUMBER(p,s) |
DATE | DATE |
STRING | VARCHAR2(4000) |
TIME_MILLIS | VARCHAR2(20 BYTE) |
TIME_MILLIS_UTC | VARCHAR2(20 BYTE) |
TIME_MICROS | VARCHAR2(20 BYTE) |
TIME_MICROS_UTC | VARCHAR2(20 BYTE) |
TIMESTAMP_MILLIS | TIMESTAMP(3) |
TIMESTAMP_MILLIS_UTC | TIMESTAMP(3) |
TIMESTAMP_MICROS | TIMESTAMP(6) |
TIMESTAMP_MICROS_UTC | TIMESTAMP(6) |
TIMESTAMP_NANOS | TIMESTAMP(9) |
See DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types for information on using Parquet complex types.
DBMS_CLOUD Package Oracle Data Type to Parquet Mapping
Describes the mapping of Oracle data types to Parquet data types.
Oracle Type | Parquet Type |
---|---|
BINARY_DOUBLE | DBL |
BINARY_FLOAT | FLT |
DATE | DATE |
NUMBER(p,s) | DECIMAL(p,s) |
NUMBER(p) | DECIMAL(p) |
TIMESTAMP(3) | TIMESTAMP_MILLIS |
TIMESTAMP(3) | TIMESTAMP_MILLIS_UTC |
TIMESTAMP(6) | TIMESTAMP_MICROS |
TIMESTAMP(6) | TIMESTAMP_MICROS_UTC |
TIMESTAMP(9) | TIMESTAMP_NANOS |
VARCHAR2(4000) | STRING |
NLS Session Parameters
The NLS session parameters NLS_DATE_FORMAT
,
NLS_TIMESTAMP_FORMAT
, NLS_TIMESTAMP_TZ_FORMAT
and NLS_NUMERIC_CHARACTERS
define how the date, timestamp,
timestamp with time zone format, and radix separator for timestamp with decimal
marker should be shown when a table with those column types are queried.
In addition, when you export data using DBMS_CLOUD.EXPORT_DATA
and specify Parquet output,
Autonomous Database reads the values
of these parameters from the NLS_SESSION_PARAMETERS
table. Autonomous Database uses these values to
convert the Oracle data types DATE
or TIMESTAMP
to
Parquet types.
The NLS_SESSION_PARAMETERS
parameters support an
RR
format mask (two character year specification).
The RR
format mask for the year is not supported for
these parameters when you export data to Parquet with DBMS_CLOUD.EXPORT_DATA
. An application error is
raised if you attempt to export to parquet and the
NLS_SESSION_PARAMETERS
are set to use the RR
format mask (the default value for the RR
format depends on the
value of the NLS_TERRITORY
parameter).
When one of the parameters NLS_DATE_FORMAT
,
NLS_TIMESTAMP_FORMAT
or
NLS_TIMESTAMP_TZ_FORMAT
uses the RR
format
mask, you must change the format value to supported value to export data to Parquet
with DBMS_CLOUD.EXPORT_DATA
. For
example:
ALTER SESSION SET NLS_DATE_FORMAT = "MM/DD/YYYY";
ALTER SESSION SET NLS_TIMESTAMP_FORMAT = 'YYYY-MM-DD HH:MI:SS.FF';
ALTER SESSION SET NLS_TIMESTAMP_TZ_FORMAT='YYYY-MM-DD HH:MI:SS.FF TZH:TZM';
NLS_SESSION_PARAMETERS
view:SELECT value FROM NLS_SESSION_PARAMETERS
WHERE parameter IN ('NLS_DATE_FORMAT','NLS_TIMESTAMP_FORMAT','NLS_TIMESTAMP_TZ_FORMAT');
If NLS_DATE_FORMAT
is set, it applies to the columns with
DATE
datatype. If NLS_TIMESTAMP_FORMAT
is set,
it applies to the columns with TIMESTAMP
datattype. If
NLS_TIMESTAMP_TZ_FORMAT
is set, it applies to the columns with
TIMESTAMP WITH TIME ZONE
datatype.
See Date and Time Parameters and NLS Data Dictionary Views for more information.
DBMS_CLOUD Package Avro, ORC, and Parquet Complex Types
Describes the mapping of Avro, ORC, and Parquet complex data types to Oracle data types.
Autonomous Database supports complex data types, including the following complex types:
-
struct
-
list
-
map
-
union
-
array
When you specify a source file type of Avro, ORC, or Parquet and the source file includes complex columns, Autonomous Database queries return JSON for the complex columns. This simplifies processing of query results; you can use Oracle's powerful JSON parsing features consistently across the file types and data types. The following table shows the format for the complex types in Autonomous Database:
Note:
The complex fields map toVARCHAR2
columns and
VARCHAR2
size limits apply.
Type | Parquet | ORC | Avro | Oracle |
---|---|---|---|---|
List: sequence of values | List | List | Array | VARCHAR2 (JSON format)
|
Map: list of objects with single key | Map | Map | Map | VARCHAR2 (JSON format)
|
Union: values of different type | Not Available | Union | Union | VARCHAR2 (JSON format)
|
Object: zero or more key-value pairs | Struct | Struct | Record | VARCHAR2 (JSON format)
|
If your ORC, Parquet, or Avro source files contain complex types, then
you can query the JSON output for these common complex types. For example, the
following shows an ORC file, movie-info.orc
, with a complex
type (the same complex type handling applies for Parquet and Avro source files).
Consider the movie-info.orc
file with the following
schema:
id int
original_title string
overview string
poster_path string
release_date string
vote_count int
runtime int
popularity double
genres array<struct<id:int,name:string>
Notice that each movie is categorized by multiple genres
using an array of genres
. The genres
array is an
array
of structs
and each item has an
id
(int
) and a name
(string
). The genres
array is considered a
complex type. You can create a table over this ORC file using DBMS_CLOUD.CREATE_EXTERNAL_TABLE
as follows:
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE
(
table_name =>'movie_info',
credential_name =>'OBJ_STORE_CRED',
file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/mytenancy/b/movies/o/movie-info.orc',
format => '{"type":"orc", "schema": "first"}');
END;
/
When you create the external table the database automatically generates
the columns based on the schema in the ORC file (if you are using Avro or Parquet,
the same applies). For this example, the DBMS_CLOUD.CREATE_EXTERNAL_TABLE
creates a table in
your database as follows:
CREATE TABLE "ADMIN"."MOVIE_INFO"
( "ID"
NUMBER(10,0),
"ORIGINAL_TITLE" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP",
"OVERVIEW" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP",
"POSTER_PATH" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP",
"RELEASE_DATE" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP",
"VOTE_COUNT" NUMBER(10,0),
"RUNTIME" NUMBER(10,0),
"POPULARITY" BINARY_DOUBLE,
"GENRES" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP"
) DEFAULT COLLATION "USING_NLS_COMP"
ORGANIZATION EXTERNAL
( TYPE ORACLE_BIGDATA
DEFAULT DIRECTORY "DATA_PUMP_DIR"
ACCESS PARAMETERS
( com.oracle.bigdata.credential.name=OBJ_STORE_CRED
com.oracle.bigdata.fileformat=ORC
)
LOCATION
(
'https://objectstorage.us-phoenix-1.oraclecloud.com/n/mytenancy/b/movies/o/movie-info.orc'
)
)
REJECT LIMIT UNLIMITED
PARALLEL;
)
Now you can query the movie data:
SELECT original_title, release_date, genres
FROM movie_info
WHERE release_date > '2000'
ORDER BY original_title;
This produces the following output:
original_title release_date genres
(500) Days of Summer 2009 [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},{"id":17,"name":"Horror"},{"id":19,"name":"Western"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
10,000 BC 2008 [{"id":6,"name":"Comedy"}]
11:14 2003 [{"id":9,"name":"Thriller"},{"id":14,"name":"Family"}]
127 Hours 2010 [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"}]
13 Going on 30 2004 [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
1408 2007 [{"id":45,"name":"Sci-Fi"},{"id":6,"name":"Comedy"},{"id":17,"name":"Horror"},{"id":6,"name":"Comedy"},{"id":18,"name":"War"}]
Notice that the complex type genres
is returned as a
JSON array.
To make the JSON data more useful, you can transform the column using
Oracle's JSON functions. For example, you can use the JSON "." notation as well as
the more powerful transform functions such as JSON_TABLE
.
See Simple Dot-Notation Access to JSON Data for information on "." notation.
See SQL/JSON Function JSON_TABLE
for information on JSON_TABLE
.
SELECT original_title, release_date, m.genre_name, genres
FROM movie_info mi,
JSON_TABLE(mi.genres, '$.name[*]'
COLUMNS (genre_name VARCHAR2(25) PATH
'$')
) AS m
WHERE rownum < 10;
The JSON_TABLE
creates a row for each value of the
array, think outer join, and the struct is parsed to extract the name of the genre.
This produces the following output:
original_title release_date genre_name genres
(500) Days of Summer 2009 Drama [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},
{"id":17,"name":"Horror"},{"id":19,"name":"Western"},
{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer 2009 Comedy [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},
{"id":17,"name":"Horror"},{"id":19,"name":"Western"},
{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer 2009 Horror [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},
{"id":17,"name":"Horror"},{"id":19,"name":"Western"},
{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer 2009 Western [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},
{"id":17,"name":"Horror"},{"id":19,"name":"Western"},
{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer 2009 War [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},
{"id":17,"name":"Horror"},{"id":19,"name":"Western"},
{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer 2009 Romance [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},
{"id":17,"name":"Horror"},{"id":19,"name":"Western"},
{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
10,000 BC 2008 Comedy [{"id":6,"name":"Comedy"}]
11:14 2003 Family [{"id":9,"name":"Thriller"},{"id":14,"name":"Family"}]
11:14 2003 Thriller [{"id":9,"name":"Thriller"},{"id":14,"name":"Family"}]
127 Hours 2010 Comedy [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"}]
127 Hours 2010 Drama [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"}]
13 Going on 30 2004 Romance [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},
{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
13 Going on 30 2004 Comedy [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},
{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
13 Going on 30 2004 War [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},
{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
13 Going on 30 2004 Drama [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},
{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
DBMS_CLOUD Package Avro, ORC, and Parquet to Oracle Column Name Mapping
Describes rules for how Avro, ORC, and Parquet column names are converted to Oracle column names.
The following are supported for Avro, ORC, and Parquet column names, but may require use of double quotes for Oracle SQL references in external tables. Thus, for ease of use and to avoid having to use double quotes when referencing column names, if possible do not use the following in Avro, ORC, and Parquet column names:
-
Embedded blanks
-
Leading numbers
-
Leading underscores
-
Oracle SQL reserved words
The following table shows various types of Avro, ORC, and Parquet column names, and rules for using the column names in Oracle column names in external tables.
Avro, ORC, or Parquet Name | CREATE TABLE Name | Oracle CATALOG | Valid SQL | Notes |
---|---|---|---|---|
part, Part, or PART | part, Part, PART | PART |
|
Oracle implicitly uppercases unquoted column names |
Ord No | "Ord No" | Ord No | select "Ord
No" |
Double quotes are required when there are embedded blanks, which also preserves the character case |
__index_key__ | "__index_key__" | __index_key__ | select
"__index_key__" |
Double quotes are required when there is a leading underscore, which also preserves the character case |
6Way | "6Way" | 6Way | select
"6Way" |
Double quotes are required when there is a leading numeric digit, which also preserves the character case |
create, Create, or CREATE, and so on. (any case variation) partition, Partition, PARTITION, and so on (for an Oracle Reserved word) | "CREATE" "PARTITION" | CREATE PARTITION |
|
Double quotes are required around Oracle SQL Reserved words. These are forced to uppercase, but must always be double-quoted when used anywhere in SQL |
rowid, Rowid, ROWid, and so on (for ROWID see notes) | rowid |
|
For ROWID, any mixed or lower-case variation of ROWID preserves the case and must always be double-quoted and use the original case variations. Due to the inherent conflict with Oracle ROWID for the table, if you specify upper-case ROWID, it is automatically stored as lower-case "rowid" and must always be double-quoted when referenced. |
Notes:
-
In general a column name in an external table can be referenced without double quotes.
-
Unless there is an embedded blank, a leading underscore ("_") or leading numeric digit ("0" through "9") in the column name, the original case of the column name is preserved, and it must always be referenced with double quotes and using the original case (upper, lower or mixed-case) of the Avro, ORC, or Parquet column name.
-
After using
DBMS_CLOUD.CREATE_EXTERNAL_TABLE
to create an external table with the format specified asavro
,orc
, orparquet
, use theDESCRIBE
command in SQL*Plus to view the table's column names. -
When Oracle SQL Reserved Words are used in Avro, ORC, or Parquet column names, they must always be double-quoted when referenced anywhere in SQL. See Oracle SQL Reserved Words for more information.
For information about Oracle's commitment to accessibility, visit the Oracle Accessibility Program website at http://www.oracle.com/pls/topic/lookup?ctx=acc&id=docacc.
Access to Oracle Support
Oracle customers that have purchased support have access to electronic support through My Oracle Support. For information, visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=info or visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=trs if you are hearing impaired.