6 Oracle Big Data SQL Reference

Find reference information for Oracle Big Data SQL here:

6.1 CREATE TABLE ACCESS PARAMETERS Clause

This section describes the properties that you use when creating an external table that uses the ORACLE_HDFS, ORACLE_HIVE, or ORACLE_BIGDATA access drivers. In a CREATE TABLE ORGANIZATION EXTERNAL statement, specify the parameters in the opaque_format_spec clause of ACCESS PARAMETERS.

This section contains the following topics:

6.1.1 Syntax Rules for Specifying Properties

The properties are set using keyword-value pairs in the SQL CREATE TABLE ACCESS PARAMETERS clause and in the configuration files.

The syntax must obey these rules:

  • The format of each keyword-value pair is a keyword, a colon or equal sign, and a value. The following are valid keyword-value pairs:

    keyword=value
    keyword:value

    The value is everything from the first non-whitespace character after the separator to the end of the line. Whitespace between the separator and the value is ignored. Trailing whitespace for the value is retained.

  • A property definition can be on one line or multiple lines.

  • A line terminator is a line feed, a carriage return, or a carriage return followed by line feeds.

  • When a property definition spans multiple lines, then precede the line terminators with a backslash (escape character), except on the last line. In this example, the value of the Keyword1 property is Value part 1 Value part 2 Value part 3.

    Keyword1=  Value part 1 \
               Value part 2 \
               Value part 3
  • You can create a logical line by stripping each physical line of leading whitespace and concatenating the lines. The parser extracts the property names and values from the logical line.

  • You can embed special characters in a property name or property value by preceding a character with a backslash (escape character), indicating the substitution. Table 6-1 describes the special characters.

Table 6-1 Special Characters in Properties

Escape Sequence Character

\b

Backspace (\u0008)

\t

Horizontal tab (\u0009)

\n

Line feed (\u000a)

\f

Form feed (\u000c)

\r

Carriage return (\u000d)

\"

Double quote (\u0022)

\'

Single quote (\u0027)

\\

Backslash (\u005c)

When multiple backslashes are at the end of the line, the parser continues the value to the next line only for an odd number of backslashes.

\uxxxx

2-byte, big-endian, Unicode code point.

When a character requires two code points (4 bytes), the parser expects \u for the second code point.

6.1.2 ORACLE_HDFS Access Parameters

The access parameters for the ORACLE_HDFS access driver provide the metadata needed to locate the data in HDFS and generate a Hive table over it.

6.1.2.1 Default Parameter Settings for ORACLE_HDFS

Describes default parameter settings for ORACLE_HDFS.

If you omit all access parameters from the CREATE TABLE statement, then ORACLE_HDFS uses the following default values:

com.oracle.bigdata.rowformat=DELIMITED
com.oracle.bigdata.fileformat=TEXTFILE
com.oracle.bigdata.overflow={"action":"error"}
com.oracle.bigdata.erroropt={"action":"setnull"}
6.1.2.2 Optional Parameter Settings for ORACLE_HDFS

ORACLE_HDFS supports the following optional com.oracle.bigdata parameters, which you can specify in the opaque_format_spec clause:

Example 6-1 shows a CREATE TABLE statement in which multiple access parameters are set.

Example 6-1 Setting Multiple Access Parameters for ORACLE_HDFS

CREATE TABLE ORDER (CUST_NUM VARCHAR2(10), 
                    ORDER_NUM VARCHAR2(20), 
                    ORDER_DATE DATE,
                    ITEM_CNT NUMBER,
                    DESCRIPTION VARCHAR2(100),
                    ORDER_TOTAL (NUMBER8,2)) ORGANIZATION EXTERNAL 
             (TYPE ORACLE_HDFS
 ACCESS PARAMETERS (
  com.oracle.bigdata.fields: (CUST_NUM,              \
                              ORDER_NUM,             \
                              ORDER_DATE,            \
                              ORDER_LINE_ITEM_COUNT, \
                              DESCRIPTION,           \
                              ORDER_TOTAL)
  com.oracle.bigdata.colMap:     {"col":"item_cnt", \
                                  "field":"order_line_item_count"}
  com.oracle.bigdata.overflow:   {"action":"TRUNCATE", \
                                  "col":"DESCRIPTION"}
  com.oracle.bigdata.errorOpt:  [{"action":"replace", \
                                  "value":"INVALID NUM", \
                                  "col":["CUST_NUM","ORDER_NUM"]} , \
                                 {"action":"reject", \
                                  "col":"ORDER_TOTAL}]
 )
 LOCATION ("hdfs:/usr/cust/summary/*"));

6.1.3 ORACLE_HIVE Access Parameters

ORACLE_HIVE retrieves metadata about external data sources from the Hive catalog.

The default mapping of Hive data to columns in the external table are usually appropriate. However, some circumstances require special parameter settings, or you might want to override the default values for reasons of your own.

6.1.3.1 Default Parameter Settings for ORACLE_HIVE

Describes the default parameter settings for ORACLE_HIVE.

If you omit all access parameters from the CREATE TABLE statement, then ORACLE_HIVE uses the following default values:

com.oracle.bigdata.tablename=name of external table
com.oracle.bigdata.overflow={"action":"error"}
com.oracle.bigdata.erroropt={"action":"setnull"}
6.1.3.2 Optional Parameter Values for ORACLE_HIVE

ORACLE_HIVE supports the following optional com.oracle.bigdata parameters, which you can specify in the opaque_format_spec clause:

Example 6-2 shows a CREATE TABLE statement in which multiple access parameters are set.

Example 6-2 Setting Multiple Access Parameters for ORACLE_HIVE

CREATE TABLE ORDER (cust_num VARCHAR2(10), 
                    order_num VARCHAR2(20), 
                    order_date DATE,
                    item_cnt NUMBER,
                    description VARCHAR2(100),
                    order_total (NUMBER8,2)) ORGANIZATION EXTERNAL 
(TYPE oracle_hive
 ACCESS PARAMETERS (
  com.oracle.bigdata.tableName:  order_db.order_summary
  com.oracle.bigdata.colMap:     {"col":"ITEM_CNT", \
                                  "field":"order_line_item_count"}
  com.oracle.bigdata.overflow:   {"action":"ERROR", \
                                  "col":"DESCRIPTION"}
  com.oracle.bigdata.errorOpt:   [{"action":"replace", \
                                   "value":"INV_NUM" , \
                                   "col":["CUST_NUM","ORDER_NUM"]} ,\
                                  {"action":"reject", \
                                   "col":"ORDER_TOTAL}]
));

6.1.4 Full List of Access Parameters for ORACLE_HDFS and ORACLE_HIVE

6.1.4.1 com.oracle.bigdata.buffersize

Sets the buffer size in kilobytes for large record reads. Set this value if you need to read records that are greater than the default buffer size. 

Default Value

1000 KB

Syntax

com.oracle.bigdata.buffersize: n

Example

The following example sets the buffer size to 100 MB:

com.oracle.bigdata.buffersize: 100000
6.1.4.2 com.oracle.bigdata.datamode

Specifies the method that SmartScan uses to scan a Hadoop data source. The method can make a significant difference in performance.

Default Value

automatic

Syntax

A JSON document with the keyword-value pairs shown in the following diagram:

datamode:

Semantics

automatic

Automatically selects the appropriate mode, based on the metadata. It selects c mode if possible, or java mode if the data contains formats that are not supported by c mode.

c

Uses Java to read the file buffers, but C code to process the data and convert it to Oracle format. Specify this mode for delimited data.

If the data contains formats that the C code does not support, then it returns an error.

java

Uses the Java SerDes and InputFormats to process the data and convert it to Oracle format. Specify this mode for Parquet, RCFile, and other data formats that require a SerDe.

6.1.4.3 com.oracle.bigdata.colmap

Maps a column in the source data to a column in the Oracle external table. You can define one or multiple pairs of column mappings. Use this property when the source field names exceed the maximum length of Oracle column names, or when you want to use different column names in the external table.

Default Value

A column in the external table with the same name as the Hive column

Syntax

A JSON document with the keyword-value pairs is shown in the following diagram:

colmap:

colmap_entry:

Semantics

"col":name

"col": The keyword must be lowercase and enclosed in quotation marks.

name: The name of a column in the Oracle external table. It is case sensitive and must be enclosed in quotation marks.

"field":name

"field": The keyword must be lowercase and enclosed in quotation marks.

name: The name of a field in the data source. It is not case sensitive, but it must be enclosed in quotation marks. See Syntax Rules for Specifying Properties.

Examples

This example maps a Hive column named ORDER_LINE_ITEM_COUNT to an Oracle column named ITEM_CNT:

com.oracle.bigdata.colMap={"col":"ITEM_CNT", \
                           "field":"order_line_item_count"}

The following example shows the mapping of multiple columns.

com.oracle.bigdata.colmap:[{"col":"KOL1", "field":"PROJECT_NAME"},{ "col":"KOL2","field":"wsdl_name"},{"col":"KOL3", "field":"method"}]
6.1.4.4 com.oracle.bigdata.erroropt

Describes how to handle errors that occur while the value of a column is calculated.

Default Value

{"action":"setnull"}

Syntax

A JSON document with the keyword-value pairs is shown in the following diagram:

erroropt:

error_element:

Semantics

The "action", "reject", "setnull", "replace", "value", and "col" keywords must be lowercase and enclosed in quotation marks. See Syntax Rules for Specifying Properties.

"action":value

value: One of these keywords:

  • "reject": Does not load any rows.

  • "setnull": Sets the column to NULL.

  • "replace": Sets the column to the specified value.

"value":string

string: Replaces a bad value in the external table. It must be enclosed in quotation marks.

"col":name

name: Identifies a column in an external table. The column name is case sensitive, must be enclosed in quotation marks, and can be listed only once.

Example

This example sets the value of the CUST_NUM or ORDER_NUM columns to INVALID if the Hive value causes an error. For any other columns, an error just causes the Hive value to be rejected.

com.oracle.bigdata.errorOpt: {"action":"replace",\
                              "value":"INVALID", \
                              "col":["CUST_NUM","ORDER_NUM"]
6.1.4.5 com.oracle.bigdata.fields

Lists the field names and data types of the data source.

Default Value

Not defined

Syntax

A JSON document with the keyword-value pairs is shown in the following diagram:

fields:

data_type:

primitive_type:

Semantics

The syntax is the same as a field list for a Hive table. If you split the field list across multiple lines, you must use a backslash to escape the new line characters.

field_name

The name of the Hive field. Use only alphanumeric characters and underscores (_). The maximum length is 128 characters. Field names are case-insensitive.

data_type

The data type of the Hive field. Optional; the default is STRING. The character set must be UTF8.

The data type can be complex or primitive:

Hive Complex Data Types

  • ARRAY: Indexable list

  • MAP: Key-value tuples

  • STRUCT: List of elements

  • UNIONTYPE: Multiple data types

Hive Primitive Data Types

  • INT: 4 byte integer

  • BIGINT: 8 byte integer

  • SMALLINT: 2 byte integer

  • TINYINT: 1 byte integer

  • BOOLEAN: TRUE or FALSE

  • FLOAT: single precision

  • DOUBLE: double precision

  • STRING: character sequence

See Also:

"Data Types" in the Apache Hive Language Manual at

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

COMMENT col_comment

A string literal enclosed in single quotation marks, which is stored as metadata for the Hive table (comment property of TBLPROPERTIES).

6.1.4.6 com.oracle.bigdata.fileformat

Describes the row format of the data source, based on the ROW FORMAT clause for a Hive table generated by ORACLE_HDFS.

Default Value

TEXTFILE

Syntax

A JSON document with the keyword-value pairs is shown in the following diagram.

fileformat:

Semantics

ORC

Optimized row columnar file format

PARQUET

Column-oriented, binary file format

RCFILE

Record columnar file format

SEQUENCEFILE

Compressed file format

TEXTFILE

Plain text file format

INPUTFORMAT

Identifies a Java class that can extract records from the data file.

OUTPUTFORMAT

Identifies a Java class that can format the output records in the desired format

6.1.4.7 com.oracle.bigdata.log.exec

Specifies how the access driver generates log files generated by the C code for a query, when it is running as parallel processes on CDH.

The access driver does not create or write log files when executing on a Hadoop cluster node; the parallel query processes write them. The log files from the Java code are controlled by log4j properties, which are specified in the configuration file or the access parameters. See "bigdata-log4j.properties".

Default Value

Not defined (no logging)

Syntax

[directory_object:]file_name_template

Semantics

directory_object

The Oracle directory object for the HDFS path on the Hadoop cluster where the log file is created.

file_name_template

A string used to generate file names. This table describes the optional variables that you can use in the template.

Table 6-2 Variables for com.oracle.bigdata.log.exec

Variable Value

%p

Operating system process identifier (PID)

%a

A number that uniquely identifies the process.

%%

A percent sign (%)

Example

The following example generates log file names that include the PID and a unique number, such as xtlogp_hive14_3413_57:

com.oracle.bigdata.log.exec= xtlogp_hive14_%p_%a
6.1.4.8 com.oracle.bigdata.log.qc

Specifies how the access driver generates log files for a query.

Default Value

Not defined (no logging)

Syntax

[directory_object:]file_name_template

Semantics

directory_object

Name of an Oracle directory object that points to the path where the log files are written. If this value is omitted, then the logs are written to the default directory for the external table.

file_name_template

A string used to generate file names. Table 6-3 describes the optional variables that you can use in the string.

Table 6-3 Variables for com.oracle.bigdata.log.qc

Variable Value

%p

Operating system process identifier (PID)

%%

A percent sign (%)

Example

This example creates log file names that include the PID and a percent sign, such as xtlogp_hive213459_%:

com.oracle.bigdata.log.qc= xtlogp_hive21%p_%%
6.1.4.9 com.oracle.bigdata.overflow

Describes how to handle string data that is too long for the columns in the external table. The data source can be character or binary. For Hive, the data source can also be STRUCT, UNIONTYPES, MAP, or ARRAY.

Default Value

{"action":"error"}

Syntax

A JSON document with the keyword-value pairs is shown in the following diagram:

overflow ::=

overflow_element ::=

Semantics

The "action", "truncate", "error", and "col" tags must be lowercase and enclosed in quotation marks. See Syntax Rules for Specifying Properties.

"action":value

The value of "action" can be one of the following keywords:

  • truncate: Shortens the data to fit the column.

  • error: Throws an error. The com.oracle.bigdata.erroropt property controls the result of the error.

"col":name

name: Identifies a column in the external table. The name is case sensitive and must be enclosed in quotation marks.

Example

This example truncates the source data for the DESCRIPTION column, if it exceeds the column width:

com.oracle.bigdata.overflow={"action":"truncate", \
                             "col":"DESCRIPTION"}
6.1.4.10 com.oracle.bigdata.rowformat

Provides the information the access driver needs to extract fields from the records in a file.

Important:

The com.oracle.bigdata.rowformat is unrelated to the access parameter syntax of  traditional external tables that use "type ORACLE_LOADER." There are keywords such as FIELDS, TERMINATED, and others that appear in both clauses, but the commonality in naming is coincidental and does not imply common functionality. The com.oracle.bigdata.rowformat access parameter is passed without change to the default Hive serde. The Hive serde to extract columns from rows is deliberately limited. Complex cases are handled by specialized serdes.

Default Value

DELIMITED

Syntax

A JSON document with the keyword-value pairs is shown in the following diagram.

rowformat:

Semantics

DELIMITED

Describes the characters used to delimit the fields in a record:

  • FIELDS TERMINATED BY: The character that delimits every field in the record. The optional ESCAPED BY character precedes the delimit character when it appears within a field value.

  • COLLECTION ITEMS TERMINATED BY: The character that marks the end of an array element. Used when a column is a collection or a nested record. In this case the resulting value will be a JSON array.

  • MAP KEYS TERMINATED BY: The character that marks the end of an entry in a MAP field. Used when a column is a collection or a nested record. The resulting value is a JSON object.

  • LINES TERMINATED BY: The character that marks the end of a record.

  • NULL DEFINED AS: The character that indicates a null value.

SERDE

Identifies a SerDe that can parse the data and any properties of the SerDe that the access driver might need.

Example

This example specifies a SerDe for an Avro container file:

com.oracle.bigdata.rowformat:
   SERDE'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

The next example specifies a SerDe for a file containing regular expressions:

com.oracle.bigdata.rowformat=\
    SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' \
    WITH SERDEPROPERTIES \ 
    ("input.regex" = "(\\\\d{6}) (\\\\d{5}) (.{29}) .*") 
6.1.4.11 com.oracle.bigdata.tablename

The Hive parameter com.oracle.bigdata.tablename identifies the Hive table that contains the source data.

Default Value

DEFAULT.external_table_name

Syntax

[hive_database_name.]table_name

Semantics

The maximum length of hive_database_name and table_name is 128 UTF-8 characters (512 bytes).

hive_database_name: The Hive database where the source data resides. DEFAULT is the name of the initial Hive database.

table_name: The Hive table with the data. If you omit table_name, then ORACLE_HIVE searches for a Hive table with the same name as the external table. Table names are case-insensitive.

Example

This setting indicates that the source data is in a table named ORDER_SUMMARY in the Hive ORDER_DB database:

com.oracle.bigdata.tablename ORDER_DB.ORDER_SUMMARY

6.1.5 ORACLE_BIGDATA Access Parameters

There is a set of access parameters that are common to all file formats. There are also parameters that are unique to a specific file format.

Common Access Parameters

The following table lists parameters that are common to all file formats accessed through ORACLE_BIGDATA. The first column identifies each access parameter common to all data file types. The second column describes each parameter.

Table 6-4 Common Access Parameters

Common Access Parameter Description
com.oracle.bigdata.credential.name

Specifies the credential object to use when accessing data files in an object store.

This access parameter is required for object store access. It is not needed for access to files through a directory object or for data stored in public buckets.

The name specified for the credential must be the name of a credential object in the same schema as the owner of the table. Granting a user SELECT or READ access to this table means that credential will be used to access the table.

Use DBMS_CREDENTIAL.CREATE_CREDENTIAL in the DBMS_CREDENTIAL PL/SQL package to create the credential object:
exec dbms_credential.create_credential(credential_name => 'MY_CRED',username =>'<username>', password => '<password>');

In the CREATE TABLE statement, set the value of the credential parameter to the name of the credential object.

com.oracle.bigdata.credential.name=MY_CRED
com.oracle.bigdata.fileformat Specifies the format of the file. The value of this parameter identifies the reader that will process the file. Each reader can support additional access parameters that may or may not be supported by other readers.

Valid values: parquet, orc, textfile, avro, csv

Default: PARQUET

com.oracle.bigdata.log.opt Specifies whether log messages should be written to a log file. When none is specified, then no logfile is created. If the value is normal, then log file is created when the file reader decides to write a message. It is up to the file reader to decide what is written to the log file.

Valid values: normal, none

Default: none.

com.oracle.bigdata.log.qc Specifies the name of the log file created by the parallel query coordinator. This parameter is used only when com.oracle.bigdata.log.opt is set to normal. The valid values are the same as specified for com.oracle.bigdata.log.qc in ORACLE_HIVE and ORACLE_HDFS.
com.oracle.bigdata.log.exec Specifies the name of the log file created during query execution. This value is used (and is required) only when com.oracle.bigdata.log.opt is set to normal. The valid values are the same as specified for com.oracle.bigdata.log.exec in ORACLE_HIVE and ORACLE_HDFS.

Valid values: normal, none

Default: none.

Avro Specific Access Parameters

In addition to common access parameters, there are some that are only valid for the Avro file format. The first column in this table identifies the access parameters specific to the Avro file format and the second column describes the parameter. There is only one Avro-specific parameter at this time.

Table 6-5 Avro Specific Access Parameters

Avro Specific Parameter Description
com.oracle.bigdata.avro.decimaltpe Specifies the representation of a decimal stored in the byte array.

Valid values: int, integer, str, string

Default: If this parameter is not used, an Avro decimal column is read assuming byte arrays store the numerical representation of the values (that is default to int) as the Avro specification defines.

Parquet Specific Access Parameters

Some access parameters are only valid for the Parquet file format. The first column in this table identifies the access parameters specific to the Parquet file format and the second column describes the parameter.

Table 6-6 Parquet Specific Access Parameters

Parquet Specific Access Parameter Description
com.oracle.bigdata.prq.binary_as_string This is a boolean property that specifies if binary is stored as a string.

Valid values: true, t, yes, y, l, false, f, no, n, 0

Default: true

com.oracle.bigdata.prq.int96_as_timestamp This is a bollean property that specifies if int96 represents a timestamp.

Valid values: true, t, yes, y, l, false, f, no, n, 0

Default: true

Textfile and CSV Specific Access Parameters

The text file and csv file formats are similar to the hive text file format. It reads text and csv data from delimited files. Big Data SQL automatically detects the line terminator (either \n, \r, or \r\n). By default, it assumes the fields in the file are separated by commas and the order of the fields in the file match the order of the columns in the external table.

Example 1: CSV Data File:

This is a simple csv example. The data file has comma separated values with optional enclosing quotes.
–----Source csv data in t.dat
t.dat:     

1,"abc",     
2,xyx,
–---------Create an external table over the csv source data in t.dat
CREATE TABLE t     
( 
 c0 number,       
 c1 varchar2(20)
) 
ORGANIZATION external     
( 
 TYPE oracle_bigdata
 DEFAULT DIRECTORY DMPDIR
 ACCESS PARAMETERS      
 (         
  com.oracle.bigdata.fileformat=csv
 )
 location
 (         
  't.dat'       
 )     
)REJECT LIMIT 1
;
–------Select data from external table
select c0, c1 from t;  

 C0    C1     
----  -----
 1     abc
 2     xyz

Example 2: CSV Data File :

This example shows how to create an external table over a csv data source which has '|' as the field separator, the data file compressed with gzip, blanks as null and a date format.
–----The source csv data in t.dat
t.dat:     

 1|  |     
 2|Apr-99-30|
–------Create an external table over the csv data source in t.dat
CREATE TABLE t(       
 c0 number,       
 c1 date     
)     
ORGANIZATION external     
(       
 TYPE oracle_bigdata       
 DEFAULT DIRECTORY DMPDIR       
 ACCESS PARAMETERS      
 (         
  com.oracle.bigdata.fileformat=textfile        
  com.oracle.bigdata.compressiontype=gzip        
  com.oracle.bigdata.csv.rowformat.separatorcharacter='|'        
  com.oracle.bigdata.blankasnull=true         
  com.oracle.bigdata.dateformat="MON-RR-DD HH:MI:SS"
 )       
 location       
 (         
  't.dat.gz'
  )     
 )REJECT LIMIT 1     
;
--Select csv data from external table
QL> select c0, c1 from t;    
         
 C0       C1    
------ ---------             
  1              
  2    30-APR-99

Example 3: Json Data File:

This is a JSON file where each row is a JSON document. The external table reaches each row. Queries use Oracle SQL JSON functions to parse the data.
–----Source Json docs. One for station 72 and the other for station 79.
{"station_id":"72","name":"W 52 St & 11 Ave","short_name":"6926.01","lat":40.76727216,"lon":-73.99392888,"region_id":71,"rental_methods":["CREDITCARD","KEY"],"capacity":39,"rental_url":"http://app.citibikenyc.com/S6Lr/IBV092JufD?station_id=72","eightd_has_key_dispenser":false}
{"station_id":"79","name":"Franklin  St & W Broadway","short_name":"5430.08","lat":40.71911552,"lon":-74.00666661,"region_id":71,"rental_methods":["CREDITCARD","KEY"],"capacity":33,"rental_url":"http://app.citibikenyc.com/S6Lr/IBV092JufD?station_id=79","eightd_has_key_dispenser":false}
— Create the external table over Json source 
CREATE TABLE stations_ext (
 doc varchar2(4000)         
)    
ORGANIZATION EXTERNAL ( 
  TYPE ORACLE_BIGDATA      
  DEFAULT DIRECTORY DEFAULT_DIR 
  ACCESS PARAMETERS(    
    com.oracle.bigdata.credential.name=MY_CRED    
    com.oracle.bigdata.fileformat=textfile    
    com.oracle.bigdata.csv.rowformat.fields.terminator='\n’    <— notice the
      delimiter is a new line (i.e. not a comma).  So, it  will read to the end of the line and get
      the whole document.        
    )      
  LOCATION ('https://swftobjectstorage.us-phoenix-1.oraclecloud.com/v1/mybucket/stations.json')   
)
REJECT LIMIT UNLIMITED;
–---Select data from external table
select s.doc.station_id,     
  s.doc.name,       
  s.doc.rental_methods[0]    –--notice we’re getting the first item in the array of possible payments
from stations_ext;

Station_id    Name                    rental_method (1st item in the array)
–----------  –-----------------------  –-----------------------------------
  72         W52 St & 11 Ave            CREDITCARD
  79         Franklin St & W Broadway   CREDITCARD

Table 6-7 Textfile and CSV Specific Access Parameters

Textfile-Specific Access Parameter Description
com.oracle.bigdata.buffersize Specifies the size of the I/O buffer used for reading the file. The value is the size of the buffer in kilobytes. Note that the buffer size is also the largest size that a record can be. If a format reader encounters a record larger than this value, it will return an error.

Default: 1024

com.oracle.bigdata.blankasnull When set to true, loads fields consisting of spaces as null.

Valid values: true, false

Default: false

Example: com.oracle.bigdata.blankasnull=true

com.oracle.bigdata.characterset Specifies the characterset of source files.

Valid values: UTF-8

Default: UTF-8

Example: com.oracle.bigdata.characterset=UTF-8

com.oracle.bigdata.compressiontype If this parameter is specified then the code tries to decompress the data according to the compression scheme specified.

Valid values: gzip, bzip2 , zlib, detect

Default: no compression

If detect is specified, the format reader tries to determine which of the supported compression methods was used to compress the file.

com.oracle.bigdata.conversionerrors If a row has data type conversion errors, the related columns are stored as null or the row is rejected.

Valid values: reject_record, store_null

Default: store_null

Example: com.oracle.bigdata.conversionerrors=reject_record

com.oracle.bigdata.csv.rowformat.nulldefinedas Specifies the character used to indicate the value of a field is NULL. If the parameter is not specified, then there is no value.
com.oracle.bigdata.csv.rowformat.fields.terminator Specifies the character used to separate the field values. The character value must be wrapped in single-quotes, as in '|'

Default: ','

com.oracle.bigdata.csv.rowformat.fields.escapedby Specifies the character used to escape any embedded field terminators or line terminators in the value for fields. The character value must be wrapped in single-quotes, as in '\'.
com.oracle.bigdata.dateformat Specifies the date format in the source file. The format option Auto checks for the following formats.

J, MM-DD-YYYYBC, MM-DD-YYYY, YYYYMMDD HHMISS, YYMMDD HHMISS, YYYY.DDD, YYYY-MM-DD

Default: yyyy-mm-dd hh24:mi:ss

Example: com.oracle.bigdata.dateformat= "MON-RR-DDHH:MI:SS"

com.oracle.bigdata.fields Specifies the order of fields in the data file. The values is the same as for com.oracle.bigdata.fields in ORACLE_HDFS with one exception – in this case, the data type is optional. Because the data file is text, the text file reader ignores the data types for the fields and assumes all fields are text. Since the data type is optional, this parameter can be a list of field names.
com.oracle.bigdata.ignoreblanklines Blank lines are ignored when set to true.

Valid values: true, false

Default: false

Example: com.oracle.bigdata.ignoreblanklines=true

com.oracle.bigdata.ignoremissingcolumns Missing columns are stored as null.

Valid values: true

Default: true

Example: com.oracle.bigdata.ignoremissingcolumns=true

com.oracle.bigdata.quote Specifies the quote character for the fields, the quote characters are removed during loading when specified.

Valid values: character

Default: Null meaning no quote

Example: com.oracle.bigdata.csv.rowformat.quotecharacter='"'

com.oracle.bigdata.rejectlimit The operation errors out after specified number of rows are rejected. This only applies when rejecting records due to conversion errors.

Valid values: number

Default: 0

Example: com.oracle.bigdata.rejectlimit=2

com.oracle.bigdata.removequotes Removes any quotes that are around any field in the source file.

Valid values: true, false

Default: false

Example:com.oracle.bigdata.removequotes=true

com.oracle.bigdata.csv.skip.header Spcifies how many rows should be skipped from the start of the files.

Valid values: number

Default: 0 if not specified

Example: com.oracle.bigdata.csv.skip.header=1

com.oracle.bigdata.timestampformat Specifies the timestamp format in the source file. The format option AUTO checks for the following formats:

YYYY-MM-DD HH:MI:SS.FF, YYYY-MM-DD HH:MI:SS.FF3, MM/DD/YYYY HH:MI:SS.FF3

Valid values: auto

Default: yyyy-mm-dd hh24:mi:ss.ff

Example: com.oracle.bigdata.timestamptzformat="auto"

com.oracle.bigdata.timestampltzformat Specifies the timestamp with local timezone format in the source file. The format option AUTO checks for the following formats:

DD Mon YYYY HH:MI:SS.FF TZR, MM/DD/YYYY HH:MI:SS.FF TZR, YYYY-MM-DD HH:MI:SS+/-TZR, YYYY-MM-DD HH:MI:SS.FF3, DD.MM.YYYY HH:MI:SS TZR

Valid values: auto

Default: yyyy-mm-dd hh24:mi:ss.ff

Example: com.oracle.bigdata.timestampltzformat="auto"

com.oracle.bigdata.timestamptzformat Specifies the timestamp with timezone format in the source file. The format option AUTO checks for the following formats:

DD Mon YYYY HH:MI:SS.FF TZR, MM/DD/YYYY HH:MI:SS.FF TZR, YYYY-MM-DD HH:MI:SS+/-TZR, YYYY-MM-DD HH:MI:SS.FF3, DD.MM.YYYY HH:MI:SS TZR

Valid values: auto

Default: yyy-mm-dd hh24:mi:ss.ff

Example: com.oracle.bigdata.timestamptzformat="auto"

com.oracle.bigdata.trimspaces Specifies how the leading and trailing spaces of the fields are trimmed.

Valid values: rtrim, ltrim, notrim, ltrim, ldrtrim

Default: notrim

Example: com.oracle.bigdata.trimspaces=rtrim

com.oracle.bigdata.truncatecol If the data in the file is too long for a field, then this option truncates the value of the field rather than rejecting the row or setting the field to NULL.

Valid values: true, false

Default: false

Example: com.oracle.bigdata.truncatecol=true

6.2 Static Data Dictionary Views for Hive

The Oracle Database catalog contains several static data dictionary views for Hive tables. You can query these data dictionary views to discover information about the Hive tables that you can access.

For you to access any Hive databases from Oracle Database, you must have read privileges on the ORACLE_BIGDATA_CONFIG directory object.

6.2.1 ALL_HIVE_DATABASES

ALL_HIVE_DATABASES describes all databases in the Hive metastore accessible to the current user.

Related Views

  • DBA_HIVE_DATABASES describes all the databases in the Hive metastore.

  • USER_HIVE_DATABASES describes the databases in the Hive metastore owned by the current user.

Column Datatype NULL Description

CLUSTER_ID

VARCHAR2(4000)

NOT NULL

Hadoop cluster where the Hive metastore is located

DATABASE_NAME

VARCHAR2(4000)

NOT NULL

Hive database name

DESCRIPTION

VARCHAR2(4000)

Hive database description

DB_LOCATION

VARCHAR2(4000)

NOT NULL

HIVE_URI

VARCHAR2(4000)

Hive database URI

6.2.2 ALL_HIVE_TABLES

ALL_HIVE_TABLES describes all tables in the Hive metastore accessible to the current user.

The Oracle Big Data SQL configuration must identify the default Hive database for the current user. The current user must also have READ privileges on the ORA_BIGSQL_CONFIG database directory. See "About the bigdata_config Directory".

Related Views

  • DBA_HIVE_TABLES describes all tables in the Hive metastore.

  • USER_HIVE_TABLES describes the tables in the database owned by the current user in the Hive metastore.

Column Datatype NULL Description

CLUSTER_ID

VARCHAR2(4000)

NOT NULL

Hadoop cluster where the Hive metastore is located

DATABASE_NAME

VARCHAR2(4000)

NOT NULL

Name of the Hive database

TABLE_NAME

VARCHAR2(4000)

NOT NULL

Name of the Hive table

LOCATION

VARCHAR2(4000)

NO_OF_COLS

NUMBER

Number of columns in the Hive table

CREATION_TIME

DATE

Time when the table was created

LAST_ACCESSED_TIME

DATE

Time of most recent access

OWNER

VARCHAR2(4000)

Owner of the Hive table

TABLE_TYPE

VARCHAR2(4000)

NOT NULL

Type of Hive table, such as external or managed

PARTITIONED

VARCHAR2(4000)

Whether the table is partitioned (YES) or not (NO)

NO_OF_PART_KEYS

NUMBER

Number of partitions

INPUT_FORMAT

VARCHAR2(4000)

Input format

OUTPUT_FORMAT

VARCHAR2(4000)

Output format

SERIALIZATION

VARCHAR2(4000)

SerDe serialization information

COMPRESSED

NUMBER

Whether the table is compressed (YES) or not (NO)

HIVE_URI

VARCHAR2(4000)

Hive database URI

6.2.3 ALL_HIVE_COLUMNS

ALL_HIVE_COLUMNS describes the columns of all Hive tables accessible to the current user.

The Oracle Big Data SQL configuration must identify the default Hive database for the current user. The current user must also have READ privileges on the ORA_BIGSQL_CONFIG database directory. See "About the bigdata_config Directory".

Related Views

  • DBA_HIVE_COLUMNS describes the columns of all tables in the Hive metastore.

  • USER_HIVE_COLUMNS describes the columns of the tables in the Hive database owned by the current user.

Column Datatype NULL Description

CLUSTER_ID

VARCHAR2(4000)

NOT NULL

Hadoop cluster where the Hive metastore is located

DATABASE_NAME

VARCHAR2(4000)

NOT NULL

Name of the Hive database; if blank, then the default database

TABLE_NAME

VARCHAR2(4000)

NOT NULL

Name of the Hive table

COLUMN_NAME

VARCHAR2(4000)

NOT NULL

Name of the Hive column

HIVE_COLUMN_TYPE

VARCHAR2(4000)

NOT NULL

Data type of the Hive column

ORACLE_COLUMN_TYPE

VARCHAR2(4000)

NOT NULL

Oracle data type equivalent to Hive data type

LOCATION

VARCHAR2(4000)

OWNER

VARCHAR2(4000)

Owner of the Hive table

CREATION_TIME

DATE

Time when the table was created

HIVE_URI

VARCHAR2(4000)

Hive database URI

6.2.4 DBA_HIVE_DATABASES

DBA_HIVE_DATABASES describes all the databases in the Hive metastore. Its columns are the same as those in ALL_HIVE_DATABASES.

See Also:

"ALL_HIVE_DATABASES"

6.2.5 DBA_HIVE_TABLES

DBA_HIVE_TABLES describes all tables in the Hive metastore. Its columns are the same as those in ALL_HIVE_TABLES.

The Oracle Big Data SQL configuration must identify the default Hive database for the current user. See "About the bigdata_config Directory".

See Also:

"ALL_HIVE_TABLES"

6.2.6 DBA_HIVE_COLUMNS

DBA_HIVE_COLUMNS describes the columns of all tables in the Hive metastore. Its columns are the same as those in ALL_HIVE_COLUMNS.

See Also:

"ALL_HIVE_COLUMNS"

6.2.7 USER_HIVE_DATABASES

USER_HIVE_DATABASES describes the databases in the Hive metastore owned by the current user. Its columns (except for OWNER) are the same as those in ALL_HIVE_DATABASES.

See Also:

"ALL_HIVE_DATABASES"

6.2.8 USER_HIVE_TABLES

USER_HIVE_TABLES describes the tables in the database owned by the current user in the Hive metastore. Its columns (except for OWNER) are the same as those in ALL_HIVE_TABLES.

The Oracle Big Data SQL configuration must identify the default Hive database for the current user. The current user must also have READ privileges on the ORA_BIGSQL_CONFIG database directory. See "About the bigdata_config Directory".

See Also:

"ALL_HIVE_TABLES"

6.2.9 USER_HIVE_COLUMNS

USER_HIVE_COLUMNS describes the columns of the tables in the Hive database owned by the current user. Its columns (except for OWNER) are the same as those in ALL_HIVE_COLUMNS.

The Oracle Big Data SQL configuration must identify the default Hive database for the current user. The current user must also have READ privileges on the ORA_BIGSQL_CONFIG database directory. See "About the bigdata_config Directory".

See Also:

"ALL_HIVE_COLUMNS"

6.3 DBMS_BDSQL PL/SQL Package

The DBMS_BDSQL PL/SQL package contains procedures to add and remove a user map.

This appendix contains the following sections:

In previous releases of Oracle Big Data SQL, all queries against Hadoop and Hive data are executed as the oracle user and there is no option to change users. Although oracle is still the underlying user in all cases, you can now use Multi-User Authorization (based on Hadoop Secure Impersonation) to direct the oracle account to execute tasks on behalf of other designated users. This enables HDFS data access based on the user that is currently executing the query, rather than the singular oracle user.

The DBMS_BDSQL package enables you to provide rules for identifying the currently connected user and to map the connected user to the user that is impersonated. Because there are numerous ways in which users can connect to Oracle Database, this user may be a database user, a user sourced from LDAP, from Kerberos, and so forth. Authorization rules on the files apply for that user and audits will reflect that user as well.

Note:

Grant the new BDSQL_ADMIN role to designated administrators in order to allow them to invoke these procedures.

6.3.1 ADD_USER_MAP

Use the ADD_USER_MAP procedure to specify the rules for identifying the actual user who is running the query.

At query execution time, the database performs a lookup on the BDSQL_USER_MAP table to determine the current database user (current_database_user). It then uses the syscontext_namespace and syscontext_parm_hadoop_user parameters to identify the actual user.

Syntax

procedure ADD_USER_MAP (  
    cluster_name                 IN VARCHAR2 DEFAULT '[DEFAULT]', 
    current_database_user        IN VARCHAR2 NOT NULL, 
    syscontext_namespace         IN VARCHAR2 DEFAULT NULL,
    syscontext_parm_hadoop_user  IN VARCHAR2 NOT NULL 
); 

Table 6-8 ADD_USER_MAP Parameters

Parameter Description
cluster_name The name of the Hadoop cluster where the map will be applied. [DEFAULT] as cluster name designates the default cluster.
current_database_user The current effective database user. This is what Oracle uses to check for authority. A value of '*' indicates that this row to be used if no other rows fit the criteria. There is no default and the value may not be NULL.
syscontext_namespace Note that for the Oracle USERENV namespace, the only allowed values are GLOBAL_UID, CLIENT_IDENTIFIER, and AUTHENTICATED_IDENTITY.

 If your Oracle Database installation uses Kerberos credentials, SSL, Active Directory, or LDAP for authentication, it is likely that your Hadoop system uses the same authentication framework. In that case, AUTHENTICATED_IDENTITY must be specified. This identifier only includes the username. The domain segment of the credential is truncated, as is the cluster name if included. For example, the username dirkrb will be used for authorization on the Hadoop cluster as the authenticated identity of dirkrb@HQ.TEST1.DBSEC2008.COM.

syscontext_parm_hadoop_user The Hadoop user that will impersonate the current database user.

Note:

The values for current_database_user and syscontext_parm_hadoop_user can be the single asterisk character (*) or any string that meets the requirements of Oracle simple_sql_name assertion:

  • The name must begin with an alphabetic character. It may contain alphanumeric characters as well as the characters _, $, and # in the second and subsequent character positions.

  • Quoted SQL names are also allowed.

  • Quoted names must be enclosed in double quotes.

  • Quoted names allow any characters between the quotes.

  • Quotes inside the name are represented by two quote characters in a row, for example, "a name with "" inside" is a valid quoted name.

  • The input parameter may have any number of leading and/or trailing white space characters.

6.3.2 REMOVE_USER_MAP

Use REMOVE_USER_MAP to remove a row from BDSQL_USER_MAP table. This disassociates a specific Oracle Database user from specific Hadoop user.

Syntax

procedure REMOVE_USER_MAP ( 
   cluster_name IN VARCHAR2 DEFAULT '[DEFAULT]', 
   current_database_user IN VARCHAR2 NOT NULL
); 

See Also:

The reference page for ADD_USER_MAP describes the cluster_name and current_database_user parameters.

6.3.3 Multi-User Authorization Security Table

SYS.BDSQL_USER_MAP is the multi-user authorization security table.

Use the procedures ADD_USER_MAP and REMOVE_USER_MAP to update this table.

The primary key is (cluster_name, current_database_user).

Table 6-9 SYS.BDSQL_USER_MAP

Column Datatype Description
cluster_name varchar2 Name of the Hadoop cluster. The default is [DEFAULT].
current_database_user varchar2 The current effective database user (no default, not NULL). Oracle uses this column to check for the authorization rule that corresponds to the given Oracle Database user. A value of '*' in a row is a directive to use this row if no other rows fit the criteria.
syscontext_namespace varchar2 This is the optional specification for the Oracle SYS_CONTEXT namespace. if customer security is set up. Note that for the Oracle USERENV namespace, the only allowed values are: ‘GLOBAL_UID’,‘CLIENT_IDENTIFIER’, ‘AUTHENTICATED_IDENTITY’.
syscontext_parm_hadoop_user varchar2 This column value has alternate interpretations.
  • If syscontext_namespace has a value, thensyscontext_parm_hadoop_user refers to the parameter that is specific to syscontext_namespace.

    However, when the value is '*' , this is a directive to use the value of current_database_user for impersonation. The syscontext_namespace column must be NULL in this case.

  • If syscontext_namespace is NULL, then syscontext_parm_hadoop_user contains the Hadoop user who is impersonated prior to HDFS files access.

Here a customer is using Active Directory, Kerberos, SSL, or LDAP for logon authentication against the database. AUTHENTICATED_IDENTITY is specified in this case because customer uses the same Active Directory framework for Hadoop user management.

The example below is similar to running the following SQL query for the currently connected user:

select sys_context('USERENV', 'AUTHENTICATED_IDENTITY') from dual;

In this example, only the username (without the “@<domain>” segment) is used for authorization on the Hadoop cluster. There may also be cases where the format of AUTHENTICATED_IDENTITY is <username>/<cluster>@<domain_name>.

cluster_name current_database_user syscontext_namespace syscontext_parm_hadoop_user
[DEFAULT] * USERENV AUTHENTICATED_IDENTITY

In this example, “HRAPP” is an HR Application that always connects to the database using the HRAPP database user and then programmatically sets the application user through the DBMS_SESSION.SET_IDENTIFIER procedure. There are number of “lightweight” users who are designated with CLIENT_IDENTIFIER (as in sys_context('USERENV', 'CLIENT_IDENTIFIER') [DEFAULT] * USERENV GLOBAL_UID, which is similar to running select sys_context('USERENV', 'CLIENT_IDENTIFIER') from dual; ) .

The current database has other effective users who are enterprise users with logons managed by Oracle Internet Directory for Enterprise User Security. In these cases, the GLOBAL_UID is used for Hadoop impersonation.

cluster_name current_database_user syscontext_namespace syscontext_parm_hadoop_user
[DEFAULT] HRAPP USERENV CLIENT_IDENTIFIER
[DEFAULT] * USERENV GLOBAL_UID

In this example, BIAPP is a business intelligence application whose own context is its username. For customers using the application, their designated ID is used for Hadoop access. In other words, when the effective current user is 'BIAPP', we use sys_context('BIVPD','USERID') for the impersonation. For the rest of the users, we simply designate [DEFAULT] * * in order use their current database username for the impersonation.

cluster_name current_database_user syscontext_namespace syscontext_parm_hadoop_user
[DEFAULT] BIAPP BIVPD USERID
[DEFAULT] *   *

In this example, the Oracle username SCOTT is impersonated by the hdpusr1 Hadoop user for HDFS access. The user ADAM is impersonated by hdpusr2 for HDFS access.

All other users have more limited access, so we use a syscontext_namespace value of 'lowprivuser” to designate these users.

cluster_name current_database_user syscontext_namespace syscontext_parm_hadoop_user
hadoop_cl_1 SCOTT   hdpusr1
hadoop_cl_1 ADAM lowprivuser hdpusr2
hadoop_cl_1 *    

6.5 DBMS_HADOOP PL/SQL Package

The DBMS_HADOOP package contains a function to generate the CREATE EXTERNAL TABLE DDL for a Hive table:

6.5.1 CREATE_EXTDDL_FOR_HIVE

This function returns a SQL CREATE TABLE ORGANIZATION EXTERNAL statement for a Hive table. It uses the ORACLE_HIVE access driver.

Syntax

DBMS_HADOOP.CREATE_EXTDDL_FOR_HIVE (
   cluster_id        IN    VARCHAR2,
   db_name           IN    VARCHAR2   := NULL,
   hive_table_name   IN    VARCHAR2,
   hive_partition    IN    BOOLEAN,
   table_name        IN    VARCHAR2   := NULL,
   perform_ddl       IN    BOOLEAN    DEFAULT FALSE,
   text_of_ddl       OUT   VARCHAR2
);

Parameters

Table 6-10 CREATE_EXTDDL_FOR_HIVE Function Parameters

Parameter Description

cluster_id

Hadoop cluster where the Hive metastore is located

db_name

Name of the Hive database

hive_table_name

Name of the Hive table

hive_partition

Whether the table is partitioned (TRUE) or not (FALSE)

table_name

Name of the Oracle external table to be created. It cannot already exist.

perform_ddl

Whether to execute the generated CREATE TABLE statement (TRUE) or just return the text of the command (FALSE).

Do not execute the command automatically if you want to review or modify it.

text_of_ddl

The generated CREATE TABLE ORGANIZATION EXTERNAL statement.

Usage Notes

The Oracle Database system must be configured for Oracle Big Data SQL. See "About Oracle Big Data SQL on the Database Server (Oracle Exadata Machine or Other)".

The data type conversions are based on the default mappings between Hive data types and Oracle data types. See "Hive to Oracle Data Type Conversions".

6.5.1.1 Example

The following query returns the CREATE EXTERNAL TABLE DDL for my_hive_table from the default Hive database. The connection to Hive is established using the configuration files in the ORACLE_BIGDATA_CONFIG directory, which identify the location of the HADOOP1 cluster.

DECLARE
   DDLtxt VARCHAR2(4000);
BEGIN
   dbms_hadoop.create_extddl_for_hive(
     CLUSTER_ID=>'hadoop1',
     DB_NAME=>'default',
     HIVE_TABLE_NAME=>'my_hive_table',
     HIVE_PARTITION=>FALSE,
     TABLE_NAME=>'my_xt_oracle',
     PERFORM_DDL=>FALSE,
     TEXT_OF_DDL=>DDLtxt
   );
   dbms_output.put_line(DDLtxt);
END;
/

The query returns the text of the following SQL command:

CREATE TABLE my_xt_oracle
(
     c0 VARCHAR2(4000),
     c1 VARCHAR2(4000),
     c2 VARCHAR2(4000),
     c3 VARCHAR2(4000))
     ORGANIZATION EXTERNAL
        (TYPE ORACLE_HIVE
          DEFAULT DIRECTORY DEFAULT_DIR
          ACCESS PARAMETERS (
             com.oracle.bigdata.cluster=hadoop1
             com.oracle.bigdata.tablename=default.my_hive_table
          )
)
PARALLEL 2 REJECT LIMIT UNLIMITED