The DBMS_HADOOP
package contains a function to generate the CREATE EXTERNAL TABLE
DDL for a Hive table:
This function returns a SQL CREATE TABLE ORGANIZATION EXTERNAL
statement for a Hive table. It uses the ORACLE_HIVE
access driver.
Syntax
DBMS_HADOOP.CREATE_EXTDDL_FOR_HIVE ( cluster_id IN VARCHAR2, db_name IN VARCHAR2 := NULL, hive_table_name IN VARCHAR2, hive_partition IN BOOLEAN, table_name IN VARCHAR2 := NULL, perform_ddl IN BOOLEAN DEFAULT FALSE, text_of_ddl OUT VARCHAR2 );
Parameters
Table 4-1 CREATE_EXTDDL_FOR_HIVE Function Parameters
Parameter | Description |
---|---|
|
Hadoop cluster where the Hive metastore is located |
|
Name of the Hive database |
|
Name of the Hive table |
|
Whether the table is partitioned ( |
|
Name of the Oracle external table to be created. It cannot already exist. |
|
Whether to execute the generated Do not execute the command automatically if you want to review or modify it. |
|
The generated |
Usage Notes
The Oracle Database system must be configured for Oracle Big Data SQL. See "About Oracle Big Data SQL on the Database Server (Oracle Exadata Machine or Other)".
The data type conversions are based on the default mappings between Hive data types and Oracle data types. See "About Data Type Conversions".
The following query returns the CREATE EXTERNAL TABLE
DDL for my_hive_table
from the default Hive database. The connection to Hive is established using the configuration files in the ORACLE_BIGDATA_CONFIG
directory, which identify the location of the HADOOP1 cluster.
DECLARE DDLtxt VARCHAR2(4000); BEGIN dbms_hadoop.create_extddl_for_hive( CLUSTER_ID=>'hadoop1', DB_NAME=>'default', HIVE_TABLE_NAME=>'my_hive_table', HIVE_PARTITION=>FALSE, TABLE_NAME=>'my_xt_oracle', PERFORM_DDL=>FALSE, TEXT_OF_DDL=>DDLtxt ); dbms_output.put_line(DDLtxt); END; /
The query returns the text of the following SQL command:
CREATE TABLE my_xt_oracle ( c0 VARCHAR2(4000), c1 VARCHAR2(4000), c2 VARCHAR2(4000), c3 VARCHAR2(4000)) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=hadoop1 com.oracle.bigdata.tablename=default.my_hive_table ) ) PARALLEL 2 REJECT LIMIT UNLIMITED
This section describes the properties that you use when creating an external table that uses the ORACLE_HDFS
or ORACLE_HIVE
access drivers. In a CREATE TABLE ORGANIZATION EXTERNAL
statement, specify the parameters in the opaque_format_spec
clause of ACCESS PARAMETERS
.
This section contains the following topics:
Alphabetical list of properties
The properties are set using keyword-value pairs in the SQL CREATE TABLE
ACCESS PARAMETERS
clause and in the configuration files. The syntax must obey these rules:
The format of each keyword-value pair is a keyword, a colon or equal sign, and a value. The following are valid keyword-value pairs:
keyword=value keyword:value
The value is everything from the first non-whitespace character after the separator to the end of the line. Whitespace between the separator and the value is ignored. Trailing whitespace for the value is retained.
A property definition can be on one line or multiple lines.
A line terminator is a line feed, a carriage return, or a carriage return followed by line feeds.
When a property definition spans multiple lines, then precede the line terminators with a backslash (escape character), except on the last line. In this example, the value of the Keyword1
property is Value part 1 Value part 2 Value part 3
.
Keyword1= Value part 1 \ Value part 2 \ Value part 3
You can create a logical line by stripping each physical line of leading whitespace and concatenating the lines. The parser extracts the property names and values from the logical line.
You can embed special characters in a property name or property value by preceding a character with a backslash (escape character), indicating the substitution. Table 4-2 describes the special characters.
Table 4-2 Special Characters in Properties
Escape Sequence | Character |
---|---|
|
Backspace ( |
|
Horizontal tab ( |
|
Line feed ( |
|
Form feed ( |
|
Carriage return ( |
|
Double quote ( |
|
Single quote ( |
|
Backslash ( When multiple backslashes are at the end of the line, the parser continues the value to the next line only for an odd number of backslashes. |
|
2-byte, big-endian, Unicode code point. When a character requires two code points (4 bytes), the parser expects |
The access parameters for the ORACLE_HDFS
access driver provide the metadata needed to locate the data in HDFS and generate a Hive table over it.
If you omit all access parameters from the CREATE TABLE
statement, then ORACLE_HDFS
uses the following default values:
com.oracle.bigdata.rowformat=DELIMITED com.oracle.bigdata.fileformat=TEXTFILE com.oracle.bigdata.overflow={"action":"error"} com.oracle.bigdata.erroropt={"action":"setnull"}
ORACLE_HDFS
supports the following optional com.oracle.bigdata
parameters, which you can specify in the opaque_format_spec
clause:
Example 4-1 shows a CREATE TABLE
statement in which multiple access parameters are set.
Example 4-1 Setting Multiple Access Parameters for ORACLE_HDFS
CREATE TABLE ORDER (CUST_NUM VARCHAR2(10), ORDER_NUM VARCHAR2(20), ORDER_DATE DATE, ITEM_CNT NUMBER, DESCRIPTION VARCHAR2(100), ORDER_TOTAL (NUMBER8,2)) ORGANIZATION EXTERNAL (TYPE ORACLE_HDFS ACCESS PARAMETERS ( com.oracle.bigdata.fields: (CUST_NUM, \ ORDER_NUM, \ ORDER_DATE, \ ORDER_LINE_ITEM_COUNT, \ DESCRIPTION, \ ORDER_TOTAL) com.oracle.bigdata.colMap: {"col":"item_cnt", \ "field":"order_line_item_count"} com.oracle.bigdata.overflow: {"action":"TRUNCATE", \ "col":"DESCRIPTION"} com.oracle.bigdata.errorOpt: [{"action":"replace", \ "value":"INVALID NUM", \ "col":["CUST_NUM","ORDER_NUM"]} , \ {"action":"reject", \ "col":"ORDER_TOTAL}] ) LOCATION ("hdfs:/usr/cust/summary/*"));
ORACLE_HIVE
retrieves metadata about external data sources from the Hive catalog. The default mapping of Hive data to columns in the external table are usually appropriate. However, some circumstances require special parameter settings, or you might want to override the default values for reasons of your own.
If you omit all access parameters from the CREATE TABLE
statement, then ORACLE_HIVE
uses the following default values:
com.oracle.bigdata.tablename=name of external table
com.oracle.bigdata.overflow={"action":"error"}
com.oracle.bigdata.erroropt={"action":"setnull"}
ORACLE_HIVE
supports the following optional com.oracle.bigdata
parameters, which you can specify in the opaque_format_spec
clause:
Example 4-2 shows a CREATE TABLE
statement in which multiple access parameters are set.
Example 4-2 Setting Multiple Access Parameters for ORACLE_HIVE
CREATE TABLE ORDER (cust_num VARCHAR2(10), order_num VARCHAR2(20), order_date DATE, item_cnt NUMBER, description VARCHAR2(100), order_total (NUMBER8,2)) ORGANIZATION EXTERNAL (TYPE oracle_hive ACCESS PARAMETERS ( com.oracle.bigdata.tableName: order_db.order_summary com.oracle.bigdata.colMap: {"col":"ITEM_CNT", \ "field":"order_line_item_count"} com.oracle.bigdata.overflow: {"action":"ERROR", \ "col":"DESCRIPTION"} com.oracle.bigdata.errorOpt: [{"action":"replace", \ "value":"INV_NUM" , \ "col":["CUST_NUM","ORDER_NUM"]} ,\ {"action":"reject", \ "col":"ORDER_TOTAL}] ));
Sets the buffer size in kilobytes for large record reads. Set this value if you need to read records that are greater than the default buffer size.
Default Value
1000 KB
Syntax
com.oracle.bigdata.buffersize: n
Example
The following example sets the buffer size to 100 MB:
com.oracle.bigdata.buffersize: 100000
Maps a column in the source data to a column in the Oracle external table. You can define one or multiple pairs of column mappings. Use this property when the source field names exceed the maximum length of Oracle column names, or when you want to use different column names in the external table.
Default Value
A column in the external table with the same name as the Hive column
Syntax
A JSON document with the keyword-value pairs is shown in the following diagram:
colmap:
colmap_entry:
Semantics
"col":name
"col"
: The keyword must be lowercase and enclosed in quotation marks.
name
: The name of a column in the Oracle external table. It is case sensitive and must be enclosed in quotation marks.
"field":name
"field"
: The keyword must be lowercase and enclosed in quotation marks.
name
: The name of a field in the data source. It is not case sensitive, but it must be enclosed in quotation marks. See "Syntax Rules for Specifying Properties".
Examples
This example maps a Hive column named ORDER_LINE_ITEM_COUNT
to an Oracle column named ITEM_CNT
:
com.oracle.bigdata.colMap={"col":"ITEM_CNT", \ "field":"order_line_item_count"}
The following example shows the mapping of multiple columns.
com.oracle.bigdata.colmap:[{"col":"KOL1", "field":"PROJECT_NAME"},{ "col":"KOL2","field":"wsdl_name"},{"col":"KOL3", "field":"method"}]
Specifies the method that SmartScan uses to scan a Hadoop data source. The method can make a significant difference in performance.
Default Value
automatic
Syntax
A JSON document with the keyword-value pairs shown in the following diagram:
datamode:
Semantics
automatic
Automatically selects the appropriate mode, based on the metadata. It selects c
mode if possible, or java
mode if the data contains formats that are not supported by c
mode.
c
Uses Java to read the file buffers, but C code to process the data and convert it to Oracle format. Specify this mode for delimited data.
If the data contains formats that the C code does not support, then it returns an error.
java
Uses the Java SerDes and InputFormats to process the data and convert it to Oracle format. Specify this mode for Parquet, RCFile, and other data formats that require a SerDe.
Describes how to handle errors that occur while the value of a column is calculated.
Default Value
{"action":"setnull"}
Syntax
A JSON document with the keyword-value pairs is shown in the following diagram:
erroropt:
error_element:
Semantics
The "action"
, "reject"
, "setnull"
, "replace"
, "value"
, and "col"
keywords must be lowercase and enclosed in quotation marks. See "Syntax Rules for Specifying Properties".
"action":value
value
: One of these keywords:
"reject"
: Does not load any rows.
"setnull"
: Sets the column to NULL
.
"replace"
: Sets the column to the specified value.
"value":string
string
: Replaces a bad value in the external table. It must be enclosed in quotation marks.
"col":name
name
: Identifies a column in an external table. The column name is case sensitive, must be enclosed in quotation marks, and can be listed only once.
Example
This example sets the value of the CUST_NUM
or ORDER_NUM
columns to INVALID
if the Hive value causes an error. For any other columns, an error just causes the Hive value to be rejected.
com.oracle.bigdata.errorOpt: {"action":"replace",\ "value":"INVALID", \ "col":["CUST_NUM","ORDER_NUM"]
Lists the field names and data types of the data source.
Default Value
Not defined
Syntax
A JSON document with the keyword-value pairs is shown in the following diagram:
fields:
data_type:
primitive_type:
Semantics
The syntax is the same as a field list for a Hive table. If you split the field list across multiple lines, you must use a backslash to escape the new line characters.
field_name
The name of the Hive field. Use only alphanumeric characters and underscores (_). The maximum length is 128 characters. Field names are case-insensitive.
data_type
The data type of the Hive field. Optional; the default is STRING
. The character set must be UTF8.
The data type can be complex or primitive:
Hive Complex Data Types
ARRAY
: Indexable list
MAP
: Key-value tuples
STRUCT
: List of elements
UNIONTYPE
: Multiple data types
Hive Primitive Data Types
INT
: 4 byte integer
BIGINT
: 8 byte integer
SMALLINT
: 2 byte integer
TINYINT
: 1 byte integer
BOOLEAN
: TRUE
or FALSE
FLOAT
: single precision
DOUBLE
: double precision
STRING
: character sequence
See Also:
"Data Types" in the Apache Hive Language Manual at
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
COMMENT col_comment
A string literal enclosed in single quotation marks, which is stored as metadata for the Hive table (comment
property of TBLPROPERTIES
).
Describes the row format of the data source, based on the ROW FORMAT
clause for a Hive table generated by ORACLE_HDFS
.
Default Value
TEXTFILE
Syntax
A JSON document with the keyword-value pairs is shown in the following diagram.
fileformat:
Semantics
ORC
Optimized row columnar file format
PARQUET
Column-oriented, binary file format
RCFILE
Record columnar file format
SEQUENCEFILE
Compressed file format
TEXTFILE
Plain text file format
INPUTFORMAT
Identifies a Java class that can extract records from the data file.
OUTPUTFORMAT
Identifies a Java class that can format the output records in the desired format
Specifies how the access driver generates log files generated by the C code for a query, when it is running as parallel processes on CDH.
The access driver does not create or write log files when executing on a Hadoop cluster node; the parallel query processes write them. The log files from the Java code are controlled by log4j
properties, which are specified in the configuration file or the access parameters. See "bigdata-log4j.properties".
Default Value
Not defined (no logging)
Syntax
[directory_object:]file_name_template
Semantics
directory_object
The Oracle directory object for the HDFS path on the Hadoop cluster where the log file is created.
file_name_template
A string used to generate file names. Table 4-2 describes the optional variables that you can use in the template.
Table 4-3 Variables for com.oracle.bigdata.log.exec
Variable | Value |
---|---|
|
Operating system process identifier (PID) |
|
A number that uniquely identifies the process. |
|
A percent sign ( |
Example
The following example generates log file names that include the PID and a unique number, such as xtlogp_hive14_3413_57
:
com.oracle.bigdata.log.exec= xtlogp_hive14_%p_%a
Specifies how the access driver generates log files for a query.
Default Value
Not defined (no logging)
Syntax
[directory_object:]file_name_template
Semantics
directory_object
Name of an Oracle directory object that points to the path where the log files are written. If this value is omitted, then the logs are written to the default directory for the external table.
file_name_template
A string used to generate file names. Table 4-4 describes the optional variables that you can use in the string.
Table 4-4 Variables for com.oracle.bigdata.log.qc
Variable | Value |
---|---|
|
Operating system process identifier (PID) |
|
A percent sign ( |
Example
This example creates log file names that include the PID and a percent sign, such as xtlogp_hive213459_%
:
com.oracle.bigdata.log.qc= xtlogp_hive21%p_%%
Describes how to handle string data that is too long for the columns in the external table. The data source can be character or binary. For Hive, the data source can also be STRUCT
, UNIONTYPES
, MAP
, or ARRAY
.
Default Value
{"action":"error"}
Syntax
A JSON document with the keyword-value pairs is shown in the following diagram:
overflow ::=
overflow_element ::=
Semantics
The "action"
, "truncate"
, "error"
, and "col"
tags must be lowercase and enclosed in quotation marks. See "Syntax Rules for Specifying Properties".
"action":value
The value of "action"
can be one of the following keywords:
truncate
: Shortens the data to fit the column.
error
: Throws an error. The com.oracle.bigdata.erroropt
property controls the result of the error.
"col":name
name
: Identifies a column in the external table. The name is case sensitive and must be enclosed in quotation marks.
Example
This example truncates the source data for the DESCRIPTION
column, if it exceeds the column width:
com.oracle.bigdata.overflow={"action":"truncate", \ "col":"DESCRIPTION"}
Provides the information the access driver needs to extract fields from the records in a file.
Important:
Thecom.oracle.bigdata.rowformat
is unrelated to the access parameter syntax of traditional external tables that use "type ORACLE_LOADER." There are keywords such as FIELDS
, TERMINATED
, and others that appear in both clauses, but the commonality in naming is coincidental and does not imply common functionality. The com.oracle.bigdata.rowformat
access parameter is passed without change to the default Hive serde. The Hive serde to extract columns from rows is deliberately limited. Complex cases are handled by specialized serdes.Default Value
DELIMITED
Syntax
A JSON document with the keyword-value pairs is shown in the following diagram.
rowformat:
Semantics
DELIMITED
Describes the characters used to delimit the fields in a record:
FIELDS TERMINATED BY
: The character that delimits every field in the record. The optional ESCAPED BY character precedes the delimit character when it appears within a field value.
COLLECTION ITEMS TERMINATED BY
: The character that marks the end of an array element. Used when a column is a collection or a nested record. In this case the resulting value will be a JSON array.
MAP KEYS TERMINATED BY
: The character that marks the end of an entry in a MAP field. Used when a column is a collection or a nested record. The resulting value is a JSON object.
LINES TERMINATED BY
: The character that marks the end of a record.
NULL DEFINED AS
: The character that indicates a null value.
SERDE
Identifies a SerDe that can parse the data and any properties of the SerDe that the access driver might need.
Example
This example specifies a SerDe for an Avro container file:
com.oracle.bigdata.rowformat: SERDE'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
The next example specifies a SerDe for a file containing regular expressions:
com.oracle.bigdata.rowformat=\ SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' \ WITH SERDEPROPERTIES \ ("input.regex" = "(\\\\d{6}) (\\\\d{5}) (.{29}) .*")
Identifies the Hive table that contains the source data.
Default Value
DEFAULT
.external_table_name
Syntax
[hive_database_name.]table_name
Semantics
The maximum length of hive_database_name and table_name is 128 UTF-8 characters (512 bytes).
hive_database_name
The Hive database where the source data resides. DEFAULT
is the name of the initial Hive database.
table_name
The Hive table with the data. If you omit table_name, then ORACLE_HIVE
searches for a Hive table with the same name as the external table. Table names are case-insensitive.
Example
This setting indicates that the source data is in a table named ORDER_SUMMARY
in the Hive ORDER_DB
database:
com.oracle.bigdata.tablename ORDER_DB.ORDER_SUMMARY
The Oracle Database catalog contains several static data dictionary views for Hive tables. You can query these data dictionary views to discover information about the Hive tables that you can access.
For you to access any Hive databases from Oracle Database, you must have read privileges on the ORACLE_BIGDATA_CONFIG
directory object.
ALL_HIVE_DATABASES
describes all databases in the Hive metastore accessible to the current user.
Related Views
DBA_HIVE_DATABASES
describes all the databases in the Hive metastore.
USER_HIVE_DATABASES
describes the databases in the Hive metastore owned by the current user.
Column | Datatype | NULL | Description |
---|---|---|---|
|
|
|
Hadoop cluster where the Hive metastore is located |
|
|
|
Hive database name |
|
|
Hive database description |
|
|
|
|
|
|
|
Hive database URI |
See Also:
ALL_HIVE_TABLES
describes all tables in the Hive metastore accessible to the current user.
The Oracle Big Data SQL configuration must identify the default Hive database for the current user. The current user must also have READ
privileges on the ORA_BIGSQL_CONFIG
database directory. See "About the bigdata_config Directory".
Related Views
DBA_HIVE_TABLES
describes all tables in the Hive metastore.
USER_HIVE_TABLES
describes the tables in the database owned by the current user in the Hive metastore.
Column | Datatype | NULL | Description |
---|---|---|---|
|
|
|
Hadoop cluster where the Hive metastore is located |
|
|
|
Name of the Hive database |
|
|
|
Name of the Hive table |
|
|
||
|
|
Number of columns in the Hive table |
|
|
|
Time when the table was created |
|
|
|
Time of most recent access |
|
|
|
Owner of the Hive table |
|
|
|
|
Type of Hive table, such as external or managed |
|
|
Whether the table is partitioned ( |
|
|
|
Number of partitions |
|
|
|
Input format |
|
|
|
Output format |
|
|
|
SerDe serialization information |
|
|
|
Whether the table is compressed ( |
|
|
|
Hive database URI |
See Also:
ALL_HIVE_COLUMNS
describes the columns of all Hive tables accessible to the current user.
The Oracle Big Data SQL configuration must identify the default Hive database for the current user. The current user must also have READ
privileges on the ORA_BIGSQL_CONFIG
database directory. See "About the bigdata_config Directory".
Related Views
DBA_HIVE_COLUMNS
describes the columns of all tables in the Hive metastore.
USER_HIVE_COLUMNS
describes the columns of the tables in the Hive database owned by the current user.
Column | Datatype | NULL | Description |
---|---|---|---|
|
|
|
Hadoop cluster where the Hive metastore is located |
|
|
|
Name of the Hive database; if blank, then the default database |
|
|
|
Name of the Hive table |
|
|
|
Name of the Hive column |
|
|
|
Data type of the Hive column |
|
|
|
Oracle data type equivalent to Hive data type |
|
|
||
|
|
Owner of the Hive table |
|
|
|
Time when the table was created |
|
|
|
Hive database URI |
See Also:
DBA_HIVE_DATABASES
describes all the databases in the Hive metastore. Its columns are the same as those in ALL_HIVE_DATABASES
.
See Also:
DBA_HIVE_TABLES
describes all tables in the Hive metastore. Its columns are the same as those in ALL_HIVE_TABLES
.
The Oracle Big Data SQL configuration must identify the default Hive database for the current user. See "About the bigdata_config Directory".
See Also:
DBA_HIVE_COLUMNS
describes the columns of all tables in the Hive metastore. Its columns are the same as those in ALL_HIVE_COLUMNS
.
See Also:
USER_HIVE_DATABASES
describes the databases in the Hive metastore owned by the current user. Its columns (except for OWNER
) are the same as those in ALL_HIVE_DATABASES
.
See Also:
USER_HIVE_TABLES
describes the tables in the database owned by the current user in the Hive metastore. Its columns (except for OWNER
) are the same as those in ALL_HIVE_TABLES
.
The Oracle Big Data SQL configuration must identify the default Hive database for the current user. The current user must also have READ
privileges on the ORA_BIGSQL_CONFIG
database directory. See "About the bigdata_config Directory".
See Also:
USER_HIVE_COLUMNS
describes the columns of the tables in the Hive database owned by the current user. Its columns (except for OWNER
) are the same as those in ALL_HIVE_COLUMNS
.
The Oracle Big Data SQL configuration must identify the default Hive database for the current user. The current user must also have READ
privileges on the ORA_BIGSQL_CONFIG
database directory. See "About the bigdata_config Directory".
See Also: